0% found this document useful (0 votes)

436 views

Spam Email Classifier

The document describes a system for classifying spam emails using machine learning techniques. It discusses collecting email data, cleaning and preprocessing the text, extracting features, splitting the data, and using various machine learning algorithms like Naive Bayes, Random Forest, and Support Vector Machines to classify emails as spam or legitimate.

Uploaded by

mm8805

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

436 views

Spam Email Classifier

Uploaded by

mm8805

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

SPAM EMAIL CLASSIFIER

Keshav Jayakrishnan [RA2111030010078]

Pranav Mathilakath [RA2111030010078]

Ananya M [RA2111030010071]
INTRODUCTION

Emails have become vulnerable to spam, which includes

advertisements, phishing attempts, and malicious content.
Around 55% of all emails are spam, causing
inconvenience and potential harm.
Text classification is important to structure the
unstructured and messy nature of text such as documents
and spam messages in a cost-effective way. Machine
learning can make more accurate precisions in real-time
and help to improve the manual slow process to much
better and faster analyzing big data. It is important
especially to a company to analyze text data, help inform
business decisions and even automate business processes.
In this project, machine learning techniques are used to
detect the spam message of a mail. Machine learning is
where computers can learn to do something without the
need to explicitly program them for the task.

2
PROBLEM STATEMENT
Email communication has become an integral part of modern-day personal and professional lives. However, the
widespread use of email has also given rise to a significant problem – spam emails. Spam emails refer to
unsolicited bulk emails that often contain advertisements, phishing attempts, malicious content, or fraudulent
activities.

Statistics indicate that approximately 55% of all emails are classified as spam, highlighting the pervasive nature
of this problem. As the volume of spam emails continues to grow, the need for effective and efficient spam
detection and classification solutions becomes increasingly crucial.

3
Literature Survey

An overview of several notable studies in the field:

1.M. Raza, N. D. Jayasinghe, and M. M. A. Muslam's Research:
• Compared the performance of Naive Bayes and Support
Vector Machines (SVMs) for spam email classification.
• Found that both Naive Bayes and SVMs consistently
exhibited superior accuracy, achieving up to 91%
classification accuracy across various experimental
setups.

2.S. Gadde, A. Lakshmanarao, and S. Satyanarayana's Paper :

• Explored the effectiveness of Long Short-Term Memory
(LSTM) networks for spam email classification.
• The LSTM system achieved an impressive accuracy rate
of 98%, demonstrating the potential of deep learning
approaches.
Literature Survey
3. P. Sethi, V. Bhandari, and B. Kohli:
• Investigated the performance variations of machine
learning algorithms based on different attributes in the
dataset.
• Highlighted the importance of careful feature selection
and engineering for optimizing classification accuracy.
4. H. Karamollaoglu, İ. A. Dogru, and M. Dorterler:
• Focused on spam classification for Turkish messages
and emails using Naive Bayes and SVMs.
• Both models achieved high accuracies around 90%,
demonstrating their applicability to multi-lingual spam
detection.
5. P. Navaney, G. Dubey, and A. Rana:
• Conducted a comparative analysis of Support Vector
Machines (SVM), Naive Bayes, and the entropy method
for spam email classification.
• Found that the SVM model consistently achieved the
highest accuracy of 97.5%, outperforming the other
methods.
SYSTEM ARCHITECTURE OVERVIE W

The system architecture comprises three main modules: User Interface (UI), Data Processing, and Machine Learning. Each module
plays a crucial role in the overall functionality of the spam email classifier.

6
SYSTEM ARCHITECTURE OVERVIEW
User Interface (UI) Module:

• Designed using the Streamlit library for Python

• Responsible for acquiring user input (email text or documents)

• Displays the processed output and classification results in a user-friendly manner

• Provides a seamless and intuitive interface for interaction

Data Processing Module:

• Performs data cleaning tasks, such as removing outliers, handling null values, and eliminating irrelevant features

• Merges multiple datasets into a single dataset with relevant features (text and label)

• Applies text pre-processing techniques, including tag removal, tokenization, stop word removal, and lemmatization

• Converts textual data into numerical feature vectors using techniques like Bag of Words (BoW) and Term Frequency-Inverse Document Frequency
(TF-IDF)

7
SYSTEM ARCHITECTURE OVERVIEW

Machine Learning Module:

• This module is the main module of all three modules.

• This module performs everything related to machine learning and results analysis.
• Some main functions of this module are
i. Training machine learning models.
ii. Testing the model
iii. Determining the respective parameter values for each model.
iv. Key-word extraction.
v. Final output calculation
• The output from this module is forwarded to UI for providing visual response to the user.

8
METHODOLOGY

9
DATA COLLECTION DATA CLEANING TEXT PRE-PROCESSING

• Tag removal: Removing HTML tags and

• Data plays a crucial role in prediction and • This step involves preprocessing the data to
unknown characters using regular
classification tasks. improve its quality and suitability for
expressions.
• Open-source datasets containing a large training.
• Tokenization: Breaking down the text into
number of emails and their corresponding • Tasks include removing outliers, handling
sentences and individual tokens (words).
labels (spam or legitimate) were collected from null values, and eliminating irrelevant
• Stop word removal: Eliminating common
various resources like Kaggle and UCI. features from the data.
stop words (e.g., 'the', 'a', 'is') that provide
• Multiple datasets were combined to ensure • After cleaning, the datasets were merged into
little information.
diversity and generalization in the training a single dataset containing only two essential
• Lemmatization: Converting words to their
data, with approximately 6,000 spam emails features: text (email content) and label (spam
base forms (lemmas) using part-of-speech
included. or legitimate).
tagging and chunking techniques.

10
FEATURE EXTRACTION DATA SPLITTING MACHINE LEARNING

• The preprocessed text data was converted into

• The dataset was split into training and testing
numerical feature vectors suitable for machine • Trains various machine learning models
subsets.
learning algorithms. (Naive Bayes, Random Forest, Logistic
• 80% of the data was used for training the
• Two language models were employed: Bag of Regression, K-Nearest Neighbors, Support
machine learning models, while the
Words (BoW) and Term Frequency-Inverse Vector Machines) on the processed data.
remaining 20% was reserved for testing and
Document Frequency (TF-IDF). • Performs classification of input text or
evaluating the models' performance.
• BoW represents text as a "bag" containing the documents as spam or legitimate.
• Care was taken to maintain a similar
count of each unique word, disregarding • Extracts relevant keywords and entities from
distribution of spam and legitimate emails in
grammar and word order. the input text.
both the training and test sets, ensuring that
• TF-IDF measures the importance of a word by • Analyzes and evaluates the performance of
the models were exposed to a diverse range
considering its frequency in the document and the trained models using appropriate metrics.
of examples during training and evaluation.
its rarity across the entire corpus.

11
CODING AND TESTING

Importing Required Libraries: Loading and Processing Data:

▪ CountVectorizer and TfidfVectorizer from scikit-learn for text ▪ The code loaded the cleaned and preprocessed dataset from a file
vectorization.
or database.
▪ NumPy for numerical computing and array manipulation.
▪ Data processing tasks, such as merging datasets, handling missing
▪ train_test_split from scikit-learn for splitting the data into values, and encoding categorical variables, were performed at
training and testing sets. this stage.on by 2%
▪ Classifiers like MultinomialNB, SVC, LogisticRegression,
KNeighborsClassifier, and RandomForestClassifier from scikit-
learn.
▪ pandas for data manipulation and analysis.
12
CODING AND TESTING

Implementing Training Algorithms: Retrieving Predictions:

• The selected machine learning algorithms (Naive Bayes, Random • Once the models were trained, they were used to make
Forest, Logistic Regression, KNN, and SVM) were instantiated and predictions on the test set.
trained on the preprocessed data.
• The predicted labels (spam or legitimate) were compared with
• Techniques like grid search and cross-validation were employed the true labels to evaluate the models' performance.
to fine-tune the hyperparameters of the models, optimizing their
performance. • Metrics such as accuracy, precision, recall, and F1-score were
calculated to assess the models' effectiveness.
13
SCREENSHOTS AND RESULTS

14
RESULT ANALYSIS
In this project, various experiments and evaluations were conducted to assess the performance of the spam email classifier and compare
the effectiveness of different approaches. This slide presents the key results and analysis derived from these experiments.
• Language Model Selection: Two language models were evaluated for representing the textual data: Bag of Words (BoW) and Term
Frequency-Inverse Document Frequency (TF-IDF).A comparative analysis was performed to determine the superior model for the
spam classification task.
• Model Performance Evaluation: The spam email classifier was evaluated using three key metrics: accuracy, precision, and F1-
score.The proposed ensemble model, which combined multiple individual classifiers, achieved outstanding performance:
Accuracy: 99.0% Precision: 98.5% F1-score: 98.6%
• Comparison with Individual Models: The performance of the proposed ensemble model was compared against the individual
machine learning algorithms used in the project. A tabular or graphical representation can be used to illustrate the metric values
(accuracy, precision, F1-score) for each model, highlighting the superior performance of the ensemble approach.
• Analysis and Interpretation: The results underscore the effectiveness of the TF-IDF language model in capturing the important
features and characteristics of textual data for spam classification.The ensemble approach, which combines the strengths of multiple
algorithms, proved to be a powerful technique for improving classification performance and robustness.

15
CONCLUSION

From the results obtained we can conclude that an ensemble machine learning model is more effective in detection and
classification of spam than any individual algorithms. We can also conclude that the TF-IDF (term frequency inverse document
frequency) language model is more effective than the Bag of words model in classification of spam when combined with several
algorithms. And finally we can say that spam detection can get better if machine learning algorithms are combined and tuned to
needs.
Future enhancements to this include extending to bulk text classification, online learning systems. Adapting to new spam trends.
Incorporating multi-lingual support. Integrate with email services for real-time spam filtering and so much more.
Overall, this project successfully demonstrated the power of machine learning and natural language processing techniques in
combating the ever-evolving challenge of spam emails. The accurate and robust spam email classifier developed in this project
has the potential to be integrated into email services, providing users with a cleaner and more secure email experience while
mitigating the risks associated with spam.

16
THANK YOU

Taiyo - AI - Data Engineering (Web Scraping) Trial Task
0% (1)
Taiyo - AI - Data Engineering (Web Scraping) Trial Task
9 pages
ED6 Lesson 1
100% (3)
ED6 Lesson 1
8 pages
EUI EUP Manual Description
No ratings yet
EUI EUP Manual Description
6 pages
Spam email. Classifier ppt
No ratings yet
Spam email. Classifier ppt
16 pages
Project 2
No ratings yet
Project 2
10 pages
Deep Learning Based Car Damage Detection, Classification and Severity
No ratings yet
Deep Learning Based Car Damage Detection, Classification and Severity
7 pages
Language Translator
100% (1)
Language Translator
13 pages
SMS Spam Detection Presentation
No ratings yet
SMS Spam Detection Presentation
8 pages
LLM - Seminar Report
No ratings yet
LLM - Seminar Report
13 pages
Internship Report
No ratings yet
Internship Report
10 pages
Software Engineering Software Requirements Specification (SRS) Document
No ratings yet
Software Engineering Software Requirements Specification (SRS) Document
13 pages
AI Mini Project
No ratings yet
AI Mini Project
29 pages
Dbms Mini Report
No ratings yet
Dbms Mini Report
14 pages
E Authentication Project Document
No ratings yet
E Authentication Project Document
76 pages
PPT
0% (1)
PPT
15 pages
Project - Research - Paper - BCA - SEM - VI - Group-1 1
No ratings yet
Project - Research - Paper - BCA - SEM - VI - Group-1 1
8 pages
Cloud Computing Architecture - Javatpoint
No ratings yet
Cloud Computing Architecture - Javatpoint
7 pages
Email Classification: Roll No-41463 (LP-3)
No ratings yet
Email Classification: Roll No-41463 (LP-3)
5 pages
Fake Product Review Final
No ratings yet
Fake Product Review Final
30 pages
LP Vi Lab Manual 2022-23 Final
No ratings yet
LP Vi Lab Manual 2022-23 Final
72 pages
Fog Screen Technology: A Technical Seminar Report ON
No ratings yet
Fog Screen Technology: A Technical Seminar Report ON
15 pages
Final PPT - Fake Product Review
100% (1)
Final PPT - Fake Product Review
27 pages
Program 7
100% (1)
Program 7
4 pages
Apj Abdul Kalam Technological Uni Versi Ty: Syllabus - Study Materials - Textbook PDF - Solved Question Papers
No ratings yet
Apj Abdul Kalam Technological Uni Versi Ty: Syllabus - Study Materials - Textbook PDF - Solved Question Papers
29 pages
DSBDA Practical Final
No ratings yet
DSBDA Practical Final
49 pages
MCA Project Titles
No ratings yet
MCA Project Titles
2 pages
Career Prediction System
No ratings yet
Career Prediction System
31 pages
A Project Report On Fake News Detection
100% (1)
A Project Report On Fake News Detection
29 pages
Synopsis On
No ratings yet
Synopsis On
8 pages
Big Data Unit 1 AKTU Notes
No ratings yet
Big Data Unit 1 AKTU Notes
87 pages
(KAVYA R SHETTY)
No ratings yet
(KAVYA R SHETTY)
21 pages
Pattern Recognition - Lec02
No ratings yet
Pattern Recognition - Lec02
44 pages
PROJECT REPORT For Machine Learning
100% (1)
PROJECT REPORT For Machine Learning
22 pages
An XML File Which Will Display The Book Information and DTD
No ratings yet
An XML File Which Will Display The Book Information and DTD
7 pages
AI-ML-DS_SUMMERINTERNSHIP
No ratings yet
AI-ML-DS_SUMMERINTERNSHIP
59 pages
College Alumani System
100% (1)
College Alumani System
51 pages
Agriculture Management System-3
No ratings yet
Agriculture Management System-3
22 pages
Updated 5th and 6th Sem 2021 Scheme and Syllabus
No ratings yet
Updated 5th and 6th Sem 2021 Scheme and Syllabus
71 pages
Machine Learning Quantum
No ratings yet
Machine Learning Quantum
64 pages
Kavya
No ratings yet
Kavya
38 pages
APSCHE LMS Profile Updations Flow PDF
No ratings yet
APSCHE LMS Profile Updations Flow PDF
5 pages
Instagram Spam Detection Using AI
0% (1)
Instagram Spam Detection Using AI
1 page
Review (2) - Machine Learning For SPAM Detection 2023
No ratings yet
Review (2) - Machine Learning For SPAM Detection 2023
13 pages
Fake News Detection Project
No ratings yet
Fake News Detection Project
7 pages
DBMS ER Design Issues - Copy Unit.2
No ratings yet
DBMS ER Design Issues - Copy Unit.2
2 pages
CSE MINI PROJECT Report
No ratings yet
CSE MINI PROJECT Report
14 pages
7th Sem 1
No ratings yet
7th Sem 1
32 pages
Wipro Resume 1
25% (4)
Wipro Resume 1
2 pages
Project Report Toxic Comment Classifier
No ratings yet
Project Report Toxic Comment Classifier
25 pages
Wa0001.
No ratings yet
Wa0001.
1 page
CCS341 Data Warehousing Notes Unit I
No ratings yet
CCS341 Data Warehousing Notes Unit I
30 pages
Sample Technical Seminar Vtu
No ratings yet
Sample Technical Seminar Vtu
14 pages
Dbms Project Report
No ratings yet
Dbms Project Report
18 pages
Spam Detection in Email Using Machine Le
No ratings yet
Spam Detection in Email Using Machine Le
8 pages
Seminar Report
100% (1)
Seminar Report
28 pages
SENTIMENT ANALYSIS PPT
100% (1)
SENTIMENT ANALYSIS PPT
35 pages
Machine Learning Lab Manual (15CSL76)
No ratings yet
Machine Learning Lab Manual (15CSL76)
30 pages
Assignment DBMS
No ratings yet
Assignment DBMS
8 pages
Final Rer Aktu
No ratings yet
Final Rer Aktu
58 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
ml lab
No ratings yet
ml lab
13 pages
0_SPAM MAIL PREDICTION
No ratings yet
0_SPAM MAIL PREDICTION
29 pages
Profile Creation
No ratings yet
Profile Creation
10 pages
Intel® Driver & Support Assistant - Detailed Report
No ratings yet
Intel® Driver & Support Assistant - Detailed Report
5 pages
Global Fraud Report 2024 Americas
No ratings yet
Global Fraud Report 2024 Americas
16 pages
Training Dendy
No ratings yet
Training Dendy
13 pages
Head First Java [3rd Ed, Early Release] 3rd Edition Kathy Sierra - The latest updated ebook version is ready for download
100% (1)
Head First Java [3rd Ed, Early Release] 3rd Edition Kathy Sierra - The latest updated ebook version is ready for download
70 pages
Class 8 Notes
No ratings yet
Class 8 Notes
5 pages
CEAT Question Bank
No ratings yet
CEAT Question Bank
8 pages
Personal Branding On LinkedIn
No ratings yet
Personal Branding On LinkedIn
29 pages
GA H61M DS2 R4.0 Schematic
No ratings yet
GA H61M DS2 R4.0 Schematic
30 pages
CG2027 Assign3
No ratings yet
CG2027 Assign3
2 pages
Kymeta U8 Antenna
No ratings yet
Kymeta U8 Antenna
2 pages
Problem Statement - RS - Amazon Product Recommendation
No ratings yet
Problem Statement - RS - Amazon Product Recommendation
2 pages
Phần 1: Câu Hỏi Mở Đầu Family 1. How many people are there in your family? Who are they?
No ratings yet
Phần 1: Câu Hỏi Mở Đầu Family 1. How many people are there in your family? Who are they?
40 pages
(Ebook) Wireless Sensor Networks: Principles and Practice by Fei Hu (Author); Xiaojun Cao (Author) ISBN 9780429111488, 9781420092158, 9781439882665, 9781466536036, 0429111487, 1420092154, 1439882665, 1466536039 instant download
100% (2)
(Ebook) Wireless Sensor Networks: Principles and Practice by Fei Hu (Author); Xiaojun Cao (Author) ISBN 9780429111488, 9781420092158, 9781439882665, 9781466536036, 0429111487, 1420092154, 1439882665, 1466536039 instant download
50 pages
Materi 7
No ratings yet
Materi 7
5 pages
UK Nainital 1
No ratings yet
UK Nainital 1
1 page
System Customer Seller E-Wallet: YES YES
No ratings yet
System Customer Seller E-Wallet: YES YES
1 page
CSC 201 Lecture Note
No ratings yet
CSC 201 Lecture Note
59 pages
IPO Model - Wikipedia
No ratings yet
IPO Model - Wikipedia
1 page
02 047 FBS163
No ratings yet
02 047 FBS163
2 pages
E-Billing & Invoice System Patel Jignesh K. & Patel Hardik S PDF
100% (1)
E-Billing & Invoice System Patel Jignesh K. & Patel Hardik S PDF
72 pages
Lecture 4 E-Business Markets and Models
No ratings yet
Lecture 4 E-Business Markets and Models
18 pages
ADLT - DimOnOff RTM Node Data Sheet
No ratings yet
ADLT - DimOnOff RTM Node Data Sheet
3 pages
Ethical Hacking Research Shohjahon Januzakov Final
No ratings yet
Ethical Hacking Research Shohjahon Januzakov Final
7 pages
UC350 IP PBX User Manual
No ratings yet
UC350 IP PBX User Manual
116 pages
ICS 2100 Chapters 1 and 2
No ratings yet
ICS 2100 Chapters 1 and 2
18 pages
D.A.V. Group of Schools: Common Periodic Test Ii - 2022-2023
No ratings yet
D.A.V. Group of Schools: Common Periodic Test Ii - 2022-2023
3 pages
Session6-Ref PTActA IPS
No ratings yet
Session6-Ref PTActA IPS
4 pages

Uploaded by

Uploaded by

SPAM EMAIL CLASSIFIER

Keshav Jayakrishnan [RA2111030010078]

Pranav Mathilakath [RA2111030010078]

Emails have become vulnerable to spam, which includes

An overview of several notable studies in the field:

2.S. Gadde, A. Lakshmanarao, and S. Satyanarayana's Paper :

• Designed using the Streamlit library for Python

• Responsible for acquiring user input (email text or documents)

• Displays the processed output and classification results in a user-friendly manner

• Provides a seamless and intuitive interface for interaction

Data Processing Module:

Machine Learning Module:

• This module is the main module of all three modules.

• Tag removal: Removing HTML tags and

• The preprocessed text data was converted into

Importing Required Libraries: Loading and Processing Data:

Implementing Training Algorithms: Retrieving Predictions:

You might also like