0% found this document useful (0 votes)
436 views

Spam Email Classifier

The document describes a system for classifying spam emails using machine learning techniques. It discusses collecting email data, cleaning and preprocessing the text, extracting features, splitting the data, and using various machine learning algorithms like Naive Bayes, Random Forest, and Support Vector Machines to classify emails as spam or legitimate.

Uploaded by

mm8805
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
436 views

Spam Email Classifier

The document describes a system for classifying spam emails using machine learning techniques. It discusses collecting email data, cleaning and preprocessing the text, extracting features, splitting the data, and using various machine learning algorithms like Naive Bayes, Random Forest, and Support Vector Machines to classify emails as spam or legitimate.

Uploaded by

mm8805
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

SPAM EMAIL CLASSIFIER

BY

Keshav Jayakrishnan [RA2111030010078]

Pranav Mathilakath [RA2111030010078]

Ananya M [RA2111030010071]
INTRODUCTION

Emails have become vulnerable to spam, which includes


advertisements, phishing attempts, and malicious content.
Around 55% of all emails are spam, causing
inconvenience and potential harm.
Text classification is important to structure the
unstructured and messy nature of text such as documents
and spam messages in a cost-effective way. Machine
learning can make more accurate precisions in real-time
and help to improve the manual slow process to much
better and faster analyzing big data. It is important
especially to a company to analyze text data, help inform
business decisions and even automate business processes.
In this project, machine learning techniques are used to
detect the spam message of a mail. Machine learning is
where computers can learn to do something without the
need to explicitly program them for the task.

2
PROBLEM STATEMENT
Email communication has become an integral part of modern-day personal and professional lives. However, the
widespread use of email has also given rise to a significant problem – spam emails. Spam emails refer to
unsolicited bulk emails that often contain advertisements, phishing attempts, malicious content, or fraudulent
activities.

Statistics indicate that approximately 55% of all emails are classified as spam, highlighting the pervasive nature
of this problem. As the volume of spam emails continues to grow, the need for effective and efficient spam
detection and classification solutions becomes increasingly crucial.

3
Literature Survey

An overview of several notable studies in the field:


1.M. Raza, N. D. Jayasinghe, and M. M. A. Muslam's Research:
• Compared the performance of Naive Bayes and Support
Vector Machines (SVMs) for spam email classification.
• Found that both Naive Bayes and SVMs consistently
exhibited superior accuracy, achieving up to 91%
classification accuracy across various experimental
setups.

2.S. Gadde, A. Lakshmanarao, and S. Satyanarayana's Paper :


• Explored the effectiveness of Long Short-Term Memory
(LSTM) networks for spam email classification.
• The LSTM system achieved an impressive accuracy rate
of 98%, demonstrating the potential of deep learning
approaches.
Literature Survey
3. P. Sethi, V. Bhandari, and B. Kohli:
• Investigated the performance variations of machine
learning algorithms based on different attributes in the
dataset.
• Highlighted the importance of careful feature selection
and engineering for optimizing classification accuracy.
4. H. Karamollaoglu, İ. A. Dogru, and M. Dorterler:
• Focused on spam classification for Turkish messages
and emails using Naive Bayes and SVMs.
• Both models achieved high accuracies around 90%,
demonstrating their applicability to multi-lingual spam
detection.
5. P. Navaney, G. Dubey, and A. Rana:
• Conducted a comparative analysis of Support Vector
Machines (SVM), Naive Bayes, and the entropy method
for spam email classification.
• Found that the SVM model consistently achieved the
highest accuracy of 97.5%, outperforming the other
methods.
SYSTEM ARCHITECTURE OVERVIE W

The system architecture comprises three main modules: User Interface (UI), Data Processing, and Machine Learning. Each module
plays a crucial role in the overall functionality of the spam email classifier.

6
SYSTEM ARCHITECTURE OVERVIEW
User Interface (UI) Module:

• Designed using the Streamlit library for Python

• Responsible for acquiring user input (email text or documents)

• Displays the processed output and classification results in a user-friendly manner

• Provides a seamless and intuitive interface for interaction

Data Processing Module:

• Performs data cleaning tasks, such as removing outliers, handling null values, and eliminating irrelevant features

• Merges multiple datasets into a single dataset with relevant features (text and label)

• Applies text pre-processing techniques, including tag removal, tokenization, stop word removal, and lemmatization

• Converts textual data into numerical feature vectors using techniques like Bag of Words (BoW) and Term Frequency-Inverse Document Frequency
(TF-IDF)

7
SYSTEM ARCHITECTURE OVERVIEW

Machine Learning Module:

• This module is the main module of all three modules.


• This module performs everything related to machine learning and results analysis.
• Some main functions of this module are
i. Training machine learning models.
ii. Testing the model
iii. Determining the respective parameter values for each model.
iv. Key-word extraction.
v. Final output calculation
• The output from this module is forwarded to UI for providing visual response to the user.

8
METHODOLOGY

9
DATA COLLECTION DATA CLEANING TEXT PRE-PROCESSING

• Tag removal: Removing HTML tags and


• Data plays a crucial role in prediction and • This step involves preprocessing the data to
unknown characters using regular
classification tasks. improve its quality and suitability for
expressions.
• Open-source datasets containing a large training.
• Tokenization: Breaking down the text into
number of emails and their corresponding • Tasks include removing outliers, handling
sentences and individual tokens (words).
labels (spam or legitimate) were collected from null values, and eliminating irrelevant
• Stop word removal: Eliminating common
various resources like Kaggle and UCI. features from the data.
stop words (e.g., 'the', 'a', 'is') that provide
• Multiple datasets were combined to ensure • After cleaning, the datasets were merged into
little information.
diversity and generalization in the training a single dataset containing only two essential
• Lemmatization: Converting words to their
data, with approximately 6,000 spam emails features: text (email content) and label (spam
base forms (lemmas) using part-of-speech
included. or legitimate).
tagging and chunking techniques.

10
FEATURE EXTRACTION DATA SPLITTING MACHINE LEARNING

• The preprocessed text data was converted into


• The dataset was split into training and testing
numerical feature vectors suitable for machine • Trains various machine learning models
subsets.
learning algorithms. (Naive Bayes, Random Forest, Logistic
• 80% of the data was used for training the
• Two language models were employed: Bag of Regression, K-Nearest Neighbors, Support
machine learning models, while the
Words (BoW) and Term Frequency-Inverse Vector Machines) on the processed data.
remaining 20% was reserved for testing and
Document Frequency (TF-IDF). • Performs classification of input text or
evaluating the models' performance.
• BoW represents text as a "bag" containing the documents as spam or legitimate.
• Care was taken to maintain a similar
count of each unique word, disregarding • Extracts relevant keywords and entities from
distribution of spam and legitimate emails in
grammar and word order. the input text.
both the training and test sets, ensuring that
• TF-IDF measures the importance of a word by • Analyzes and evaluates the performance of
the models were exposed to a diverse range
considering its frequency in the document and the trained models using appropriate metrics.
of examples during training and evaluation.
its rarity across the entire corpus.

11
CODING AND TESTING

Importing Required Libraries: Loading and Processing Data:


▪ CountVectorizer and TfidfVectorizer from scikit-learn for text ▪ The code loaded the cleaned and preprocessed dataset from a file
vectorization.
or database.
▪ NumPy for numerical computing and array manipulation.
▪ Data processing tasks, such as merging datasets, handling missing
▪ train_test_split from scikit-learn for splitting the data into values, and encoding categorical variables, were performed at
training and testing sets. this stage.on by 2%
▪ Classifiers like MultinomialNB, SVC, LogisticRegression,
KNeighborsClassifier, and RandomForestClassifier from scikit-
learn.
▪ pandas for data manipulation and analysis.
12
CODING AND TESTING

Implementing Training Algorithms: Retrieving Predictions:


• The selected machine learning algorithms (Naive Bayes, Random • Once the models were trained, they were used to make
Forest, Logistic Regression, KNN, and SVM) were instantiated and predictions on the test set.
trained on the preprocessed data.
• The predicted labels (spam or legitimate) were compared with
• Techniques like grid search and cross-validation were employed the true labels to evaluate the models' performance.
to fine-tune the hyperparameters of the models, optimizing their
performance. • Metrics such as accuracy, precision, recall, and F1-score were
calculated to assess the models' effectiveness.
13
SCREENSHOTS AND RESULTS

14
RESULT ANALYSIS
In this project, various experiments and evaluations were conducted to assess the performance of the spam email classifier and compare
the effectiveness of different approaches. This slide presents the key results and analysis derived from these experiments.
• Language Model Selection: Two language models were evaluated for representing the textual data: Bag of Words (BoW) and Term
Frequency-Inverse Document Frequency (TF-IDF).A comparative analysis was performed to determine the superior model for the
spam classification task.
• Model Performance Evaluation: The spam email classifier was evaluated using three key metrics: accuracy, precision, and F1-
score.The proposed ensemble model, which combined multiple individual classifiers, achieved outstanding performance:
Accuracy: 99.0% Precision: 98.5% F1-score: 98.6%
• Comparison with Individual Models: The performance of the proposed ensemble model was compared against the individual
machine learning algorithms used in the project. A tabular or graphical representation can be used to illustrate the metric values
(accuracy, precision, F1-score) for each model, highlighting the superior performance of the ensemble approach.
• Analysis and Interpretation: The results underscore the effectiveness of the TF-IDF language model in capturing the important
features and characteristics of textual data for spam classification.The ensemble approach, which combines the strengths of multiple
algorithms, proved to be a powerful technique for improving classification performance and robustness.

15
CONCLUSION

From the results obtained we can conclude that an ensemble machine learning model is more effective in detection and
classification of spam than any individual algorithms. We can also conclude that the TF-IDF (term frequency inverse document
frequency) language model is more effective than the Bag of words model in classification of spam when combined with several
algorithms. And finally we can say that spam detection can get better if machine learning algorithms are combined and tuned to
needs.
Future enhancements to this include extending to bulk text classification, online learning systems. Adapting to new spam trends.
Incorporating multi-lingual support. Integrate with email services for real-time spam filtering and so much more.
Overall, this project successfully demonstrated the power of machine learning and natural language processing techniques in
combating the ever-evolving challenge of spam emails. The accurate and robust spam email classifier developed in this project
has the potential to be integrated into email services, providing users with a cleaner and more secure email experience while
mitigating the risks associated with spam.

16
THANK YOU

You might also like