Minor Project Report
Minor Project Report
It is hereby certified that the work which is being presented in the B. Tech Minor Project
Report entitled "Fake News Article Detection using NLP " in partial fulfilment of the
requirements for the award of the degree of Bachelor of Technology and submitted in the
Department of Computer Science & Engineering of JAMIA HAMDARD UNIVERSITY,
New Delhi (AFFILIATED TO AICTE) is an authentic record of our own work carried out
during a period from March 2021 to May 2021 under the guidance of Mr. Piyush Kumar
Gupta. The matter presented in the B.Tech Minor Project Report has not been submitted by
us for the award of any other degree of this or any other Institute.
ii
CERTIFICATE
This is to certify that the project entitled Fake News Detection using NLP is a bonafide
work carried out by Mr. Ritwik Bahukhandi under my guidance and supervision and
submitted in partial fulfillment of B. Tech. degree in CSE of JAMIA HAMDARD
UNIVERSITY affiliated by AICTE, Delhi . The work embodied in this project has not been
submitted for any other degree or diploma.
iii
Acknowledgements
I express my deep gratitude to Mr. Piyush Kumar Gupta, Assistant Professor, Department
of Computer Science and Engineering for his valuable guidance and suggestion throughout
my project work.
I would like to extend our sincere thanks to Prof. MD. Afshar Alam ,Professor & HOD,
Computer Science & Engineering , for his time to suggestions to complete our project
work.
iv
TABLE OF CONTENTS
Candidate’s Declaration……………………………….………………….…….II
Certificate…………………………………………………………...................III
Acknowledgement……………………………………………………….….…IV
Table of Contents………………………………………….……………....V –
VI
List of Figures………………………………………………………….VII –
VIII
List of Tables..…………………………………………..………......................IX
Abstract……………………………………………………...………………….X
Chapter-1. Introduction……………………………………………………...1 –
6
3.2 Methodology…………………………………………...…………...17
3.2.1 Dataset…………………………………………...................17
v
3.2.2 Text Pre-Processing………………………...................17 –
20
3.2.4 Classifiers………………………………………….….24 – 30
Chapter-5. Results………………………………………………………...34 –
27
5.2 Comparison…………………………………………………….37 –
39
Chapter-6. Conclusion…………………………………………………………40
Reference...………………………………………………….…………….41 - 42
vi
List of Figures
2.1 N-gram 10
3.5 Tokenization 21
vii
4.1 Home Page 32
15. Confusion matrix: Naïve Bayes with CountVectorizer and TF-IDF vectorizer 35
16. Confusion matrix: Logistic Regression with CountVectorizer and TF-IDF vectorizer 35
19. Confusion matrix: Decision Tree with CountVectorizer and TF-IDF vectorizer 37
20. Confusion matrix: Random Forest with CountVectorizer and TF-IDF vectorizer 37
viii
List of Tables
ix
Abstract
Fake news is false or misleading information presented as news. It often has the aim of
damaging the reputation of a person or entity, or making money
through advertising revenue. Media scholar Dr. Nolan Higdon has offered a more broad
definition of fake news as “false or misleading content presented as news and communicated
in formats spanning spoken, written, printed, electronic, and digital communication." This
leads to misinformation and problems in society. Hence, it is important to make effort to
detect these facts and prevent them from spreading.
AIM
This project aims to tackle this issue using a computational models of machine learning
models using Scikit-learn library in Python. The extraction of features from the dataset is
done by Count-Vectorizer and Term Frequency-Inverse Document Frequency (TF-IDF)
Transformer. Various classifiers Linear and Non Linear like Logistic Regression, K-nearest
Neighbors (KNN), Stochastic Gradient Descent (SGD), Naïve Bayes, Decision Tree
Classification, Random Forest Classification classifier were used to predict the fake news.
Simulation results indicate Logistic Regression with TF-IDF gave the most accurate
prediction of 0.93.
x
Chapter-1
Introduction
Today most of the people are depended on news primarily source from WhatsApp,
Facebook, twitter, Reddit and many other platforms. Social media plays a huge role in this
matter. Fake news has always been a problem, which wasn’t exposed to the mass public until
last year 2019. The Economic Times an Indian business-focused daily newspaper published
an article “2019 – The year of fake news” on December 20, 2019 stating that “The menace of
the fake news is showing no sign of waning despite a government crackdown and a swell of
self – regulation measures taken by social media platforms such as WhatsApp and Facebook.
Spread of fake news has hit a new high in 2019 with every major events”.
Fake news is a major problem in our connected world. Although misinformation and
propaganda have been around for ages, ‘fake news’ is now becoming a real threat, partly due
to the ease of creating, diffusing, and consuming content online. What makes Fake News a
hard problem to solve is the difficulty in identifying, tracking and controlling unreliable
content. Even when there is early evidence for a Fake Story being circulated online,
removing it or preventing people from sharing it could be perceived as an attempt of
intervention and censorship. People, websites, blogs, social media are all part of the
problem to some extent intentionally or not. False or misleading stories can be easily created
and diffused via the global online networks with a few clicks — in many cases silently
impacting the public opinion. With the so-called ‘deep fakes’, it is already extremely difficult
to tell if what you see is true or not: latest technologies enable hacking real videos or creating
artificial ones, presenting people saying things they never did — in a very realistic way.
Moreover, synthesized speech matching the voice of a known person can be used to claim
statements or words never said. The times when something was perceived as true just
because it was ‘seen on TV’ or in a photo or in a video, are gone.
1
Fake news is a neologism. Fake news, or fake news websites, have no basis in fact, but are
presented as being factually accurate. Media scholar Dr. Nolan Higdon has argued that the
definition of fake news has been applied too narrowly to select mediums and political
ideologies. Fake news also refers to stories that are fabricated that obtain little to no
verifiable facts. In this report, we present our preliminary experiments on applying machine
learning techniques for fake news detection. In particular, we studied and developed methods
and tools for detecting fake news.
Fake News influence or even shape public opinion and (re)set the agenda. They are
distributed by platforms and users — both intentionally and unintentionally. ‘Unintentional
sharing’ is based on the general lack of awareness of the problem: people do not realize how
often they are exposed to Fake News; they don’t know if they are influenced by misleading
content; of if they are part of the problem itself by unintentionally sharing Fake News and
influencing others.
There are ongoing efforts within news corporations and social media companies to mitigate
the problem. And some of them may prove to be somehow effective. But the fake news
problem is bigger — it goes beyond the corporate boundaries.
This report describes a global, immutable registry of labeled ‘fake news’ as the basis of a
universal solution to the problem — on top of news organizations, social media, and search
engines. Utilizing technologies such as Blockchain, IPFS, Natural Language processing the
described platform can empower a global network of evaluators establishing continuous
feedback, labeling, and evaluation of a representative random sample of our global content.
The objective of the system is to quantify the problem and raise global awareness by
systematically taking snapshots of online content which is assessed and labeled by humans.
Furthermore, the platform can offer specialized APIs to expose the patterns and knowledge
extracted from the on-going analysis of content in order to enable 3rd parties to predict the
trustworthiness of new content — at ‘publish time’ or ‘share time’.
2
Fig 1.1 NLP subset of AL, ML and DL [17]
Natural Language Processing or NLP for short, is a branch of artificial intelligence that helps
computers understand, interpret and manipulate human language. NLP draws from many
disciplines, including computer science and computational linguistics, in its pursuit to fill the
gap between human communication and computer understanding. It refers to AI method of
communicating with an intelligent systems using a natural language such as English.
Processing of Natural Language is required when you want an intelligent system like robot to
perform as per your instructions, when you want to hear decision from a dialogue based
clinical expert system, etc.
Its goal is to understand written or spoken language the way a human would. NLU is used in
natural language processing (NLP) tasks like topic classification, language detection, and
sentiment analysis:
a) Sentiment analysis automatically interprets emotions within a text and categorizes them as
positive, negative, or neutral. By quickly understanding, processing, and analyzing thousands
3
of online conversations, sentiment analysis tools can deliver valuable insights about how
customers view your brand and products.
c) Topic classification is able to understand natural language to automatically sort texts into
predefined groups or topics. Software company Atlassian, for example, uses the tags
Reliability, Usability, _and Functionality_ to sort incoming customer support tickets,
enabling them to deal with customer issues efficiently.
a) Text planning − It includes retrieving the relevant content from knowledge base.
4
Applications of NLP:
Fig 1.2
There are many applications of NLP, among which Text classification is one of them, which
is used in this project to classify whether the text is real or fake.
Text Classification: Text classification also known as text tagging or text categorization is
the process of categorizing text into organized groups. By using Natural Language
Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-
defined tags or categories based on its content.
These days unstructured text is everywhere, such as emails, chat conversations, websites, and
social media but it’s hard to extract value from this data unless it’s organized in a certain
way. Doing so used to be a difficult and expensive process since it required spending time
and resources to manually sort the data or creating handcrafted rules that are difficult to
maintain. Text classifiers with NLP have proven to be a great alternative to structure textual
data in a fast, cost-effective, and scalable way.
5
Bag-of-Words approach which gives frequency matrix of the words and Term Frequency-
Inverse Document Frequency (TF-IDF) Transformer for transforming the matrix into a TF-
IDF frequency matrix. After getting the TF-IDF matrix we split the matrices into random
train and test subsets of 20:80, in which 80% of the data is for training the models and the
remaining 20% is used for testing the models. By combining these approaches, we built a
fake news detection tool. It has a minimal user interface allowing the user to enter the title
and text of that particular article which they would like to verify. The entered title and text
are further combines. The analysis is made based on article title and content.
The experimental results with the tool developed are promising giving an accuracy score
greater than 0.8 for content and title for almost all the classification. The Logistic Regression
classifier works the best with TF-IDF and gives the highest accuracy prediction score (0.93).
The lowest accuracy score was obtained by the non-linear classifier K-nearest Neighbors
with CountVectorizer and gives the lowest accuracy prediction score (0.73).
6
Chapter-2
Literature Review
The term fake news emerged during the last two years. However, fake news and misleading
information were present in all time periods. Most of works which discuss the detection of
fake news and biased information are relevantly modern. Some of them are based on
studying the credibility of a news source regardless of the news content. This process is not a
good way because a news source could be classified as untrusted and at the same time it
could publish a true fact.
Natural language processing (NLP) is a field of computer science responsible for the study
and analysis of raw text. The purpose of this field is to enhance human-computer
communication by constructing systems that are capable of understanding raw text and
incorporate interaction interfaces based on textual messages. Some of the main topics of NLP
are learning syntactic and semantic rules and determining concept, topics, and sentiment
from a document, automatic summarization, machine translation, natural language
generation, information extraction, etc.
In paper [1] the authors proposed a Hybrid approach, where combine the advantage of
linguistic models and network models, which intuitively outperform either of them. Models
based on network analysis realize the importance of taking various background information
into account, instead of inspecting solely the articles themselves. They perform generally
well at most times, but when related information is missing or little, their performance will
drop. In their paper they evaluated a fake news detector Fake-box on adversarial attacks,
including fact-distortion, subject-object exchange and cause confounding attacks. They
believed that similar models based solely on linguistic characters will perform much less
effectively in the real world and are especially vulnerable to tempering attacks. They all
suggested the use of crowdsourced knowledge graph to collect facts about news events, the
7
timely information collection can be used to compare to the extracted information from news
articles and help generate a label of veracity.
In paper [3] the approach is similar of paper [2]. They take web link to a news article as an
input, parse that link and extract the whole html code from the webpage. From the html the
name of the author, title of the article and date is extracted. After getting all the information
they analyze the article content/title and check whether it is fake/real or clickbait/non-
clickbait by reading the dataset with fake and real news or with clickbait/non-clickbait titles
and split it into train and test sets. Then they build the text representation model using Bag-
of-Words, Term Frequency – Inverse Document Frequency and Bi-gram from the train and
test data. Then they fit the data to two machine learning classifiers: Naïve Bayes
(Probabilistic classifier) and Linear Support Vector Machine (Linear classifier) and predict
the label (fake/real) of the article content. They implemented this application in Python using
the following libraries: Scikit-learn, Pandas, Beautiful Soup 4, PyQT5 and two external
APIs: Google Cloud Natural Language Processing API and NewsAPI.
In paper [4] the author approach for detecting fake news based on article content, where the
authors present an algorithm that uses data mining to detect fake news. The idea is that the
verification of a news article depends on many factors, like: the publisher of the article, the
content of the article, the time of posting on social media websites and the number of
engagements between different users and the article. First, the algorithm extracts the text
features (the characteristics of the content and the publisher). Then it will perform linguistic
and visual studies on the extracted features (source, headline, body text, image and video).
The linguistic study includes studying the lexical features and the syntactic features, lexical
features like different writing systems and sensational headlines, syntactic features such as
sentences characteristics and words frequencies. The extracted information will be used to
construct a machine learning model to classify the articles as fake or real.
A similar approach to detect the similarity between head line and article content using the
same dataset offered by the fake news challenge website is [10]. The authors present a
8
system of classifiers consisting of two stacked1layers. The first layer consists from five
independent classifiers (slave classifiers) developed using Natural Language Processing
(NLP) modules. The second layer has one master classifier which will use the output of the
weak classifiers (slave predictions) as an input. The master classifier will use the predictions
of slave classifiers to give a final prediction as an output. Classification techniques used are
multi-layer perceptron and ReLU activation function.
The entire research of paper [13] has been defined by following steps: Data Retrieval, Data
Preprocessing, Data Visualization, Tokenization, Feature Extraction, Machine Leaning
Algorithms, Training & Testing Model and Evaluation Metrics. Their main source of data
retrieval is Twitter posts. After creating a Twitter Developer account they starts craping
tweets and used multiple keywords to download tweets like Donald Trump, politics,
coronavirus. User and tweet information was also extracted like ID, username, tweet text and
if the user is verified. Then the entire dataset was converted into lower case letters and
punctuations were removed. After cleaning the data the extraction of featured was done by
TF-IDF vectorizer, Once the network of each algorithm is built, the data undergo training
where they perform supervised learning. Algorithm uses trained data to learn features from
dataset. Later, testing is performed to evaluate the performance of their built model using
performance metrics like accuracy, precision f1 score and recall. This is also called as
confusion matrix.
In [13] research, real time tweets were pulled, and features were preprocessed to extract
valuable characteristics. Four types of algorithms were analyzed under multiple evaluation
metrics. Logistic Regression and Support Vector Machine presented the best results with
more than 0.90 accuracy score. LSTM works best with unstructured data like images and
videos. As there are very less absolute credible sources of information, it makes the process
of detecting fake news more challenging.
[5]’s report provides a fairly simple approach to encode texts and how the presence of words
in general impacts the classification of texts as real and fake. They achieved high accuracy
9
results in most of our algorithms and in particular neural networks generally do better than
the others. What’s worth noting is that our LSTMs only use a timestep of 1 and are
essentially multi-layer perceptrons. Still, as mentioned is the LSTM’s method section, the
LSTMs with the real recurrence are performed by using Word2Vec for representations at the
word’s level. In this case, each word has its own vector, and a sample will be a collection of
vectors and thus a 2-D matrix.
Their approach involves pre-processing the data and then extract features from that pre-
processed data. They used Unigram, Bigram and N-gram.
N-grams are simply all combinations of adjacent words or letters of length n that you can
find in your source text. For example, given the word fox, all 2-grams (or “bigrams”) are fo
and ox. We may also count the word boundary – that would expand the list of 2-grams to #f,
fo, ox, and x#, where # denotes a word boundary.
We can do the same on the word level. As an example, “the hello, world! ” text contains the
following word-level bigrams: # hello, hello world, world #.
The basic point of n-grams is that they capture the language structure from the statistical
point of view, like what letter or word is likely to follow the given one. The longer the n-
gram (the higher the n), the more context you have to work with. Optimum length really
depends on the application – if your n-grams are too short, you may fail to capture important
differences. On the other hand, if they are too long, you may fail to capture the “general
knowledge” and only stick to particular cases.
10
Among the three pre-training models, CountVectorizer achieves in general the best
performance comparatively and Word2Vec performs relatively poor amongst the three
models. The essential idea behind both CountVectorizer and TF-IDF is computing a score
which depends on the frequency of the word belonging to the vocabulary. However,
comparing to CountVectorizer, the TF-IDF includes an extra inverse document frequency
that “penalizes”, the contextual meaning within the words that appear more frequently across
documents. They represent the importance of the word within a document. The results may
imply that even though the penalization is smoothed by a log function, the punishment may
be too high. The results also show that in general neural networks do the best consistently, as
neural networks serves as a powerful universal approximator.
In [12] the authors compared clickbait and non-clickbait headlines, and highlighted many
interesting differences between these two categories. They utilized these differences as
features to detect clickbait. They also proposed personalized approaches which can block
certain clickbait according to user interests. Finally, using these two components, they have
developed a Chrome extension which warns the users of different media websites about the
presence of clickbait in these websites. The extension also gives the option to the users to
block clickbait and it automatically blocks similar ones in subsequent visits by the users.
1) Blocking based on topical similarity: Their first approach to block clickbait articles is to
first extract a set of topics from a clickbait article, and find the similarity between this set and
the topics previously extracted from blocked and clicked articles.
11
2) Blocking based on linguistic patterns: In the second approach, they identified the linguistic
patterns in the articles that the user clicks on or chooses to block. The pattern is formed by
normalizing the words in the headlines in the following ways. First Numbers and Quotes are
replaced by tags <D>and <QUOTE>, then the top 200 most commonly occurring words in
the clickbait corpus, including English stop words, were retained in their original form, after
that content words such as Nouns, Adjectives, Adverbs and Verb inflections were replaced
by their POS tags. For instance, “Which Dead ‘Grey’s Anatomy’ Character Are You”,
reduces to “which JJ <QUOTE>character are you” and “Which ‘Inside Amy Schumer’
Character Are You” reduces to “which <QUOTE>character are you”. They convert each
headline into such patterns, and thus they get a set of patterns for both blocked articles and
clicked articles. To compute the similarity between two patterns, they use the word-level edit
distance. Using the mechanism similar to the topical similarity case, they make the block/do
not block decision.
3) Hybrid approach: They finally experimented with a hybrid approach which takes into
account both topical similarity and linguistic patterns. For a test article, its tags are extracted,
nugget is formed and compared with the Block-Nuggets and Click Nuggets – this gives the
topical similarity scores. Similarly, they get the linguistic similarity scores. The hybrid scores
are obtained using a weighted combination of both topical and linguistic similarity scores,
and finally they make the block/do not block decision based on the hybrid scores.
Their work is the first attempt to provide a comprehensive solution to deter the prevalence of
clickbait. However, their future work lies in improving the classification and blocking
performances further and tune the extension according to further user feedbacks. It is belief
by them that combating the prevalence of clickbait should be a community initiative and
towards that end, they also plan to make all the data and source code public, so that the
researcher and developer communities can come forward, and collectively make the effort a
grand success.
In [14] the author built a model that takes news events as an input and based on twitter
reviews and classification algorithms it predicts the percentage of news being fake or real.
The basic idea of their project is to build a model that can predict the credibility of real time
12
news events. Their proposed framework consists of four major steps: Data collection, Data
preprocessing, Classification and Analysis of results same as [1][2][3][4]. They first take key
phrases of the news event as an input that the individual need to authenticate. After that live
data is collected from Twitter Streaming API. The filtered data is stored in the database
(Mongo DB). The data preprocessing unit is responsible for preparing a data for further
processing. Classification is based on various news features, twitter reviews like Sentiment
Score, Number of Tweets, Number of followers, Number of hashtags, is verified User,
Number of retweets and NLP techniques. They described fake news detection method based
on one artificial intelligence algorithm –Naïve Bayes Classifier. Sentiment Score is
calculated using Text Vectorization algorithm and NLTK (Natural Language Toolkit). By
doing the evaluation of effects acquired from classification and analysis, they were able to
decide the share of news being fake or real.
In paper [7] the author approach for detecting fake news based on article content, where the
authors present an algorithm that uses data mining to detect fake news. The idea is that the
verification of a news article depends on many factors, like: the publisher of the article, the
content of the article, the time of posting on social media websites and the number of
engagements between different users and the article. First, the algorithm extracts the text
features (the characteristics of the content and the publisher). Then it will perform linguistic
and visual studies on the extracted features (source, headline, body text, image and video).
The linguistic study includes studying the lexical features and the syntactic features, lexical
features like different writing systems and sensational headlines, syntactic features such as
sentences characteristics and words frequencies. The extracted information will be used to
construct a machine learning model to classify the articles as fake or real.
In [15]’s work, they have demonstrated the use and effectiveness of pre-trained Transformer
based language models in retrieving and classifying fake news in a highly specialized domain
of COVID-19. The architecture consists of a two stage model, the author refer the first model
as “Model A” and the second model as “Model B”. The objective of Model A is to fetch the
candidate “true facts” or explanations for a given claim, which are then evaluated for
13
entailment using the Model B. Next, they describe the training procedure as well as intended
run time behaviour for both Model A and Model B.
Model A: fetch relevant explanations, then they train the Transformer model on a binary
sentence entailment task, where the claims and explanations are the two sentences fed in as
input separated by a [SEP] tag. Then they generate negative claim-explanation pairs through
random sampling to ensure that equal proportions of positive and negative pairs are present.
Training multi-attention network with the COVID-19 specific data enables the model to
capture long-range correlations between the vector representations of claims and
explanations of similar contexts. After that the training of models with a base encoder and a
sequence classification head on top for binary classification of the labels. The model is
trained to optimise the cross entropy loss. For faster near real-time performance, they cache
the embedding for all explanations (knowledge base) beforehand, and compute the cosine
similarity between the claim and the cached embedding of the explanations. The vector of the
[CLS](the start of sentence) token of the final layer works as a strong representation of the
entire sentence, although they found that taking element-wise mean over all the token vectors
leads to better performance. After all this they fetch the top explanations for any given claim
exceeding a certain threshold of sentence similarity as there could be several explanations
relevant for a given claim. This threshold is determined on the basis of the summary statistics
of the cosine similarity metric between the claim and relevant explanations in the validation
set.
Model B: The second part of this pipeline is to identify the veracity of a given claim. Model
A fetches the candidate explanations while Model B is used to verify whether the given claim
aligns with our set of candidate explanations or not. So that they can therefore treat the task
as a textual entailment problem. To train the Model B, a smaller subset of “false claim” and
“explanation” pairs from their original dataset, and cross validate each sample with “true
claim” or in other words, claims that align with the factual explanation were used. However,
this small annotated data is not sufficient to train the model effectively. Therefore, the
parameters of the Model A, which was trained on a much larger dataset were used as initial
parameters for Model B, and fine-tuned further using their cross validated dataset. The
training of Model B was done in a similar way as Model A. They used the statistic of mean
14
probability score and standard deviation of aligning and non-aligning claim and explanation
pairs in the validation set to determine the thresholds for Model B classification. They trained
and evaluated both Model A and Model B using several approaches based on classical NLP
methods as well as more sophisticated pre-trained Transformer models.
In [16] paper, the author proposed a novel deep learning model for fake news detection in
social media. Because of the fast propagation of fake news, timely detection is critical to
mitigate their effects. However, usually very few data samples can be labeled in a short time,
which in turn makes the supervised learning models infeasible. Hence, a deep semi-
supervised learning model is proposed by them. The two paths in the proposed model
generate supervised loss (cross-entropy) and unsupervised loss (Mean Squared Error),
respectively. Then training is performed by jointly optimizing these two losses. Also their
experimental results indicate that the proposed model could detect fake news from PHEME
datasets effectively by using limited labeled data and lots of unlabeled data.
The author of [16] proposed a model in the context of fake news detection. Suppose the
training data consist of total N inputs, out of which M are labeled. The inputs are tweets that
contain sentences related to fake news. They employ word embedding represent input xi (i ∈
1...N) as “images”, where each row in the “image” represents one word in the tweet as
embedding and the number of rows is the number of words in the tweet. S is the set of
labeled inputs, |S|= M. For every i ∈ S, we have a known correct label yi ∈ 1...C, where C is
the number of different classes. The framework of the proposed model and corresponding
learning procedures evaluate the network for each training input xi with the supervised path
and the unsupervised path, resulting in prediction vectors zi and zi' , respectively. Then
utilize those two vectors to calculate the loss. The loss consists of two components, li is the
standard cross entropy loss to evaluate the loss for labeled inputs only. On the other hand, li',
evaluated for all inputs, penalizes different predictions for the same training input xi by
taking the mean squared error between zi and zi' . To combine the supervised loss li and
unsupervised loss li', scaling the latter by time-dependent weighting function w(t) that ramps
up, starting from zero, along a Gaussian curve. Furthermore, instead of using one path CNN,
15
they apply two independent CNN to generate those two outputs. They also plan to examine
the proposed model on other NLP tasks such as sentiment analysis in future.
In paper [2] the author used took the dataset form kaggle by Andrew Thompson and it consist
of almost 150,000 posts from which the randomly selected 13,000 posts. After getting the
posts they pre-processed the text and remove the stopwords from the text. After
preprocessing and removing the stopwords from the text they used pre-trained model of TF-
IDF to convert the text (posts) into features. The author of paper [2] used multiple classifiers
like Naïve Bayes, SVM, Logistic Regression, Decision Tree and NN for testing the text,
whether the text is real or fake. The SVM outperform all the other classifiers with the
accuracy score of 0.923 and the worst performing was NN with an accuracy score of 0.499.
Both, logistic regression and SVM models give better scores with larger datasets. Naive
Bayes and Decision Tree did not improve scores by a considerable margin with increase in
sample size. However, Naive Bayes gave the highest accuracy scores for smaller datasets and
hence has considerable importance. Decision Trees on the other hand performed poorly
throughout and hence seems to be a bad candidate for fake news classification.
16
Chapter – 3
Proposed Solution
To find the relation between the words and the context/sentiments in which the words appear
within the title & content (text) and how it could be used to classify texts as real or fake.
3.2. Methodology
3.2.1 Dataset
The dataset was taken from Kaggle. It’s often considered to be the first step towards
classification of fake news, because it’ll cascade in how good the model will be, the more and
better data that we get, the better our model will perform. Three different datasets from
Kaggle were taken. The first dataset is named REAL.csv, it contains all the real news articles
before 2020 and has 4 features: title, text, subject and date. The second dataset is named
Fake.csv, it contains all the fake news articles before 2020 and has 4 features: title, text,
subject and date. The third dataset is named train.csv, it contains fake as well as real news
articles between the years of 2016 and 2017 and has 5 features: id, title, author, text and
label.
It results to a matrix of 2 * 65812 which has 2 features: article (collection of title, subject and
content) and the label (whether the article is real or fake), as it is supervised learning the label
feature is required.
Text cleaning or Text pre-processing is a mandatory step when we are working with text in
Natural Language Processing (NLP). In real-life human writable text data contain various
words with the wrong spelling, short words, special symbols, emojis, etc. we need to clean
this kind of noisy text data before feeding it to the machine learning model.
17
Fig. 3.1 Workflow of Project
A) Lower Case
First step before starting the pre-processing is to convert the whole document into string and
lower case all of the content (Strings).
B) Remove Punctuations
In addition of pre-processing the textual data, we removed all characters that are not textual
(not alphabets such as punctuation, extra delimiters etc.). We used the string. punctuation
module in Python to find all punctuation characters. We remove all those punctuation
characters from every word in the texts, with the exception of the symbols ‘#’ and ‘@’.
Because these are characters used for Twitter hashtags and mentions, we handle these later.
Next, we removed an assortment of special characters that don’t appear on traditional
American keyboards and don’t contribute to the meaning of the text. The long dash (“–”),
single and double Asian quotations, ellipse characters (. . .), and bullet points (•) all were
removed for this reason.
Proceeding further, we make all of our texts lowercase and then remove all rows that have
foreign language characters in their text, since we are only interested in identifying fake news
in English. To do this we used the package string in Python to identify the language of all
texts, and removed all rows with foreign characters. This finally ensures the text we preserve
is only with English words with no non-alpha character.
18
Fig. 3.2 punctuations in string module(python3)
C) Remove Stopwords
Stop words are a list of the most common words in a language, such as “a”, “be”, “quite”,
“should”...etc. They are often void of meaning, and does not add anything to the content.
They are also most frequently present in every text. Hence, we presumed removal of stop
words can have multiple advantages. For once, it decreases memory overhead, since we cut
down a huge amount of text (and hence narrows down the number of features to train our
models on). Second, it reduces noise, since by eliminating stop words, we are able to focus
on more meaningful contents (the more distinct features between these two classes).
Although it is not often the case that removing stop words are the most optimal, sometimes
the information that we are looking for may be included in the stop words that we removed.
For example, in most cases of language modeling, or translation, where it is important that
we keep all the stop words. However, in our circumstances, we are using the semantics of the
text to make a decision. In this case, we can safely remove stop words to observe the more
meaningful context words.
19
D) Lemmatization
Lemmatization is the process of grouping together the different inflected forms of a word so
they can be analyzed as a single item. Lemmatization is similar to stemming but it brings
context to the words. So it links words with similar meaning to one word. Text preprocessing
includes both Stemming as well as Lemmatization. Many times people find these two terms
confusing. Some treat these two as same. Actually, lemmatization is preferred over
Stemming because lemmatization does morphological analysis of the words. One major
difference with stemming is that lemmatize takes a part of speech parameter, “pos” If not
supplied, the default is “noun.”
Features in machine learning is basically numerical attributes from which anyone can perform
some mathematical operation such as matrix factorisation, dot product etc. But in this cases
dataset contain numerical value, string value, character value, categorical value, connection
(one user connected to another user). Conversion of these types of feature into numerical
feature is called featurization.
20
In this project features extraction is done using CountVectorizer and Term Frequency-Inverse
Document Frequency (TF-IDF) Transformer.
A) Tokenization
Tokenization refers to splitting up a larger body of text into smaller lines, words or even
creating words for a non-English language. The various tokenization functions are in-built
into the nltk module itself. We have used Regex Tokenizer, that extracts tokens
either by using the provided regex pattern to split the text (default) or repeatedly matching
the regex (if gaps are false). Fig. 4 shows the tokenized words of our dataset. Optional
parameters also allow filtering tokens while minimizing the length. It returns an array of
strings that can be empty.
B) CountVectorizer
The bag-of-words model is a way of representing text data when modeling text with machine
learning algorithms. The bag-of-words model is simple to understand and implement and has
seen great success in problems such as language modeling and document classification. A
21
bag-of-words is a representation of text that describes the occurrence of words within a
document. It involves two things:
C) TF-IDF
A problem with scoring word frequency is that highly frequent words start to dominate in the
document (e.g. larger score), but may not contain as much “informational content” to the
model as rarer but perhaps domain specific words.
One approach is to rescale the frequency of words by how often they appear in all
documents, so that the scores for frequent words like “the” that are also frequent across all
documents are penalized.
This approach to scoring is called Term Frequency – Inverse Document Frequency, or TF-
IDF for short, where:
22
The term frequency tf (t, d) calculates the proportion of times that the term t ∈ V (d) appears
in the document d. The vocabulary V (d) = Σ n (t, d) is constructed by the document d. Thus,
if a word w' does not appear in a document d', the term frequency tf (t', d') in this case would
be zero. The idea of the term frequency is essentially the same as CountVectorizer.
Given a document collection D, the inverse document frequency idf (t, D) is the log of the
number of documents N divided by df (t, D), the number of documents d ∈ D containing the
term t. As a result, common words in D will have a low term frequency score, while
infrequent words will have a high term frequency. Thus, the term frequency will be very
likely to separate fake news that often have less common words (even ungrammatical) from
real news that usually consist of common words.
As a summary, TF-IDF score w (t, d) for a word increases with its count, but will be
counteracted if the word appears in too many documents.
23
Fig. 3.7 TF-IDF Output
3.2.4 Classifiers
A) Naïve Bayes
Naive Bayes classifiers are probabilistic classifiers (Probabilistic classifier is a classifier that
can predict the category of a given observation, based on the observation’s probability
distribution over the set of categories) which apply Bayes theorem in their decision rule
assuming strong independence between features. Naive Bayes classifiers are useful for text
classification problems, because they can be trained precisely in the supervised learning case.
Given a class variable (label) y and a vector of features x = (x1, x2, ..., xn), where n is the
number of features, the probability of y depending on the vector of features x can be
calculated using the formula:
Since the value of P(x1, ..., xn)is the same for all class variables (labels), we can say that the
result of a prediction probability r is the maximum probability of the calculated probabilities
for all class variables:
24
B) Logistic Regression
This refers to a regression model that is used for classification. This method is widely used
for binary classification problems. It can also be extended to multi-class classification
problems. Here, the dependent variable is categorical: y ϵ {0, 1}. A binary dependent
variable can have only two values, like 0 or 1, win or lose, pass or fail, healthy or sick, etc. In
this case, you model the probability distribution of output y as 1 or 0. This is called the
sigmoid probability (σ). If σ (θ Tx) > 0.5, set y = 1, else set y = 0 .Unlike Linear Regression
(and its Normal Equation solution), there is no closed form solution for finding optimal
weights of Logistic Regression. Instead, you must solve this with maximum likelihood
estimation (a probability model to detect the maximum likelihood of something happening).
It can be used to calculate the probability of a given outcome in a binary model, like the
probability of being classified as sick or passing an exam.
Sigmoid Probability
The probability in the logistic regression is often represented by the Sigmoid function (also
called the logistic function or the S-curve):
In this equation, t represents data values * the number of hours studied and S (t) represents
the probability of passing the exam. Assume sigmoid function:
25
g (z) tends toward 1 as z -> infinity , and g(z) tends toward 0 as z -> infinity.
Decision Trees (DT) can be used both for classification and regression. The advantage of
decision trees is that they require very little data preparation. They do not require feature
scaling or centering at all. They are also the fundamental components of Random Forests,
one of the most powerful ML algorithms. Unlike Random Forests and Neural Networks
(which do black-box modeling), Decision Trees are white box models, which means that
inner workings of these models are clearly understood. In the case of classification, the data
is segregated based on a series of questions. Any new data point is assigned to the selected
leaf node.
26
Start at the tree root and split the data on the feature using the decision algorithm, resulting in
the largest information gain (IG). This splitting procedure is then repeated in an iterative
process at each child node until the leaves are pure. This means that the samples at each node
belonging to the same class. In practice, you can set a limit on the depth of the tree to prevent
overfitting. The purity is compromised here as the final leaves may still have some impurity.
However, the advantages outweigh their limitations since you do not have to worry about
hyper-parameters except k, which stands for the number of decision trees to be created from
a subset of samples. RF is quite robust to noise from the individual decision trees. Hence,
you need not prune individual decision trees. The larger the number of decision trees, the
more accurate the Random Forest prediction is. (This, however, comes with higher
computation cost).
27
Fig. 3.9 Random Forest [23]
E) K-nearest Neighbors
K-nearest Neighbors algorithm is used to assign a data point to clusters based on similarity
measurement. It uses a supervised method for classification. To writing a k-means algorithm
we first need to choose the number of k and a distance metric (k=5 is common). Then, find
the k-nearest neighbors of the sample that you want to classify and assign the class label by
the majority votes.
KNN Classification
A new input point is classified in the category such that it has the most number of neighbors
from that category. For example:
28
Fig. 3.10 KNN Classification [24]
This estimator implements regularized linear models with stochastic gradient descent (SGD)
learning: the gradient of the loss is estimated each sample at a time and the model is updated
along the way with a decreasing strength schedule (aka learning rate). SGD allows mini
batch (online/out-of-core) learning via the partial_fit method. For best results using the
default learning rate schedule, the data should have zero mean and unit variance. This
implementation works with data represented as dense or sparse arrays of floating point values
for the features. The model it fits can be controlled with the loss parameter; by default, it fits
a linear support vector machine (SVM). The regularizer is a penalty added to the loss
function that shrinks model parameters towards the zero vector using either the squared
euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net). If the
parameter update crosses the 0.0 value because of the regularizer, the update is truncated to
0.0 to allow for learning sparse models and achieve online feature selection.
Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to fitting linear
classifiers and regressors under convex loss functions such as (linear) Support Vector
29
Machines and Logistic Regression. Even though SGD has been around in the machine
learning community for a long time, it has received a considerable amount of attention just
recently in the context of large-scale learning. SGD has been successfully applied to large-
scale and sparse machine learning problems often encountered in text classification and
natural language processing. Given that the data is sparse, the classifiers in this module easily
scale to problems with more than 10^5 training examples and more than 10^5 features. As
like other classifiers, SGD has to be fitted with two arrays: an array X of shape (n_samples,
n_features) holding the training samples, and an array y of shape (n_samples,) holding the
target values (class labels) for the training sample
30
Chapter – 4
Project Screenshots
31
In figure 11 and 12 it the user interface is shown, how the user will be able to classify
whether the article is real or fake. The user enters the Article’s title and the Article’s
content/text then click the ‘Check’ button.
32
Chapter – 5
Results
A) Naïve Bayes
Fig. 5.1 Confusion matrix: Naïve Bayes with CountVectorizer and TF-IDF vectorizer
B) Logistic Regression
Fig. 5.2 Confusion matrix: Logistic Regression with CountVectorizer and TF-IDF
vectorizer
33
C) SGD
Fig. 5.3 Confusion matrix: SGD with CountVectorizer and TF-IDF vectorizer
D) KNN
Fig. 5.4 Confusion matrix: KNN with CountVectorizer and TF-IDF vectorizer
34
E) Decision Tree
Fig. 5.5 Confusion matrix: Decision Tree with Count Vectorizer and TF-IDF vectorizer
F) Random Forest
Fig. 5.6 Confusion matrix: Random Forest with CountVectorizer and TF-IDF vectorizer
35
5.2 Comparison
With this approach we got the highest score of 0.99 from Logistic Regression
& SGD with TF-IDF and the lowest was 0.76 from KNN with
CountVectorizor.
36
Chapter – 6
Conclusion
In this report we evaluated Fake news detector, we come to an understanding that how the
presence of words can impact the classification of the text (Article in this case) as FAKE or
REAL. We achieve high accuracy in most of the algorithms, In particular the Logistic
Regression & SGD gives the best result giving an accuracy score of 0.99. The essential idea
behind both CountVectorizer and TF-IDF is computing a score which depends on the
frequency of the word belonging to the vocabulary. However, comparing to CountVectorizer,
the TF-IDF includes an extra inverse document frequency that “penalizes” (apparently
masks) the contextual meaning within the words that appear more frequently across
documents. They represent the importance of the word within a document.
37
References
[1] Zhixuan Zhou, Huankang Guan, Meghana Moorthy Bhat and Justin Hsu, “Fake News
Detection via NLP is Vulnerable to Adversarial Attacks”, January 5, 2019
[3] Bashar Al Asaad, Madalina Erascu, “A Tool for Fake News Detection”, 2018 20th
International Symposium on Symbolic and Numeric Algorithms for Scientific Computing
(SYNASC), September 2018
[4] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake news detection on social media: A
data mining perspective”, August 2017
[5] Sairamvinay Vijayaraghavan, Zhiyuan Guo,Ye Wang, John Voong, Wenda Xu, Armand
Nasseri, Jiaru Cai, Linda Li, Kevin Vuong, Eshan Wadhwa, “Fake News Detection with
Different Models”, February 2020
[6] Bourgonje, P., Schneider, J. M., and Rehm, G, “From clickbait to fake news detection”,
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets
Journalism, January 2017
[7] Shu K., Sliva A., Wang S., Tang J. and Liu H., “FakeNews Detection on Social Media:
A Data Mining Perspective”, 2017
[8] Brownlee J., “How to Prepare Text Data for Machine Learning with scikit-learn”, 2015
[9] Bajaj S., “The Pope Has a New Baby Fake News Detection Using Deep Learning”, 2017
[10] J. Thorne, M. Chen, G. Myrianthous, J. Pu, X. Wang, and A. Vlachos, “Fake news
stance detection using stacked ensemble of classifiers”, Proceedings of the 2017 EMNLP
Workshop: Natural Language Processing meets Journalism, January 2017
38
[11] Fernando Cardoso Durier da Silva, Rafael Vieira da Costa Alves, Ana Cristina Bicharra
Garcia, “Can Machines Learn to Detect Fake News? A Survey Focused on Social Media”,
Proceedings of the 52nd Hawaii International Conference on System Sciences, January 2019
[12] Chakraborty, A., Paranjape, B., Kakarla, S., Ganguly, N., “Stop clickbait: Detecting and
preventing clickbaits in online news media”, 2016 IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining (ASONAM), August 2016
[13] Harika Kudarvalli and Dr. Jinan Fiaidhi, “Detecting Fake News using Machine
Learning Algorithms”, November 2020
[14] Subhadra Gurav, Swati Sase, Supriya Shinde, Prachi Wabale, Sumit Hirve, “Survey on
Automated System for Fake News Detection using NLP & Machine Learning Approach”,
International Research Journal of Engineering and Technology (IRJET), January 1, 2019
[15] Rutvik Vijjali, Prathyush Potluri, Siddharth Kumar, Sundeep Teki, “Two Stage
Transformer Model for COVID-19 Fake News Detection and Fact Checking”, November 26,
2020
[16] Xishuang Dong, Uboho Victor, Shanta Chowdhury, Lijun Qian, “Deep Two-path Semi-
supervised Learning for Fake News Detection”, June 10, 2019
[17] https://devopedia.org/images/article/187/4949.1560446213.png
[18] https://deepai.org/machine-learning-glossary-and-terms/n-gram
[19] https://x7r6b9v3.rocketcdn.me/wp-content/uploads/2015/11/Stop-Words.jpg.webp
[20] https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
[21] https://www.kdnuggets.com/wp-content/uploads/text-tokens-tokenization-manning.jpg
[22] https://miro.medium.com/max/875/1*WerHJ14JQAd3j8ASaVjAhw.jpeg
[23] https://en.wikipedia.org/wiki/File:Random_forest_diagram_complete.png
[24]https://www.simplilearn.com/ice9/free_resources_article_thumb/knn-classification-
machine-learning.JPG
39