0% found this document useful (0 votes)

110 views49 pages

Minor Project Report

Project Report

Uploaded by

Piyush Kumar Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

110 views49 pages

Minor Project Report

Project Report

Uploaded by

Piyush Kumar Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 49

Fake News Article Detection using NLP

MINOR PROJECT REPORT

SUBMITTED IN PARTIAL FULFILMENT OF THE
REQUIREMENT FOR THE DEGREE OF
B.TECH (CSE)
BY
Ritwik Bahukhandi (2018-310-116)
Under the Guidance of
Mr. Piyush Kumar Gupta

JAMIA HAMDARD UNIVERSITY

HAMDARD NAGAR , NEW DELHI - 110062
(AFFILIATED TO AICTE)
MAY , 2021
CANDIDATE’S DECLARATION

It is hereby certified that the work which is being presented in the B. Tech Minor Project
Report entitled "Fake News Article Detection using NLP " in partial fulfilment of the
requirements for the award of the degree of Bachelor of Technology and submitted in the
Department of Computer Science & Engineering of JAMIA HAMDARD UNIVERSITY,
New Delhi (AFFILIATED TO AICTE) is an authentic record of our own work carried out
during a period from March 2021 to May 2021 under the guidance of Mr. Piyush Kumar
Gupta. The matter presented in the B.Tech Minor Project Report has not been submitted by
us for the award of any other degree of this or any other Institute.

Ritwik Bahukhandi (2018-310-116)

ii
CERTIFICATE

This is to certify that the project entitled Fake News Detection using NLP is a bonafide
work carried out by Mr. Ritwik Bahukhandi under my guidance and supervision and
submitted in partial fulfillment of B. Tech. degree in CSE of JAMIA HAMDARD
UNIVERSITY affiliated by AICTE, Delhi . The work embodied in this project has not been
submitted for any other degree or diploma.

Mr. Piyush Kumar Gupta

(Assistant Professor)
Department of Computer Science & Engineering

Prof. MD. Afshar Alam

Professor & HOD,

Computer Science & Engineering,
Jamia Hamdard University (SEST)

iii
Acknowledgements

I express my deep gratitude to Mr. Piyush Kumar Gupta, Assistant Professor, Department
of Computer Science and Engineering for his valuable guidance and suggestion throughout
my project work.
I would like to extend our sincere thanks to Prof. MD. Afshar Alam ,Professor & HOD,
Computer Science & Engineering , for his time to suggestions to complete our project
work.

iv
TABLE OF CONTENTS

Topic Page no.

Candidate’s Declaration……………………………….………………….…….II

Certificate…………………………………………………………...................III

Acknowledgement……………………………………………………….….…IV

Table of Contents………………………………………….……………....V –
VI

List of Figures………………………………………………………….VII –
VIII

List of Tables..…………………………………………..………......................IX

Abstract……………………………………………………...………………….X

Chapter-1. Introduction……………………………………………………...1 –
6

Chapter-2. Literature Review…………………………………..………..…6 –

Chapter-3. Proposed Solution…………………………………….....................16

3.1 Problem Statement…………………………………...

……………...17

3.2 Methodology…………………………………………...…………...17

3.2.1 Dataset…………………………………………...................17
v
3.2.2 Text Pre-Processing………………………...................17 –
20

3.2.3 Text to Features………………………….....................20 – 24

3.2.4 Classifiers………………………………………….….24 – 30

Chapter-4. Project Screenshots……………………………....................…32 –

Chapter-5. Results………………………………………………………...34 –
27

5.1 Confusion Matrix………………………………………………34 –

5.2 Comparison…………………………………………………….37 –
39

Chapter-6. Conclusion…………………………………………………………40

Reference...………………………………………………….…………….41 - 42

vi
List of Figures

S. no. Figure Page no.

1.1 NLP subset of AL, ML and DL 3

1.2 Applications of NLP 5

2.1 N-gram 10

3.1 Pre-Processing workflow 18

3.2. Punctuations in String (Python) 19

3.3 Stopwords removal 19

3.4 Lemmatization (Root words) 20

3.5 Tokenization 21

3.6 CountVectorizer Output 22

3.7 TF-IDF Output 24

3.8 Decision Tree 26

3.9 Random Forest 28

3.10 KNN Classification 29

vii
4.1 Home Page 32

4.2 Home Page (Input) 32

4.3 Step-by-Step processing of the text 33

4.4 Predictions of multiple classifiers 34

15. Confusion matrix: Naïve Bayes with CountVectorizer and TF-IDF vectorizer 35

16. Confusion matrix: Logistic Regression with CountVectorizer and TF-IDF vectorizer 35

17. Confusion matrix: SGD with CountVectorizer and TF-IDF vectorizer 36

18. Confusion matrix: KNN with CountVectorizer and TF-IDF vectorizer 36

19. Confusion matrix: Decision Tree with CountVectorizer and TF-IDF vectorizer 37

20. Confusion matrix: Random Forest with CountVectorizer and TF-IDF vectorizer 37

viii
List of Tables

S. no. Table Page no.

1. Content Detection Accuracy Scores of paper[3] 38

2. Title Detection Accuracy Scores of paper[3] 38

3. Mean Test Scores of paper[5] 38

4. Comparison between multiple classifiers with CountVectorizer to the 39

classifiers with TG-IDF Vectorizer

ix
Abstract

Fake news is false or misleading information presented as news. It often has the aim of
damaging the reputation of a person or entity, or making money
through advertising revenue. Media scholar Dr. Nolan Higdon has offered a more broad
definition of fake news as “false or misleading content presented as news and communicated
in formats spanning spoken, written, printed, electronic, and digital communication." This
leads to misinformation and problems in society. Hence, it is important to make effort to
detect these facts and prevent them from spreading.

AIM

This project aims to tackle this issue using a computational models of machine learning
models using Scikit-learn library in Python. The extraction of features from the dataset is
done by Count-Vectorizer and Term Frequency-Inverse Document Frequency (TF-IDF)
Transformer. Various classiﬁers Linear and Non Linear like Logistic Regression, K-nearest
Neighbors (KNN), Stochastic Gradient Descent (SGD), Naïve Bayes, Decision Tree
Classification, Random Forest Classification classiﬁer were used to predict the fake news.
Simulation results indicate Logistic Regression with TF-IDF gave the most accurate
prediction of 0.93.

x
Chapter-1

Introduction

Today most of the people are depended on news primarily source from WhatsApp,
Facebook, twitter, Reddit and many other platforms. Social media plays a huge role in this
matter. Fake news has always been a problem, which wasn’t exposed to the mass public until
last year 2019. The Economic Times an Indian business-focused daily newspaper published
an article “2019 – The year of fake news” on December 20, 2019 stating that “The menace of
the fake news is showing no sign of waning despite a government crackdown and a swell of
self – regulation measures taken by social media platforms such as WhatsApp and Facebook.
Spread of fake news has hit a new high in 2019 with every major events”.

Fake news is a major problem in our connected world. Although misinformation and
propaganda have been around for ages, ‘fake news’ is now becoming a real threat, partly due
to the ease of creating, diffusing, and consuming content online. What makes Fake News a
hard problem to solve is the difficulty in identifying, tracking and controlling unreliable
content. Even when there is early evidence for a Fake Story being circulated online,
removing it or preventing people from sharing it could be perceived as an attempt of
intervention and censorship. People, websites, blogs, social media are all part of the
problem to some extent intentionally or not. False or misleading stories can be easily created
and diffused via the global online networks with a few clicks — in many cases silently
impacting the public opinion. With the so-called ‘deep fakes’, it is already extremely difficult
to tell if what you see is true or not: latest technologies enable hacking real videos or creating
artificial ones, presenting people saying things they never did — in a very realistic way.
Moreover, synthesized speech matching the voice of a known person can be used to claim
statements or words never said. The times when something was perceived as true just
because it was ‘seen on TV’ or in a photo or in a video, are gone.

1
Fake news is a neologism. Fake news, or fake news websites, have no basis in fact, but are
presented as being factually accurate. Media scholar Dr. Nolan Higdon has argued that the
definition of fake news has been applied too narrowly to select mediums and political
ideologies. Fake news also refers to stories that are fabricated that obtain little to no
verifiable facts. In this report, we present our preliminary experiments on applying machine
learning techniques for fake news detection. In particular, we studied and developed methods
and tools for detecting fake news.

Fake News influence or even shape public opinion and (re)set the agenda. They are
distributed by platforms and users — both intentionally and unintentionally. ‘Unintentional
sharing’ is based on the general lack of awareness of the problem: people do not realize how
often they are exposed to Fake News; they don’t know if they are influenced by misleading
content; of if they are part of the problem itself by unintentionally sharing Fake News and
influencing others.

There are ongoing efforts within news corporations and social media companies to mitigate
the problem. And some of them may prove to be somehow effective. But the fake news
problem is bigger — it goes beyond the corporate boundaries.

This report describes a global, immutable registry of labeled ‘fake news’ as the basis of a
universal solution to the problem — on top of news organizations, social media, and search
engines. Utilizing technologies such as Blockchain, IPFS, Natural Language processing the
described platform can empower a global network of evaluators establishing continuous
feedback, labeling, and evaluation of a representative random sample of our global content.
The objective of the system is to quantify the problem and raise global awareness by
systematically taking snapshots of online content which is assessed and labeled by humans.
Furthermore, the platform can offer specialized APIs to expose the patterns and knowledge
extracted from the on-going analysis of content in order to enable 3rd parties to predict the
trustworthiness of new content — at ‘publish time’ or ‘share time’.

2
Fig 1.1 NLP subset of AL, ML and DL [17]

Natural Language Processing or NLP for short, is a branch of artificial intelligence that helps
computers understand, interpret and manipulate human language. NLP draws from many
disciplines, including computer science and computational linguistics, in its pursuit to fill the
gap between human communication and computer understanding. It refers to AI method of
communicating with an intelligent systems using a natural language such as English.
Processing of Natural Language is required when you want an intelligent system like robot to
perform as per your instructions, when you want to hear decision from a dialogue based
clinical expert system, etc.

There are two components of NLP as given

1) Natural Language Understanding (NLU):

Natural language understanding (NLU) is a sub-topic of natural language processing, which

focuses on machines understanding human language. Interesting applications include text
categorization, machine translation, and question answering. NLU makes it possible for
machines to understand the overall context and meaning of “natural language”, beyond literal
definitions.

Its goal is to understand written or spoken language the way a human would. NLU is used in
natural language processing (NLP) tasks like topic classification, language detection, and
sentiment analysis:

a) Sentiment analysis automatically interprets emotions within a text and categorizes them as
positive, negative, or neutral. By quickly understanding, processing, and analyzing thousands

3
of online conversations, sentiment analysis tools can deliver valuable insights about how
customers view your brand and products.

b) Language detection automatically understands the language of written text. An essential

tool to help businesses route tickets to the correct local teams, avoid wasting time passing
tickets from one customer agent to the next, and respond to customer issues faster.

c) Topic classification is able to understand natural language to automatically sort texts into
predefined groups or topics. Software company Atlassian, for example, uses the tags
Reliability, Usability, _and Functionality_ to sort incoming customer support tickets,
enabling them to deal with customer issues efficiently.

2) Natural Language Generation (NLG): It is the process of producing meaningful phrases

and sentences in the form of natural language from some internal representation. NLG is a
software process that transforms structured data into natural language. It can be used to
produce long form content for organizations to automate custom reports, as well as produce
custom content for a web or mobile application. It can also be used to generate short blurbs
of text in interactive conversations (a chatbot) which might even be read out by a text-to-
speech system. Automated NLG can be compared to the process humans use when they turn
ideas into writing or speech. Psycholinguists prefer the term language production for this
process, which can also be described in mathematical terms, or modeled in a computer for
psychological research. NLG systems can also be compared to translators of artificial
computer languages, such as decompilers or transpilers, which also produce human-readable
code generated from an intermediate representation. Human languages tend to be
considerably more complex and allow for much more ambiguity and variety of expression
than programming languages, which makes NLG more challenging. It involves:

a) Text planning − It includes retrieving the relevant content from knowledge base.

b) Sentence planning − It includes choosing required words, forming meaningful phrases,

setting tone of the sentence.

c) Text Realization − It is mapping sentence plan into sentence structure.

4
Applications of NLP:

Fig 1.2

There are many applications of NLP, among which Text classification is one of them, which
is used in this project to classify whether the text is real or fake.

Text Classification: Text classification also known as text tagging or text categorization is
the process of categorizing text into organized groups. By using Natural Language
Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-
defined tags or categories based on its content.

These days unstructured text is everywhere, such as emails, chat conversations, websites, and
social media but it’s hard to extract value from this data unless it’s organized in a certain
way. Doing so used to be a difficult and expensive process since it required spending time
and resources to manually sort the data or creating handcrafted rules that are difficult to
maintain. Text classifiers with NLP have proven to be a great alternative to structure textual
data in a fast, cost-effective, and scalable way.

We used the machine learning library Scikit-learn (http://scikit-learn.org) in Python since it

has built-in methods that implement different classiﬁcation approaches. We have used Linear
and Non Linear classifiers like Logistic Regression, K-nearest Neighbors (KNN), Stochastic
Gradient Descent (SGD), Naïve Bayes, Decision Tree Classification, Random Forest
Classification classiﬁer. As text representation models, we used Count-Vectorizer which uses

5
Bag-of-Words approach which gives frequency matrix of the words and Term Frequency-
Inverse Document Frequency (TF-IDF) Transformer for transforming the matrix into a TF-
IDF frequency matrix. After getting the TF-IDF matrix we split the matrices into random
train and test subsets of 20:80, in which 80% of the data is for training the models and the
remaining 20% is used for testing the models. By combining these approaches, we built a
fake news detection tool. It has a minimal user interface allowing the user to enter the title
and text of that particular article which they would like to verify. The entered title and text
are further combines. The analysis is made based on article title and content.

The experimental results with the tool developed are promising giving an accuracy score
greater than 0.8 for content and title for almost all the classiﬁcation. The Logistic Regression
classifier works the best with TF-IDF and gives the highest accuracy prediction score (0.93).
The lowest accuracy score was obtained by the non-linear classiﬁer K-nearest Neighbors
with CountVectorizer and gives the lowest accuracy prediction score (0.73).

6
Chapter-2

Literature Review

The term fake news emerged during the last two years. However, fake news and misleading
information were present in all time periods. Most of works which discuss the detection of
fake news and biased information are relevantly modern. Some of them are based on
studying the credibility of a news source regardless of the news content. This process is not a
good way because a news source could be classiﬁed as untrusted and at the same time it
could publish a true fact.

Natural language processing (NLP) is a field of computer science responsible for the study
and analysis of raw text. The purpose of this field is to enhance human-computer
communication by constructing systems that are capable of understanding raw text and
incorporate interaction interfaces based on textual messages. Some of the main topics of NLP
are learning syntactic and semantic rules and determining concept, topics, and sentiment
from a document, automatic summarization, machine translation, natural language
generation, information extraction, etc.

In paper [1] the authors proposed a Hybrid approach, where combine the advantage of
linguistic models and network models, which intuitively outperform either of them. Models
based on network analysis realize the importance of taking various background information
into account, instead of inspecting solely the articles themselves. They perform generally
well at most times, but when related information is missing or little, their performance will
drop. In their paper they evaluated a fake news detector Fake-box on adversarial attacks,
including fact-distortion, subject-object exchange and cause confounding attacks. They
believed that similar models based solely on linguistic characters will perform much less
effectively in the real world and are especially vulnerable to tempering attacks. They all
suggested the use of crowdsourced knowledge graph to collect facts about news events, the

7
timely information collection can be used to compare to the extracted information from news
articles and help generate a label of veracity.

In paper [3] the approach is similar of paper [2]. They take web link to a news article as an
input, parse that link and extract the whole html code from the webpage. From the html the
name of the author, title of the article and date is extracted. After getting all the information
they analyze the article content/title and check whether it is fake/real or clickbait/non-
clickbait by reading the dataset with fake and real news or with clickbait/non-clickbait titles
and split it into train and test sets. Then they build the text representation model using Bag-
of-Words, Term Frequency – Inverse Document Frequency and Bi-gram from the train and
test data. Then they fit the data to two machine learning classifiers: Naïve Bayes
(Probabilistic classifier) and Linear Support Vector Machine (Linear classifier) and predict
the label (fake/real) of the article content. They implemented this application in Python using
the following libraries: Scikit-learn, Pandas, Beautiful Soup 4, PyQT5 and two external
APIs: Google Cloud Natural Language Processing API and NewsAPI.

In paper [4] the author approach for detecting fake news based on article content, where the
authors present an algorithm that uses data mining to detect fake news. The idea is that the
veriﬁcation of a news article depends on many factors, like: the publisher of the article, the
content of the article, the time of posting on social media websites and the number of
engagements between different users and the article. First, the algorithm extracts the text
features (the characteristics of the content and the publisher). Then it will perform linguistic
and visual studies on the extracted features (source, headline, body text, image and video).
The linguistic study includes studying the lexical features and the syntactic features, lexical
features like different writing systems and sensational headlines, syntactic features such as
sentences characteristics and words frequencies. The extracted information will be used to
construct a machine learning model to classify the articles as fake or real.

A similar approach to detect the similarity between head line and article content using the
same dataset offered by the fake news challenge website is [10]. The authors present a

8
system of classifiers consisting of two stacked1layers. The first layer consists from five
independent classifiers (slave classifiers) developed using Natural Language Processing
(NLP) modules. The second layer has one master classifier which will use the output of the
weak classifiers (slave predictions) as an input. The master classifier will use the predictions
of slave classifiers to give a final prediction as an output. Classification techniques used are
multi-layer perceptron and ReLU activation function.

The entire research of paper [13] has been defined by following steps: Data Retrieval, Data
Preprocessing, Data Visualization, Tokenization, Feature Extraction, Machine Leaning
Algorithms, Training & Testing Model and Evaluation Metrics. Their main source of data
retrieval is Twitter posts. After creating a Twitter Developer account they starts craping
tweets and used multiple keywords to download tweets like Donald Trump, politics,
coronavirus. User and tweet information was also extracted like ID, username, tweet text and
if the user is verified. Then the entire dataset was converted into lower case letters and
punctuations were removed. After cleaning the data the extraction of featured was done by
TF-IDF vectorizer, Once the network of each algorithm is built, the data undergo training
where they perform supervised learning. Algorithm uses trained data to learn features from
dataset. Later, testing is performed to evaluate the performance of their built model using
performance metrics like accuracy, precision f1 score and recall. This is also called as
confusion matrix.

In [13] research, real time tweets were pulled, and features were preprocessed to extract
valuable characteristics. Four types of algorithms were analyzed under multiple evaluation
metrics. Logistic Regression and Support Vector Machine presented the best results with
more than 0.90 accuracy score. LSTM works best with unstructured data like images and
videos. As there are very less absolute credible sources of information, it makes the process
of detecting fake news more challenging.

[5]’s report provides a fairly simple approach to encode texts and how the presence of words
in general impacts the classification of texts as real and fake. They achieved high accuracy

9
results in most of our algorithms and in particular neural networks generally do better than
the others. What’s worth noting is that our LSTMs only use a timestep of 1 and are
essentially multi-layer perceptrons. Still, as mentioned is the LSTM’s method section, the
LSTMs with the real recurrence are performed by using Word2Vec for representations at the
word’s level. In this case, each word has its own vector, and a sample will be a collection of
vectors and thus a 2-D matrix.

Their approach involves pre-processing the data and then extract features from that pre-
processed data. They used Unigram, Bigram and N-gram.

Fig 2.1 N-gram [18]

N-grams are simply all combinations of adjacent words or letters of length n that you can
find in your source text. For example, given the word fox, all 2-grams (or “bigrams”) are fo
and ox. We may also count the word boundary – that would expand the list of 2-grams to #f,
fo, ox, and x#, where # denotes a word boundary.

We can do the same on the word level. As an example, “the hello, world! ” text contains the
following word-level bigrams: # hello, hello world, world #.

The basic point of n-grams is that they capture the language structure from the statistical
point of view, like what letter or word is likely to follow the given one. The longer the n-
gram (the higher the n), the more context you have to work with. Optimum length really
depends on the application – if your n-grams are too short, you may fail to capture important
differences. On the other hand, if they are too long, you may fail to capture the “general
knowledge” and only stick to particular cases.

10
Among the three pre-training models, CountVectorizer achieves in general the best
performance comparatively and Word2Vec performs relatively poor amongst the three
models. The essential idea behind both CountVectorizer and TF-IDF is computing a score
which depends on the frequency of the word belonging to the vocabulary. However,
comparing to CountVectorizer, the TF-IDF includes an extra inverse document frequency
that “penalizes”, the contextual meaning within the words that appear more frequently across
documents. They represent the importance of the word within a document. The results may
imply that even though the penalization is smoothed by a log function, the punishment may
be too high. The results also show that in general neural networks do the best consistently, as
neural networks serves as a powerful universal approximator.

In [12] the authors compared clickbait and non-clickbait headlines, and highlighted many
interesting differences between these two categories. They utilized these differences as
features to detect clickbait. They also proposed personalized approaches which can block
certain clickbait according to user interests. Finally, using these two components, they have
developed a Chrome extension which warns the users of different media websites about the
presence of clickbait in these websites. The extension also gives the option to the users to
block clickbait and it automatically blocks similar ones in subsequent visits by the users.

For blocking clickbait they uses 3 different approaches:

Blocking as personalized classiﬁcation: As users’ inter-action with clickbait headlines vary

drastically, a one-size-ﬁts-all approach can’t work in blocking clickbait. They instead need
personalized classiﬁer for individual users to classify the clickbait into the ones to block and
the ones not to block. Essentially the problem is to model the user’s interests from the articles
she has already read as well as already blocked. Accordingly, for a new clickbait, they need
to predict whether the user would like to block the article or not.

1) Blocking based on topical similarity: Their first approach to block clickbait articles is to
ﬁrst extract a set of topics from a clickbait article, and ﬁnd the similarity between this set and
the topics previously extracted from blocked and clicked articles.

11
2) Blocking based on linguistic patterns: In the second approach, they identiﬁed the linguistic
patterns in the articles that the user clicks on or chooses to block. The pattern is formed by
normalizing the words in the headlines in the following ways. First Numbers and Quotes are
replaced by tags <D>and <QUOTE>, then the top 200 most commonly occurring words in
the clickbait corpus, including English stop words, were retained in their original form, after
that content words such as Nouns, Adjectives, Adverbs and Verb inﬂections were replaced
by their POS tags. For instance, “Which Dead ‘Grey’s Anatomy’ Character Are You”,
reduces to “which JJ <QUOTE>character are you” and “Which ‘Inside Amy Schumer’
Character Are You” reduces to “which <QUOTE>character are you”. They convert each
headline into such patterns, and thus they get a set of patterns for both blocked articles and
clicked articles. To compute the similarity between two patterns, they use the word-level edit
distance. Using the mechanism similar to the topical similarity case, they make the block/do
not block decision.

3) Hybrid approach: They finally experimented with a hybrid approach which takes into
account both topical similarity and linguistic patterns. For a test article, its tags are extracted,
nugget is formed and compared with the Block-Nuggets and Click Nuggets – this gives the
topical similarity scores. Similarly, they get the linguistic similarity scores. The hybrid scores
are obtained using a weighted combination of both topical and linguistic similarity scores,
and ﬁnally they make the block/do not block decision based on the hybrid scores.

Their work is the ﬁrst attempt to provide a comprehensive solution to deter the prevalence of
clickbait. However, their future work lies in improving the classiﬁcation and blocking
performances further and tune the extension according to further user feedbacks. It is belief
by them that combating the prevalence of clickbait should be a community initiative and
towards that end, they also plan to make all the data and source code public, so that the
researcher and developer communities can come forward, and collectively make the effort a
grand success.

In [14] the author built a model that takes news events as an input and based on twitter
reviews and classification algorithms it predicts the percentage of news being fake or real.
The basic idea of their project is to build a model that can predict the credibility of real time

12
news events. Their proposed framework consists of four major steps: Data collection, Data
preprocessing, Classification and Analysis of results same as [1][2][3][4]. They first take key
phrases of the news event as an input that the individual need to authenticate. After that live
data is collected from Twitter Streaming API. The filtered data is stored in the database
(Mongo DB). The data preprocessing unit is responsible for preparing a data for further
processing. Classification is based on various news features, twitter reviews like Sentiment
Score, Number of Tweets, Number of followers, Number of hashtags, is verified User,
Number of retweets and NLP techniques. They described fake news detection method based
on one artificial intelligence algorithm –Naïve Bayes Classifier. Sentiment Score is
calculated using Text Vectorization algorithm and NLTK (Natural Language Toolkit). By
doing the evaluation of effects acquired from classification and analysis, they were able to
decide the share of news being fake or real.

In paper [7] the author approach for detecting fake news based on article content, where the
authors present an algorithm that uses data mining to detect fake news. The idea is that the
veriﬁcation of a news article depends on many factors, like: the publisher of the article, the
content of the article, the time of posting on social media websites and the number of
engagements between different users and the article. First, the algorithm extracts the text
features (the characteristics of the content and the publisher). Then it will perform linguistic
and visual studies on the extracted features (source, headline, body text, image and video).
The linguistic study includes studying the lexical features and the syntactic features, lexical
features like different writing systems and sensational headlines, syntactic features such as
sentences characteristics and words frequencies. The extracted information will be used to
construct a machine learning model to classify the articles as fake or real.

In [15]’s work, they have demonstrated the use and effectiveness of pre-trained Transformer
based language models in retrieving and classifying fake news in a highly specialized domain
of COVID-19. The architecture consists of a two stage model, the author refer the first model
as “Model A” and the second model as “Model B”. The objective of Model A is to fetch the
candidate “true facts” or explanations for a given claim, which are then evaluated for

13
entailment using the Model B. Next, they describe the training procedure as well as intended
run time behaviour for both Model A and Model B.

Model A: fetch relevant explanations, then they train the Transformer model on a binary
sentence entailment task, where the claims and explanations are the two sentences fed in as
input separated by a [SEP] tag. Then they generate negative claim-explanation pairs through
random sampling to ensure that equal proportions of positive and negative pairs are present.
Training multi-attention network with the COVID-19 specific data enables the model to
capture long-range correlations between the vector representations of claims and
explanations of similar contexts. After that the training of models with a base encoder and a
sequence classification head on top for binary classification of the labels. The model is
trained to optimise the cross entropy loss. For faster near real-time performance, they cache
the embedding for all explanations (knowledge base) beforehand, and compute the cosine
similarity between the claim and the cached embedding of the explanations. The vector of the
[CLS](the start of sentence) token of the final layer works as a strong representation of the
entire sentence, although they found that taking element-wise mean over all the token vectors
leads to better performance. After all this they fetch the top explanations for any given claim
exceeding a certain threshold of sentence similarity as there could be several explanations
relevant for a given claim. This threshold is determined on the basis of the summary statistics
of the cosine similarity metric between the claim and relevant explanations in the validation
set.

Model B: The second part of this pipeline is to identify the veracity of a given claim. Model
A fetches the candidate explanations while Model B is used to verify whether the given claim
aligns with our set of candidate explanations or not. So that they can therefore treat the task
as a textual entailment problem. To train the Model B, a smaller subset of “false claim” and
“explanation” pairs from their original dataset, and cross validate each sample with “true
claim” or in other words, claims that align with the factual explanation were used. However,
this small annotated data is not sufficient to train the model effectively. Therefore, the
parameters of the Model A, which was trained on a much larger dataset were used as initial
parameters for Model B, and fine-tuned further using their cross validated dataset. The
training of Model B was done in a similar way as Model A. They used the statistic of mean

14
probability score and standard deviation of aligning and non-aligning claim and explanation
pairs in the validation set to determine the thresholds for Model B classification. They trained
and evaluated both Model A and Model B using several approaches based on classical NLP
methods as well as more sophisticated pre-trained Transformer models.

In [16] paper, the author proposed a novel deep learning model for fake news detection in
social media. Because of the fast propagation of fake news, timely detection is critical to
mitigate their effects. However, usually very few data samples can be labeled in a short time,
which in turn makes the supervised learning models infeasible. Hence, a deep semi-
supervised learning model is proposed by them. The two paths in the proposed model
generate supervised loss (cross-entropy) and unsupervised loss (Mean Squared Error),
respectively. Then training is performed by jointly optimizing these two losses. Also their
experimental results indicate that the proposed model could detect fake news from PHEME
datasets effectively by using limited labeled data and lots of unlabeled data.

The author of [16] proposed a model in the context of fake news detection. Suppose the
training data consist of total N inputs, out of which M are labeled. The inputs are tweets that
contain sentences related to fake news. They employ word embedding represent input xi (i ∈
1...N) as “images”, where each row in the “image” represents one word in the tweet as
embedding and the number of rows is the number of words in the tweet. S is the set of
labeled inputs, |S|= M. For every i ∈ S, we have a known correct label yi ∈ 1...C, where C is
the number of different classes. The framework of the proposed model and corresponding
learning procedures evaluate the network for each training input xi with the supervised path
and the unsupervised path, resulting in prediction vectors zi and zi' , respectively. Then
utilize those two vectors to calculate the loss. The loss consists of two components, li is the
standard cross entropy loss to evaluate the loss for labeled inputs only. On the other hand, li',
evaluated for all inputs, penalizes different predictions for the same training input xi by
taking the mean squared error between zi and zi' . To combine the supervised loss li and
unsupervised loss li', scaling the latter by time-dependent weighting function w(t) that ramps
up, starting from zero, along a Gaussian curve. Furthermore, instead of using one path CNN,

15
they apply two independent CNN to generate those two outputs. They also plan to examine
the proposed model on other NLP tasks such as sentiment analysis in future.

In paper [2] the author used took the dataset form kaggle by Andrew Thompson and it consist
of almost 150,000 posts from which the randomly selected 13,000 posts. After getting the
posts they pre-processed the text and remove the stopwords from the text. After
preprocessing and removing the stopwords from the text they used pre-trained model of TF-
IDF to convert the text (posts) into features. The author of paper [2] used multiple classifiers
like Naïve Bayes, SVM, Logistic Regression, Decision Tree and NN for testing the text,
whether the text is real or fake. The SVM outperform all the other classifiers with the
accuracy score of 0.923 and the worst performing was NN with an accuracy score of 0.499.
Both, logistic regression and SVM models give better scores with larger datasets. Naive
Bayes and Decision Tree did not improve scores by a considerable margin with increase in
sample size. However, Naive Bayes gave the highest accuracy scores for smaller datasets and
hence has considerable importance. Decision Trees on the other hand performed poorly
throughout and hence seems to be a bad candidate for fake news classiﬁcation.

16
Chapter – 3

Proposed Solution

3.1 Problem Statement

To ﬁnd the relation between the words and the context/sentiments in which the words appear
within the title & content (text) and how it could be used to classify texts as real or fake.

3.2. Methodology

3.2.1 Dataset

The dataset was taken from Kaggle. It’s often considered to be the ﬁrst step towards
classiﬁcation of fake news, because it’ll cascade in how good the model will be, the more and
better data that we get, the better our model will perform. Three different datasets from
Kaggle were taken. The first dataset is named REAL.csv, it contains all the real news articles
before 2020 and has 4 features: title, text, subject and date. The second dataset is named
Fake.csv, it contains all the fake news articles before 2020 and has 4 features: title, text,
subject and date. The third dataset is named train.csv, it contains fake as well as real news
articles between the years of 2016 and 2017 and has 5 features: id, title, author, text and
label.

It results to a matrix of 2 * 65812 which has 2 features: article (collection of title, subject and
content) and the label (whether the article is real or fake), as it is supervised learning the label
feature is required.

3.2.2 Text Pre-Processing

Text cleaning or Text pre-processing is a mandatory step when we are working with text in
Natural Language Processing (NLP). In real-life human writable text data contain various
words with the wrong spelling, short words, special symbols, emojis, etc. we need to clean
this kind of noisy text data before feeding it to the machine learning model.
17
Fig. 3.1 Workflow of Project

A) Lower Case

First step before starting the pre-processing is to convert the whole document into string and
lower case all of the content (Strings).

B) Remove Punctuations

In addition of pre-processing the textual data, we removed all characters that are not textual
(not alphabets such as punctuation, extra delimiters etc.). We used the string. punctuation
module in Python to ﬁnd all punctuation characters. We remove all those punctuation
characters from every word in the texts, with the exception of the symbols ‘#’ and ‘@’.
Because these are characters used for Twitter hashtags and mentions, we handle these later.
Next, we removed an assortment of special characters that don’t appear on traditional
American keyboards and don’t contribute to the meaning of the text. The long dash (“–”),
single and double Asian quotations, ellipse characters (. . .), and bullet points (•) all were
removed for this reason.

Proceeding further, we make all of our texts lowercase and then remove all rows that have
foreign language characters in their text, since we are only interested in identifying fake news
in English. To do this we used the package string in Python to identify the language of all
texts, and removed all rows with foreign characters. This ﬁnally ensures the text we preserve
is only with English words with no non-alpha character.

18
Fig. 3.2 punctuations in string module(python3)

C) Remove Stopwords

Stop words are a list of the most common words in a language, such as “a”, “be”, “quite”,
“should”...etc. They are often void of meaning, and does not add anything to the content.
They are also most frequently present in every text. Hence, we presumed removal of stop
words can have multiple advantages. For once, it decreases memory overhead, since we cut
down a huge amount of text (and hence narrows down the number of features to train our
models on). Second, it reduces noise, since by eliminating stop words, we are able to focus
on more meaningful contents (the more distinct features between these two classes).
Although it is not often the case that removing stop words are the most optimal, sometimes
the information that we are looking for may be included in the stop words that we removed.
For example, in most cases of language modeling, or translation, where it is important that
we keep all the stop words. However, in our circumstances, we are using the semantics of the
text to make a decision. In this case, we can safely remove stop words to observe the more
meaningful context words.

Fig. 3.3 Stopwords [19]

19
D) Lemmatization

Lemmatization is the process of grouping together the different inflected forms of a word so
they can be analyzed as a single item. Lemmatization is similar to stemming but it brings
context to the words. So it links words with similar meaning to one word. Text preprocessing
includes both Stemming as well as Lemmatization. Many times people find these two terms
confusing. Some treat these two as same. Actually, lemmatization is preferred over
Stemming because lemmatization does morphological analysis of the words. One major
difference with stemming is that lemmatize takes a part of speech parameter, “pos” If not
supplied, the default is “noun.”

Fig. 3.4 lemmatization (Root words) [20]

3.2.3 Text to Features

Feature extraction is a process of dimensionality reduction in which an initial set of

raw data is reduced to more manageable groups for processing. By doing this we are
reducing the processing time of the compiler and increasing the rate of efficiency in detecting
the value of the word. A characteristic of these large data sets is many variables require a lot
of computing resources to process.

Features in machine learning is basically numerical attributes from which anyone can perform
some mathematical operation such as matrix factorisation, dot product etc. But in this cases
dataset contain numerical value, string value, character value, categorical value, connection
(one user connected to another user). Conversion of these types of feature into numerical
feature is called featurization.

20
In this project features extraction is done using CountVectorizer and Term Frequency-Inverse
Document Frequency (TF-IDF) Transformer.

A) Tokenization

Tokenization refers to splitting up a larger body of text into smaller lines, words or even
creating words for a non-English language. The various tokenization functions are in-built
into the nltk module itself. We have used Regex Tokenizer, that extracts tokens
either by using the provided regex pattern to split the text (default) or repeatedly matching
the regex (if gaps are false). Fig. 4 shows the tokenized words of our dataset. Optional
parameters also allow filtering tokens while minimizing the length. It returns an array of
strings that can be empty.

Fig. 3.5 Tokenization [21]

B) CountVectorizer

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to

transform a given text into a vector on the basis of the frequency (count) of each word that
occurs in the entire text. This is helpful when we have multiple such texts, and we wish to
convert each word in each text into vectors (for using in further text analysis). Countvectorizer
uses Bag-of-words approach.

The bag-of-words model is a way of representing text data when modeling text with machine
learning algorithms. The bag-of-words model is simple to understand and implement and has
seen great success in problems such as language modeling and document classification. A

21
bag-of-words is a representation of text that describes the occurrence of words within a
document. It involves two things:

1. A vocabulary of known words.

2. A measure of the presence of known words.
It is called a “bag” of words, because any information about the order or structure of words
in the document is discarded. The model is only concerned with whether known words occur
in the document, not where in the document.
The bag-of-words can be as simple or complex as you like. The complexity comes both in
deciding how to design the vocabulary of known words (or tokens) and how to score the
presence of known words.

Fig. 3.6 CountVectorizer Output

C) TF-IDF

A problem with scoring word frequency is that highly frequent words start to dominate in the
document (e.g. larger score), but may not contain as much “informational content” to the
model as rarer but perhaps domain specific words.

One approach is to rescale the frequency of words by how often they appear in all
documents, so that the scores for frequent words like “the” that are also frequent across all
documents are penalized.

This approach to scoring is called Term Frequency – Inverse Document Frequency, or TF-
IDF for short, where:

22
The term frequency tf (t, d) calculates the proportion of times that the term t ∈ V (d) appears
in the document d. The vocabulary V (d) = Σ n (t, d) is constructed by the document d. Thus,
if a word w' does not appear in a document d', the term frequency tf (t', d') in this case would
be zero. The idea of the term frequency is essentially the same as CountVectorizer.

Given a document collection D, the inverse document frequency idf (t, D) is the log of the
number of documents N divided by df (t, D), the number of documents d ∈ D containing the
term t. As a result, common words in D will have a low term frequency score, while
infrequent words will have a high term frequency. Thus, the term frequency will be very
likely to separate fake news that often have less common words (even ungrammatical) from
real news that usually consist of common words.

As a summary, TF-IDF score w (t, d) for a word increases with its count, but will be
counteracted if the word appears in too many documents.

23
Fig. 3.7 TF-IDF Output

3.2.4 Classifiers

A) Naïve Bayes

Naive Bayes classifiers are probabilistic classifiers (Probabilistic classifier is a classifier that
can predict the category of a given observation, based on the observation’s probability
distribution over the set of categories) which apply Bayes theorem in their decision rule
assuming strong independence between features. Naive Bayes classifiers are useful for text
classification problems, because they can be trained precisely in the supervised learning case.

Given a class variable (label) y and a vector of features x = (x1, x2, ..., xn), where n is the
number of features, the probability of y depending on the vector of features x can be
calculated using the formula:

Since the value of P(x1, ..., xn)is the same for all class variables (labels), we can say that the
result of a prediction probability r is the maximum probability of the calculated probabilities
for all class variables:

24
B) Logistic Regression

This refers to a regression model that is used for classification. This method is widely used
for binary classification problems. It can also be extended to multi-class classification
problems. Here, the dependent variable is categorical: y ϵ {0, 1}. A binary dependent
variable can have only two values, like 0 or 1, win or lose, pass or fail, healthy or sick, etc. In
this case, you model the probability distribution of output y as 1 or 0. This is called the
sigmoid probability (σ). If σ (θ Tx) > 0.5, set y = 1, else set y = 0 .Unlike Linear Regression
(and its Normal Equation solution), there is no closed form solution for finding optimal
weights of Logistic Regression. Instead, you must solve this with maximum likelihood
estimation (a probability model to detect the maximum likelihood of something happening).
It can be used to calculate the probability of a given outcome in a binary model, like the
probability of being classified as sick or passing an exam.

Sigmoid Probability

The probability in the logistic regression is often represented by the Sigmoid function (also
called the logistic function or the S-curve):

In this equation, t represents data values * the number of hours studied and S (t) represents
the probability of passing the exam. Assume sigmoid function:

25
g (z) tends toward 1 as z -> infinity , and g(z) tends toward 0 as z -> infinity.

C) Decision Tree Classifier

Decision Trees (DT) can be used both for classification and regression. The advantage of
decision trees is that they require very little data preparation. They do not require feature
scaling or centering at all. They are also the fundamental components of Random Forests,
one of the most powerful ML algorithms. Unlike Random Forests and Neural Networks
(which do black-box modeling), Decision Trees are white box models, which means that
inner workings of these models are clearly understood. In the case of classification, the data
is segregated based on a series of questions. Any new data point is assigned to the selected
leaf node.

Fig. 3.8 Decision Tree [22]

26
Start at the tree root and split the data on the feature using the decision algorithm, resulting in
the largest information gain (IG). This splitting procedure is then repeated in an iterative
process at each child node until the leaves are pure. This means that the samples at each node
belonging to the same class. In practice, you can set a limit on the depth of the tree to prevent
overfitting. The purity is compromised here as the final leaves may still have some impurity.

D) Random Forest Classifier

A random forest can be considered an ensemble of decision trees (Ensemble learning).

Random Forest algorithm: Draw a random bootstrap sample of size n (randomly choose n
samples from the training set). Grow a decision tree from the bootstrap sample. At each node,
randomly select d features. Split the node using the feature that provides the best split
according to the objective function, for instance by maximizing the information gain. Repeat
the steps 1 to 2 k times. (k is the number of trees you want to create, using a subset of
samples) Aggregate the prediction by each tree for a new data point to assign the class label
by majority vote (pick the group selected by the most number of trees and assign new data
point to that group). Random Forests are opaque, which means it is difficult to visualize their
inner workings.

However, the advantages outweigh their limitations since you do not have to worry about
hyper-parameters except k, which stands for the number of decision trees to be created from
a subset of samples. RF is quite robust to noise from the individual decision trees. Hence,
you need not prune individual decision trees. The larger the number of decision trees, the
more accurate the Random Forest prediction is. (This, however, comes with higher
computation cost).

27
Fig. 3.9 Random Forest [23]

E) K-nearest Neighbors

K-nearest Neighbors algorithm is used to assign a data point to clusters based on similarity
measurement. It uses a supervised method for classification. To writing a k-means algorithm
we first need to choose the number of k and a distance metric (k=5 is common). Then, find
the k-nearest neighbors of the sample that you want to classify and assign the class label by
the majority votes.

KNN Classification

A new input point is classified in the category such that it has the most number of neighbors
from that category. For example:

28
Fig. 3.10 KNN Classification [24]

E) Stochastic Gradient Descent (SGD) Classifier

This estimator implements regularized linear models with stochastic gradient descent (SGD)
learning: the gradient of the loss is estimated each sample at a time and the model is updated
along the way with a decreasing strength schedule (aka learning rate). SGD allows mini
batch (online/out-of-core) learning via the partial_fit method. For best results using the
default learning rate schedule, the data should have zero mean and unit variance. This
implementation works with data represented as dense or sparse arrays of floating point values
for the features. The model it fits can be controlled with the loss parameter; by default, it fits
a linear support vector machine (SVM). The regularizer is a penalty added to the loss
function that shrinks model parameters towards the zero vector using either the squared
euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net). If the
parameter update crosses the 0.0 value because of the regularizer, the update is truncated to
0.0 to allow for learning sparse models and achieve online feature selection.

Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to fitting linear
classifiers and regressors under convex loss functions such as (linear) Support Vector

29
Machines and Logistic Regression. Even though SGD has been around in the machine
learning community for a long time, it has received a considerable amount of attention just
recently in the context of large-scale learning. SGD has been successfully applied to large-
scale and sparse machine learning problems often encountered in text classification and
natural language processing. Given that the data is sparse, the classifiers in this module easily
scale to problems with more than 10^5 training examples and more than 10^5 features. As
like other classifiers, SGD has to be fitted with two arrays: an array X of shape (n_samples,
n_features) holding the training samples, and an array y of shape (n_samples,) holding the
target values (class labels) for the training sample

30
Chapter – 4

Project Screenshots

Fig. 4.1 Home Page

Fig. 4.2 Text PreProcessing Page

31
In figure 11 and 12 it the user interface is shown, how the user will be able to classify
whether the article is real or fake. The user enters the Article’s title and the Article’s
content/text then click the ‘Check’ button.

The web application is built using the Streamlit Framework.

32
Chapter – 5

Results

5.1 Confusion Matrix

A) Naïve Bayes

Fig. 5.1 Confusion matrix: Naïve Bayes with CountVectorizer and TF-IDF vectorizer

B) Logistic Regression

Fig. 5.2 Confusion matrix: Logistic Regression with CountVectorizer and TF-IDF
vectorizer

33
C) SGD

Fig. 5.3 Confusion matrix: SGD with CountVectorizer and TF-IDF vectorizer

D) KNN

Fig. 5.4 Confusion matrix: KNN with CountVectorizer and TF-IDF vectorizer

34
E) Decision Tree

Fig. 5.5 Confusion matrix: Decision Tree with Count Vectorizer and TF-IDF vectorizer

F) Random Forest

Fig. 5.6 Confusion matrix: Random Forest with CountVectorizer and TF-IDF vectorizer

35
5.2 Comparison

Table : Comparison between multiple classifiers

Naïve Log. Reg SGD KNN Decision Random

Bayes Tree Forest

TF-IDF 0.93 0.98 0.99 0.86 0.99 0.96

Count 0.94 0.99 0.99 0.76 0.99 0.96

Vectorizer
(BOW)

With this approach we got the highest score of 0.99 from Logistic Regression
& SGD with TF-IDF and the lowest was 0.76 from KNN with
CountVectorizor.

36
Chapter – 6

Conclusion

In this report we evaluated Fake news detector, we come to an understanding that how the
presence of words can impact the classification of the text (Article in this case) as FAKE or
REAL. We achieve high accuracy in most of the algorithms, In particular the Logistic
Regression & SGD gives the best result giving an accuracy score of 0.99. The essential idea
behind both CountVectorizer and TF-IDF is computing a score which depends on the
frequency of the word belonging to the vocabulary. However, comparing to CountVectorizer,
the TF-IDF includes an extra inverse document frequency that “penalizes” (apparently
masks) the contextual meaning within the words that appear more frequently across
documents. They represent the importance of the word within a document.

37
References

[1] Zhixuan Zhou, Huankang Guan, Meghana Moorthy Bhat and Justin Hsu, “Fake News
Detection via NLP is Vulnerable to Adversarial Attacks”, January 5, 2019

[2] Karishnu Poddar, Geraldine Bessie Amali, Umadevi K S, “Comparison of Various

Machine Learning Models for Accurate Detection of Fake News”, 2019 Innovations in
Power and Advanced Computing Technology (i-PACT), March 2019

[3] Bashar Al Asaad, Madalina Erascu, “A Tool for Fake News Detection”, 2018 20th
International Symposium on Symbolic and Numeric Algorithms for Scientific Computing
(SYNASC), September 2018

[4] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake news detection on social media: A
data mining perspective”, August 2017

[5] Sairamvinay Vijayaraghavan, Zhiyuan Guo,Ye Wang, John Voong, Wenda Xu, Armand
Nasseri, Jiaru Cai, Linda Li, Kevin Vuong, Eshan Wadhwa, “Fake News Detection with
Different Models”, February 2020

[6] Bourgonje, P., Schneider, J. M., and Rehm, G, “From clickbait to fake news detection”,
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets
Journalism, January 2017

[7] Shu K., Sliva A., Wang S., Tang J. and Liu H., “FakeNews Detection on Social Media:
A Data Mining Perspective”, 2017

[8] Brownlee J., “How to Prepare Text Data for Machine Learning with scikit-learn”, 2015

[9] Bajaj S., “The Pope Has a New Baby Fake News Detection Using Deep Learning”, 2017

[10] J. Thorne, M. Chen, G. Myrianthous, J. Pu, X. Wang, and A. Vlachos, “Fake news
stance detection using stacked ensemble of classiﬁers”, Proceedings of the 2017 EMNLP
Workshop: Natural Language Processing meets Journalism, January 2017

38
[11] Fernando Cardoso Durier da Silva, Rafael Vieira da Costa Alves, Ana Cristina Bicharra
Garcia, “Can Machines Learn to Detect Fake News? A Survey Focused on Social Media”,
Proceedings of the 52nd Hawaii International Conference on System Sciences, January 2019

[12] Chakraborty, A., Paranjape, B., Kakarla, S., Ganguly, N., “Stop clickbait: Detecting and
preventing clickbaits in online news media”, 2016 IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining (ASONAM), August 2016

[13] Harika Kudarvalli and Dr. Jinan Fiaidhi, “Detecting Fake News using Machine
Learning Algorithms”, November 2020

[14] Subhadra Gurav, Swati Sase, Supriya Shinde, Prachi Wabale, Sumit Hirve, “Survey on
Automated System for Fake News Detection using NLP & Machine Learning Approach”,
International Research Journal of Engineering and Technology (IRJET), January 1, 2019

[15] Rutvik Vijjali, Prathyush Potluri, Siddharth Kumar, Sundeep Teki, “Two Stage
Transformer Model for COVID-19 Fake News Detection and Fact Checking”, November 26,
2020

[16] Xishuang Dong, Uboho Victor, Shanta Chowdhury, Lijun Qian, “Deep Two-path Semi-
supervised Learning for Fake News Detection”, June 10, 2019

[17] https://devopedia.org/images/article/187/4949.1560446213.png

[18] https://deepai.org/machine-learning-glossary-and-terms/n-gram

[19] https://x7r6b9v3.rocketcdn.me/wp-content/uploads/2015/11/Stop-Words.jpg.webp

[20] https://www.datacamp.com/community/tutorials/stemming-lemmatization-python

[21] https://www.kdnuggets.com/wp-content/uploads/text-tokens-tokenization-manning.jpg

[22] https://miro.medium.com/max/875/1*WerHJ14JQAd3j8ASaVjAhw.jpeg

[23] https://en.wikipedia.org/wiki/File:Random_forest_diagram_complete.png

[24]https://www.simplilearn.com/ice9/free_resources_article_thumb/knn-classification-
machine-learning.JPG

Fingerprint Based Advanced Voting Machine Using GSM: A Major Project Report Submitted To
No ratings yet
Fingerprint Based Advanced Voting Machine Using GSM: A Major Project Report Submitted To
64 pages
Complete Download Machine Learning With R, The Tidyverse, and MLR 1st Edition Hefin Ioan Rhys PDF All Chapters
100% (4)
Complete Download Machine Learning With R, The Tidyverse, and MLR 1st Edition Hefin Ioan Rhys PDF All Chapters
62 pages
P.31 ICAIBDA Paper Halaman 179-185
No ratings yet
P.31 ICAIBDA Paper Halaman 179-185
302 pages
Unit III PPT Slides
No ratings yet
Unit III PPT Slides
157 pages
6433 Bi
No ratings yet
6433 Bi
131 pages
9-10 - Machine-Learning Techniques for Predictive Analytics
No ratings yet
9-10 - Machine-Learning Techniques for Predictive Analytics
60 pages
BDA Notes Unit-5
No ratings yet
BDA Notes Unit-5
62 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
32 pages
Mushrooms Edibility Classification - George Yermak
No ratings yet
Mushrooms Edibility Classification - George Yermak
31 pages
Imp Ques - Ans
No ratings yet
Imp Ques - Ans
15 pages
CLASSIFICATION: Bayesian Classifiers: Naïve Bayes Bayesian Networks
No ratings yet
CLASSIFICATION: Bayesian Classifiers: Naïve Bayes Bayesian Networks
29 pages
Summer Training Report
No ratings yet
Summer Training Report
36 pages
Massive Online Analysis: Manual
No ratings yet
Massive Online Analysis: Manual
55 pages
IRJMETS Research Paper
No ratings yet
IRJMETS Research Paper
5 pages
ML Unit-Iii
No ratings yet
ML Unit-Iii
178 pages
Project Cdac
No ratings yet
Project Cdac
4 pages
Data Warehousing and Mining Viva Answer
No ratings yet
Data Warehousing and Mining Viva Answer
29 pages
Naive Bayesian Classifiers
No ratings yet
Naive Bayesian Classifiers
3 pages
Weka Overview Slides
No ratings yet
Weka Overview Slides
31 pages
Assignment 3
No ratings yet
Assignment 3
2 pages
THE FAKE ACCOUNT DETECTION IN ONLINE SOCIAL NETWORKS (OSNs) USING RANDOM FOREST
No ratings yet
THE FAKE ACCOUNT DETECTION IN ONLINE SOCIAL NETWORKS (OSNs) USING RANDOM FOREST
95 pages
Heart Disease Prediction Flask PPT
No ratings yet
Heart Disease Prediction Flask PPT
11 pages
Machine Learning Algorithms For Opinion Mining and Sentiment Classification
No ratings yet
Machine Learning Algorithms For Opinion Mining and Sentiment Classification
6 pages
Multi Disease Prediction System Using ML (Phase-II)
No ratings yet
Multi Disease Prediction System Using ML (Phase-II)
14 pages
Minor Project
No ratings yet
Minor Project
64 pages
Assignment 2
No ratings yet
Assignment 2
111 pages
Daa Ktu Notes
No ratings yet
Daa Ktu Notes
112 pages
Midterm Text
No ratings yet
Midterm Text
13 pages
Network Monitoring and Control System
No ratings yet
Network Monitoring and Control System
75 pages
08 Class Basic
No ratings yet
08 Class Basic
141 pages
Project Report: Minor Project
88% (17)
Project Report: Minor Project
69 pages
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
No ratings yet
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
61 pages
Prediction of Cardiovascular Disease Using Machine Learning Algorithms
No ratings yet
Prediction of Cardiovascular Disease Using Machine Learning Algorithms
11 pages
Bayesian Learning: Salma Itagi, Svit
No ratings yet
Bayesian Learning: Salma Itagi, Svit
14 pages
Student management system
No ratings yet
Student management system
41 pages
Project File Cse
No ratings yet
Project File Cse
162 pages
Network Security and Cryptography
No ratings yet
Network Security and Cryptography
68 pages
Intership Report Music Recomandation System
No ratings yet
Intership Report Music Recomandation System
50 pages
Results by Using Python Full Stack: An Internship Report On
No ratings yet
Results by Using Python Full Stack: An Internship Report On
66 pages
smart parking
No ratings yet
smart parking
110 pages
Report Project 1
No ratings yet
Report Project 1
25 pages
Vandana Internship Report
No ratings yet
Vandana Internship Report
48 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
23 pages
Kumar Mu Tie Rep
No ratings yet
Kumar Mu Tie Rep
30 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
17 pages
Soft Computing Techniques
No ratings yet
Soft Computing Techniques
48 pages
Malicious Url Detection Based On Machine Learning
No ratings yet
Malicious Url Detection Based On Machine Learning
52 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
43 pages
Final Report
No ratings yet
Final Report
51 pages
CSE35 Project Report
No ratings yet
CSE35 Project Report
111 pages
Face Mask Detection
No ratings yet
Face Mask Detection
34 pages
Fake News Detection Using Machine Learning Report Final
No ratings yet
Fake News Detection Using Machine Learning Report Final
26 pages
Think Speak Iot Document
No ratings yet
Think Speak Iot Document
11 pages
Fake News Detection Using Machine Learning
No ratings yet
Fake News Detection Using Machine Learning
8 pages
Project Final Report
100% (1)
Project Final Report
44 pages
Medical Insurance Cost Prediction Report Full
100% (1)
Medical Insurance Cost Prediction Report Full
50 pages
Complete Final Sem Report PDF
No ratings yet
Complete Final Sem Report PDF
79 pages
Project Report (Group - 2) 17bcs4090,91 & 86
No ratings yet
Project Report (Group - 2) 17bcs4090,91 & 86
21 pages
CSC Mini Project
No ratings yet
CSC Mini Project
12 pages
Final Intership Report
No ratings yet
Final Intership Report
32 pages
Enhancing Data Security in Iot Healthcare Services Using Fog Computing
No ratings yet
Enhancing Data Security in Iot Healthcare Services Using Fog Computing
36 pages
Mca, Bca Project List 2023-2024
No ratings yet
Mca, Bca Project List 2023-2024
90 pages
Detection of Phishing WebsitesUsing Random Forest and XGBOOST
No ratings yet
Detection of Phishing WebsitesUsing Random Forest and XGBOOST
14 pages
Report For Cse 343 Lpu
No ratings yet
Report For Cse 343 Lpu
25 pages
A Report of 08 Weeks Industrial Training At: ASPEXX Health Solution Pvt. LTD
No ratings yet
A Report of 08 Weeks Industrial Training At: ASPEXX Health Solution Pvt. LTD
74 pages
Final CPP Project
No ratings yet
Final CPP Project
19 pages
Industrial Training Report
No ratings yet
Industrial Training Report
21 pages
Weather Prediction Using CPT+ Algorithm: Proposed Scheme
No ratings yet
Weather Prediction Using CPT+ Algorithm: Proposed Scheme
12 pages
Report
100% (1)
Report
32 pages
INTERN
No ratings yet
INTERN
40 pages
Drowsiness Detection Using Python Opencv
No ratings yet
Drowsiness Detection Using Python Opencv
10 pages
Blockchain Based Certificate Validation
No ratings yet
Blockchain Based Certificate Validation
7 pages
Onlinepay
No ratings yet
Onlinepay
23 pages
Own Cryptography System: A Project Report
No ratings yet
Own Cryptography System: A Project Report
52 pages
Computer Architecture
No ratings yet
Computer Architecture
12 pages
MCA Project Report Format - MU - Updated
100% (1)
MCA Project Report Format - MU - Updated
20 pages
SMS Spam Detection Using Machine Learning
No ratings yet
SMS Spam Detection Using Machine Learning
9 pages
Python Development With PyDev and Eclipse
No ratings yet
Python Development With PyDev and Eclipse
12 pages
Grievance Portal
No ratings yet
Grievance Portal
44 pages
For Fake or Real Disaster Tweet Analysis of Machine Learning Algorithms
No ratings yet
For Fake or Real Disaster Tweet Analysis of Machine Learning Algorithms
23 pages
A Summer Training Report On Chat Massenger
No ratings yet
A Summer Training Report On Chat Massenger
62 pages
Python and Machine Learning: A Practical Training Report On
No ratings yet
Python and Machine Learning: A Practical Training Report On
65 pages
Detection of Fake Online Reviews Using Semi Supervised and Supervised Learning
No ratings yet
Detection of Fake Online Reviews Using Semi Supervised and Supervised Learning
4 pages
Steganography Project Report For Major Project in B Tech
No ratings yet
Steganography Project Report For Major Project in B Tech
74 pages
Big Data
No ratings yet
Big Data
30 pages
Journal App Report
No ratings yet
Journal App Report
37 pages
IJPREMS Template January 2023
No ratings yet
IJPREMS Template January 2023
2 pages
Touchpad Prime Ver. 1.2 Class 6
From Everand
Touchpad Prime Ver. 1.2 Class 6
Nisha Batra
No ratings yet
Touchpad Plus Ver. 1.1 Class 7
From Everand
Touchpad Plus Ver. 1.1 Class 7
Nisha Batra
No ratings yet
Introduction to Linux: Installation and Programming
From Everand
Introduction to Linux: Installation and Programming
N. B. Venkateswarlu
No ratings yet