0% found this document useful (0 votes)
33 views

An Enhanced Graph-Based Semi-Supervised Learning Algorithm To Detect Fake Users On Twitter

This document discusses an enhanced graph-based semi-supervised learning algorithm (EGSLA) to detect fake users on Twitter. The proposed method has four modules: data collection, feature extraction, classification, and decision making. Twitter data collected using Scrapy is used to evaluate the algorithm. The performance of EGSLA is tested against existing techniques like game theory, k-nearest neighbor, support vector machine, and decision tree. Results show EGSLA achieves 90.3% accuracy in identifying fake Twitter users.

Uploaded by

karan kalla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

An Enhanced Graph-Based Semi-Supervised Learning Algorithm To Detect Fake Users On Twitter

This document discusses an enhanced graph-based semi-supervised learning algorithm (EGSLA) to detect fake users on Twitter. The proposed method has four modules: data collection, feature extraction, classification, and decision making. Twitter data collected using Scrapy is used to evaluate the algorithm. The performance of EGSLA is tested against existing techniques like game theory, k-nearest neighbor, support vector machine, and decision tree. Results show EGSLA achieves 90.3% accuracy in identifying fake Twitter users.

Uploaded by

karan kalla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/334781666

An enhanced graph-based semi-supervised learning algorithm to detect fake


users on Twitter

Article  in  The Journal of Supercomputing · September 2019


DOI: 10.1007/s11227-019-02948-w

CITATIONS READS

57 937

6 authors, including:

BalaAnand Muthu Karthikeyan Natesapillai


V.R.S. College of Engineering & Technology, Villupuram, India SNS College of Technology
42 PUBLICATIONS   1,002 CITATIONS    36 PUBLICATIONS   407 CITATIONS   

SEE PROFILE SEE PROFILE

Karthik Subburathinam
SNS College of Technology
214 PUBLICATIONS   1,142 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

CFC : Call for book chapter "Cyber-Physical Systems for Industrial Transformation" by CRC Press – Taylor & Francis Group View project

Introduction to the special issue on deep learning for remote sensing environments View project

All content following this page was uploaded by Karthikeyan Natesapillai on 03 December 2019.

The user has requested enhancement of the downloaded file.


The Journal of Supercomputing
https://doi.org/10.1007/s11227-019-02948-w

An enhanced graph‑based semi‑supervised learning


algorithm to detect fake users on Twitter

M. BalaAnand1   · N. Karthikeyan2 · S. Karthik3 · R. Varatharajan4 ·


Gunasekaran Manogaran5 · C. B. Sivaparthipan3

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract
Over the the past decade, social networking services (SNS) have proliferated on
the web. The nature of such sites makes identity deception easy, providing a fast
means for creating and managing identities, and then connecting with and deceiv-
ing others. Fake users are those accounts specifically created for purposes such as
stalking or abuse of another user, for slander, or for marketing. The current system
for detecting deception depends on behavioral, non-behavioral and user-generated
content (UGC) information gathered from users. Although these methods have high
detection accuracy, they cannot be implemented in databases with massive volumes
of data. To address this issue, this paper proposes an enhanced graph-based semi-
supervised learning algorithm (EGSLA) to detect fake users from a large volume
of Twitter data. The proposed method encompasses four modules: data collection,
feature extraction, classification and decision making. Data collected from Twitter
using Scrapy is utilized for the evaluation. The performance of the proposed algo-
rithm is tested with existing game theory, k-nearest neighbor (KNN), support vector
machine (SVM) and decision tree techniques. The results show that the proposed
EGSLA algorithm achieves 90.3% accuracy in spotting fake users.

Keywords  Fake user detection · Identity deception · Sock puppets · Semi-supervised


learning algorithm

1 Introduction

Online social networks (OSNs) have become one of the most important ways for
people to interact with their friends, to grow a business or to obtain information. By
some estimates, up to 67% of people aged 18–29 exploit social networking at some
level, and around 22% of the population overall uses social network websites for

* M. BalaAnand
[email protected]
Extended author information available on the last page of the article

13
Vol.:(0123456789)
M. BalaAnand et al.

some reason. These days, social networks such as Twitter and Facebook are driving
different sorts of interaction, collaboration, exchange and dialogue [1].
Typically, individuals tend to use different OSNs concurrently. For example, an
individual uses Facebook to connect with his friends, uses Twitter to post news, and
uses Foursquare for location-centered social activities. As expected, his social activ-
ities and associations are scattered on some networks, so his online information on
a solitary website is frequently inadequate. If precisely incorporate these sites, can
fabricate that individual’s information to be better in addition to more complete to
enhance online services, for example, community discovery, recommendation [2].
Commonly, in a social network, users post their personal traits on their personal
pages, for example, their colleges, majors and interests. This profile data can be used
as a reason for grouping users, sharing content, and suggesting friends. However, in
reality, not every user on a social network is real [3].
Fake users are defined as any account that consistently involves automation over
an observed period, such as the use of the Twitter API or other third-party tools,
or performing actions such as automated likes, tweets, and retweets. Fake users
include marketing bots, social bots, stalker bots and malicious bots [4]. Every bot
has a unique purpose. Marketing bots tweet about a specific product or company in a
predefined design. A social bot retweets love likes and follow in light of certain key
words (e.g. smile). A stalker bot spins around a user or two. Malicious bots spread
spam links [5]. A tweet is an original status and not a retweet, a retweet is a tweet
with “RT” in the text, and status is either a tweet or a retweet. Content on Twitter is
limited to whatever is contained in a tweet: text, URL, picture, and video [6].
The fundamental difficulty in detecting fake users is that they are rapidly refining
their spam strategies to stay ahead of advances in detection measures [7]. For meth-
ods that use essential features focused upon user profiles and message content, fake
users can evade detection by purchasing followers or using tools to post messages
with a similar meaning but with different words [8, 9]. These bots tend to be inter-
connected, surrounding account groups. Supporter accounts help fake users evade
detection by increasing their numbers of followers. In any case, perceived as social
business people are a small percentage of real users, and they will take after back
any person who follows them, with the specific goal of manufacturing their notori-
ety. There are a set number of edges keeping up complementary social connections
between fake users and real users, but the confirmation that real users take after the
fake user more than anticipated was found [10].
Fake users registered on social networks are able to perform a variety of actions
(depending on the specific social network), as follows: log onto the social network
home page and access a personal page, modify personal settings using the graphical
interface of the social network by providing personal information (full name, virtual
communities of interest, educational institutions, list of employers, vacation destina-
tions, areas of interest, recreational activities, and musical and other interests); change
(create, modify or delete) current “status” bar values on personal page (typically con-
tains current activity, disposition or geolocation indicated by the user or the user’s
device); post text, photo or audio materials on the personal page on social network
home page; frame or cut off friendly relations (both one-way and two-way—usually
confirmed by the user); send personal messages to a specific social network user, which

13
An enhanced graph‑based semi‑supervised learning algorithm…

is undetectable by other users; comment on text, photo or audio materials published


by other users by distributing instant messages immediately underneath them; express
approval or disapproval regarding other users’ posts or comments by marking them
positive or negative using the graphical interface of the social network community; fin-
ish the operation in the social network as a registered user at any point, breaking the
authorization session and using the “log out” soft key in the social network graphical
interface.
Consider that there is access to open data aggregated on random social network
users, including personal details available on their social network accounts, registra-
tion date and time, actions performed at certain time points, a set of distributed text
messages in open access, a set of open, friendly relationships developed by social net-
work users and stored in a social database [11]. The relationship between users can be
represented in the form of graph arcs connecting the corresponding nodes of the graph,
and so these relations can similarly be described as a vertex matrix. Formal markers
can also indicate the presence of the social bot subset in the social networks. Big data
concerns assume a key role in the accurate analysis of data from OSNs. The analysis
of data from OSNs can aid communications media in various respects [12]. Some key
challenges with respect to big data analysis can be highlighted using a few examples.
Algorithms that can be used to process big data include ensemble learning in order to
improve performance by reducing computational cost, model unpredictability methods
to optimize time complexity, local learning frameworks to reduce computational com-
plexity, semi-parametric approximations for global models, deep learning to increase
chip processing ability and further lower computing hardware cost, big data processing
with MapReduce, GraphLab for batch processing, and Storm and SAMOA for stream
processing [13, 23].
In this paper, we introduce an enhanced graph-based semi-supervised learning algo-
rithm (EGSLA) to detect fake users by examining the activity of users over a broadened
time span. Intuitionally, but various spammers can make their accounts appear like real
user accounts at some fixed time points to evade detection, it is not possible for them to
control the dynamic changing system of features over an extended time period because
of the high cost. Scalable streaming, which includes the batch and stream processing,
is a real-time model for scalable processing of social data for classification along with
prediction. Batch processing is used to train machine learning (ML) models, which is
not sufficient for real-time applications. The EGSLA model performs classification,
makes predictions and gives recommendations.
The remainder of this paper is structured as follows: Sect. 2 covers related works
on fake user detection in OSNs; Sect.  3 covers the proposed method (EGSLA);
Sect. 4 presents the results and analysis of the proposed EGSLA; and Sect. 5 con-
cludes the paper.

2 Related works

Boshmaf et  al. [14] introduced Integro, a scalable defense system that helps OSN
administrators to detect fake accounts by means of a large-scale user ranking sys-
tem. OSNs today are confronted with the problem of identifying fake accounts under

13
M. BalaAnand et al.

extremely adversarial conditions. The issue is becoming even more challenging as


these accounts have become increasingly sophisticated and adept at concealing their
activity with patterns that look like real user behavior. The authors showed that
SybilRank, the state of the art in fake account detection, was ineffective when the
fake users infiltrated the target OSN by befriending various real users. Integro, how-
ever, exhibited greater resilience to this effect by making novel use of information
about benign victim accounts that befriended fake users. The authors implemented
Integro on the standard data processing platforms Mahout and Giraph, which are
scalable and easily deployed in present-day server farms.
Escalante et  al. [15] proposed a methodology for profile-based detection of
deception along with aggressiveness in written text. Profile-based representations
utilize class term occurrence in order to deduce a non-sparse, low-dimensional, dis-
criminative representation for documents, where profiles can be further divided into
subprofiles or subclasses. Because these representations can be estimated even when
only a single term is available, they are suitable for analysis where documents have
missing or incomplete information. In light of this advantage, the authors developed
early recognition methods based on profile representations and assessed them in data
sets for distinguishing deception and aggressiveness, where the use of a minimum
amount of information was critical for prevention. The experimental results showed
that these representations could be used in a preventive way to perform detection
tasks. When classifying full-length documents, profile-based representations did not
appear to contribute to improved recognition, although in many cases they improved
the performance of alternative methods, achieving superior performance. In early
recognition, profile-based representations enhanced the performance of the refer-
ence method, chiefly sexual predator detection and Kaggle data sets [20]. In these
instances, utilizing profile-based representations enhanced both the jump-start and
classification performance when reading a small amount of data, which is an espe-
cially desirable feature for any early detection method. Subprofile-based represen-
tations performed better than profile-based representations in both sexual predator
detection and aggressive text identification. Profile and subprofile representations
were sensitive to the amount of data available in the training data set, in terms of
both sparseness and number of samples. This issue affects subprofile-based repre-
sentations significantly. Aggressive text classification is a more complex problem
than sexual predator detection for ML techniques. Among the classifiers that were
considered, neural networks demonstrated the most consistent performance across
data sets, achieving state-of-the-art performance. However, the early recognition
capacity of this strategy was limited.
Tsikerdekis et  al. [16] noted that to use these systems logically, one needed to
consider three basic factors that affect their capability: the number of users on a
platform, the innovation framework and the information speed. Their work arranged
detection methods based on the approach and identified components that continu-
ously structures will influence the adequacy and capability of these procedures [21].
Advancements were proposed that can control computational costs. Facilitate real-
time personality trickery recognition made an order of existing duplicity discovery
approaches and proposed factors and moreover improvements that should be consid-
ered for a continuous utilization of these systems in web-based social networking.

13
An enhanced graph‑based semi‑supervised learning algorithm…

These proposals require a reevaluation of identification approaches and their poten-


tial application to online networking [22]. Given the rapid advances in online net-
working together with increasing data speed, it was likely that innovation won’t
extra the day for these identification approaches but rather creative new ways that
will make them handy went for real-time applications.
Kuruvilla et al. [17] proposed a system for identifying multiple accounts owned
by a single user based upon verbal and nonverbal behavior. This system identified
multiple accounts maintained by a single user by utilizing both the user’s verbal and
nonverbal behavior. It distinguished nonverbal behavior by utilizing ML calcula-
tions. Wikipedia has page revision history, which contains the actions of every user,
for example, posts made by users, posts deleted/edited by users, and the time taken
to create a modification. Every user has separate correction log documents. It rec-
ognized the fake users based on the execution time of user activity and individual
writing style. This system also had high detection and prediction accuracy. However,
it did not think about the parameters for the users protection.
Gera and Singh [18] proposed a system for identifying the authenticity of each
survey posted on product review sites. They demonstrated a parameterized approach
for recognizing questions, review spammers and their groups considering geologi-
cal measurements alongside systems administration parameters [24]. The proposed
hostile review spam detection framework combines a two-stage methodology for
detecting opinion spammers and their groups [22, 25]. In future work, the proposed
parameterized approach will be completed to produce results with good accuracy. It
was found that this method would similarly diminish the spam impact, so to speak,
and, similar to this assistance in making the better appraising quality for a thing
which will aid many users in making better purchase decisions.
Jiang et  al. [19] proposed QuickSquad, a graph computing system on a single
server that is specific to social network graph structure optimization. The authors
first divided the graph structure data into heavy and light sets according to the out-
degree of vertices. They then stored them with different formats, processed them
with the appropriate edge-based and vertex-based updating in a two-phase process-
ing model, applied two selective scheduler strategies of different levels, and pro-
vided four cache priorities when the memory was insufficient to cache all data. They
then implemented two detection methods, dSybilRank and dCOLOR. In the experi-
mental evaluation, this approach was compared with existing approaches.

3 Proposed methodology for fake user detection in OSN

The proposed strategy is an EGSLA that uses graph-based classification techniques


to distinguish fake users from a pool of Twitter users. Here, we outline a system to
consequently gather user-generated content (UGC) from Twitter and to store them in
the database. The composed data are processed to extract the features. The features
are utilized in the EGSLA classification module and predicated on the weighted
graph, and the data are labeled. Based on the label, the believability of the user
account is determined. This innovative fake user detection method comprises three
steps, as enumerated below and illustrated in Fig. 1.

13
M. BalaAnand et al.

1. Data collection
• Python web-scraping framework
2. Feature extraction
• Fraction of retweets
• Standard tweet length
• Fraction of URLs
• Average time between tweets:
3. Classification and decision making
• EGSLA algorithm
Graphs are mathematical structures utilized for the modeling of pairwise relation-
ships between objects. Graph models can be classified into directed graphs and undi-
rected graphs. An undirected graph implies that there is no distinction between two
vertices related to every edge. For instance, if each vertex signifies a man at a social
event, an edge between two individuals implies that they shake hands, and after
that the graph will be undirected. Despite what might be expected, a directed graph
incorporates directional edges, and the direction is shown by arrows. For instance, if
each vertex signifies a site page, and edges signify a hyperlink starting with one site
page and then the next, at that point the graph will be directed. The most commonly
utilized graphs in computer science are weighted graphs. The fundamental compo-
nents of weighted graphs are vertex, edge and weight. In its mathematical represen-
tation, a graph is an ordered pair G = (V, E, W) , where V is a set of vertices or points
or nodes, E is a collection of edges or lines or arcs that are two-component subsets
of V, and W is a set of weights of the vertices or the edges.
A graph-based model is utilized for modeling user behavior. Typically, the nodes
of such a graph are associated with individuals, and the edges connecting the nodes
are associated with the type of communication or connection between the gen-
eral population. All fake users display the same example. These parameters can be
restrictively joined to detect the likeness between users.

Fig. 1  Proposed fake user detection system

13
An enhanced graph‑based semi‑supervised learning algorithm…

• Fake users post the same content in multiple accounts. The group of instant mes-
sages posted by the user can totally involve messages which were taken from
various users. Accordingly, for any message of the ith user, the unclear message
shows in the social database that was distributed before by some jth user. The
comparative way participation work aimed at social bot subset that relies upon
weighted factors can be assembled.
• Fake users perform activities at regular intervals. Different users perform the
same activities at regular predefined time intervals.
• Fake user accounts have similar patterns. This is a typical pattern which appro-
priates for every individual data of the ith and jth users and exist after that it
is sensible to assume the registration of ith and jth virtual users to be mostly
administered by sophisticated software. A typical example that is reasonable for
the name order of accounts (“kumar_2000”, “kumar_2001”, “kumar_2002”,
etc.) as though which are prepared by a unique algorithm. Utilizing the regular
pattern naming causes the authorized person to get the job statistics is a rule.
• Fake users have either links or media content in all the posts. Likewise, they uti-
lize trending hashtags. This is done to expand their tweet reach. However, the
content of the post and the hashtags or multimedia are not really related.

3.1 Data collection

The initial step is gathering the necessary data for the analysis. Here, 2 months of
Twitter data have been collected, and as many tweets as possible are collected from
the Twitter handles that tweet with links and trending topics. It is assumed that fake
users tweet with links in every tweet. In addition, fake users like to tweet with trend-
ing hashtags to increase their visibility. Trending topics and external links are the
fundamental data sources. In this approach, the data are collected using the Python
web-scraping framework. Web scraping is an automated method utilized to extract
large quantities of data from websites. The data on the websites are unstructured.
Web scraping aids in collecting this unstructured data and storing it in a structured
form. A crawler is implemented in the Python Scrapy framework to download the
above-mentioned parameters in Twitter. The raw data gathered by the Scrapy system
is stored in a Neo4j instance. Neo4J is a graph database that stores nodes and the
relationships between nodes. Neo4J is the most popular graph database according to
DB-Engines Ranking, and the 22nd most popular database overall. Neo4J has sev-
eral types of graphs, one of which is a directed graph, where nodes and relationships
by default are not bidirectional. Twitter relationships are of this type, where a user
can follow certain profiles in this social network without them following him. It can
scale limitlessly and can efficiently perform graph calculations (for example, “dis-
cover the nodes that are linked with a node that is linked with another node through
a specific relationship”). Finally, the Twitter data collection consists of user id, sta-
tus, followers list, following information, posting time, retweet status, tweet length
and time between tweets.
The data are preprocessed utilizing Apache Spark to format and standardize the raw
data to create data sets. Apache Spark is a fast computing technology; it increases the

13
M. BalaAnand et al.

processing speed of the application and also supports multiple languages. The preproc-
essor comprises two classes, one for connecting with the Neo4j instance and gather-
ing the data centered upon a cipher query, and a subsequent class that performs all the
formatting and normalizing. The preprocessed data contain the graph of user id, status,
follower list and following information for the users. The data were gathered periodi-
cally at 1-h intervals to extract all trending data from Twitter. The crawler utilized the
breadth-first search algorithm to crawl Twitter. A total of 21,473 users and 2,915,147
tweets were ultimately acquired.

3.2 Feature extraction

This section exhibits the formulation for all of the features employed for fake user
detection. A user typically posts his UGC on different accounts at similar time periods.
Such temporal distribution reveals his online action pattern. For instance, some users
like to post UGC toward the start of the day, while other users conduct their activity at
night. Such users’ online patterns are very helpful for user identification. For a user, this
can extract the posting time from his public UGC and acquire a succession of posting
times. Similar content is found on various accounts, which demonstrates that fake users
generally publish their data on numerous different accounts. This UGC contains a con-
siderable number of similar words or synonyms. The similarity of UGC content can be
indicated by the semantic similarity of words or the numbers of the same words.

3.2.1 Fraction of retweets

This measures the number of times the user distributed a retweet divided by the number
of tweets the user has distributed, calculated as:
|{x|x ∈ tweetsu , x = retweet}|
Retweet(u) = (1)
|tweetsu |
Whether a tweet is a retweet or not is decided by looking at the “retweeted status”
field of the tweet’s data when returned through the Scrapy system. If the field contains
tweet information, view it as a retweet.

3.2.2 Average tweet length

A tweet’s text is limited to 280 characters, but it is possible that bots will post less than
that, as they could simply be advancing a URL or a hashtag. To take this into account,
the average length of the user’s tweets is checked. It is the total of the characters in the
whole tweets the user has distributed divided by the number of tweets the user has dis-
tributed, formally:
∑�tweetsu � � �
�tweetsui �
Length(u) =
i � � (2)
�tweetsu �

13
An enhanced graph‑based semi‑supervised learning algorithm…

| |
where |tweetsui | is the length of the user u’s ith tweet, estimated by the quantity of
| |
characters in the tweet.

3.2.3 Fraction of URLs

This measures the number of times the user published a tweet containing a Uniform
Resource Locator (URL) divided by the total number of tweets the user has published:
|{y|y ∈ tweetsu , y = URL}|
URL(u) = (2)
|tweetsu |
Whether a tweet contains a URL is determined by looking at the “entities” field of
the tweet returned by the application programming interface. Bots attempt to per-
suade genuine users to go to external sites that are operated by their controller. This
component allows one to distinguish bots that are attempting to promote URLs in
their tweets. The entry point of the URL is found for each tweet. It gauges the fre-
quency with which every URL occurs in the tweet.

3.2.4 Average time between tweets

It was discovered that various bots tweet in a “bursty” nature, distributing an extensive
proportion of their tweets within a short period. This lead is measured as:
N
1 ∑
Time(u) = (t − t ) (3)
|tweets | − 1 i−2 i j
u

where ti is the time stamp of the ith tweet in user u’s timeline, tj is the time stamp
of the jth tweet in user u’s timeline, sorted chronologically in ascending order (i.e.
ti > tj).
The similarity of posting time can be signified as the distinction in time. Also, dole
out weight w1im to posting time tim
1
 . For two users (u1i , u2k ) can calculate the distinction
between two successions of time, and obtain a time similarity matrix
Mikt = {Mmn × w1im × w2kn } (4)
where Mmn indicates the contrast between the mth posting time of user v1i and the
nth time of user v2k . When considering the distinction on the users’ online behavior
patterns, computed the time similarity in two unique granularities, date and time,
in addition get two related matrices Mikt1 and Mikt2 . To signify the temporal features,
extracted the time distinction distribution from Mikt1 and Mikt2 , and mean these fea-
tures as the vector Vikt  . Likewise, for two users (u1i , u2k ) , it can calculate the similarity
between any two sets of UGC, and obtain a content similarity matrix Mik = {Mmn }.
p

13
M. BalaAnand et al.

3.3 Classification and decision making

The task of distinguishing true from false stories is frequently seen as a binary clas-
sification task, although, depending upon the application, it may be perceived as a
task evaluating the probability of trickiness. The graph-based approach depends on
user activity and their account details. This graph-based learning provides a helpful
approach for modeling data on classification problems. The reason for choosing a
graph-based approach is that some data sets are naturally represented by a graph,
which represents heterogeneous data uniformly, and is easily parallelizable, scalable
to large data and effective in practice. One can classify each account as fake or real
by recognizing the unique features of the account. A test data set is created with
plausible positive and negative sentiment scores. With regard to the sentiment score
range, for example, if it ranges from − 2 to 2, and − 2 is a negative score, then − 1
is a negative and 1 is a positive score. Keeping in mind the end goal of determining
the similarity, 800 arbitrary users’ tweets about a trending hashtag are utilized. The
extracted features are trained using an EGSLA. It uses labeled and unlabeled sam-
ples to train itself, which circumvents the issue of costly group training samples with
labels and significantly helps the measure of accessible training samples.
Assuming a set of l labeled samples and u unlabeled samples are provided:

{(xi , yj )}li=1 Labeled samples (5)

{(xi )}l+u
i=l+1
Unlabeled samples (6)

{ }l
Xtest = xi i=1 Test samples (7)

A binary classification issue is analyzed here, for instance, where


yi ∈ {−1, 1} Class label (8)
Consequently, the label allocation on the sample set is described by the vector
yi ∈ {−1, 1}l (9)
The learning algorithm can be depicted by a function f. Indicate F ∈ Rn to be the
vector of f  ’s values on the l + u points. At that point, this F contains two sections:
Fl ∈ Rl for the l labeled samples, and Fu ∈ Ru for the u unlabeled samples.
The learning algorithm would then be transformed into the following optimiza-
tion issue:
minF T SF + k1 L(Fl , yl ) + K2 𝜒(f ) (10)
where K1 and K2 are regularization parameters. S is a matrix that catches the
smoothness of the anticipated samples as for the data’s manifold structure. By limit-
ing the initial term, the smoothness is guaranteed. Regularization parameters keep
the model vector growing arbitrarily for negligible decreases in the error, and are
selected based upon the real application and for fitting interpretation of the data
structure. The characteristics of regularization parameters are to reduce the error

13
An enhanced graph‑based semi‑supervised learning algorithm…

by fitting a function appropriately on the given training set and to avoid overfitting.
Regularization functions vary in high density, which is given as
n
� � �2
⟨f , lf ⟩ = ⟨f , (D − K)f ⟩ = wij fi − fj (11)
i,j=1

where l denotes the labeled samples and wij denotes the weight function.
Various techniques can be employed for the selection of S and 𝜒  . For instance, the
graph Laplacian matrix (the Laplacian matrix is used to compute the number of con-
nected components of a graph and can represent the graph structure)S = D − K can be
engaged for S, where K is the kernel matrix of the graph, which is expressed as

K= 𝜂i 𝜓i 𝜓iT (12)
i

where 𝜓i ’s implies the eigenvectors of the graph Laplacian, 𝜂i ≥ 0 are the eigen-
values of the kernel matrix, T denotes the transverse value of eigenvector which is
obtained from the related kernel function, and D is the equivalent degree matrix.
Instead, the normalized graph Laplacian matrix can be engaged:
(13)
1 1
S = I − D− 2 KD− 2
where I is the identity matrix. Once S is determined, 𝜒(f ) can be deduced by utilizing
𝜒(f ) = ‖ ‖
‖Fu ‖ (14)
The local and global consistency error function is
� 2
1 � �� Fic Fjc � � �
1
���
E(F) = wij � √ − √ � + − 1 �Fic − Bic �2 (15)
2 i,j � � � �

c � D ij D �
ij �
𝛼 i c

where c denotes the local and global consistency of the label score F. Fic and Fjc
represent the label score and Bic represents the initial label assignments. The label
scores are assigned for each node in the graph; the label scores differ smoothly over
the graph structure. The parameter 0 ≤ 𝛼 ≤ 1 controls the rate at which the label
information is propagated around the network. To avoid computational and memory
cost the operation uses the power method to iteratively update F:
( )
Ft+1 = Z −1 (1 − 𝛼)B + 𝛼LFt (16)
where Ft denotes the current label score, Ft+1 denotes the updated label score, Z −1
signifies a diagonal matrix used to normalize the rows of F, and B denotes the initial
label assignments. The proposed EGSLA pseudo code is exhibited in Fig. 2.

13
M. BalaAnand et al.

Fig. 2  Pseudo code for EGSLA algorithm

Table 1  Overview of data set


Date accounts created December 3, 2017
Date analysis started December 3, 2017
Date analysis ended February 3, 2018
Users 21,473
Tweets 2,915,147

4 Results and analysis

The proposed approach is assessed based on real posts from Twitter. Using the pro-
posed method, 2,915,147 tweets from 21,473 users during the period from Decem-
ber 2017 to February 2018 were recovered. A review of the acquired data set is
shown in Table 1.

4.1 Performance analysis

The performance of the proposed method is examined by performing statistical test-


ing for sensitivity, accuracy and specificity. The statistical metrics of sensitivity,
accuracy and specificity can be communicated concerning true-positive (TP), false-
positive (FP), false-negative (FN) and true-negative (TN) values. Once the regulated
ML models are trained, their effectiveness is assessed. The significance of each
marker is graphically featured by the network in Table 2, where each section implies
the events in the expected class, while each line speaks to the events in the real class:

13
An enhanced graph‑based semi‑supervised learning algorithm…

to gauge the use of each lead to the records in the benchmark data set, thought about
the going with standard appraisal measurements:

• True positive (TP) the number of fake users perceived by the rule as fake users
• True negative (TN) the number of real users perceived by the rule as real users
• False positive (FP) the number of real users perceived by the rule as fake users
• False negative (FN) the number of fake users perceived by the rule as real users

The following accuracy, false positive, precision, recall and Matthews correlation
coefficient metrics are used to determine the model’s efficiency.

(a) Accuracy
  Accuracy is a measure of the accounts that are correctly classified:
(TN + TP)
(17)
(TN + FP + FN + TP)
(b) Precision
  Precision is calculated as the number of correct positive predictions divided
by the total number of positive predictions. The precision calculation is given as
TP
(18)
TP + FP
(c) Sensitivity
  Sensitivity is calculated as the number of correct positive predictions divided
by the total number of positives. The sensitivity measure is mathematically
expressed as
TP
(19)
TP + FN
(d) Specificity
  Specificity is calculated as the number of correct negative predictions divided
by the total number of negatives. The specificity calculation is given as
TN
(20)
TN + FP
(e) Matthews correlation coefficient (MCC)
  The estimator of the correlation between the predicted class and the real class
of the samples is characterized as:

Table 2  The confusion matrix Actual class Predicted class


Real user Fake user

Real user TN FP
Fake user FN TP

13
M. BalaAnand et al.

Table 3  Accuracy of the Train sentiment Test sentiment Accuracy


proposed EGSLA on negative,
positive and neutral labels (for Positive Positive 90.3
800 users)
Negative 84.1
Negative Positive 83.4
Negative 85.9
Neutral Neutral 87.2

Table 4  Precision of the Train sentiment Test sentiment Precision


proposed EGSLA on negative,
positive and neutral labels (for Positive Positive 92.3
800 users)
Negative 82.6
Negative Positive 84.5
Negative 88.6
Neutral Neutral 88.9

TP ⋅ TN − FP ⋅ FN
√ (21)
(TP + FN)(TP + FP)(TN + FP)(TN + FN)

Each of the above measures gets a substitute piece of the forecast nature of the
examples that have a place with the huge class. Precision measures what quanti-
ties of tests are adequately recognized in both of the classes; by and by, it does not
express if the noteworthy class is better seen over the other one. Also, there are con-
ditions where some prediction models perform better than others, despite having
lower accuracy. MCC attempts to pass on in one single value the nature of a gauge,
uniting alternate measurements. Additionally, MCC uses all four quadrants of the
confusion matrix.
The experiment was conducted for every rule considering two classes of accounts,
the fake users and the real users. To epitomize the outcome of every experiment,
some were considered as assessment metrics in light of two standard indicators,
accuracy and MCC. The estimation results are shown in Tables  3 and 7, and the
estimation results for the precision, specificity and sensitivity are shown in Tables 4,
5 and 6.
The accuracy of the proposed EGSLA based on positive, negative and neutral
labels in train sentiment and test sentiment is given in Table 3. The neutral value is
87.2 in both the train and test sentiment. In the positive label, the positive value is
90.3 and the negative value is 84.1. Figure 3 shows the proposed EGSLA accuracy
on negative, positive and neutral labels.
Table 4 shows the precision of the proposed EGSLA on negative, positive, and
neutral labels for 800 users. The precision is 92.3, which is obtained in the testing
sentiment of the positive level. The graphical representation of the precision value
of the proposed EGSLA is shown in Fig. 4

13
An enhanced graph‑based semi‑supervised learning algorithm…

Table 5  Specificity of the Train sentiment Test sentiment Specificity


proposed EGSLA on negative,
positive and neutral labels (for Positive Positive 89.88
800 users)
Negative 82.1
Negative Positive 79.20
Negative 83.4
Neutral Neutral 87

Table 6  Sensitivity of the Train sentiment Test sentiment Sensitivity


proposed EGSLA on negative,
positive and neutral labels (for Positive Positive 90.8
800 users)
Negative 84.3
Negative Positive 83.7
Negative 87.9
Neutral Neutral 88

Fig. 3  Accuracy of proposed EGSLA on negative, positive and neutral labels

Table 5 shows the specificity of the proposed EGSLA on negative, positive and
neutral labels for 800 users. The performance is shown in train sentiment and test
sentiment phases. A pictorial representation of the specificity value is shown in
Fig. 5.
Table 6 shows the sensitivity of the proposed EGSLA on negative, positive, and
neutral labels for 800 users. The sensitivity value is 90.8 in positive phase. In the
train sentiment positive label, the positive value is 90.8 and the negative value is
84.3. In the train sentiment negative label, the negative value is 87.9. The sensitivity
value is represented as shown in Fig. 6.
Table 7 shows the MCC performance of the proposed EGSLA on negative, posi-
tive and neutral labels. In the train sentiment and test sentiment, the neutral value is
85.5. From this observation, the positive value is better than the negative value. Fig-
ure 7 shows MCC of the proposed EGSLA on negative, positive and neutral labels.

13
M. BalaAnand et al.

Fig. 4  Precision of proposed EGSLA on negative, positive and neutral labels

Fig. 5  Specificity of proposed EGSLA on negative, positive and neutral labels

The tweets are organized into real user and fake user groups, and based on
these two groups, the tweets are labeled for examining the transition sequences.
The tweets are characterized into five labels utilizing EGSLA. A group of 800
users is considered for assessment of #budget2018. The average values of each
label for 800 users is shown in Table 8
The labels are given in increasing request of the total of the estimations of
each measurement of the component vector of the names’ centroid. Label 1 is
the group whose total measurement of the centroid’s element vector is the low-
est. Label 5 is the group whose total of the measurement of the centroid’s ele-
ment vector is to the greatest.

13
An enhanced graph‑based semi‑supervised learning algorithm…

Fig. 6  Sensitivity of proposed EGSLA on negative, positive and neutral labels

Table 7  MCC measure of Train sentiment Test sentiment MCC


proposed EGSLA on negative,
positive and neutral labels (for Positive Positive 89.4
800 users)
Negative 83.2
Negative Positive 81.8
Negative 82.7
Neutral Neutral 85.5

Fig. 7  MCC of proposed EGSLA on negative, positive and neutral labels

4.2 Comparative analysis

To comprehensively assess the EGSLA, we compare the detection results with


those acquired by other state-of-the art fake user recognition techniques: game the-
ory, k-nearest neighbor (KNN), support vector machine (SVM) and decision tree.
Indeed, the EGSLA detection method outperforms the other methods for all metrics,

13
M. BalaAnand et al.

Table 8  Average values of each Label Post Reply RT Users


label (800 users)
1 6.253 0.138 0.130 346
2 18.238 0.309 0.106 211
3 63.917 0.409 0.080 124
4 65.471 0.318 0.144 93
5 198.817 0.371 0.117 26

Table 9  Accuracy in detecting fake users


Percentage of EGSLA Game theory KNN (k = 10) SVM Decision tree
labeled data (%)

8 63.5 58.2 43.1 41.2 42.1


10 70.9 67.1 44.2 43.0 43.2
15 78.5 72.0 57.2 54.1 54.8
20 84.4 74.9 61.1 62.1 61.2
30 90.3 75.1 63.0 63.2 63.1

Fig. 8  Comparison of accuracy of fake user detection algorithms

achieving 89.4% MCC, 88.2% true F-measure and 89.7% false F-measure. Table 9
shows the performance comparison between different ML algorithms used to detect
the fake users on OSNs.
Figure  8 shows the accuracy of various ML algorithms compared with the
EGSLA algorithm.
In this experiment, it was found that if the percentage of labeled data of the
anticipated characteristic is very large, the prediction accuracy can be increased by
gradually increasing the percentage of labeled data. If the proclivity estimation of
a property is greater than 10, users having a similar incentive on this characteristic
push toward getting to be friends more successfully. The time and retweet likeness
were found to be the best comparability measure expected for inferring this quality.

13
An enhanced graph‑based semi‑supervised learning algorithm…

Therefore, increasing the weight of retweet and time closeness can increase predic-
tion accuracy.

5 Conclusion

Semi-supervised learning algorithms demonstrate their advantages in various real-


life applications because of their capacity to develop patterns with great general-
izability from an extremely limited labeled sample set by making use of the unla-
beled ones followed by the labeled ones for training purpose. The fake user issue
is approached utilizing EGSLA. Huge quantities of feature sets are extracted by
analyzing the tweets of the users. For decision making, the extracted features are
classified and labeled. The results demonstrate that the proposed EGSLA algo-
rithm distinguishes the fake user in the pool of Twitter users. The EGSLA technique
was compared with other ML techniques, including game theory, KNN, SVM, and
decision tree methods, to assess the performance of the proposed method, which
showed that 90.3% accuracy, 88.2% true F-measure and 89.7% false F-measure were
achieved using the EGSLA. This compares with accuracy of 75.1%, 63.0%, 63.2%
and 63.1% for game theory, KNN, SVM and decision tree.
In the future, the proposed method can be enhanced by the analysis of user behav-
ior patterns, and a hybrid classifier can be used in order to further improve the accu-
racy of fake account detection.

References
1. Hanna R, Rohm A, Crittenden VL (2011) We’re all connected: the power of the social media eco-
system. Bus Horiz 54(3):265–273
2. Doan A, Ramakrishnan R, Halevy AY (2011) Crowdsourcing systems on the World-Wide Web.
Commun ACM 54(4):86–96
3. Ding Y, Yan S, Zhang Y, Dai W, Dong L (2016) Predicting the attributes of social network users
using a graph-based machine learning method. Comput Commun 73:3–11
4. Krombholz K, Merkl D, Weippl E (2012) Fake identities in social media: a case study on the sus-
tainability of the Facebook business model. J Serv Sci Res 4(2):175–212
5. Chu Z, Gianvecchio S, Wang H, Jajodia S (2012) Detecting automation of Twitter accounts: are you
a human, bot, or cyborg? IEEE Trans Dependable Secure Comput 9(6):811–824
6. Gilani Z, Farahbakhsh R, Tyson G, Wang L, Crowcroft J (2017) Of bots and humans (on Twitter),
In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks
Analysis and Mining 2017, pp 349–354. ACM
7. Qiang F, Feng B, Guo D, Li Q (2018) Combating the evolving spammers in online social networks.
Comput Secur 72:60–73
8. Yang C, Harkreader R, Guofei G (2013) Empirical evaluation and new design for fighting evolving
Twitter spammers. IEEE Trans Inf Forensics Secur 8(8):1280–1293
9. Aggarwal A, Rajadesingan A, Kumaraguru P (2012) PhishAri: Automatic Realtime Phishing Detec-
tion on Twitter, eCrime Researchers Summit (eCrime), 2012, pp 1–12, IEEE
10. Yan G (2013) Peri-Watchdog: hunting for hidden botnets in the periphery of online social networks.
Comput Netw 57(2):540–555
11. Drevs Y, Svodtsev A (2016) Formalization of criteria for social bots detection systems. Procedia-
Soc Behav Sci 236:9–13
12. Farasat A, Nikolaev A, Srihari NS, Blair RH (2015) Probabilistic graphical models in modern social
network analysis. Soc Netw Anal Min 5(1):5–62

13
M. BalaAnand et al.

13. Ramalingam D, Chinnaiah V (2017) Fake profile detection techniques in large-scale online social
networks: a comprehensive review. Comput Electr Eng 65:165–177
14. Boshmaf Y, Logothetis D, Siganos G, Lería J, Lorenzo J, Ripeanu M, Beznosov K, Halawa H
(2016) Íntegro: leveraging victim prediction for robust fake account detection in large scale OSNs.
Comput Secur 61:142–168
15. Escalante HJ, Villatoro-Tello E, Garza SE, López-Monroy AP, Montes-y-Gómez M, Villaseñor-
Pineda L (2017) Early detection of deception and aggressiveness using profile-based representa-
tions. Expert Syst Appl 89:99–111
16. Tsikerdekis M (2017) Real-time identity deception detection techniques for social media: optimiza-
tions and challenges. IEEE Internet Comput 99:1–11
17. Kuruvilla AM, Varghese S (2015) A detection system to counter identity deception in social media
applications, In: International Conference Circuit, Power and Computing Technologies (ICCPCT),
2015, pp 1–5, IEEE
18. Gera T, Singh J (2015) A parameterized approach to deal with sock puppets, In: 2015 Third Interna-
tional Conference Computer, Communication, Control and Information Technology (C3IT), pp 1–6,
IEEE
19. Jiang X, Li Q, Ma Z, Dong M, Wu J, Guo D (2018) QuickSquad: a new single-machine graph
computing framework for detecting fake accounts in large-scale social networks. Peer-to-Peer Netw
Appl 1–18
20. Yuan W, Yang M, Li H, Wang C, Wang B (2018) End-to-end learning for high-precision lane keep-
ing via multi-state model. CAAI Trans Intell Technol 3:185–190
21. Shi Q, Lam HK, Xiao B, Tsai SH (2018) Adaptive PID controller based on Q-learning algorithm.
CAAI Trans Intell Technol 3(4):235–244
22. Wang K, Zhu N, Cheng Y, Li R, Zhou T, Long X (2018) Fast feature matching based on r-nearest
k-means searching. CAAI Trans Intell Technol 3(4):198–207
23. BalaAnand M, Karthikeyan N, Karthik S (2018) Designing a framework for communal software:
based on the assessment using relation modelling. Int J Parallel Prog. https​://doi.org/10.1007/s1076​
6-018-0598-2
24. Solomon Z, Sivaparthipan CB, Punitha P, BalaAnand M, Karthikeyan N (2018) Certain investiga-
tion on power preservation in sensor networks. In: 2018 International Conference on Soft-Comput-
ing and Network Security (ICSNS), Coimbatore, India, https​://doi.org/10.1109/icsns​.2018.85736​88
25. Sivaparthipan CB, Karthikeyan N, Karthik S (2018) Designing statistical assessment healthcare
information system for diabetics analysis using big data. Multimed Tools Appl

Publisher’s Note  Springer Nature remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.

Affiliations

M. BalaAnand1   · N. Karthikeyan2 · S. Karthik3 · R. Varatharajan4 ·


Gunasekaran Manogaran5 · C. B. Sivaparthipan3
N. Karthikeyan
[email protected]
S. Karthik
[email protected]
R. Varatharajan
[email protected]
Gunasekaran Manogaran
[email protected]
C. B. Sivaparthipan
[email protected]

13
An enhanced graph‑based semi‑supervised learning algorithm…

1
Department of Computer Science & Engineering, V.R.S. College of Engineering & Technology,
Viluppuram, India
2
Department of MCA, SNS College of Engineering, Coimbatore, India
3
Department of Computer Science & Engineering, SNS College of Technology, Coimbatore,
India
4
Department of Computer Science & Engineering, Anna University, Chennai, India
5
Department of Computer Science, University of California, Davis, USA

13
View publication stats

You might also like