Machine Learning A Review On Binary Classification
Machine Learning A Review On Binary Classification
11
International Journal of Computer Applications (0975 – 8887)
Volume 160 – No 7, February 2017
detection features. Dhanyasree P et.al.[6] has contributed their time. Ashkan Sami et.al.[12] has provide a framework for
work for detection of identity deception on social networking analyzing and classifying PE files based on data mining
sites. On social networking sites, one person creates multiple techniques. Windows Application programming
account for malicious use. So this become a very big issue on interface(API) can be used to extract knowledge describing
social sites. So on the basis of verbal and non verbal behavior behavior of executables .Each API call is used as a feature.
it can be detect such types of account. So authors has tried to FISHER SCORE based feature selection process is used. Top
detect such accounts on the basis of verbal and non verbal 4 categories by Fisher's Score are :File Management, Process
behavior. They have used algorithms, Calculation of non- and Thread, Console and Registry. 34820 PE files where
verbal variables and model testing using Random Forest 31,869 were malicious and 2951 were benign windows PE
method and Identification of time window using PSO. They files. RF,NB and DT techniques are used. Random Forest
found that Detecting multiple accounts through nonverbal gives good performance. G.Ganesh Sundarkumar et.al. [13]
behavior has more accuracy. The automated system to detect has done text mining for feature selection. Then Mutual
multiple accounts gives good performance . Both the verbal Information is used to extract most influential features. Then
and nonverbal behavior can be combined and used for data mining models such as Decision tree, Neural network
sockpuppets detection, in which binary classification are done model, SVM , Probabilistic neural network and group method
to detect sockpuppet or non sockpuppet cases. M BalaaNand of data handling(GMDH) is used. On the basis of Accuracy,
et.al. [7] has proposed a method to detect multiple account Sensitivity. Specificity all 59 models are compared. DT,
and fake identity on social media like WIKIPEDIA using non- SVM, PNN, NN and GMDH techniques are used for
verbal behavior(User activity and User Movement). Authors comparison. Then again the dataset are balanced using
has worked for time independent based non verbal behavior. Oversampling and again tested the model .After balancing
In which they has used data from Wikipedia and SVM,RF and sensitivity/accuracy improved. Prasha Shrestha et.al.[14] has
ADA techniques for binary classification. They found explained Malware Family Identification process using string
Adaptive Boosting gives the best balance between recall and information. Classification of malware into correct family is
precision with high accuracy. Sheetal Antony et.al.[8] has an important task for antivirus vendor. Using term-frequency
proposed a system that can use verbal and non verbal and inverse document frequency(tf-idf) and using prominent
behavioral patterns to detect identity deception. There is an strings extraction classification work are done in this paper.
Admin who manages each account for users. The details and To check accuracy-way vendor agreement are compared with
activities of the user are analyzed and detect if there is some accuracy achieved by used algorithm or techniques. Exact
deception. The details are verified in database. If it detects match: Global vocabulary, exact matches: Prominent strings,
that there is some deception then there are some security Prominent strings set and Absence of prominent string are
questions that are asked to users. Zaher Yamak et.al.[9] has techniques used for this purpose. Data are used from
proposed a detection method in which following steps are University's malware database(1504 malware files). On the
taken: first of all data are crawled from Wikipedia, then detect basis of above mentioned experiments it can be easy to detect
sockpuppet accounts, after that create a set of non-verbal malware family files. Exact Match: Global vocabulary gives
behavior features and then calculate the values of the the best result. Michael Bailey et.al.[15] has explained that
proposed features and finally used machine learning algorithm anti-virus is incomplete in that it fails to detect or provide
for classification. SVM, RF, Naive Bayes, K nearest neighbor, labels of the malware samples. Authors explained that when
Bayesian Network and Adaptive Boosting are taken for result these systems do provide labels, theses labels do not have
comparison. Best accuracy given by Random Forest(99.8%) consistent meaning across families and variants within a
and Bayesian Network(99.6%) for sockpuppet detection. single naming convention as well as across multiple vendors.
Malware detection is an important issues to save our computer Finally they demonstrated that these system lack conciseness
system and communication infrastructure. So, Anti-virus in that they provide some little information or sometime too
technology is a key player in tackling malware files, based on much information about a specific piece of malware. Authors
two methods: signature based and heuristic-based method. has proposed a novel technique to overcome these problems.
Asaf Shabtai et.al.[10] has addressed different challenges i.e., On the basis of behavioral fingerprints of malware's activity,
files representation method, feature selection method and automated malware classification are done. To compare and
classification algorithm. Some additional issues are also combine these fingerprints, single-linkage hierarchical
mentioned in this paper such as: weighting clustering approach are applied. Gaston L’Huillier et.al.[16]
algorithm(ensembles),imbalance problem ,active learning and has explained phishing mail classification. Phishing email
chronological evaluation. Authors has proposed a framework fraud is to attempt to gain personal/sensitive information such
for detecting new malicious code in executable files can be as username, passwords and credit cards details. Algorithm
designed to achieve very high accuracy while maintaining low like Support vector machines, naïve Bayes, Random forest
false positives. Antu Mary et.al.[11] has proposed a method algorithm are used for classification of phishing emails. The
for detecting identity deception by a single user is based on classification of phishing emails is extension of text mining.
using Nonverbal behavior. Non verbal behavior explains In this paper feature extraction methodology for fishing
activities done by each user separately such as Some emails are enhances by using latent semantic analysis features
Wikipedia users create multiple accounts and use them for and keyword extraction techniques. SVMs ,the naïve Bayes
various malicious purposes such as Number of articles model and the logistic regression method are used in Weka
generates, Number of searches done for same articles, tool to improve accuracy. Rafiqul Islam et.al.[17] has tried to
Number of bytes added and also removed, Number of times classify malware on the use of static and dynamic features.
same spelling mistakes carryout constantly, Time taken There are some drawback in static techniques for malware
between each revision, creating fraudulent articles, damaging classification. So it focuses to detect some dynamic features
existing article text etc. So these deceptions cannot easily which is very useful in classification process. For static
detected by any authority. Numerous methods have been features there are two information needed: function length
proposed that can help in detecting multiple accounts owned frequency and printable strings information. For dynamic
by the same persons. Using verbal and nonverbal behavior of features API functions name are used. SVM,DT,RF and Naive
user can easily detect the sockpuppet with limited amount of Bayes techniques are used in WEKA tools with 10 fold cross
12
International Journal of Computer Applications (0975 – 8887)
Volume 160 – No 7, February 2017
validation for classification. Random forest gives the highest and F-measures. All these three improved random forest
accuracy with TP,FN and Accuracy parameters. . Ali Danesh methods are compared with other widely used text
et.al. [18] proposed a classifier fusion method to improve text categorization methods i.e., support vector
classification. Proposed approach combined Naive Bayes, K- machines(SVM),Naive Bayesian(NB),and K-
NN and Rocchio methods by Voting algorithms methods and NearestNeighbor(KNN). M.Sivakumar et.al. [21] proposed a
achieve a better classification rate which experimental results hybrid text classification Approach using KNN and SVM.
shows that the classification error decreases by 15%. 2000 They proposed SVM-KNN approach aims to reduce the
documents from 20 different newsgroups has taken for impact of parameters in classification accuracy. The
experiment. Aytug Onan et.al.[19] proposed ensemble performance analysis shows the accuracy of SVM-KNN
approach such as Adaboost, Bagging, Dagging, Random method remains optimal for even huge values of the
Subspaces and majority Voting. Two way ANOVA test parameters. The accuracy compared to the KNN method is
conducted. The experimental analysis shows that the bagging higher in the SVM-KNN. Unlike the conventional KNN
ensemble of Random Forest with the most frequent based classification approach, the SVM-KNN approaches has low
keyword extraction method yields promising results for text impact on the implementation of the parameters. Sundus
classification. The experimental result shows that the Hassan et.al. [22] proposed a method for text categorization in
utilization of keyword based representation of text documents which they compared Support Vector Machine(SVM) and
in conjunction with ensemble learning can enhance the Naive Bayes (NB) classifiers. Baseline for the experiment has
predictive performance and scalability of text classification setup by removing stopwords and stemmed the dataset by
schemes. Baoxun Xu et.al.[20] has proposed an improved using Porter Stemmer. They used micro-average and macro
Random forest classifier for text categorization. They average F-Measure. Experiments shows the improvement in
proposed improved random forest methods with both feature micro average and macro average F-measure in both method
weighting and tree selection methods(WTRF), Breiman's i.e., SVM and NB.
Random forest (BRF) and the random forest with only tree
selection method(TRF).Comparisons are based on accuracy
Table 1: Summary of related work on sockpuppets detection, multiple identity deception detection and text categorization by
performing binary classification
13
International Journal of Computer Applications (0975 – 8887)
Volume 160 – No 7, February 2017
Table 1: Continued...
Reuters-21578 R8 dataset
M.Sivakumar et used. Proposed SVM-KNN method
al.[2014] KNN and SVM Accuracy provides high accuracy.
Sundus Hassan et Dataset from 20 Newsgroup Macro F-Measures and NB gives better performance
al.[2000] with 1000 documents NB and SVM Micro F-Measures over SVM.
14
International Journal of Computer Applications (0975 – 8887)
Volume 160 – No 7, February 2017
IJCATM : www.ijcaonline.org 15