0% found this document useful (0 votes)
176 views40 pages

A Deep Learning Based Static Taint Analysis Approach

Uploaded by

Yash Mehta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
176 views40 pages

A Deep Learning Based Static Taint Analysis Approach

Uploaded by

Yash Mehta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Journal Pre-proofs

A Deep Learning Based Static Taint Analysis Approach for IoT Software Vul-
nerability Location

Weina Niu, Xiaosong Zhang, Xiaojiang Du, Lingyuan Zhao, Rong Cao, Mohsen
Guizani

PII: S0263-2241(19)31005-X
DOI: https://doi.org/10.1016/j.measurement.2019.107139
Reference: MEASUR 107139

To appear in: Measurement

Accepted Date: 7 October 2019

Please cite this article as: W. Niu, X. Zhang, X. Du, L. Zhao, R. Cao, M. Guizani, A Deep Learning Based Static
Taint Analysis Approach for IoT Software Vulnerability Location, Measurement (2019), doi: https://doi.org/
10.1016/j.measurement.2019.107139

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover
page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will
undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing
this version to give early visibility of the article. Please note that, during the production process, errors may be
discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier Ltd.


A Deep Learning Based Static Taint Analysis Approach
for IoT Software Vulnerability Location

Weina Niua , Xiaosong Zhangb,c,∗, Xiaojiang Dud , Lingyuan Zhaob , Rong


Caob , Mohsen Guizanie
a Collegeof Cybersecurity, Sichuan University, Chengdu, Sichuan, 610065, China
b Institute
for Cyber Security, School of Computer Science and Engineering, University of
Electronic Science and Technology of China, Chengdu, Sichuan, 611731, China
c Cyberspace Security Research Center, Peng Cheng Laboratory, Shenzhen, 518040, China
d Department of Computer and Information Sciences, Temple University, Philadelphia, PA

19122, USA
e College of Engineering, Qatar University, Doha, 2713, Qatar

Abstract

Computer system vulnerabilities, computer viruses, and cyber attacks are root-
ed in software vulnerabilities. Reducing software defects, improving software
reliability and security are urgent problems in the development of software.
The core content is the discovery and location of software vulnerability. Howev-
er, traditional human experts-based approaches are labor-consuming and time-
consuming. Thus, some automatic detection approaches are proposed to solve
the problem. But, they have a high false negative rate. In this paper, a deep
learning based static taint analysis approach is proposed to automatically locate
Internet of Things (IoT) software vulnerability, which can relieve tedious man-
ual analysis and improve detection accuracy. Deep learning is used to detect
vulnerability since it considers the program context. Firstly, the taint from the
difference file between the source program and its patched program selection
rules are designed. Secondly, the taint propagation paths are got using static
taint analysis. Finally, the detection model based on two-stage Bidirectional
Long Short Term Memory (BLSTM) is applied to discover and locate software
vulnerabilities. The Code Gadget Database is used to evaluate the proposed ap-

I Fullydocumented templates are available in the elsarticle package on CTAN.


∗ Corresponding author
Email address: [email protected] (Xiaosong Zhang )

Preprint submitted to Journal of LATEX Templates August 29, 2019


proach, which includes two types of vulnerabilities in C/C++ programs, buffer
error vulnerability (CWE-119) and resource management error vulnerability
(CWE-399). Experimental results show that our proposed approach can achieve
an accuracy of 0.9732 for CWE-119 and 0.9721 for CWE-399, which is higher
than that of the other three models (the accuracy of RNN, LSTM, and BLSTM
is under than 0.97) and achieve a lower false negative rate and false positive
rate than the other approaches.
Keywords: IoT software vulnerability location, deep learning, software
patching, static taint analysis
2010 MSC: 00-01, 99-00

1. Introduction

Over the next four years, 10 billion Internet of Things (IoT) and connected
devices will be deployed worldwide according to the report from Strategy An-
alytics [1]. The popularity and rapid development of IoT technology have also
5 brought new security and privacy risks, like IoT botnet, cryptocurrency mining
and ransomware attack. Several papers (e.g., [2, 3, 4, 5]) have studied related
security issues. IoT devices are closer to the user state than traditional PC de-
vices or server devices, and their associated privacy data or property data has a
larger number than that of traditional devices, which makes attackers favor IoT
10 devices. These attacks may shorten the lifetime of the battery in IoT network
or ruin the energy supply system, which would influence the basic functions of
the devices and even cause huge economic losses [6]. For example, the notorious
Mirai botnet [7] exploits login vulnerabilities in unsecured IoT devices such as
webcams and home routers has launched the largest DDoS attack known to
15 date. Moreover, the autonomous working mechanism and limitations on energy
resources of IoT devices make them vulnerable to energy resources exhaustion
(ERE) attacks [8, 9]. On the base of the attack method analysis, many research
studies have been carried out on this topic. For example, Boubiche et al. [10]
analyzed Sleep Deprivation, Barrage, Collision and Synchronization attacks at

2
20 the intersection of the physical and data link layers. Goudar et al. [11] discussed
Denial-of-Sleep attacks in WSN, which were caused by manipulating network
packets.
Applications, network protocols, operating systems, and cryptographic algo-
rithms ultimately exist in the form of software. However, it is difficult to guar-
25 antee reliable and safe software development since the design and development
of computer software requires high-intensity mental work and rich experience.
Moreover, the number of software vulnerabilities registered in the Common Vul-
nerabilities and Exposures (CVE) [12] continues to grow since 1999 and reaches
14, 714 in 2017, which is shown in Figure 1. In addition, the US Department of
30 Defense’s Advanced Research Projects Agency (DARPA) has hosted the Cyber
Grand Challenge (CGC) [13] since 2015 to improve the capabilities of a new
generation of fully automated cyberspace defense systems. The five aspects of
the CGC emphasize that all competitions are automated [13]. Therefore, the
fully automated method is the solution to future cyber warfare.

The number of vulnerabilities registered in CVE every year


16000 14714

14000
12000 10616
CVE Num

10000
7946
8000 6610 6520 64806447
5632 5736 5297 5191
6000 4935 4652
4155
4000 2451
1677 2156 1527
2000 894 1020

year

Figure 1: The number of vulnerabilities registered in CVE per year.

35 Detecting and locating software vulnerabilities is the foundation for building


a cyber defense system [14]. Many software vulnerability detection methods
have been proposed, which can be divided into two categories, one is pattern-

3
based [15] and the other is code similarity-based [16, 17]. The first group of
methods relies on human experts for building vulnerability feature database.
40 Therefore, they are labor-consuming and sometimes error-prone. Moreover,
they cannot discover the precise locations of vulnerabilities due to independent
program representation. The second group of methods can solve the problem
by representing each software in the abstract level, and these methods consider
contextual information as well. However, they have a high false negative rate
45 and false positive rate.
Deep learning has dramatically changed the way that computing devices han-
dle human-centric content such as image, video, and audio [18]. The widespread
use of IoT and Network Physics Systems (CPS) in the industry can benefit from
the introduction of deep learning models. For example, images of production
50 vehicles in the assembly line and their annotations are input into a deep learning
system such as AlexNet, GoogLeNet, etc. to achieve visual inspection. Deep
learning can also produce next-generation applications on IoT devices, which
can perform complex sensing and recognition tasks [19, 20].
However, to the best of our knowledge, it is not common to use deep learning
55 to detect software vulnerabilities. Deep learning is mainly used for software
defect prediction, like software language modeling [21], code cloning detection
[22], API learning [23], binary function boundary recognition [24], and malicious
URLs, file paths detection and registry keys detection [25], which is different
from software vulnerabilities detection. Li et al.[26] used deep learning model
60 BLSTM to detect software vulnerabilities and can achieve an accuracy of 0.949.
The work is also used as a comparative experiment of our proposed method.
In response to the above reality and to reduce false positive rates and false
negative rates, deep learning based static taint analysis approach is proposed
in this paper. The designed IoT software vulnerability location system can
65 enhance the automation level and accuracy of software vulnerability discovery
and location, which uses the patch-based taint propagation path method and
deep learning-based vulnerability discovery and location method.
Our contributions. This paper introduces deep learning based static taint

4
analysis approach for IoT software vulnerability. Three are three of our main
70 contributions as follows.
First, we propose three taint selection principles to determine the original
taints. The first is to select the variables shared by the deleted and added lines
in the diff file, the second is to select the parameters of the known vulnerability
function or the ordinary function, and the third is to select the restricted variable
75 in the if conditional statement.
Second, we propose the taint weight calculation method to select taint with
high weight. A large number of initial taints are generated using the three
taint selection strategies, but many initial taints do not actually trigger the
vulnerabilities. In order to further improve the accuracy of the selected taints,
80 further taint screening is performed in combination with the frequency of real
taints.
Third, we develop the deep learning-based IoT software vulnerability loca-
tion system and evaluate its effectiveness using the Code Gadget Database.
Paper organization. The rest of the paper is organized as follows. Sec-
85 tion 2 introduces related work about software vulnerability detection. Section
3 presents some preliminaries. Section 4 elaborates the proposed deep learning-
based vulnerability location approach in detail. Section 5 describes our experi-
mental evaluation and results. Section 6 concludes the present paper.

2. Related work

90 In this section, how traditional software complexity and quality metrics (such
as entropy) contribute to software vulnerability analysis and studies related to
software vulnerability detection and location are discussed. Some latest devel-
opment in related fields is also tracked.
There are three parameters to decide the software complexity, which include
95 overall complexity, input and output complexity, and rectification complexity.
The indicators of overall complexity include the number of code lines, the num-
ber of functions, the number of code lines declaring functions and variables, the

5
complexity of some key algorithms, the complexity of circles and the number of
recursive calls layers. The input and output complexity contains metrics such
100 as global variables used by functions, parameters of functions, heaps and stacks
of function calls. Intuitively, the rectification complexity is the number of code
lines that are annotated.
The quality indicators of the software are as follows:
1) The number of bugs in each code segment/module/time period. Coverity
105 and Checkmarx use this indicator to judge their ability to detect vulnerabilities.
2) Code coverage (the proportion and extent to which the source code was
tested). HP Fortify uses this indicator to judge its ability to detect vulnerabili-
ties.
3) Designing/Development constraints (number of methods/properties in a
110 class).
4) Software complexity.
In summary, the software complexity and quality indicators contribute to
the vulnerability analysis in three main points: pre-judging and cost controlling
for the limitations of vulnerability recurrence, adjusting and improving the vul-
115 nerability tracking solution, and evaluating and repairing of the vulnerability
repair strategy.
Flawfinder [27] is an open-source code analysis tool, which primarily makes
simple text pattern matching with a built-in database of C/C++ functions to
discover well-known problems. However, Flawfinder has high false negative rates
120 since it doesn’t do control flow or data flow analysis. In order to improve univer-
sality, Rough Auditing Tool for Security(RATS) [28] provides a list of potential
trouble spots with C, C++, Perl, PHP, and Python source code. It checks for
risky built-in/library function calls by the rules of RATS. Unfortunately, RATS
has high false negative rates and false positive rates since it performs only a
125 rough analysis of source code. Moreover, manual inspection is still necessary
under the aid of RATS. To support interactive programming environments in
real-time, ITS4 [29] uses a parse tree generated with a context-free parser to
represent the program. It breaks a non-preprocessed file into a series of lexical

6
tokens and then matches them to the vulnerability database. However, ITS4
130 has high false positive rates since it cannot understand the program context.
CxSAST from Checkmarx [30] is an accurate and flexible source code anal-
ysis solution, which is fluent in all major languages. Checkmarx uses a unique
lexical analysis technique and CxQL patent query technology to perform static
analysis. However, Checkmarx has high false negative rates. In order to im-
135 prove detection accuracy and reduce cost, Coverity [31] offers integrations with
key development tools and CI/CD systems. Moreover, Coverity supports multi-
ple programming languages and frameworks. Compared to other static analysis
tools, Coverity has the following characteristics: providing deep, full path cover-
age accuracy; using interprocedural analysis. But, it also has high positive rates.
140 For example, Coverity may report a risk when the pointer pN ext does pN ext++
operation without being assigned or assigned a value of N U LL. However, if the
pN ext pointer is assigned in a while loop below pN ext = N U LL, this report
can be ignored. To reduce time-consuming and effort-consuming, HP Fortify
[32] statically analyzes the source code through the built-in five main analysis
145 engines: data flow, semantics, structure, control flow, configuration flow, etc.
Unfortunately, it cannot effectively locate the location of the vulnerability. The
comparison of mainstream commercial static analysis tools is shown in Table 1.
Neuhaus et al. [33] have developed Vulture, which is used to mine a vulnera-
bility database, a version archive and a code base, and map past vulnerabilities
150 to components. Vulture [33] is able to predict vulnerabilities of new compo-
nents based on their imports and function calls [34]. However, such fine-grained
relationships of Vulture is still at the component level. Based on the empiri-
cal study of 3, 241 Red Hat packages, Neuhaus et al. [35] used support vector
machines on Red Hat dependency data to predict vulnerable packages [36]. To
155 further reduce the vulnerability analysis granularity, Yamaguchi et al. [37] em-
bedded code in a vector space and automatically determined API usage patterns
using the machine learning. However, the false negative rate also exists. Ya-
maguchi et al. [38] extracted abstract syntax trees from the source code and
searched for vulnerabilities based on the idea of vulnerability extrapolation, but

7
Table 1: The comparison of mainstream commercial static analysis tools.
Tool name Checkmarx [30] Coverity [31] HP Fortify [32]
platform Windows multi-platform multi-platform
program
multi-language C/C++, Java multi-language
language
vulnerability
multiple multiple multiple
types
development
Checkmarx Coverity HP
agency
release
2003 2002 2012
time
data flow,
semantics,
Lexical Analysis SAT engine
key structure,
and CxQL and software
technology control flow,
Patent Query DNA map
configuration
flow, etc.

8
160 they cannot identify vulnerabilities automatically. Grieco et al. [15] proposed
a machine-learning-based approach to discover software vulnerability through
lightweight static and dynamic features. Unfortunately, the test prediction er-
ror is high as well. Li et al. [39] presented an automatic software vulnerability
detection system, Vulnerability Pecker (VulPecker). VulPecker [39] generated
165 the signature of the target program and then detected vulnerability using code-
similarity algorithms. However, the effectiveness of the approach needs to be
further improved due to its some heuristics. Kim et al. [16] proposed a scal-
able approach for vulnerable code clone discovery, VUDDY, which leverages
function-level granularity and a length-filtering technique to reduce the num-
170 ber of signature comparisons. However, VUDDY [16] focuses on the vulnerable
discovery of code clone.
The academic community also has a lot of articles about vulnerability detec-
tion and location, which promotes the development of this field. Huang et al.
[40] proposed a new automatic vulnerability classification model (TFI-DNN).
175 The proposed TFI-DNN model outperformed others in accuracy, precision, and
F1-score and performed well in recall rate. It was also superior to SVM, Nave
Bayes and KNN on comprehensive evaluation indexes. Jurn et al. [41] proposed
a Hybrid Fuzzing method based on a binary complexity analysis and intro-
duced an automatic patch technique modifying the PLT/GOT table to translate
180 vulnerable functions into safe functions. The experimental results showed the
proposed model has good performance in open-source binaries. Spanos et al.
[42] proposed a model combined with text analysis and multi-target classifica-
tion techniques to estimate the vulnerability characteristics. They considered
the vulnerability characteristics as a vector of six targets and estimated these
185 characteristics using multi-target classification. Experimental results showed
that the proposed methodology could achieve comparable results. Aakanshi
et al. [43] proposed a mathematical model to predict the bad smells using
the Information Theory. Bad smells were collected using the detection tool
from sub-components of the Apache Abdera project, and different measures
190 of entropy (Shannon, Rnyi, and Tsallis entropy) were used to identify bad s-

9
mells. The experimental results showed that all three entropy approaches are
sufcient to predict the bad smells in software. Madhu et al. [44] proposed bug
dependency-based mathematical models by considering the summary descrip-
tion of bugs and comments submitted by users in terms of the entropy-based
195 measures. The models mainly followed exponential, S-shaped or mixtures of
both types of curves. But some improvement work could be done in the area
of the summary entropy and comment entropymetric-based models using other
project data to make it general.
In addition to the above scholars, there are still some people who have made
200 their own contributions in this field or similar fields. StarWarsIV, Ali Has-
san et al. [45] proposed Hybrid Adaptive Bandwidth and Power Algorithm,
and Delay-tolerant Streaming Algorithm to signicantly optimizes power drain,
battery lifetime, standard deviation. Ali et al. [46] proposed an optimization
scheme aiming at achieving the customer experience quality of vehicle Internet.
205 Abdul et al. [47] evaluated the quality of service computing in health care ap-
plications, proposed AQCA algorithm which is more suitable for the quality of
service computing, and analyzed the impact of each QoE parameter on medical
data processing by estimating QoE perception. Sandeep et al. [48] improved
and managed M-QoS by prioritizing telemedicine services using a decisive and
210 intelligent tool called Analytic Hierarchy Process (AHP). Hina et al. [49] pre-
sented a detailed survey about how 5G has revolutionized medical healthcare
with the help of IoT for enhancing quality and efficiency of the wearable de-
vices. Also, state-of-the-art 5G-based sensor node architecture was proposed for
the health monitoring of the patients with ease and comfort. Ali Hassan et al.
215 [50] proposed a novel joint transmission power control (TPC) and duty-cycle
adaptation based framework, adaptive energy-efficient transmission power con-
trol (AETPC) algorithm a Feedback Control-based duty-cycle algorithm and
system-level battery and energy harvesting models to minimize charge and en-
ergy depletions of the wearable devices. Ali Hassan et al. [51] proposed a
220 forward central dynamic and available approach (FCDAA), a system-level bat-
tery model, a data reliability model for edge AI-based IoT devices over hybrid

10
TPC and duty-cycle network to use resources appropriately. Ali Hassan et al.
[52] proposed a novel energy-efficient adaptive power control (APC) algorithm
to overcome the problem that a constant transmission power and a typical con-
225 ventional transmission power control (TPC) methods are not suitable choices
for WBAN for the large temporal variations in the wireless channel. Muham-
mad et al. [53] put forward the Wireless Body Sensor Networks which provides
ways to monitor individual activity in a variety of scenarios. YezhiLin et al.
[54] developed an efficient, simple and unified way to increase the potential
230 speed of the multicore-system. Chandio et al. [55] proposed a system, which
is named integration of inter-connectivity of information system (i3) based on
service-oriented architecture (SOA) with web services, to monitor and exchange
students information. Lodro et al. [56] proposed the channel modeling of 5G
mmWave cellular communication for urban microcell which was simulated in
235 LOS condition at operating frequency of 28 GHz with multiple antenna ele-
ments at transmitter and receiver. Different parameters affecting the channel
had been considered in simulation using NYUSIM software.

3. Preliminaries

In this section, related preliminaries about our deep-learning-based IoT soft-


240 ware vulnerability location approach are illustrated.

3.1. Static taint analysis


Static taint propagation analysis[57] refers to determining whether data can
be transmitted from a tainted source to a taint convergence point by analyzing
data dependencies between program variables without running and modifying
245 the code.
The object of static taint analysis is usually the source code or intermedi-
ate representation of the program. The workflow of static taint analysis are
described as follows: first, a call graph (CG) is constructed according to the
function call relationship in the program; then, specific data stream propaga-
250 tion analysis is performed within the function or between functions according

11
to different program characteristics. Common explicit stream taint propagation
methods include direct assignment propagation, propagation through function
(procedure) calls, and propagation through aliases (pointers).
In recent years, researchers have developed a number of tools to conduct taint
255 analysis on other languages like java, but there are only a few tools available for
C/C++. Some famous open-source tools like Saint[58] proposed in 2015 and
Tanalysis [59], built as a plugin for the Frama-C platform, now are no longer
available. Tools are still available such as Marcelo [60], which modifies the clang
static analyzer to perform static taint analysis, but clang has disadvantages of
260 not being able to analyze multiple source files, and it does not have access to the
LLVM which can help with analysis. Lacking of an extensible and configurable
static taint analysis tool is an open opportunity ignored by academia.

3.2. CNN-BLSTM

Neural networks [61] have achieved great success in image processing [62],
265 speech recognition, and NLP [63], but they are rarely used in vulnerability de-
tection. It means that many neural network models may not be suitable for vul-
nerability detection. Therefore, some principles are needed to guide the selection
of neural network models for vulnerability detection. Whether the vulnerability
is included in the code is determined by the context, so a neural network that
270 is able to handle the context can be used for vulnerability detection [64]. The
neural network used for NLP also needs to consider the context. It is feasible
to use the deep learning model to conduct software vulnerability discovery and
location [26]. The structure of Neural networks includes convolution layer and
pooling layer.
275 Convolution layer: the input to each node in the convolutional layer is just
a small piece of the upper layer of the neural network, which we usually call the
kernel. The convolutional layer attempts to further analyze each small block
i
in the neural network to obtain more abstract features. Assuming that wx,y,z is
used to represent the ith node in the output unit node matrix, ax,y,z is used to
280 indicate the the weight of filter input node (x,y,z), bi is used to represent the

12
offset term coefficient corresponding to the ith output node, then the value of
the ith node in the identity matrix g(i) is defined as for formula (1).

X
a X
b X
c
i
g(i) = f ( ax,y,z × wx,y,z + bi ) (1)
x=1 y=1 z=1

Pooling layer: it can effectively reduce the size of the matrix, thus reducing the
parameters in the final full connected laye. Using the pooling layer can both
285 speed up the calculation and prevent over-fitting problems. The calculation of
the pooling layer in the filter is not a weighted sum of nodes, but a simpler
maximum or average operation.
Recurrent Neural Network (RNN)[65] is used to mine the time series infor-
mation in the data and the deep representation of semantic information. It is
290 often used in speech recognition, language modeling, machine translation, and
timing analysis. RNN differs from ordinary fully connected neural networks in
that the nodes between the hidden layers of the RNN are connected. The input
of the hidden layer includes not only the output of the input layer but also the
output of the hidden layer at the previous moment.
295 Long Short Term Memory (LSTM) [66] is a special type of RNN, which
can learn long-term dependency information. LSTM is different from standard
RNN, which has four different structures that interact in a very special way.
LSTM is a special network structure with three ”gate” structures. Bidirectional
Long Short Term Memory (BLSTM) [67] uses a two-way structured LSTM
300 model, taking into account the impact of context on the structure.

Output:IoT software
are vulnerable or not
and vulnerability type
Convolution layer SOFT
... MAX

FC
BLSTM

Figure 2: A brief review of CNN-BLSM neural network.

13
In addition, the RNN model has a Vanishing Gradient problem [68], which
may lead to invalid model training. The Vanishing Gradient problem is solved
with the idea of memory cells into RNNs (including LSTM and GRU), but
LSTM is one-way and is not enough to detect software vulnerability (function
305 parameters may be affected by the previous statement may also be affected
by the following statements). Therefore, it is feasible to use the BLSM model
to conduct software vulnerability discovery and location. To further improve
detection accuracy, CNN-BLSTM neural network is applied in the paper. The
input data size is 100 ∗ 150. After the convolution layer and the pooling layer,
310 the data size is 9 ∗ 128. After the LSTM layer, the data size is 64. Finally, the
data is classified by the fully connected layer. Figure 2. shows the structure
of CNN-BLSTM neural network, which has a convolution layer, a max pooling
layer, a number of BLSTM layers, a fully connected (FC) layer and a softmax
layer.

315 4. Deep-learning-based IoT software vulnerability location approach

The section introduces the proposed deep-learning-based IoT software vul-


nerability location approach. As is highlighted in Figure 3, the proposed ap-
proach includes four components: patching comparison, static taint analysis,
taint propagation paths transforming, and IoT software vulnerability location.
320 Figure 4 shows the specific technical process. There are six steps in our patching-
based approach: using difflib to obtain the Diff file between the source code and
the patched code; labeling taints according to the taint selection principles;
generating taint propagation paths using static taint propagation; transforming
taint propagation paths into symbolic representations; encoding the symbolic
325 representations into vectors; applying the trained CNN-BLSTM neural network
to locate two common types of IoT software vulnerability.

4.1. Patching comparison


The patching comparison obtains a Diff file with different marks by compar-
ing the source code of the vulnerable software with the patched software. The

14
Source code
*.cpp/*.c Taint selection
Patching comparison principles
Static taint analysis

output

Lines where two types of


IoT software
common Taint propagation
vulnerability location
vulnerabilities:CWE-119 paths transforming
by CNN-BLSTM
and CWE-399 appears

Figure 3: Overview of our proposed appraoch.

Training program(source programs and patched


programs)

Component
Using difflib to obtain Diff file between the
I:Patching
source program and the patched program
comparison

Common to deleted and added


Step 1:Labelling taints according to lines; The left of the assignment;
the taint selection principles Constrained in the Āifā
Component
conditional statement and so on.
II:Static taint
analysis Taint selection
Step 2:Generating taint propagation principles
paths using static taint analysis

Step 1:Transforming taint propagation paths


Component into symbolic representations according to the Word2vec
III:Taint embedding matrix
propagation
paths
Step 2:Encoding the symbolic representations Embedding matrix
transforming
into vectors

Component Lines where two types of


CNN-BLSTM neural network with fine-tuned
V:IoT software common vulnerabilities:
model
vulnerability CWE-119 and CWE-399
parameters
location appears
Output

Figure 4: Technique flow chart of our proposed appraoch.

15
330 Diff file is input to the static taint analysis module. The existing source code
comparison tools include DiffMerge [69], Textdiff [70], Meld [71], Git diff [72]
and so on. Most of these open-source tools have a graphical interface. If we call
them directly in the source code, it will take a lot of time to operate manually.
In addition, the source code of most tools has runtime errors, which is not easy
335 to use. As the difflib package of python can achieve the same functions as these
tools, this package is directly used in the paper to obtain Diff file with different
marks.

4.2. Static taint analysis

As is highlighted in Figure 5, the function of the static taint analysis mod-


340 ule is to generate a taint propagation path by performing lexical analysis and
grammar analysis on the Diff file. The module includes the following two tasks:
(1) determining the taint according to taint selection principles (see below); (2)
generating the taint propagation paths according to tainted sources.
The principles of taint selection are described as follows:

345 1. Selecting the common variables in the deleted and added rows;
2. Selecting the parameters of the known vulnerability function or the ordi-
nary function;
3. selecting the restricted variable in the if conditional statement.

According to the principles of taint selection, the appropriate taints are ini-
350 tially selected, but many initial taints do not actually trigger the vulnerabilities.
In order to further improve the accuracy of the selected taints, we then rank
taints based on the taint weight calculation method, which is as follows: 1) If
a taint is a parameter of the CWE-119 or CWE-339 vulnerability correlation
function, the tainted weight is 1; 2) If a taint is a parameter of the ordinary
355 function, the tainted weight is 2; 3) If a taint is bound by an if statement, the
tainted weight is 3; 4) Otherwise, the tainted weight is 4. Finally, we generate
taint propagation paths based on static taint analysis and the line number at
which the taint first appears.

16
Source program

taint selection Initial taints


Diff file
principles

Patched program
taint weight calculation
method

Taint propagation path code


Static taint analysis Final taints
blocks

The line number at which the


taint first appears

Figure 5: The work flow of the module of Static taint analysis.

17
4.3. Taint propagation paths transforming

360 There are two steps in this module. The first step is transforming taint
propagation paths into symbolic representation, and the second step is encoding
the symbolic representation into vectors. The output of the module is the input
of IoT software vulnerability location module.

4.3.1. Transforming symbolic representation


365 Processing the code segment is analogized to the Natural Language Pro-
cessing (NLP) problem, which is necessary to segment the training data. The
purpose of the word segmentation is to convert the text into a sequence of words
that the model can read. Generating a word sequence requires two steps: 1:
using the data set to generate a participle dictionary; 2. replacing each word
370 in the original code snippet with the number of the word corresponding to the
dictionary. The detailed steps are shown as follow.
Step 1, after filtering special symbols such as “!”#$%&()*+,-./:;<=>?@[]ˆ ’{|}∼
\t\n” from all the files, all words in the dataset are segmented and all words
are extracted.
375 Step 2, generate a word breaker: the word frequency of all words in the data
set is counted, and each word is assigned a number according to the word fre-
quency order. Finally, a dictionary (vocabulary-word frequency order number)
is formed.
Step 3, use the generation dictionary to replace words in all code snippets
380 in the dataset with word numbers for model input. Take print(”hello world”)
for example, whose result of filtering special symbols is print hello world. The
input of model is [2, 40, 66], which is converted according to the dictionary
{”int”:1, ”print”:2, ... , ”hello”:40, ... , ”world”:66, ...}.
The purpose of this module is to transform the code block in the form of text
385 type into a numeric type and populate it. keras.preprocessing.text.tokenizer is
used to segment the input data and the input data is represented by a subscript
sequence. Here are the implementation details of tokenizer: calculating the
number of times each word appears in the program, sorting words according to

18
the number of times the word appears from large to small, the first one is 1,
390 and this is recursive. A numerical representation of the dataset sample is got by
using keras.preprocessing.text.text to word sequence participle. The special
symbol in the participle process should be ignored, such as “!”#$%&()*+,-
./:;<=>?@[]ˆ ’{|}∼ \t\n”.

4.3.2. Encoding vectors


395 The goal of the module is generating a word vector set using Word2Vec in the
gensim package and embedding the processed data as a vector form. Word2Vec
is an efficient tool to characterize words as real-value vectors, which includes
two models: CBOW and SkipGram. CBOW is used in this paper, which uses
distributed representation to map each word into a K-dimensional real number
400 vector. The implementation of word2vec is essentially a two-layer deep neural
network (DNN) that predicts adjacent words with input words.
Word2Vec is a model for learning semantic knowledge in an unsupervised
way from a large amount of text corpus, which is widely used in NLP. Word2Vec
actually uses the word vector to represent the semantic information of the word
405 by learning the text. That is, the semantically similar words are close to each
other in the space through an embedded space. Embedding is actually a map-
ping that maps words from the original space to the new multidimensional space.
In other words, space, where the original word is located, is embedded in a new
space. The working principle of embedding is shown in Figure 6. As vectors
410 encoded using the one-hot method are high-dimensional and sparse, embedding
is chosen to solve the problem. Suppose we encounter a dictionary containing
2, 000 words in NLP. When using one-hot encoding, each word is represented
by a vector containing 2, 000 integers, which not only takes up a lot of storage
space but also cannot express the similarity between words and words. The
415 embedding is to express the word ”deep” with a fixed length vector [32, 02, 48,
21, 56, 15]. However, not every word is replaced by a one-hot vector but instead
is used to find the index of the vector in the embedded matrix. The following
is an example. The training sample is a computational diagram of (input word:

19
”ants”, output word: ”car”).

Output weights for Ācarā

300 features
Word vector for Āantsā Probability that if you
= randomly pick a word
X Softmax
nearby Āantsā, that it
300 features is Ācarā

Figure 6: The working principle of embedding [73].

420 4.4. IoT software vulnerability location

This module has two parts, the first part is the training phase and the second
part is the testing phase.
1) The training phase: Generating Diff file between the source program and
patched program, selecting taint according to taint selection principles and tain-
425 t weight calculation method; getting taint propagation paths using static taint
analysis; transforming taint propagation paths into certain symbolic represen-
tations; encoding taint propagation paths in the symbolic representation into
vectors; training a CNN-BLSTM neural network. The trained CNN-BLSTM
neural network is shown in Figure 7.
430 2) The detection phase: Given one or multiple target programs, Diff files are
generated between source programs and patched programs. Taints are labeled
using the taint selection principles and taint weight calculation method, and
taint propagation paths are obtained based on the taints labeled in the previous
step. Taint propagation path is transformed into symbolic representation and
435 encoded by Word2Vec. At last, the lines of code where the vulnerability exists
are located by applying the trained CNN-BLSM model.
The pseudo code of our proposed method is as follows:

20
21
Figure 7: The trained CNN-BLSTM neural network.
pseudo code of our proposed method
Get the diff file between the source file and the file after patching through
difflib
for diff in diff file:
Remove comments (// & /*...*/) and import header files (beginning
with #)
Select variables that are shared in the difference row;
Remove keywords
if there are taints in the function line:
if taint in sensitive function:
for taint in sensitive function from VulDeePecker [26]:
The vulnerability exists in the line
Extract the taint propagation path of the variable
break
else:
for taint in normal function
Extracting the taint propagation path of the
variable
break
elif taint in if statement:
for taint in taint which in if statement
Extracting the taint propagation path of the variable
break
else
for taint in taints
Extracting the taint propagation path of the variable
break
if training
Establish a CNN-BLSTM network, and use the extracted taint
propagation path to train the network to obtain a model
elif testing
Using the obtained model to22detect whether the taint propagation path
is a vulnerability and a related type;
5. Experimental evaluation and results

440 In this section, the dataset is illustrated to verify the validity of the proposed
approach. The experimental setting and evaluation metrics are given, and then
the experimental results are analyzed.

5.1. Experimental dataset


The proposed IoT software vulnerability location system focuses on two type-
445 s of common vulnerabilities: buffer error (i.e., CWE-119) and resource manage-
ment error (i.e., CWE-399) in this paper. Open source software programs in
the experimental dataset come from the National Vulnerability Database (N-
VD) and the NIST Software Assurance Reference Dataset (SARD) project. The
distribution of taint propagation paths from software programs in our experi-
450 mental dataset is listed in Table 2, where TPP represents taint propagation
paths. A taint propagation path is a number of lines of code with semantic
correlation.

Table 2: The distribution of software programs in the experiment.


TPP with TPP with
Category Normal TPP
CWE-119 CWE-399
Number 43, 913 10, 439 7, 285

5.2. Experimental setting


The deep-learning-based IoT software vulnerability location system has been
455 implemented in Python 3.6.5, Numpy 1.14.3, TensorFlow 1.8.0, Keras 2.1.6,
Joblib 0.11, Gensim 3.4.0, Scikit-learn 0.19.1, Nltk 3.3, Reportlab 3.4.0, and
all experiments are made using an off-the-shelf computer with Intel Core i7 at
3.7 GHz and 32GB of RAM, GeForce GTX 1080. In order to evaluate the
true positive rates and false positive rates of our software vulnerability location
460 approach, we use the parameters in the evaluating experiment, which is shown
in Table 3. In the experiment, the ratio of the training set to the test set is
4 : 1.

23
Table 3: Experimental parameters settings.
Parameter Description Value
size of the program
max num len 20000
word dictionary
maximum length of taint
max sequence len 100
propagation path fragment
W ord2V ec : size word vector dimension 150
f it : batch size The size of batch 32
LST M : dropout parameters used to
0.2
&Dense : dropout prevent overfitting
pool size size of pool window 3
kernel size size of convolution window 3
f it : nb epoch the frequency of training 5
whether data is returned
LST M : return sequences T rue
at each time step
loss loss function binary crossentropy
optimizer optimization function adam

24
The minimum length of taint propagation paths in the training set is 2, the
largest length is 2, 698, and the average length is 48. The reasonable parameter
465 value of max sequence len is 100. For English text, the embedding length
is usually 150. Moreover, the larger the embedding length is, the larger the
computational overhead is. Therefore, the value of parameter embedding dim
is 150. The value of dropout is normally set to 0.5. However, because the
number of training is 5, the result shows that there will be no overfitting. Thus,
470 the value of the parameter dropout is set as 0.2. Regarding the parameter
return sequences, by setting this parameter to True, the result is output at each
time step in the LSTM, and finally, all the outputs are stitched together. The
final result contains each time step information, and the test result also indicates
that its better to set the value of the parameter True. Since the task is a two-
475 category task, the output of the final model should use the sigmoid activation
function, and the corresponding loss function should be binary crossentropy.
Whats more, commonly used optimization algorithms such as sgd are prone to
get into minimum values, so the adam optimization algorithm is used, and the
speed is faster and the gradient descent process is smoother.

480 5.3. Evaluation metrics and experiment analysis

The following evaluation metrics are chosen to evaluate the proposed IoT
software vulnerability location systems based on patching comparison. T P is
the number of normal programs correctly labeled as normal, F P is the number
of programs with vulnerabilities labeled as normal programs, F N is the number
485 of normal programs labeled as programs with vulnerabilities, and T N is the
number of programs with vulnerabilities correctly detected. F PO and F NO
indicate the FP and FN of other deep learning models, like RNN, LSTM, and
BLSTM. T PCB and F NCB indicate the FP and FN of CNN-BLSTM model,
respectively.
490 Accuracy = (T P + T N )/(T P + T N + F P + F N )
F PI = (F PO − T PCB )/T PCB
F NI = (F NO − F NCB )/F NCB

25
In general, the higher the Accuracy, the better the recognition effect. The
values of F PI and F NI are positive, indicating that CNN-BLSTM model is
495 better; otherwise other models are better.
Some experiments are performed to evaluate the performance of the proposed
approach for detecting IoT software CEW-399/CWE-119 vulnerabilities. The
experiments mainly include the comparison of CNN-BLSTM with other deep
learning models, such as RNN, LSTM, and BLSTM.

500 5.3.1. Experimental results of different models for CWE-399 identification


Table 4 shows the experimental results of different deep learning models for
locating software vulnerability with CWE-399. Experimental results show that
CNN-BLSTM-based classifier is more effective in software vulnerability with
CWE-399 identification, whose value of Accuracy is 0.9721. Firstly, the convo-
505 lutional neural network (CNN) is used to train the morphological character-level
vector of code gadgets, and the word vector with semantic feature information
is obtained from the large-scale background corpus training. Then the two
are combined as input and constructed. BLSTM deep neural network model
is suitable for software vulnerability identification, which can further improve
510 recognition performance. The Accuracy of the RNN is 0.9157. RNN has the
worst recognition effect because of its gradient disappearance problem. LSTM
performs worse than BLSTM, mainly because the network structure of BLSTM
constitutes an acyclic graph, and the output is obtained by taking into consid-
eration the factors before and after. What’s more, as seen from F PI and F NI ,
515 CNN-BLSTM has been greatly enhanced on RNN. Although there are negative
numbers in F PI of BLSTM and F NI of LSTM, it is still improved overall.

5.3.2. Experimental results of different models for CWE-119 identification


Table 5 shows the experimental results of different deep learning models for
locating software vulnerability with CWE-119. Experimental results show that
520 CNN-BLSTM-based classifier has a better performance in software vulnerability
with CWE-119 identification, whose value of Accuracy is 0.9732. The Accuracy

26
Table 4: Experimental results of different models for CWE-399 identification.
Model RNN LSTM BLSTM CNN-BLSTM
TN 1239 1362 1419 1416
FP 210 87 30 33
FN 159 76 107 89
TP 2769 2582 2821 2839
Accuracy 0.9157 0.9628 0.9687 0.9721
F PI 5.364 1.636 −0.09 1
F NI 0.786 −0.146 0.2 1

of the RNN is 0.9098. RNN has the worst recognition effect because of its
gradient disappearance problem. LSTM performs worse than BLSTM, mainly
because BLSTM considers the factors before and after. As can be seen from
525 Table 5, the CNN-BLSTM model has a lower false positive rate than other deep
learning models, but its false negative rate is indeed higher than the LSTM and
BLSTM models. What’s more, as seen from F PI and F NI , CNN-BLSTM has
been greatly enhanced on RNN. Although there are negative numbers in F NI
of LSTM and F NI of BLSTM, the improvement in F PI is greater, and the
530 CNN-LSTM model is still better overall.
Finally, we trained a total of 31, 802 samples using CNN-BLSTM and spent
3097.2 seconds. And we tested a total of 7950 samples using CNN-BLSTM and
spent 34.2 seconds.

6. Conclusion

535 In recent years, various kinds of commercial software have been frequently
exposing vulnerabilities, which seriously affect the enterprise’s security. Thus,
the security of third-party applications has received much attention. Existing
dynamic detection methods consuming a lot of CPU resources, and the level of
automation is low. This work uses static analysis and deep learning algorithms
540 to automatically locate vulnerabilities. The proposed approach generates the

27
Table 5: Experimental results of different models for CWE-119 identification.
Model RNN LSTM BLSTM CNN-BLSTM
TN 5170 5763 5833 5954
FP 1183 482 454 306
FN 969 314 274 334
TP 16529 17292 17290 17257
Accuracy 0.9098 0.9666 0.9695 0.9732
F PI 2.866 0.575 0.484 1
F NI 1.9 −0.06 −0.18 1

Diff file between source code and patched program, labels taint sources accord-
ing to the designed taint selection principles, obtains the lines where taints first
appear and taint propagation paths using static taint analysis, transforms tain-
t propagation paths into symbolics, encodes symbolic into vectors, discovers
545 CWE-119/CWE-399 vulnerabilities based on trained CNN-BLSM model and
finds their lines. The vulnerability locator based on deep learning is evaluat-
ed on a dataset consisting of 17, 725 programs with vulnerabilities and 43, 913
benign programs. Experimental results show that the proposed approach can
achieve an accuracy of 0.9732 for CWE-119 and 0.9721 for CWE-399, which is
550 higher than that of the other three models (the accuracy of RNN, LSTM, and
BLSTM is under than 0.97).
In the future, our work can be applied to industrial control security, smart
car security, smart home security, and other fields to ensure the safety of the
Internet of Things equipment. Our work can be used as a detection system to
555 detect these devices before they leave the factory, or as a chip embedded in the
Internet of Things device to detect.

7. Acknowledgment

We thank the anonymous reviewers for their comments that helped us im-
prove the paper. This work was supported in part by in part by the National

28
560 Key R&D Plan under Grant CNS 2016QY06X1205, in part by the Basic research
business fees of central colleges under Grant CNS 20826041B4252, in part by
the National Natural Science Foundation (NSFC) under Grant CNS 61572115,
and in part by the Science and Technology Project of State Grid Corporation
of China under Grant CNS 522722180007. Any opinions, findings, conclusions
565 or recommendations expressed in this material are those of the authors and do
not reflect the views of the funding agencies.

References

[1] D. Mercer, Smart home will drive internet of things


to 50 billion devices, says strategy analytics, https:
570 //www.strategyanalytics.com/strategy-analytics/
news/strategy-analytics-press-releases/2017/10/26/
smart-home-will-drive-internet-of-things-to-50-billion-devices-says-strategy-analytics
(2017).

[2] X. Du, M. Guizani, Y. Xiao, H.-H. Chen, Transactions papers a routing-


575 driven elliptic curve cryptography based key management scheme for het-
erogeneous sensor networks, IEEE Transactions on Wireless Communica-
tions 8 (3) (2009) 1223–1229.

[3] Y. Xiao, V. K. Rayi, B. Sun, X. Du, F. Hu, M. Galloway, A survey of key


management schemes in wireless sensor networks, Computer communica-
580 tions 30 (11-12) (2007) 2314–2341.

[4] X. Du, Y. Xiao, M. Guizani, H.-H. Chen, An effective key management


scheme for heterogeneous sensor networks, Ad Hoc Networks 5 (1) (2007)
24–34.

[5] X. Du, H.-H. Chen, Security in wireless sensor networks, IEEE Wireless
585 Communications 15 (4) (2008) 60–66.

[6] A. Laszka, A. Dubey, M. Walker, D. Schmidt, Providing privacy, safety, and


security in iot-based transactive energy systems using distributed ledgers,

29
in: Proceedings of the Seventh International Conference on the Internet of
Things, ACM, 2017, p. 13.

590 [7] C. Kolias, G. Kambourakis, A. Stavrou, J. Voas, Ddos in the iot: Mirai
and other botnets, Computer 50 (7) (2017) 80–84.

[8] A. T. Capossele, V. Cervo, C. Petrioli, D. Spenza, Counteracting denial-of-


sleep attacks in wake-up-radio-based sensing systems, in: 2016 13th Annual
IEEE International Conference on Sensing, Communication, and Network-
595 ing (SECON), IEEE, 2016, pp. 1–9.

[9] T. Farzana, A. Babu, A light weight plgp based method for mitigating
vampire attacks in wireless sensor networks, Int. J. Eng. Comput. Sci 3 (7).

[10] D. E. Boubiche, A. Bilami, A defense strategy against energy exhausting


attacks in wireless sensor networks, Journal Of Emerging Technologies In
600 Web Intelligence 5 (1) (2013) 18–27.

[11] C. Goudar, S. Kulkarni, Mechanisms for detecting and preventing denial


of sleep attacks and strengthening signals in wireless sensor networks, Int.
J. Emerg. Res. Manag. Technol 4 (6).

[12] CVE, Cve details, https://www.cvedetails.com/browse-by-date.php


605 (2017).

[13] D. Fraze, Cyber grand challenge (cgc), https://www.darpa.mil/program/


cyber-grand-challenge (2017).

[14] W.-C. Lin, S.-W. Ke, C.-F. Tsai, Cann: An intrusion detection system
based on combining cluster centers and nearest neighbors, Knowledge-based
610 systems 78 (2015) 13–21.

[15] G. Grieco, G. L. Grinblat, L. Uzal, S. Rawat, J. Feist, L. Mounier, Toward


large-scale vulnerability discovery using machine learning, in: Proceedings
of the Sixth ACM Conference on Data and Application Security and Pri-
vacy, ACM, 2016, pp. 85–96.

30
615 [16] S. Kim, S. Woo, H. Lee, H. Oh, Vuddy: A scalable approach for vulnerable
code clone discovery, in: 2017 IEEE Symposium on Security and Privacy
(SP), IEEE, 2017, pp. 595–614.

[17] H. Sajnani, V. Saini, J. Svajlenko, C. K. Roy, C. V. Lopes, Sourcerercc:


Scaling code clone detection to big-code, in: 2016 IEEE/ACM 38th In-
620 ternational Conference on Software Engineering (ICSE), IEEE, 2016, pp.
1157–1168.

[18] F. Wu, J. Wang, J. Liu, W. Wang, Vulnerability detection with deep learn-
ing, in: 2017 3rd IEEE International Conference on Computer and Com-
munications (ICCC), IEEE, 2017, pp. 1298–1302.

625 [19] M. Mohammadi, A. Al-Fuqaha, S. Sorour, M. Guizani, Deep learning for


iot big data and streaming analytics: A survey, IEEE Communications
Surveys & Tutorials 20 (4) (2018) 2923–2960.

[20] H. Li, K. Ota, M. Dong, Learning iot in edge: Deep learning for the internet
of things with edge computing, IEEE Network 32 (1) (2018) 96–101.

630 [21] M. White, C. Vendome, M. Linares-Vásquez, D. Poshyvanyk, Toward deep


learning software repositories, in: Proceedings of the 12th Working Con-
ference on Mining Software Repositories, IEEE Press, 2015, pp. 334–345.

[22] M. White, M. Tufano, C. Vendome, D. Poshyvanyk, Deep learning code


fragments for code clone detection, in: Proceedings of the 31st IEEE/ACM
635 International Conference on Automated Software Engineering, ACM, 2016,
pp. 87–98.

[23] X. Gu, H. Zhang, D. Zhang, S. Kim, Deep api learning, in: Proceedings of
the 2016 24th ACM SIGSOFT International Symposium on Foundations
of Software Engineering, ACM, 2016, pp. 631–642.

640 [24] E. C. R. Shin, D. Song, R. Moazzezi, Recognizing functions in binaries with


neural networks, in: 24th {USENIX} Security Symposium ({USENIX}
Security 15), 2015, pp. 611–626.

31
[25] J. Saxe, K. Berlin, expose: A character-level convolutional neural network
with embeddings for detecting malicious urls, file paths and registry keys,
645 arXiv preprint arXiv:1702.08568 (2017) 1–18.

[26] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, Y. Zhong, Vuldeep-
ecker: A deep learning-based system for vulnerability detection, arXiv
preprint arXiv:1801.01681 (2018) 1–15.

[27] D. A. Wheeler, Flawfinder, https://www.dwheeler.com/flawfinder/


650 (2017).

[28] S. S. Solutions, Rough auditing tool for security (rats), https://code.


google.com/archive/p/rough-auditing-tool-for-security/ (2017).

[29] J. Viega, J.-T. Bloch, Y. Kohno, G. McGraw, Its4: A static vulnerabil-


ity scanner for c and c++ code, in: Proceedings 16th Annual Computer
655 Security Applications Conference (ACSAC’00), IEEE, 2000, pp. 257–267.

[30] Checkmarx, Securing uncompiled code with cxsast, https://www.


checkmarx.com/products/static-application-security-testing/
(2017).

[31] Coverity, Coverity scan static analysis, https://scan.coverity.com/


660 (2017).

[32] HP, Fortify static code analyzer: Static application security


testing, https://software.microfocus.com/es-es/products/
static-code-analysis-sast/overview (2017).

[33] S. Neuhaus, T. Zimmermann, The beauty and the beast: Vulnerabilities in


665 red hat’s packages., in: USENIX Annual Technical Conference, 2009.

[34] G. Abaei, A. Selamat, H. Fujita, An empirical study based on semi-


supervised hybrid self-organizing map for software fault prediction,
Knowledge-Based Systems 74 (2015) 28–39.

32
[35] S. Neuhaus, T. Zimmermann, C. Holler, A. Zeller, Predicting vulnerable
670 software components, in: ACM Conference on computer and communica-
tions security, Citeseer, 2007, pp. 529–540.

[36] S. S. Rathore, S. Kumar, Linear and non-linear heterogeneous ensemble


methods to predict the number of faults in software systems, Knowledge-
Based Systems 119 (2017) 232–256.

675 [37] F. Yamaguchi, F. Lindner, K. Rieck, Vulnerability extrapolation: assisted


discovery of vulnerabilities using machine learning, in: Proceedings of the
5th USENIX conference on Offensive technologies, USENIX Association,
2011, pp. 13–13.

[38] F. Yamaguchi, M. Lottmann, K. Rieck, Generalized vulnerability extrap-


680 olation using abstract syntax trees, in: Proceedings of the 28th Annual
Computer Security Applications Conference, ACM, 2012, pp. 359–368.

[39] J. Li, M. D. Ernst, Cbcd: Cloned buggy code detector, in: Proceedings of
the 34th International Conference on Software Engineering, IEEE Press,
2012, pp. 310–320.

685 [40] G. Huang, Y. Li, Q. Wang, J. Ren, Y. Cheng, X. Zhao, Automatic clas-
sification method for software vulnerability based on deep neural network,
IEEE Access 7 (2019) 28291–28298.

[41] J. Jurn, T. Kim, H. Kim, An automated vulnerability detection and reme-


diation method for software security, Sustainability 10 (5) (2018) 1652.

690 [42] G. Spanos, L. Angelis, A multi-target approach to estimate software vulner-


ability characteristics and severity scores, Journal of Systems and Software
146 (2018) 152–166.

[43] A. Gupta, B. Suri, V. Kumar, S. Misra, T. Blažauskas, R. Damaševičius,


Software code smell prediction model using shannon, rényi and tsallis en-
695 tropies, Entropy 20 (5) (2018) 372.

33
[44] M. Kumari, A. Misra, S. Misra, L. Fernandez Sanz, R. Damasevicius, V. S-
ingh, Quantitative quality evaluation of software products by considering
summary and comments entropy of a reported bug, Entropy 21 (1) (2019)
91.

700 [45] A. H. Sodhro, S. Pirbhulal, Z. Luo, V. H. C. de Albuquerque, Towards an


optimal resource management for iot based green and sustainable smart
cities, Journal of Cleaner Production 220 (2019) 1167–1179.

[46] A. H. Sodhro, Z. Luo, G. H. Sodhro, M. Muzamal, J. J. Rodrigues, V. H. C.


de Albuquerque, Artificial intelligence based qos optimization for multime-
705 dia communication in iov systems, Future Generation Computer Systems
95 (2019) 667–680.

[47] A. H. Sodhro, A. S. Malokani, G. H. Sodhro, M. Muzammal, L. Zongwei,


An adaptive qos computation for medical data processing in intelligent
healthcare applications, Neural Computing and Applications (2019) 1–12.

710 [48] A. H. Sodhro, F. K. Shaikh, S. Pirbhulal, M. M. Lodro, M. A. Shah,


Medical-qos based telemedicine service selection using analytic hierarchy
process, in: Handbook of Large-Scale Distributed Computing in Smart
Healthcare, Springer, 2017, pp. 589–609.

[49] H. Magsi, A. H. Sodhro, F. A. Chachar, S. A. K. Abro, G. H. Sodhro,


715 S. Pirbhulal, Evolution of 5g in internet of medical things, in: 2018 Inter-
national Conference on Computing, Mathematics and Engineering Tech-
nologies (iCoMET), IEEE, 2018, pp. 1–7.

[50] A. H. Sodhro, S. Pirbhulal, G. H. Sodhro, A. Gurtov, M. Muzammal,


Z. Luo, A joint transmission power control and duty-cycle approach for
720 smart healthcare system, IEEE Sensors Journal (2018) 1–1.

[51] A. H. Sodhro, S. Pirbhulal, V. H. C. de Albuquerque, Artificial intelligence


driven mechanism for edge computing based industrial applications, IEEE
Transactions on Industrial Informatics (2019) 4235–4243.

34
[52] A. H. Sodhro, Y. Li, M. A. Shah, Energy-efficient adaptive transmission
725 power control for wireless body area networks, IET Communications 10 (1)
(2016) 81–90.

[53] M. Muzammal, R. Talat, A. H. Sodhro, S. Pirbhulal, A multi-sensor da-


ta fusion enabled ensemble approach for medical data from body sensor
networks, Information Fusion 53 (2020) 155–164.

730 [54] Y. Lin, X. Jin, J. Chen, A. H. Sodhro, Z. Pan, An analytic computation-


driven algorithm for decentralized multicore systems, Future Generation
Computer Systems 96 (2019) 101–110.

[55] A. A. Chandio, D. Zhu, A. H. Sodhro, M. U. Syed, An implementation of


web services for inter-connectivity of information systems, arXiv preprint
735 arXiv:1407.8320 (2014) 1–7.

[56] M. M. Lodro, N. Majeed, A. A. Khuwaja, A. H. Sodhro, S. Greedy, Statis-


tical channel modelling of 5g mmwave mimo wireless communication, in:
2018 International Conference on Computing, Mathematics and Engineer-
ing Technologies (iCoMET), IEEE, 2018, pp. 1–5.

740 [57] I. Medeiros, N. Neves, M. Correia, Detecting and removing web application
vulnerabilities with static analysis and data mining, IEEE Transactions on
Reliability 65 (1) (2016) 54–69.

[58] X. N. Noundou, Saint: Simple static taint analysis tool users manual,
https://archive.org/details/saint (2015).

745 [59] djn3m0, tanalysis, https://github.com/djn3m0/tanalysis (2015).

[60] M. Arroyo, F. Chiotta, F. Bavera, An user configurable clang static ana-


lyzer taint checker, in: 2016 35th International Conference of the Chilean
Computer Science Society (SCCC), IEEE, 2016, pp. 1–12.

[61] J. Schmidhuber, Deep learning in neural networks: An overview, Neural


750 networks 61 (2015) 85–117.

35
[62] H. Xu, C. Huang, D. Wang, Enhancing semantic image retrieval with lim-
ited labeled examples via deep learning, Knowledge-Based Systems 163
(2019) 252–266.

[63] H. Peng, Y. Ma, Y. Li, E. Cambria, Learning multi-grained aspect tar-


755 get sequence for chinese sentiment analysis, Knowledge-Based Systems 148
(2018) 167–176.

[64] J. Leng, Q. Chen, N. Mao, P. Jiang, Combining granular computing tech-


nique with deep learning for service planning under social manufacturing
contexts, Knowledge-Based Systems 143 (2018) 295–306.

760 [65] H. Liu, B. Lang, M. Liu, H. Yan, Cnn and rnn based payload classi-
fication methods for attack detection, Knowledge-Based Systems (2018)
S0950705118304325–.

[66] G. Lin, J. Zhang, W. Luo, L. Pan, Y. Xiang, Poster: Vulnerability dis-


covery with function representation learning from unlabeled projects, in:
765 Proceedings of the 2017 ACM SIGSAC Conference on Computer and Com-
munications Security, ACM, 2017, pp. 2539–2541.

[67] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, Z. Chen, S. Wang, J. Wang, Sysevr: a
framework for using deep learning to detect software vulnerabilities, arXiv
preprint arXiv:1807.06756 (2018) 1–13.

770 [68] R. Jozefowicz, W. Zaremba, I. Sutskever, An empirical exploration of re-


current network architectures, in: International Conference on Machine
Learning, 2015, pp. 2342–2350.

[69] L. SourceGear, Diffmerge, https://sourcegear.com/diffmerge/.

[70] Textdiff, http://www.angusj.com/delphi/textdiff.html.

775 [71] Textdiff, http://meldmerge.org/.

[72] Git diff, https://www.atlassian.com/git/tutorials/


saving-changes/git-diff.

36
[73] C. McCormick, Word2vec tutorial - the skip-
gram model, http://mccormickml.com/2016/04/19/
780 word2vec-tutorial-the-skip-gram-model/ (2017).

37
Training program(source programs and patched
programs)

Component
Using difflib to obtain Diff file between the
I:Patching
source program and the patched program
comparison

Common to deleted and added


Step 1:Labelling taints according to lines; The left of the assignment;
the taint selection principles Constrained in the Āifā
Component
conditional statement and so on.
II:Static taint
analysis Taint selection
Step 2:Generating taint propagation principles
paths using static taint analysis

Step 1:Transforming taint propagation paths


Component into symbolic representations according to the Word2vec
III:Taint embedding matrix
propagation
paths
Step 2:Encoding the symbolic representations Embedding matrix
transforming
into vectors

Component Lines where two types of


CNN-BLSTM neural network with fine-tuned
V:IoT software common vulnerabilities:
model
vulnerability CWE-119 and CWE-399
parameters
location appears
Output
Highlights
Three taint selection principles are proposed to determine the original taints.
The taint weight calculation method is put forward to select final taint with high weight.
The deep learning-based IoT software vulnerability location system is developed.
The effectiveness of our developed system is evaluated using the Code Gadget Database.

You might also like