The Basic Building Blocks of Malware
The Basic Building Blocks of Malware
Abstract
Many security attacks accomplish their goals by controlling the inputs
to a target program. Based on this insight, we have developed a novel and
highly effective approach to developing malware signatures. 1 These signa-
tures, also called “basic building blocks” of malware, possess the essential
elements common to all malware of a certain class. The key to the success
of our approach is that it captures the global semantics of malware. Exper-
imental evaluation shows that our algorithm can detect syntactic malware
variants with low errors, and outperforms currently popular malware de-
tection systems, such as McAfee VirusScan Enterprise.
1 Introduction
The objective of our research is to detect malware, such as a virus, by recog-
nizing its underlying goals. Rather than identifying and representing malware
patterns syntactically, we adopt a semantic approach that discovers and cap-
tures the true underlying attack goal(s) of the malware. Why is a semantic
approach preferable? First, different syntactic representations may have the
same meaning. Second, it is easy for an attacker to obscure the program’s real
goals by inserting irrelevant function calls or changing the binary code’s superfi-
cial appearance. Third, a significant portion of the malware might be incidental
or irrelevant to its attack goal. For example, some code may merely perform
normal operations, such as memory allocation and initialization, to set up for a
subsequent real attack. In contrast, our approach detects all malware variants
with semantically identical attack goals.
Christodorescu et al. [5] proposed the first approach to semantics-aware mal-
ware detection. However, their approach is manual and depends on local pat-
terns; an attacker may easily mount her attack by avoiding such local attack
patterns. To avoid this problem, we focus on global, rather than local seman-
tics. Global semantics refers to the structure of the entire program as a whole,
whereas local semantics refers to individual system calls. To the best of our
1 Our B 3 approach has a patent pending. This project was supported by ONR URI grant
1
knowledge, our approach is the first to capture global semantics in a malware
signature, along with almost complete automation.
Our malware signature, called a basic building block (b3 ), is constructed by
translating code to graphical structures, abstraction, extraction of semantics,
and finally inductive inference. Our contributions are:
2 Attack Goals
Initially, a malware program has no control over the target program. But it
does have control over the input to the target program. It takes control of the
target program via malicious input.
The input to the target results from malicious outputs from the malware
program. We call the malware program’s outputs its attack goals. We coarsely
divide security attacks into memory-based (such as buffer overflow or format
string attacks) and function-call-based (or simply function-based) attacks, ac-
cording to the main strategy utilized by the malware program.
The focus here is on function-based attacks, which produce hostile actions
on the victim’s system, such as opening a TCP port to send a copy of itself
to remote machines, dropping a backdoor, deleting or intercepting sensitive
2
information, modifying system configurations of the victim’s machine, and so
on. The goal of function-based attacks is usually expressed as hostile actions,
e.g., by invoking certain function calls. Most of today’s viruses, worms, Trojans,
backdoors, DoS tools, and other hacking tools, which are written in high-level
languages such as C/C++, are function-based attacks.
The fundamental difference between previous research and our work in find-
ing a malware signature is that we do not rely on local attack patterns, such as
binary pattern matching. Instead, we take a global view as to what the ultimate
attack goals are and how they relate to each other. To this end, we analyze out-
puts of a malware program – because they represent potential attack goals. For
example, Figure 1 shows two example programs that are syntactically different,
but have semantically identical goals. A typical modern malware detection sys-
tem would create two signatures – one for each program. Our system recognizes
their identical semantics and generates a single semantic malware signature.
As an aside, note that we are not claiming to have solved the program
equivalence problem, which is undecidable. We have instead addressed the
problem of determining whether a new unseen program belongs to a particular
class of malware, which is a machine learning problem in the standard supervised
learning paradigm [11].
3 B 3 Discovery Algorithm
3.1 Overview
A basic building block, which is a model of malware attacks, is constructed
from a set of attack and non-attack programs as follows:
1. Convert each program (attack or non-attack) into a graph. A
graphical representation is used because it is easier to generalize over.
Since an attack program’s source code is often unavailable, executable
binary must first be transformed into a tractable high-level representation
for the graph. The IDA Pro disassembler [6] is used to automate this
process; it obtains assembly code from binary code. Because IDA Pro is
unable to unpack/decrypt binary code, we first manually unpack and/or
decrypt the program. The assembly code is then converted to a graph
that is a hybrid of control flow and data dependence graphs.
2. Partition the graph into subgraphs. For abstraction, the overall
graph is divided into subgraphs, each containing a program subgoal or
terminal function.
3. Semantic abstraction. Semantic abstraction is the key to making our
approach scalable. With abstraction, the graph is boiled down to its
skeletal semantic essence. Our abstraction algorithm inputs a graph that
has been divided into subgraphs, and outputs a finite-state machine (FSM)
that captures global program semantics. An FSM representation has been
chosen because it simplifies the induction process.
3
int main(void){
FILE* fp = NULL; //file pointer
char* data = "abcde";
fp = fopen("test", "w"); //opens a file
if(fp == NULL) exit(1); //if fopen() fails, then exit
fputs(data, fp); //writes data in the file
fclose(fp); //closes the file
int c;
fp = fopen("foo", "r"); //opens a file
c = fgetc(fp); //reads data
fclose(fp); //closes the file
Program A
int main(void){
HANDLE h; //file handle
char buffer[1024];
strncpy(buffer, "abcde", 5);
h = CreateFile("test"...); //opens a file
if(h = INVALID_HANDLE_VALUE)
ExitProcess(1); //if CreateFile() fails, then exit
WriteFile(h, buffer, 5,...); //writes data in the file
CloseHandle(h); //closes the file
Program B
4
3.2 Graph Construction and Pruning
Malware assembly code is converted to a graph. This graph is composed of
both a control flow graph (CFG) and a data dependence graph (DDG). CFGs
enable us to logically interconnect subgoals in the later abstraction phase, and
DDGs help recover function call arguments, also used in abstraction.
DDGs are used to recover function call arguments. For each function call,
our algorithm identifies a function-call node in the graph. The algorithm then
follows reverse paths in the graph from the function-call node to its data sources
in the data dependence graph. It halts when it gets to graph nodes containing
values that are statically known, and it removes all subgraphs earlier than these
nodes. This procedure, which is a form of backward slicing, significantly prunes
the graph size. In static data flow analysis, some data values such as function
pointers are impossible to recover because they are statically unknown. Each
statically unknown value is replace with a question mark.
Backward slicing [7] is then performed with the CFG. The algorithm begins
toward the end of the program, at the location where the program emits a ma-
licious output intended to be sent to the target program as input. We predefine
output-emitting functions and terminal functions to identify these locations in
the program. The algorithm then follows the reverse control flow edges in the
graph. During this backward CFG traversal, every subgraph identified as “not
semantically critical” (i.e., not output-emitting in terms of security attacks) is
pruned from the graph [15].
3.3 Subgraphs
After graph construction and pruning, the next step is to prepare for ab-
straction. The graph is divided into multiple subgraphs, called “subblocks.”
Each subblock will become an abstract element (indivisible unit/node) in an
abstract graph, called a “finite-state machine/transducer.” The fundamental
basis of each subblock is either a security-critical or terminal function, to be
defined next.
A security-critical function is one that generates a suboutput, i.e., an action
that is critical from a security standpoint, and which can be used to formu-
late the final (attack) program’s (malicious) output. Examples include creat-
ing/deleting a file, sending network data, or modifying system configurations.
Any function that is not security-critical is called non-security-critical. A termi-
nal function is one that causes a program to terminate, e.g., exit. An example
of a non-terminal function is send.
The key to embedding semantics into our approach is that we divide security-
critical and terminal functions according to their semantic properties. For ex-
ample, functions such as fputc, fputs and write all share the same semantic
functional meaning to the system: they write data to a file stream. A unique
function group number is assigned to each group of semantically similar func-
tions.
To detect subblock boundaries, a semantic prologue (SP) and semantic epi-
5
logue (SE) pair is defined. A semantic prologue is the set of functions that must
be executed before a security-critical function call, and a semantic epilogue is
the set of functions that must be executed after the function call.
In summary, a subblock consists of a security-critical or terminal function as
its basis, and an SP-SE pair for subblock boundary delineation. For example,
Figure 2 shows three subblocks for program A in Figure 1. Formally, we define
a subblock as:
6
Figure 2: Subblocks of program A in Figure 1. Dotted lines indicate data depen-
dencies and solid lines control flows. Note that there are only three subblocks
in the graph because fgetc is a non-security-critical function.
Note that some subblocks may be both security-critical and terminal. If this
is the case, we split the subblock into two subblocks (a security-critical subblock
and a terminal subblock) and serially connect them. If there is a control branch
from a subblock, an empty subblock is inserted at the branching point, with
ε-transitions as connections. For example, Figure 3 shows the abstract-FST for
the programs A and B in Figure 1. Note that they both translate semantically
to the same abstract-FST.
7
Figure 3: Abstract-FST for programs A and B in Figure 1. Bold circles represent
the final subblocks. Since subblock 3 is both security-critical and terminal, an
extra final subblock is appended at the end of it. An empty subblock (the second
subblock from the left) is inserted for a control branch.
8
3.5.1 Creation and Alignment of Examples
Concept learning is done over positive and negative examples, which are
collectively called training examples for the learner.
To convert a string, x, to a positive or negative training example, it is nec-
essary to augment the string with a frequency vector, V . The purpose of this
vector is to give higher weight to more frequent attack patterns. In particular,
the frequency vector encodes information regarding which INPUT symbols ap-
pear more often in the positive examples and less often in the negative examples.
It is initialized to be all 0s, and it is updated during induction (as described in
the following subsection).
Prior to induction, examples are aligned according to their INPUT symbols.
We use a sequence alignment technique from [12] to find an optimal alignment
between strings (and, later, between a string and the model) and to calculate
a similarity score. These scores are divided by the maximum string (sequence)
length – to express similarity as a percentage. All strings that are aligned are
made to have the same length. An underscore ( ) sign denotes a placeholder,
which gives strings the same length. When strings are aligned, we often see this
placeholder along with an ²-transition. Note that |x| is defined to be the length
of string x, where ²’s are included.
There is only one terminal subblock for any sub-machine and this terminal
subblock does not emit any output. Recall that if a subblock is both output-
emitting and terminal, then we split the subblock into two subblocks and serially
connect them, so that the last subblock is always terminal, not output-emitting.
Therefore, when we align two sub-machines, the terminal subblocks are always
merged into a single final state.
Next, we define the union operator that performs generalization. To simplify
our formal definitions of the union and difference operators, training examples
9
are expressed using the same notation as models. In fact, they are actually
degenerate models (i.e., models without disjunction), so this is reasonable.
The union operation merges equivalent aligned states in the following man-
ner. Let us define any pair of states q1 ∈ a1 and q2 ∈ a2 , that are aligned,
and for which the INPUTs (on the outgoing edge) are equal to be equivalent.
These equivalent states merge into a single state q in a1 ∪ a2 . Any transitions
leading into/out of q1 or q2 in a1 or a2 now lead into/out of this single state q
in a1 ∪ a2 . The OUTPUT of this state becomes a function of the product of
the OUTPUTs of the two original states, i.e., γ = Fg (γ1 , γ2 ), where γ1 ∈ Γ1 ,
γ2 ∈ Γ2 , and γ ∈ Γ, and Γ = Fg (Γ1 × Γ2 ). Also, a state with ² INPUT is
considered equivalent to any state aligned with it during the union (but not
difference) operation. Its INPUT becomes that of the other state with which it
is aligned. If a ² OUTPUT symbol appears in any of q1 or q2 or both, then Fg
returns either ² or an implementation-specific value. Finally, Qe is the set of all
aligned states that are equivalent in a1 and a2 , e.g., q ∈ Qe . Se is the subset of
these that are start states.
During the union operation, the frequency vector V is updated with the
following sequence of steps:
1. V = 0
2. V = V1 + V2 .
3. If the nth INPUT symbols in a1 and a2 match, then increase the nth
element in V by one.
After generalization has completed over all positive examples, specialization
is performed over all negative examples. Specialization subtracts each negative
example, one-by-one, from the model via a difference operator – to omit elements
specific to non-attacks.
10
V is updated with the following rule:
1. V = V1 .
2. If the nth INPUT symbols in a1 and a2 match, then remove the nth
element in V .
The idea of deleting the nth element is not to give any weight to the subblock
that exits in non-attacks. Figure 4 gives an example of generalization followed
by specialization.
11
A separate b3 , with this parameter optimization, is constructed for each
attack group (class):
4 Classification
The following approach is used to classify new unseen examples as “AT-
TACK” or “NON-ATTACK.” Each new example is compared with the b 3 . Re-
call that partial matching is used. To calculate a similarity score (or matching
score) between the learned b3 and previously unseen examples, we do the fol-
lowing. First, we obtain β and γ for the learned b3 and use those parameters to
compute a new similarity score α0 between the b3 and the unseen submachines.
This score is used to classify new examples. A new example is only labeled an
“ATTACK” if α0 exceeds the threshold α from the learned b3 .
5 Experimental Results
We tested our algorithm against all variants of 23 attack groups (see Table 1).
For each group, we divided the attack variants into two subgroups for training
(i.e., to construct a b3 ) and testing (i.e., to test the b3 ’s classification accuracy
on unseen test examples). We performed induction using the attack training
examples plus 120 randomly-chosen benign programs from a fresh Windows in-
stallation. For all attack groups, we tried a token translator, length translator,
and character distribution translator for translation functions. A token transla-
tor extracts character strings in the arguments, a length translator encodes the
argument length in bytes, and a character distribution translator encodes the
character distribution in the arguments.
Attack Type Attack Group Name
Worm Donghe, Vorgon, Deborm, Klez, Libertine,
Nimda, Gizer, Energy, Kelino, Shorm
Virus CIH, Emotion, Belod, Evul, Mooder,
Team, Inrar, Eva, Lash, Resur, Spit
Hacking Tool Auha
DoS Tool Lanxue
After the learning phase, our algorithm was evaluated on the testing ex-
amples of the aforementioned attack groups. While testing, we excluded any
12
examples that could not be processed by IDA Pro. Since we assume that disas-
sembly can be performed successfully by IDA Pro before detection, we do not
take into account those failed examples. Out of 79 testing examples, our algo-
rithm missed one instance of Auha but detected the rest of the attack variants
in the testing group (98.73% detection). We also tested the b3 s against 1032
randomly-chosen normal programs. The system detected 2 normal programs
(telnet.exe, wupdmgr.exe) as attack (0.19% false positive).
In order to see if our algorithm is resilient to minor binary changes, we gener-
ated the basic building blocks from randomly chosen CIH samples from [2] and
tested against the original copy of CIH.1010b and CIH.2690 and the signature-
removed version of CIH.1010b and CIH.2690. Our algorithm successfully de-
tected all of them. Also, we tested McAfee [1] VirusScan Enterprise ver. 8.0.0
with the latest virus definition against CIH.1010b and CIH.2690 virus samples
obtained from [2]. VirusScan successfully detected the original copies but it
failed to detect them after we manually removed (zeroed-out) the CIH.1010b
and CIH.2690 signature from the virus body with a binary editor.
One of the disadvantages of our approach is that our system may classify
benign programs as attacks if the benign programs are semantically similar
to attacks. This is a possible explanation for the two mis-classified examples
(telnet.exe, wupdmgr.exe).
Malware detection must be efficient. Table 2 shows the low average CPU
time and memory taken to transform unknown programs to examples and then
classify them as attack or non-attack.3
Target Size (KB) 4∼40 4∼100 100∼400 400∼1024
Time (sec) 52 210 288 381
6 Related Work
There are two complementary approaches to malware detection: static and
dynamic, each approach having both strengths and weaknesses. This paper
focuses on a static approach. One very effective and popular static approach is
that of Sung et al., called SAVE [18]. Their malware signature is derived from
an API calling sequence and they mapped each API to an integer number to
encode the sequence. They used a sequence alignment algorithm to compute
a similarity score to compare malware variants. However the resulting API
sequence is nothing more than a piece of syntactic information – therefore an
adversary may create another malware variant to defeat the system in such a
way that the program has a totally different API sequence but still has the
same semantic attack goal. Furthermore, an attacker can randomize the API
3 These times were taken using an Intel Pentium M 1.0GHz CPU with 512MB memory.
13
sequences by inserting arbitrary APIs in the middle of the sequence that do not
affect the original attack goal.
In response to the problems with syntactic signatures, there has been a very
recent but growing trend toward semantic malware detection. A handful of
publications on the topic have appeared in the last couple of years. In 2005,
Christodorescu et al. [5] were the first researchers to provide a formal seman-
tics for malware detection. They manually developed a template that describes
malware semantic properties and demonstrated that their algorithm can detect
all variants of certain malware using the template with no false positives. Wang
et al. [20] proposed a system called Shield, which has vulnerability-specific,
exploit-generic network filters for preventing exploits against security vulner-
abilities. Shield is resilient to polymorphic or metamorphic worms. Sokolsky
et al. [17] used bisimulation to capture some of the semantics, but not at an
abstract level. Bruschi et al. [3] invented a semantic approach to handling au-
tomated obfuscations. Kinder et al. [9] used model checking to semantically
identify malware that deviates from a temporal logic correctness specification.
Scheirer and Chuah [14] developed a semantics-aware NIDS to detect buffer
overflow exploits.
There are two major reasons why our approach presents an advance beyond
these prior approaches. First, other than the model checker, all of these other
approaches look for local, rather than global, semantic attack patterns. Second,
they all require significant manual intervention, e.g., to develop a template,
graph, or other data structure representing desirable (or attack) behavior. The
problem with using local attack patterns is that an attacker can at any time take
advantage of this fact and mount her attack by avoiding the local attack patterns
in her program. Furthermore, by capturing global semantics, a regular grammar
(rather than a context-free grammar as needed by Wagner and Dean [19]) suffices
for signatures. This results in a substantial computational advantage. The
problem with manual intervention is that it is time-consuming and impractical.
In contrast to these prior semantic approaches, ours looks for global patterns
and is almost fully automated.
Some researchers have focused on automating the generation of attack signa-
tures, e.g., Autograph [8], Honeycomb [10], and EarlyBird [16] analyze network
streams to automatically produce signatures by extracting common byte pat-
terns and they are used to detect unknown Internet worms. Unfortunately,
these approaches are syntactic and local.
The most relevant prior work to our inductive inference approach is that of
Kephart et al., who developed a statistical method for automatically extracting
signatures from a corpus of machine code viruses [13]. Their approach differs
from ours because it is syntactic which, as mentioned above, is problematic.
14
though our approach cannot handle some of the more challenging malware,
such as code that self-mutates at run-time, or is specially packed/encrypted or
obfuscated, it is nevertheless broadly applicable. In particular, experimental
evaluation has demonstrated that our algorithm can detect a wide variety of
unknown attacks (viruses, worms) with low errors, and it is resilient to minor
binary changes.
Future work will focus primarily on optimizing the speed of our approach,
further testing over more examples, and methods for recovery after a malware
attack has been identified by a b3 .
Acknowledgement
The authors would like to thank Insup Lee for suggesting the problem of
identifying the basic building blocks of malware.
References
[1] Mcafee - antivirus software and intrusion prevention solutions. http://
www.mcafee.com/, Last accessed on 10 Nov. 2005.
[3] Danilo Bruschi, Lorenzo Martignoni, and Mattia Monga. Using code nor-
malization for fighting self-mutating malware. In Proceedings of the Confer-
ence on Detection of Intrusions and Malware and Vulnerability Assessment.
IEEE Computer Society, 2006.
[4] Mihai Christodorescu and Somesh Jha. Testing malware detectors. In IS-
STA ’04: Proceedings of the 2004 ACM SIGSOFT international symposium
on Software testing and analysis, pages 34–44, New York, NY, USA, 2004.
ACM Press.
[5] Mihai Christodorescu, Somesh Jha, Sanjit A. Seshia, Dawn Song, and Ran-
dal E. Bryant. Semantics-aware malware detection. In Proceedings of the
2005 IEEE Symposium on Security and Privacy, pages 32–46, Washington,
DC, USA, 2005. IEEE Computer Society.
15
[9] Johannes Kinder, Stefan Katzenbeisser, Christian Schallhart, and Helmut
Veith. Detecting malicious code by model checking. In Lecture Notes in
Computer Science 3548, pages 174–187. Springer Verlag, 2005.
[10] Christian Kreibich and Jon Crowcroft. Honeycomb - Creating intrusion de-
tection signatures using honeypots. In Proceedings of the Second Workshop
on Hot Topics in Networks (Hotnets II), Boston, November 2003.
[12] S.B. Needleman and C.D. Wunsch. A general method applicable to the
search for similarities in the amino acid sequence of two proteins. J. Mol.
Biol., 48:443–453, 1970.
[14] Walter Scheirer and Mooi Chuah. Network intrusion detection with
semantics-aware capability. In Proceedings of the Second International
Conference on Security and Systems in Networks. IEEE Computer Soci-
ety, 2006.
[15] Jinwook Shin. The basic building blocks of attacks. Master’s thesis, Uni-
versity of Wyoming, 2006.
[16] Sumeet Singh, Cristian Estan, George Varghese, and Stefan Savage. Auto-
mated worm fingerprinting. In OSDI, pages 45–60, 2004.
[17] Oleg Sokolsky, Sampath Kannan, and Insup Lee. Simulation-based graph
similarity. In Lecture Notes in Computer Science 3920. Springer Verlag,
2006.
[18] Andrew H. Sung, Jianyun Xu, Patrick Chavez, and Srinivas Mukkamala.
Static analyzer of vicious executables (save). In ACSAC, pages 326–334,
2004.
[19] D. Wagner and D. Dean. Intrusion detection via static analysis. In SP ’01:
Proceedings of the 2001 IEEE Symposium on Security and Privacy, pages
156–169, Washington, DC, USA, 2001. IEEE Computer Society.
[20] Helen J. Wang, Chuanxiong Guo, Daniel R. Simon, and Alf Zugenmaier.
Shield: vulnerability-driven network filters for preventing known vulnera-
bility exploits. In SIGCOMM ’04: Proceedings of the 2004 conference on
Applications, technologies, architectures, and protocols for computer com-
munications, pages 193–204, New York, NY, USA, 2004. ACM Press.
16