0% found this document useful (0 votes)

39 views

The Basic Building Blocks of Malware

The document summarizes an approach for developing malware signatures called "basic building blocks" (B3) that capture the essential semantic elements of malware programs. The approach involves (1) converting malware programs to graphs, (2) semantically abstracting the graphs into finite state machines to represent global program semantics, and (3) using inductive inference to generate the B3 signatures from the finite state machines. Experimental results show the B3 approach can detect malware variants with low errors and outperforms existing detection systems.

Uploaded by

aptureinc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

The Basic Building Blocks of Malware

Uploaded by

aptureinc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

The Basic Building Blocks of Malware

Jinwook Shin and Diana F. Spears

University of Wyoming
Laramie, WY 82071
{jshin@, dspears@cs.}uwyo.edu

Abstract
Many security attacks accomplish their goals by controlling the inputs
to a target program. Based on this insight, we have developed a novel and
highly effective approach to developing malware signatures. 1 These signa-
tures, also called “basic building blocks” of malware, possess the essential
elements common to all malware of a certain class. The key to the success
of our approach is that it captures the global semantics of malware. Exper-
imental evaluation shows that our algorithm can detect syntactic malware
variants with low errors, and outperforms currently popular malware de-
tection systems, such as McAfee VirusScan Enterprise.

1 Introduction
The objective of our research is to detect malware, such as a virus, by recog-
nizing its underlying goals. Rather than identifying and representing malware
patterns syntactically, we adopt a semantic approach that discovers and cap-
tures the true underlying attack goal(s) of the malware. Why is a semantic
approach preferable? First, different syntactic representations may have the
same meaning. Second, it is easy for an attacker to obscure the program’s real
goals by inserting irrelevant function calls or changing the binary code’s superfi-
cial appearance. Third, a significant portion of the malware might be incidental
or irrelevant to its attack goal. For example, some code may merely perform
normal operations, such as memory allocation and initialization, to set up for a
subsequent real attack. In contrast, our approach detects all malware variants
with semantically identical attack goals.
Christodorescu et al. [5] proposed the first approach to semantics-aware mal-
ware detection. However, their approach is manual and depends on local pat-
terns; an attacker may easily mount her attack by avoiding such local attack
patterns. To avoid this problem, we focus on global, rather than local seman-
tics. Global semantics refers to the structure of the entire program as a whole,
whereas local semantics refers to individual system calls. To the best of our
1 Our B 3 approach has a patent pending. This project was supported by ONR URI grant

“Anomaly and misuse detection in network traffic streams,” subcontract PENUNV48725.

1
knowledge, our approach is the first to capture global semantics in a malware
signature, along with almost complete automation.
Our malware signature, called a basic building block (b3 ), is constructed by
translating code to graphical structures, abstraction, extraction of semantics,
and finally inductive inference. Our contributions are:

• Globally semantic signatures. Most prior malware signatures are syn-

tactic; they consist of a short sequence of bytes that is unique to each
attack. With these signatures, it is easy to bypass detection with minor
binary changes to the program [4]. Even recent semantic signatures [5]
find local patterns and are therefore easily vulnerable. We present a novel
approach that focuses on attack goals using a globally semantic malware
signature.

• Automated signature construction. A malware signature is normally

generated manually by security experts, which typically takes hours or
days whenever a new attack instance comes into the world. Our algo-
rithm automatically generates the basic building blocks in a few minutes,
depending on the number of training examples and program size.

• Reduction in errors of omission (false negatives). Unlike other

modern malware detection systems, our algorithm is able to detect un-
known attack examples with high probability. What makes this possible
is that our algorithm is based upon a machine learning technique called
inductive inference, which enables us to predict unknown examples. On
the other hand, for certain classes of benign programs, our approach has
an increase in the number of false positives (false alarms).

• Wider applicability to modern attacks. Our approach is applicable

to any function-based malware program written in a high-level language.
This is significant because most modern malware programs are in high-
level languages.

2 Attack Goals
Initially, a malware program has no control over the target program. But it
does have control over the input to the target program. It takes control of the
target program via malicious input.
The input to the target results from malicious outputs from the malware
program. We call the malware program’s outputs its attack goals. We coarsely
divide security attacks into memory-based (such as buffer overflow or format
string attacks) and function-call-based (or simply function-based) attacks, ac-
cording to the main strategy utilized by the malware program.
The focus here is on function-based attacks, which produce hostile actions
on the victim’s system, such as opening a TCP port to send a copy of itself
to remote machines, dropping a backdoor, deleting or intercepting sensitive

2
information, modifying system configurations of the victim’s machine, and so
on. The goal of function-based attacks is usually expressed as hostile actions,
e.g., by invoking certain function calls. Most of today’s viruses, worms, Trojans,
backdoors, DoS tools, and other hacking tools, which are written in high-level
languages such as C/C++, are function-based attacks.
The fundamental difference between previous research and our work in find-
ing a malware signature is that we do not rely on local attack patterns, such as
binary pattern matching. Instead, we take a global view as to what the ultimate
attack goals are and how they relate to each other. To this end, we analyze out-
puts of a malware program – because they represent potential attack goals. For
example, Figure 1 shows two example programs that are syntactically different,
but have semantically identical goals. A typical modern malware detection sys-
tem would create two signatures – one for each program. Our system recognizes
their identical semantics and generates a single semantic malware signature.
As an aside, note that we are not claiming to have solved the program
equivalence problem, which is undecidable. We have instead addressed the
problem of determining whether a new unseen program belongs to a particular
class of malware, which is a machine learning problem in the standard supervised
learning paradigm [11].

3 B 3 Discovery Algorithm
3.1 Overview
A basic building block, which is a model of malware attacks, is constructed
from a set of attack and non-attack programs as follows:
1. Convert each program (attack or non-attack) into a graph. A
graphical representation is used because it is easier to generalize over.
Since an attack program’s source code is often unavailable, executable
binary must first be transformed into a tractable high-level representation
for the graph. The IDA Pro disassembler [6] is used to automate this
process; it obtains assembly code from binary code. Because IDA Pro is
unable to unpack/decrypt binary code, we first manually unpack and/or
decrypt the program. The assembly code is then converted to a graph
that is a hybrid of control flow and data dependence graphs.
2. Partition the graph into subgraphs. For abstraction, the overall
graph is divided into subgraphs, each containing a program subgoal or
terminal function.
3. Semantic abstraction. Semantic abstraction is the key to making our
approach scalable. With abstraction, the graph is boiled down to its
skeletal semantic essence. Our abstraction algorithm inputs a graph that
has been divided into subgraphs, and outputs a finite-state machine (FSM)
that captures global program semantics. An FSM representation has been
chosen because it simplifies the induction process.

3
int main(void){
FILE* fp = NULL; //file pointer
char* data = "abcde";
fp = fopen("test", "w"); //opens a file
if(fp == NULL) exit(1); //if fopen() fails, then exit
fputs(data, fp); //writes data in the file
fclose(fp); //closes the file

CreateProcess(program1,...); //runs a program

int c;
fp = fopen("foo", "r"); //opens a file
c = fgetc(fp); //reads data
fclose(fp); //closes the file

return 0; //returns to operating system

}

Program A

int main(void){
HANDLE h; //file handle
char buffer[1024];
strncpy(buffer, "abcde", 5);
h = CreateFile("test"...); //opens a file
if(h = INVALID_HANDLE_VALUE)
ExitProcess(1); //if CreateFile() fails, then exit
WriteFile(h, buffer, 5,...); //writes data in the file
CloseHandle(h); //closes the file

WinExec("program1", SW_SHOW); //runs a program

return 0; //returns to operating system

}

Program B

Figure 1: Two general programs in C. Programs A and B are syntactically

different but have semantically identical goals (goal1 : write data into a file and
goal2 : execute a process). Note that fgetc in program A does not contribute
to emitting an output.

4. Inductive inference. The final step is to perform inductive inference

(which is a form of machine learning) over strings (i.e., possible execu-
tions) of all the FSMs – for the purpose of inferring one general model
(signature) of all malware seen so far that are in a certain class. With
inductive inference, strings from attack FSMs are treated as “positive ex-
amples” and strings from non-attack FSMs as “negatives examples” to
train on. After training on these examples, the general model will include
features of attacks, while excluding features of non-attacks. The resulting
general model is a basic building block, or b3 , of malware of a certain
class. This b3 , which is in the form of a generalized string (i.e., a string
with disjunction allowed), is used for classifying new, previously unseen
programs as “ATTACK” or “NON-ATTACK.”
Note that every step of this process has been fully automated, other than un-
pack/decryption. Each of these steps will now be described in detail.

4
3.2 Graph Construction and Pruning
Malware assembly code is converted to a graph. This graph is composed of
both a control flow graph (CFG) and a data dependence graph (DDG). CFGs
enable us to logically interconnect subgoals in the later abstraction phase, and
DDGs help recover function call arguments, also used in abstraction.
DDGs are used to recover function call arguments. For each function call,
our algorithm identifies a function-call node in the graph. The algorithm then
follows reverse paths in the graph from the function-call node to its data sources
in the data dependence graph. It halts when it gets to graph nodes containing
values that are statically known, and it removes all subgraphs earlier than these
nodes. This procedure, which is a form of backward slicing, significantly prunes
the graph size. In static data flow analysis, some data values such as function
pointers are impossible to recover because they are statically unknown. Each
statically unknown value is replace with a question mark.
Backward slicing [7] is then performed with the CFG. The algorithm begins
toward the end of the program, at the location where the program emits a ma-
licious output intended to be sent to the target program as input. We predefine
output-emitting functions and terminal functions to identify these locations in
the program. The algorithm then follows the reverse control flow edges in the
graph. During this backward CFG traversal, every subgraph identified as “not
semantically critical” (i.e., not output-emitting in terms of security attacks) is
pruned from the graph [15].

3.3 Subgraphs
After graph construction and pruning, the next step is to prepare for ab-
straction. The graph is divided into multiple subgraphs, called “subblocks.”
Each subblock will become an abstract element (indivisible unit/node) in an
abstract graph, called a “finite-state machine/transducer.” The fundamental
basis of each subblock is either a security-critical or terminal function, to be
defined next.
A security-critical function is one that generates a suboutput, i.e., an action
that is critical from a security standpoint, and which can be used to formu-
late the final (attack) program’s (malicious) output. Examples include creat-
ing/deleting a file, sending network data, or modifying system configurations.
Any function that is not security-critical is called non-security-critical. A termi-
nal function is one that causes a program to terminate, e.g., exit. An example
of a non-terminal function is send.
The key to embedding semantics into our approach is that we divide security-
critical and terminal functions according to their semantic properties. For ex-
ample, functions such as fputc, fputs and write all share the same semantic
functional meaning to the system: they write data to a file stream. A unique
function group number is assigned to each group of semantically similar func-
tions.
To detect subblock boundaries, a semantic prologue (SP) and semantic epi-

5
logue (SE) pair is defined. A semantic prologue is the set of functions that must
be executed before a security-critical function call, and a semantic epilogue is
the set of functions that must be executed after the function call.
In summary, a subblock consists of a security-critical or terminal function as
its basis, and an SP-SE pair for subblock boundary delineation. For example,
Figure 2 shows three subblocks for program A in Figure 1. Formally, we define
a subblock as:

Definition 1 (Subblock) An attack graph (or, more generally, program graph)

is defined as G = hV, Ei, where V is a set of vertices and E a set of edges. We
define a function distance(v1 , v2 ) to output the number of edges on the shortest
path from vertex v1 to v2 . Let F be a set of security-critical functions and
terminal functions, and VF ⊆ V be the set of graph nodes containing these
functions. For each function f ∈ F , let vf ∈ VF be the node that contains f . A
subblock is a subgraph G0 = hV 0 ∪ V 00 ∪ {vf }, E 0 ∪ E 00 i where
V 0 = V 00 = {vf }, E 0 = φ if and only if SP Ef = {φ, φ}. Otherwise,
• V 0 = (Vsp ∪Vspv ) ⊆ V and Vsp = {v} and v = argminϑ∈SPf (distance(ϑ, vf )),
Vspv = {v 0 | v 0 ∈ V is on the shortest path from v to vf }, E 0 = {e | e ∈ E
is on the shortest path from v to vf }.
• V 00 = (Vse ∪Vsev ) ⊆ V and Vse = {w} and w = argminϑ∈SEf (distance(vf , ϑ)),
Vsev = {v 0 | v 0 ∈ V is on the shortest path from vf to w}, E 00 = {e | e ∈ E
is on the shortest path from vf to w}.

3.4 Semantic Abstraction

The next step is to convert each graph into a form of finite-state machine
called a finite-state transducer (FST). The FST is an abstract graphical model of
the global program semantics. Each FST node (state) corresponds to a subblock.
An FST is a type of finite-state machine whose OUTPUT is not just ACCEPT or
REJECT; it is also a translator. Each transition in an FST is labeled with two
symbols: INPUT/OUTPUT. Coinciding with the execution of each FST transition
is the emission of the corresponding OUTPUT symbol.
INPUTs in the FST encode the subblocks, and OUTPUTs encode the subblock
suboutputs. In particular, the function group number becomes the INPUT sym-
bol, and the terminal or security-critical function arguments are translated to
the OUTPUT, using a translation function. Recall that the function group num-
ber is a semantic notion; therefore, the INPUT symbol abstracts the semantic
aspects of the attack. The reason for using function arguments for the OUTPUT
is that a function’s suboutput is strictly dependent upon its arguments. The
function’s suboutput can therefore be summarized semantically by specifying
its arguments.
With this encoding scheme of INPUT/OUTPUT symbols, we can now formally
define the new data structure to which the pruned and subdivided (into sub-
graphs) graph is converted. This new data structure is called an abstract-FST.

6
Figure 2: Subblocks of program A in Figure 1. Dotted lines indicate data depen-
dencies and solid lines control flows. Note that there are only three subblocks
in the graph because fgetc is a non-security-critical function.

An abstract-FST is a very concise graphical representation of the global seman-

tic essence of the original (attack) program:
Definition 2 (Abstract-FST) A = (Σ, Q, q, F, Γ, δ), where:
• Σ: a finite set of INPUT symbols
• Q: a finite set of states (i.e., subblocks),
• q ∈ Q: the initial state (which is the starting subblock in the program),
• F ⊆ Q: the finite set of final states (subblocks),
• Γ: a finite set of OUTPUT symbols, and
• δ: the transition function (edges), which is defined as δ : Q × Σ → Q × Γ.

Note that some subblocks may be both security-critical and terminal. If this
is the case, we split the subblock into two subblocks (a security-critical subblock
and a terminal subblock) and serially connect them. If there is a control branch
from a subblock, an empty subblock is inserted at the branching point, with
ε-transitions as connections. For example, Figure 3 shows the abstract-FST for
the programs A and B in Figure 1. Note that they both translate semantically
to the same abstract-FST.

7
Figure 3: Abstract-FST for programs A and B in Figure 1. Bold circles represent
the final subblocks. Since subblock 3 is both security-critical and terminal, an
extra final subblock is appended at the end of it. An empty subblock (the second
subblock from the left) is inserted for a control branch.

3.5 Inductive Inference

This section describes the basic building blocks of malware that are inferred
from a set of attack and non-attack programs, which have been converted (as
described above) into abstract-FSTs. Concept learning is used to infer a model
(or hypothesis) from the FSTs. This model, which is a basic building block
(b3 ) of malware, can be used to classify future examples (programs) as either
ATTACK or NON-ATTACK. Concept learning is a form of inductive inference, or
simply “induction” [11].
The inductive process consists of five steps. It is assumed that abstract-
FSTs have been formed from every attack or non-attack program, as described
in the previous sections. For the first step, we take each abstract-FST, and
extract all possible strings from it.2 A string is a single execution sequence of
an FST that begins in an initial state of the FST, follows the FST transitions,
and terminates in a final state of the FST. Figure 4 (at the top) shows two
example FST strings. If an FST is derived from an attack program, then its
extracted strings are labeled attack strings; if it is derived from a non-attack
program, then its strings are labeled non-attack strings. For the second step,
every attack and non-attack string is augmented with a frequency vector. We
call an augmented attack string a positive example, and we call an augmented
non-attack string a negative example. The third step consists of aligning exam-
ples, in preparation for learning (training). After alignment, learning consists
of two steps: generalization (step 4) to incorporate all positive examples into
the model, followed by specialization (step 5) to exclude all negative examples
from the model. Inductive inference generalizes the model by taking the union
of it and all positive examples, and then specializes the model by taking the
difference between the model and all negative examples. The result is a model
that captures the commonalities of attacks, and omits features of non-attacks.
Each of these steps will be described in more detail in the following subsections.
2 If there is a loop in an FST, the loop is followed only once. This creates loop-free strings.

8
3.5.1 Creation and Alignment of Examples
Concept learning is done over positive and negative examples, which are
collectively called training examples for the learner.
To convert a string, x, to a positive or negative training example, it is nec-
essary to augment the string with a frequency vector, V . The purpose of this
vector is to give higher weight to more frequent attack patterns. In particular,
the frequency vector encodes information regarding which INPUT symbols ap-
pear more often in the positive examples and less often in the negative examples.
It is initialized to be all 0s, and it is updated during induction (as described in
the following subsection).
Prior to induction, examples are aligned according to their INPUT symbols.
We use a sequence alignment technique from [12] to find an optimal alignment
between strings (and, later, between a string and the model) and to calculate
a similarity score. These scores are divided by the maximum string (sequence)
length – to express similarity as a percentage. All strings that are aligned are
made to have the same length. An underscore ( ) sign denotes a placeholder,
which gives strings the same length. When strings are aligned, we often see this
placeholder along with an ²-transition. Note that |x| is defined to be the length
of string x, where ²’s are included.

3.5.2 Induction Over Training Examples

Our induction algorithm consists of two phases: generalization and special-
ization. Generalization finds the commonalities among the positive examples
via a union operator. It does this one example at a time. In other words, it
begins by taking the union of two positive examples to create an initial model.
Then it continues to take the union of the model with the remainder of the
positive examples, thereby continuing to generalize the model.
Recall that a b3 is a model of malware attacks. To understand the form of a
3
b , it is necessary to formally define a model. A model is similar to an example,
except it includes disjunction. In other words, a model can combine multiple
alternative execution paths in one data structure. Formally,

Definition 3 (Model) A model, m =(Σ, Q, S, f, Γ, δ, V ), is a sub-machine of

an abstract-FST with additional information attached. Here, S is a set of initial
states, f is a final state, and V is the frequency vector. Note that the length of
model m, i.e., |m|, is equal to the maximum length of any training example.

There is only one terminal subblock for any sub-machine and this terminal
subblock does not emit any output. Recall that if a subblock is both output-
emitting and terminal, then we split the subblock into two subblocks and serially
connect them, so that the last subblock is always terminal, not output-emitting.
Therefore, when we align two sub-machines, the terminal subblocks are always
merged into a single final state.
Next, we define the union operator that performs generalization. To simplify
our formal definitions of the union and difference operators, training examples

9
are expressed using the same notation as models. In fact, they are actually
degenerate models (i.e., models without disjunction), so this is reasonable.

Definition 4 (Union Operator) Assume two aligned examples, a1 = (Σ1 ,

Q1 , S1 , f, Γ1 , δ1 , V1 ) and a2 = (Σ2 , Q2 , S2 , f, Γ2 , δ2 , V2 ), or an aligned model,
a1 , and an example, a2 . The union operator (∪) on a1 and a2 is defined to be
a1 ∪ a2 = (Σ1 ∪ Σ2 , Q1 ∪ Q2 ∪ Qe , S1 ∪ S2 ∪ Se , f, Γ1 ∪ Γ2 ∪ Γ, δ, V ), where
the transition function δ = (Q1 ∪ Q2 ) × (Σ1 ∪ Σ2 ) → (Q1 ∪ Q2 ) × Fg (Γ1 × Γ2 ),
and Fg is an implementation-specific function defined as Fg : Γ1 × Γ2 → Γ. In
other words, Fg is a function that computes a set of new OUTPUT symbols Γ
from Γ1 and Γ2 . Qe and Se are defined below.

The union operation merges equivalent aligned states in the following man-
ner. Let us define any pair of states q1 ∈ a1 and q2 ∈ a2 , that are aligned,
and for which the INPUTs (on the outgoing edge) are equal to be equivalent.
These equivalent states merge into a single state q in a1 ∪ a2 . Any transitions
leading into/out of q1 or q2 in a1 or a2 now lead into/out of this single state q
in a1 ∪ a2 . The OUTPUT of this state becomes a function of the product of
the OUTPUTs of the two original states, i.e., γ = Fg (γ1 , γ2 ), where γ1 ∈ Γ1 ,
γ2 ∈ Γ2 , and γ ∈ Γ, and Γ = Fg (Γ1 × Γ2 ). Also, a state with ² INPUT is
considered equivalent to any state aligned with it during the union (but not
difference) operation. Its INPUT becomes that of the other state with which it
is aligned. If a ² OUTPUT symbol appears in any of q1 or q2 or both, then Fg
returns either ² or an implementation-specific value. Finally, Qe is the set of all
aligned states that are equivalent in a1 and a2 , e.g., q ∈ Qe . Se is the subset of
these that are start states.
During the union operation, the frequency vector V is updated with the
following sequence of steps:
1. V = 0
2. V = V1 + V2 .
3. If the nth INPUT symbols in a1 and a2 match, then increase the nth
element in V by one.
After generalization has completed over all positive examples, specialization
is performed over all negative examples. Specialization subtracts each negative
example, one-by-one, from the model via a difference operator – to omit elements
specific to non-attacks.

Definition 5 (Difference Operator) Assume a model, a1 = (Σ1 , Q1 , S1 , f,

Γ1 , δ1 , V1 ) and a negative example, a2 = (Σ2 , Q2 , S2 , f, Γ2 , δ2 , V2 ), that are
aligned. Let Qe be the set of all aligned states that are equivalent (see above)
in a1 and a2 . Se is the subset of these that are initial states. The difference
operator (−) on a1 and a2 is defined to be a1 − a2 = (Σ1 , Q1 − Qe , S1 − Se , f,
Γ1 , δ, V ), where δ = δ1 with the exception that any transition to a deleted state
goes instead to the successor of the deleted state.

10
V is updated with the following rule:
1. V = V1 .
2. If the nth INPUT symbols in a1 and a2 match, then remove the nth
element in V .
The idea of deleting the nth element is not to give any weight to the subblock
that exits in non-attacks. Figure 4 gives an example of generalization followed
by specialization.

Figure 4: Two aligned examples a1 and a2 become an initial model (a1 ∪ a2 )

via generalization and a refined model (a1 − a2 ) via specialization.

The output of this generalization-specialization process is a b3 , which is an attack

model. This b3 can be used to classify new, unseen examples (see Section 4).

3.5.3 Parameter Optimization and Final b3

Our classification algorithm (Section 4, below) uses partial matching between
the b3 and an example, for flexibility. In particular, the algorithm calculates a
maximum matching threshold, k. To be labeled an attack, the similarity score
of a new example must exceed k.
To compute k, the model and an example are aligned according to INPUT
symbols, and both INPUT symbols and decoded OUTPUT symbols are used to
calculate the matching score. We first compare corresponding INPUT symbols,
and if they match then we compare OUTPUT symbols. The decoded OUTPUT
symbol is a list of positive integers and which one to use is implementation-
specific (i.e., we only use the value with the highest weight). For the OUTPUT
symbol comparison, we use another parameter, β, to allow matches within some
range (±β). Therefore the matching score k is dependent upon β.
The value k is also dependent on another parameter called the subgoal win-
dow, γ, which tolerates a partial rather than total match between the order of
subgoals in the model and new example. Finally, k is weighted to compute the
final matching score, α. We use the frequency vector, V , to give greater weight
to matches with more frequent subgoals.

11
A separate b3 , with this parameter optimization, is constructed for each
attack group (class):

Definition 6 (Basic Building Block) A basic building block, b3 = (Σ, Q, S,

f, Γ, δ, V, α, β, γ), is the final attack model that has been formed from general-
ization, specialization, and parameter optimization over all training examples.

The sub-machine at the bottom in Figure 4 is a very simple (for illustration

purposes) example of a basic building block.

4 Classification
The following approach is used to classify new unseen examples as “AT-
TACK” or “NON-ATTACK.” Each new example is compared with the b 3 . Re-
call that partial matching is used. To calculate a similarity score (or matching
score) between the learned b3 and previously unseen examples, we do the fol-
lowing. First, we obtain β and γ for the learned b3 and use those parameters to
compute a new similarity score α0 between the b3 and the unseen submachines.
This score is used to classify new examples. A new example is only labeled an
“ATTACK” if α0 exceeds the threshold α from the learned b3 .

5 Experimental Results
We tested our algorithm against all variants of 23 attack groups (see Table 1).
For each group, we divided the attack variants into two subgroups for training
(i.e., to construct a b3 ) and testing (i.e., to test the b3 ’s classification accuracy
on unseen test examples). We performed induction using the attack training
examples plus 120 randomly-chosen benign programs from a fresh Windows in-
stallation. For all attack groups, we tried a token translator, length translator,
and character distribution translator for translation functions. A token transla-
tor extracts character strings in the arguments, a length translator encodes the
argument length in bytes, and a character distribution translator encodes the
character distribution in the arguments.
Attack Type Attack Group Name
Worm Donghe, Vorgon, Deborm, Klez, Libertine,
Nimda, Gizer, Energy, Kelino, Shorm
Virus CIH, Emotion, Belod, Evul, Mooder,
Team, Inrar, Eva, Lash, Resur, Spit
Hacking Tool Auha
DoS Tool Lanxue

Table 1: Attack groups for training and testing.

After the learning phase, our algorithm was evaluated on the testing ex-
amples of the aforementioned attack groups. While testing, we excluded any

12
examples that could not be processed by IDA Pro. Since we assume that disas-
sembly can be performed successfully by IDA Pro before detection, we do not
take into account those failed examples. Out of 79 testing examples, our algo-
rithm missed one instance of Auha but detected the rest of the attack variants
in the testing group (98.73% detection). We also tested the b3 s against 1032
randomly-chosen normal programs. The system detected 2 normal programs
(telnet.exe, wupdmgr.exe) as attack (0.19% false positive).
In order to see if our algorithm is resilient to minor binary changes, we gener-
ated the basic building blocks from randomly chosen CIH samples from [2] and
tested against the original copy of CIH.1010b and CIH.2690 and the signature-
removed version of CIH.1010b and CIH.2690. Our algorithm successfully de-
tected all of them. Also, we tested McAfee [1] VirusScan Enterprise ver. 8.0.0
with the latest virus definition against CIH.1010b and CIH.2690 virus samples
obtained from [2]. VirusScan successfully detected the original copies but it
failed to detect them after we manually removed (zeroed-out) the CIH.1010b
and CIH.2690 signature from the virus body with a binary editor.
One of the disadvantages of our approach is that our system may classify
benign programs as attacks if the benign programs are semantically similar
to attacks. This is a possible explanation for the two mis-classified examples
(telnet.exe, wupdmgr.exe).
Malware detection must be efficient. Table 2 shows the low average CPU
time and memory taken to transform unknown programs to examples and then
classify them as attack or non-attack.3
Target Size (KB) 4∼40 4∼100 100∼400 400∼1024
Time (sec) 52 210 288 381

Table 2: Average CPU time taken for classification.

6 Related Work
There are two complementary approaches to malware detection: static and
dynamic, each approach having both strengths and weaknesses. This paper
focuses on a static approach. One very effective and popular static approach is
that of Sung et al., called SAVE [18]. Their malware signature is derived from
an API calling sequence and they mapped each API to an integer number to
encode the sequence. They used a sequence alignment algorithm to compute
a similarity score to compare malware variants. However the resulting API
sequence is nothing more than a piece of syntactic information – therefore an
adversary may create another malware variant to defeat the system in such a
way that the program has a totally different API sequence but still has the
same semantic attack goal. Furthermore, an attacker can randomize the API
3 These times were taken using an Intel Pentium M 1.0GHz CPU with 512MB memory.

13
sequences by inserting arbitrary APIs in the middle of the sequence that do not
affect the original attack goal.
In response to the problems with syntactic signatures, there has been a very
recent but growing trend toward semantic malware detection. A handful of
publications on the topic have appeared in the last couple of years. In 2005,
Christodorescu et al. [5] were the first researchers to provide a formal seman-
tics for malware detection. They manually developed a template that describes
malware semantic properties and demonstrated that their algorithm can detect
all variants of certain malware using the template with no false positives. Wang
et al. [20] proposed a system called Shield, which has vulnerability-specific,
exploit-generic network filters for preventing exploits against security vulner-
abilities. Shield is resilient to polymorphic or metamorphic worms. Sokolsky
et al. [17] used bisimulation to capture some of the semantics, but not at an
abstract level. Bruschi et al. [3] invented a semantic approach to handling au-
tomated obfuscations. Kinder et al. [9] used model checking to semantically
identify malware that deviates from a temporal logic correctness specification.
Scheirer and Chuah [14] developed a semantics-aware NIDS to detect buffer
overflow exploits.
There are two major reasons why our approach presents an advance beyond
these prior approaches. First, other than the model checker, all of these other
approaches look for local, rather than global, semantic attack patterns. Second,
they all require significant manual intervention, e.g., to develop a template,
graph, or other data structure representing desirable (or attack) behavior. The
problem with using local attack patterns is that an attacker can at any time take
advantage of this fact and mount her attack by avoiding the local attack patterns
in her program. Furthermore, by capturing global semantics, a regular grammar
(rather than a context-free grammar as needed by Wagner and Dean [19]) suffices
for signatures. This results in a substantial computational advantage. The
problem with manual intervention is that it is time-consuming and impractical.
In contrast to these prior semantic approaches, ours looks for global patterns
and is almost fully automated.
Some researchers have focused on automating the generation of attack signa-
tures, e.g., Autograph [8], Honeycomb [10], and EarlyBird [16] analyze network
streams to automatically produce signatures by extracting common byte pat-
terns and they are used to detect unknown Internet worms. Unfortunately,
these approaches are syntactic and local.
The most relevant prior work to our inductive inference approach is that of
Kephart et al., who developed a statistical method for automatically extracting
signatures from a corpus of machine code viruses [13]. Their approach differs
from ours because it is syntactic which, as mentioned above, is problematic.

7 Summary and Future Work

We have presented a basic building block discovery algorithm to detect mal-
ware variants. Our approach is globally semantics-aware and automated. Al-

14
though our approach cannot handle some of the more challenging malware,
such as code that self-mutates at run-time, or is specially packed/encrypted or
obfuscated, it is nevertheless broadly applicable. In particular, experimental
evaluation has demonstrated that our algorithm can detect a wide variety of
unknown attacks (viruses, worms) with low errors, and it is resilient to minor
binary changes.
Future work will focus primarily on optimizing the speed of our approach,
further testing over more examples, and methods for recovery after a malware
attack has been identified by a b3 .

Acknowledgement
The authors would like to thank Insup Lee for suggesting the problem of
identifying the basic building blocks of malware.

References
[1] Mcafee - antivirus software and intrusion prevention solutions. http://
www.mcafee.com/, Last accessed on 10 Nov. 2005.

[2] Vx heavens. http://vx.netlux.org/, Last accessed on 10 Nov. 2005.

[3] Danilo Bruschi, Lorenzo Martignoni, and Mattia Monga. Using code nor-
malization for fighting self-mutating malware. In Proceedings of the Confer-
ence on Detection of Intrusions and Malware and Vulnerability Assessment.
IEEE Computer Society, 2006.

[4] Mihai Christodorescu and Somesh Jha. Testing malware detectors. In IS-
STA ’04: Proceedings of the 2004 ACM SIGSOFT international symposium
on Software testing and analysis, pages 34–44, New York, NY, USA, 2004.
ACM Press.

[5] Mihai Christodorescu, Somesh Jha, Sanjit A. Seshia, Dawn Song, and Ran-
dal E. Bryant. Semantics-aware malware detection. In Proceedings of the
2005 IEEE Symposium on Security and Privacy, pages 32–46, Washington,
DC, USA, 2005. IEEE Computer Society.

[6] DataRescue. Ida pro - interactive disassembler. http://www.datarescue.

com/idabase, Last accessed on 10 Oct. 2005.

[7] S. Horwitz, T. Reps, and F. Binkley. Interprocedural slicing using de-

pendence graphs. Transactions on Programming Languages and Systems,
12(1), 1990.
[8] Hyang-Ah Kim and Brad Karp. Autograph: Toward automated, dis-
tributed worm signature detection. In USENIX Security Symposium, pages
271–286, 2004.

15
[9] Johannes Kinder, Stefan Katzenbeisser, Christian Schallhart, and Helmut
Veith. Detecting malicious code by model checking. In Lecture Notes in
Computer Science 3548, pages 174–187. Springer Verlag, 2005.
[10] Christian Kreibich and Jon Crowcroft. Honeycomb - Creating intrusion de-
tection signatures using honeypots. In Proceedings of the Second Workshop
on Hot Topics in Networks (Hotnets II), Boston, November 2003.

[11] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.

[12] S.B. Needleman and C.D. Wunsch. A general method applicable to the
search for similarities in the amino acid sequence of two proteins. J. Mol.
Biol., 48:443–453, 1970.

[13] Jeffrey O.Kephart and William C.Arnold. Automatic extraction of com-

puter virus signatures. pages 178–184. 4th Virus Bulletin International
Conference, 1994.

[14] Walter Scheirer and Mooi Chuah. Network intrusion detection with
semantics-aware capability. In Proceedings of the Second International
Conference on Security and Systems in Networks. IEEE Computer Soci-
ety, 2006.

[15] Jinwook Shin. The basic building blocks of attacks. Master’s thesis, Uni-
versity of Wyoming, 2006.

[16] Sumeet Singh, Cristian Estan, George Varghese, and Stefan Savage. Auto-
mated worm fingerprinting. In OSDI, pages 45–60, 2004.

[17] Oleg Sokolsky, Sampath Kannan, and Insup Lee. Simulation-based graph
similarity. In Lecture Notes in Computer Science 3920. Springer Verlag,
2006.

[18] Andrew H. Sung, Jianyun Xu, Patrick Chavez, and Srinivas Mukkamala.
Static analyzer of vicious executables (save). In ACSAC, pages 326–334,
2004.

[19] D. Wagner and D. Dean. Intrusion detection via static analysis. In SP ’01:
Proceedings of the 2001 IEEE Symposium on Security and Privacy, pages
156–169, Washington, DC, USA, 2001. IEEE Computer Society.

[20] Helen J. Wang, Chuanxiong Guo, Daniel R. Simon, and Alf Zugenmaier.
Shield: vulnerability-driven network filters for preventing known vulnera-
bility exploits. In SIGCOMM ’04: Proceedings of the 2004 conference on
Applications, technologies, architectures, and protocols for computer com-
munications, pages 193–204, New York, NY, USA, 2004. ACM Press.

Knowing and Teaching Elementary Mathematics
No ratings yet
Knowing and Teaching Elementary Mathematics
8 pages
Malware Analysis
0% (1)
Malware Analysis
11 pages
As IEC 60300.3.14-2005 Dependability Management Application Guide - Maintenance and Maintenance Support
100% (1)
As IEC 60300.3.14-2005 Dependability Management Application Guide - Maintenance and Maintenance Support
9 pages
Computer Basics Worksheet
No ratings yet
Computer Basics Worksheet
5 pages
Malware Detection Using Machine Learning and Deep Learning
No ratings yet
Malware Detection Using Machine Learning and Deep Learning
10 pages
Mini Project
No ratings yet
Mini Project
11 pages
Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML
No ratings yet
Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML
30 pages
Research Paper
No ratings yet
Research Paper
8 pages
Windows Malware Detection
No ratings yet
Windows Malware Detection
14 pages
Unit Ii Ais
No ratings yet
Unit Ii Ais
26 pages
A novel ensemble-based approach for Windows malware detection
No ratings yet
A novel ensemble-based approach for Windows malware detection
10 pages
Malware Detection and Classification Based On Graph Convolutional Networks and Function Call Graphs
No ratings yet
Malware Detection and Classification Based On Graph Convolutional Networks and Function Call Graphs
11 pages
Malware KA Webinar Slides
No ratings yet
Malware KA Webinar Slides
40 pages
Paprer CJ Usenix03
No ratings yet
Paprer CJ Usenix03
18 pages
Malware - Detection - Using - Machine - Learning (3) - Removed
No ratings yet
Malware - Detection - Using - Machine - Learning (3) - Removed
31 pages
Analyzing and comparing the effectiveness of malware detection_ A study of machine learning approaches - ScienceDirect
No ratings yet
Analyzing and comparing the effectiveness of malware detection_ A study of machine learning approaches - ScienceDirect
39 pages
ELF Et Virologie Informatique
No ratings yet
ELF Et Virologie Informatique
7 pages
Malware Detection Using Machine Leaning
No ratings yet
Malware Detection Using Machine Leaning
9 pages
15709-Article Text-55876-2-10-20220114
No ratings yet
15709-Article Text-55876-2-10-20220114
26 pages
Analysis of Cyber Security Threats Using
No ratings yet
Analysis of Cyber Security Threats Using
5 pages
Malware Detection Using ML
No ratings yet
Malware Detection Using ML
20 pages
Malware - Detection - Using - Machine - Learning (2) - Removed
No ratings yet
Malware - Detection - Using - Machine - Learning (2) - Removed
31 pages
Malware Identification
No ratings yet
Malware Identification
28 pages
Major Project
No ratings yet
Major Project
10 pages
A Framework For Detection of Malicious Code by Exploiting Machine Learning Techniques On Portable Executables
No ratings yet
A Framework For Detection of Malicious Code by Exploiting Machine Learning Techniques On Portable Executables
4 pages
Effective Malware Detection Based On Behaviour and Data Features
No ratings yet
Effective Malware Detection Based On Behaviour and Data Features
16 pages
Week 1 - Lecture
No ratings yet
Week 1 - Lecture
46 pages
Final Synposis
No ratings yet
Final Synposis
10 pages
synopsis1
No ratings yet
synopsis1
7 pages
Robust_malicious_software_detection_and_classifica
No ratings yet
Robust_malicious_software_detection_and_classifica
16 pages
Python and Malware: Developing Stealth and Evasive Malware Without Obfuscation
No ratings yet
Python and Malware: Developing Stealth and Evasive Malware Without Obfuscation
12 pages
Kolter
No ratings yet
Kolter
24 pages
Malware Application Detection Using Machine Learning
No ratings yet
Malware Application Detection Using Machine Learning
8 pages
Artificial Intelligence in Malware Detection: Cosolan Cornelia Ionela May 22, 2018
No ratings yet
Artificial Intelligence in Malware Detection: Cosolan Cornelia Ionela May 22, 2018
5 pages
Malware Classification Using Static Disassembly and Machine Learning
No ratings yet
Malware Classification Using Static Disassembly and Machine Learning
10 pages
Malware_Analysis_using_Machine_Learning_and_Deep_Learning_techniques
No ratings yet
Malware_Analysis_using_Machine_Learning_and_Deep_Learning_techniques
7 pages
Ijcna 2021 o 56
No ratings yet
Ijcna 2021 o 56
18 pages
A Survey of Machine Learning Methods and Challenges For Windows Malware Classification
No ratings yet
A Survey of Machine Learning Methods and Challenges For Windows Malware Classification
52 pages
Building a M Platform at Home 1739382859
No ratings yet
Building a M Platform at Home 1739382859
50 pages
CH1- Introduction to malware analysis-v1
No ratings yet
CH1- Introduction to malware analysis-v1
23 pages
Comp. Project Synopsis Reviwed
No ratings yet
Comp. Project Synopsis Reviwed
16 pages
Survey Paper of Group 7
No ratings yet
Survey Paper of Group 7
9 pages
Udayakumar 2017
No ratings yet
Udayakumar 2017
6 pages
Penetration Testing Fundamentals-2: Penetration Testing Study Guide To Breaking Into Systems
From Everand
Penetration Testing Fundamentals-2: Penetration Testing Study Guide To Breaking Into Systems
Devi Prasad
No ratings yet
Ly Ngoc Vu YSCPaper
No ratings yet
Ly Ngoc Vu YSCPaper
11 pages
Malcode Detection
No ratings yet
Malcode Detection
5 pages
im_2007
No ratings yet
im_2007
48 pages
Nostarch Wintersampler 2018 Ebook
No ratings yet
Nostarch Wintersampler 2018 Ebook
40 pages
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
No ratings yet
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
9 pages
Understanding of Malware Threat Detection with HMM
No ratings yet
Understanding of Malware Threat Detection with HMM
13 pages
Malware Analysis
No ratings yet
Malware Analysis
35 pages
malware_detection_research_paper_updated Soheb6
No ratings yet
malware_detection_research_paper_updated Soheb6
8 pages
Project JAISON
No ratings yet
Project JAISON
61 pages
1742747318200
No ratings yet
1742747318200
37 pages
Malware Detection Using Machine Learning
No ratings yet
Malware Detection Using Machine Learning
11 pages
Chapter One 1.1 Background of The Study
No ratings yet
Chapter One 1.1 Background of The Study
40 pages
Document
No ratings yet
Document
5 pages
A Multi-View Feature Fusion Approach For Effective Malware Classification Using Deep Learning
No ratings yet
A Multi-View Feature Fusion Approach For Effective Malware Classification Using Deep Learning
15 pages
Malware Application Detection Using Machine Learning
No ratings yet
Malware Application Detection Using Machine Learning
7 pages
Malware Analysis and Classification Survey
No ratings yet
Malware Analysis and Classification Survey
9 pages
Malware Detection Research Paper Updated Soheb6
No ratings yet
Malware Detection Research Paper Updated Soheb6
6 pages
Research Paper
No ratings yet
Research Paper
8 pages
Classifying_Malware_Represented_as_Control_Flow_Graphs_using_Deep_Graph_Convolutional_Neural_Network
No ratings yet
Classifying_Malware_Represented_as_Control_Flow_Graphs_using_Deep_Graph_Convolutional_Neural_Network
12 pages
Hate Crime Stats
No ratings yet
Hate Crime Stats
2 pages
Incidents and Offenses: Hate Crime Statistics, 2006
No ratings yet
Incidents and Offenses: Hate Crime Statistics, 2006
5 pages
SchoolLibrary Today Tomorrow
No ratings yet
SchoolLibrary Today Tomorrow
35 pages
Slac Pub 10973
No ratings yet
Slac Pub 10973
23 pages
M
No ratings yet
M
179 pages
Concordia Study April 2006
100% (1)
Concordia Study April 2006
56 pages
Incidents and Offenses: Hate Crime Statistics, 2006
No ratings yet
Incidents and Offenses: Hate Crime Statistics, 2006
5 pages
Conference Report: The 7th Conference of Parliamentarians of The Arctic Region
No ratings yet
Conference Report: The 7th Conference of Parliamentarians of The Arctic Region
32 pages
Gilles Berhault
No ratings yet
Gilles Berhault
2 pages
European Diving Junior Championships: (Update: 27 September 2003)
No ratings yet
European Diving Junior Championships: (Update: 27 September 2003)
6 pages
Minifacts About Norway
No ratings yet
Minifacts About Norway
64 pages
New Approaches To The Synthesis of Highly HDS Active Silica-Supported Nickel Phosphide Catalysts
No ratings yet
New Approaches To The Synthesis of Highly HDS Active Silica-Supported Nickel Phosphide Catalysts
1 page
Sami Konf Program
No ratings yet
Sami Konf Program
3 pages
The Environment in The News
No ratings yet
The Environment in The News
38 pages
Peer Reviewed Articles: Bibliography Molecular/Genetic Analysis of Glioma
No ratings yet
Peer Reviewed Articles: Bibliography Molecular/Genetic Analysis of Glioma
10 pages
Appendixf
No ratings yet
Appendixf
11 pages
Internet Corporation For Assigned Names and Numbers (Icann) : Objectives and Main Activities
No ratings yet
Internet Corporation For Assigned Names and Numbers (Icann) : Objectives and Main Activities
9 pages
Feb 08 Mar 08 Events Att and Activities PT 1
No ratings yet
Feb 08 Mar 08 Events Att and Activities PT 1
2 pages
AD NHS Staff Survey 2006 RYF Full
No ratings yet
AD NHS Staff Survey 2006 RYF Full
39 pages
A11 Alison Sunday Rover
No ratings yet
A11 Alison Sunday Rover
1 page
Ambulance Report 2006
No ratings yet
Ambulance Report 2006
25 pages
Summer 07 Days Out
No ratings yet
Summer 07 Days Out
4 pages
Knowledge Accelerator Xi R2: Xcelsius 4.5
No ratings yet
Knowledge Accelerator Xi R2: Xcelsius 4.5
3 pages
Digestaug 06
No ratings yet
Digestaug 06
8 pages
PowerAnalytics2 0 Xcelsius 03-04-08
No ratings yet
PowerAnalytics2 0 Xcelsius 03-04-08
2 pages
Book C No Covers
No ratings yet
Book C No Covers
83 pages
StarToken NG
No ratings yet
StarToken NG
46 pages
Growatt ShineWiFi User Manual 20160818
No ratings yet
Growatt ShineWiFi User Manual 20160818
2 pages
Module 11 Terminate and Connect of Electrical Wiring and Electronic Circuits
No ratings yet
Module 11 Terminate and Connect of Electrical Wiring and Electronic Circuits
40 pages
redeemed christian church of god hymn - english
No ratings yet
redeemed christian church of god hymn - english
16 pages
Roberts, David W. - Comparison of Distance Based and Model Based Ordinations
No ratings yet
Roberts, David W. - Comparison of Distance Based and Model Based Ordinations
35 pages
PDS Lab Asg 7
No ratings yet
PDS Lab Asg 7
2 pages
Mass Effect 2 PC Manual
100% (1)
Mass Effect 2 PC Manual
15 pages
Cards EN
No ratings yet
Cards EN
64 pages
Professional Ajax 2nd Edition Nicholas C. Zakas - Quickly download the ebook to start your content journey
No ratings yet
Professional Ajax 2nd Edition Nicholas C. Zakas - Quickly download the ebook to start your content journey
47 pages
Ey Service Resiliency Service Overview
No ratings yet
Ey Service Resiliency Service Overview
2 pages
Time Table data collection
No ratings yet
Time Table data collection
3 pages
Bluetooth
No ratings yet
Bluetooth
81 pages
Computer and Network Security
No ratings yet
Computer and Network Security
13 pages
Critikon Dinamap CO2 Modul - Service Manual
No ratings yet
Critikon Dinamap CO2 Modul - Service Manual
32 pages
Makito X4 Decoder 1.4.0 Release Notes
No ratings yet
Makito X4 Decoder 1.4.0 Release Notes
3 pages
Venkata Sai Sharath Snowflake Resume
No ratings yet
Venkata Sai Sharath Snowflake Resume
3 pages
B.E. CSE-Curriculum & Syllabus
No ratings yet
B.E. CSE-Curriculum & Syllabus
98 pages
Breaking Hitag2
No ratings yet
Breaking Hitag2
16 pages
AJANCAMV7 User Guide
No ratings yet
AJANCAMV7 User Guide
15 pages
Virtual Keyboard Without Keys and Board
No ratings yet
Virtual Keyboard Without Keys and Board
13 pages
Researchgate Add Thesis
100% (3)
Researchgate Add Thesis
7 pages
Designing and Ergonomic Evaluation of A Shoe-Rack
No ratings yet
Designing and Ergonomic Evaluation of A Shoe-Rack
5 pages
Interview FMT
No ratings yet
Interview FMT
9 pages
SPLOT
No ratings yet
SPLOT
2 pages
CRO Olympiad Book For Class 3
No ratings yet
CRO Olympiad Book For Class 3
11 pages
Whitepaper_Carbonio_Digital_Workplace
No ratings yet
Whitepaper_Carbonio_Digital_Workplace
21 pages
CMDB Health
No ratings yet
CMDB Health
3 pages
Short Cut Keys
No ratings yet
Short Cut Keys
5 pages

Uploaded by

Uploaded by

The Basic Building Blocks of Malware

Jinwook Shin and Diana F. Spears

“Anomaly and misuse detection in network traffic streams,” subcontract PENUNV48725.

• Globally semantic signatures. Most prior malware signatures are syn-

• Automated signature construction. A malware signature is normally

• Reduction in errors of omission (false negatives). Unlike other

• Wider applicability to modern attacks. Our approach is applicable

CreateProcess(program1,...); //runs a program

return 0; //returns to operating system

WinExec("program1", SW_SHOW); //runs a program

return 0; //returns to operating system

Figure 1: Two general programs in C. Programs A and B are syntactically

4. Inductive inference. The final step is to perform inductive inference

Definition 1 (Subblock) An attack graph (or, more generally, program graph)

3.4 Semantic Abstraction

An abstract-FST is a very concise graphical representation of the global seman-

3.5 Inductive Inference

3.5.2 Induction Over Training Examples

Definition 3 (Model) A model, m =(Σ, Q, S, f, Γ, δ, V ), is a sub-machine of

Definition 4 (Union Operator) Assume two aligned examples, a1 = (Σ1 ,

Definition 5 (Difference Operator) Assume a model, a1 = (Σ1 , Q1 , S1 , f,

Figure 4: Two aligned examples a1 and a2 become an initial model (a1 ∪ a2 )

The output of this generalization-specialization process is a b3 , which is an attack

3.5.3 Parameter Optimization and Final b3

Definition 6 (Basic Building Block) A basic building block, b3 = (Σ, Q, S,

The sub-machine at the bottom in Figure 4 is a very simple (for illustration

Table 1: Attack groups for training and testing.

Table 2: Average CPU time taken for classification.

7 Summary and Future Work

[2] Vx heavens. http://vx.netlux.org/, Last accessed on 10 Nov. 2005.

[6] DataRescue. Ida pro - interactive disassembler. http://www.datarescue.

[7] S. Horwitz, T. Reps, and F. Binkley. Interprocedural slicing using de-

[11] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.

[13] Jeffrey O.Kephart and William C.Arnold. Automatic extraction of com-

You might also like