Detecting Computer Viruses using GPUs
Alexandre Nuno Vicente Dias
Thesis to obtain the Master of Science Degree in
Information Systems and Computer Engineering
Examination Committee
Chairperson:
Supervisor:
Member:
Prof. Pedro Manuel Moreira Vaz Antunes de Sousa
Prof. José Carlos Alves Pereira Monteiro
Prof. Alexandre Paulo Lourenço Francisco
November 2012
Agradecimentos
Depois de 5 anos no IST, sou uma pessoa diferente. Mais racional principalmente. Mas em muitos
aspectos continuo igual ao que sempre fui. Não mudaria nada nos anos passados nesta casa. Aprendi
bastante, não só a nı́vel técnico, mas como a nı́vel pessoal.
Penso que o conhecimento mais valioso que aprendi foi que tudo é possı́vel, desde que se dedique o
tempo necessário.
Neste momento em que me encontro com futuro ainda incerto, espero seguir um caminho que me leve
a durante a minha vida poder aplicar ao máximo os conhecimentos que adquiri durante o meu curso.
Quero agradecer principalmente aos meus pais e à minha irmã, sem vocês e o vosso apoio eu nunca
chegaria a onde estou. É devido a vocês que sou a pessoa que sou.
Em segundo lugar, aos meus amigos mais chegados durante estes últimos tempos. Vocês sabem quem
são. Nunca esquecerei os tempos que por aqui passámos, bem como o apoio que me deram neste último
ano.
Por fim, ao meu orientador e co-orientador, respectivamente Dr. Prof. José Monteiro e Nuno Lopes,
que tanto me encaminharam e ajudaram durante este trabalho. Aprendi muito durante este ano e estou
bastante grato por isso.
Que comece a próxima jornada neste meu caminho.
Lisboa, 7 de Novembro de 2012
Alexandre Dias
iii
Resumo
Produtos de anti-vı́rus são a principal defesa contra programas maliciosos, que estão a ficar cada
vez mais comuns e avançados. Uma grande parte do processo de verificar um ficheiro ou programa
é dedicada a verificar se alguma assinatura correspondente a um vı́rus está contida nesse ficheiro ou
programa. Como é importante que o processo de verificação seja o mais rápido possı́vel, este tempo de
verificar se uma assinatura está contida num ficheiro deve ser minimizado ao máximo. Recentemente,
as placas gráficas têm aumentado em popularidade para computação com altos requisitos de execução
devido à sua arquitectura paralela. Uma das aplicações possı́veis em placas gráficas é o emparelhamento
em texto, que é o processo realizado para verificar se uma assinatura está contida num ficheiro. Neste
trabalho apresentamos detalhes do sistema implementado, um sistema paralelo de emparelhamento em
texto baseado em autómatos finitos e deterministas executado numa placa gráfica. Devido a problemas
de espaço inerentes a estes autómatos, o nosso sistema apenas realiza emparelhamento usando parte
de cada assinatura, o que o leva a servir de pré-filtragem para os ficheiros que têm que ser verificados.
Múltiplas optimizações foram implementadas para reduzir o seu tempo de execução. Nos nossos testes
com conjuntos de ficheiros de teste, o nosso sistema teve uma melhoria de aproximadamente 28 quando
comparado com a parte de emparelhamento do ClamAV, um anti-vı́rus usado na indústria. Contudo, em
outros conjuntos de ficheiros de teste, o nosso sistema não tem uma melhoria tão boa. Trabalho futuro
é apresentado para melhorar o sistema nestas condições.
Palavras-chave: segurança, vı́rus, emparelhamento em texto, computação em placas gráficas, CUDA
v
Abstract
Anti-virus software is the main defense mechanism against malware, which is becoming more common
and advanced. A significant part of the virus scanning process is dedicated to scanning a given file against
a set of virus signatures. As it is important that the overall scanning process be as fast as possible, efforts
must be done to minimize the time spent in signature matching. Recently, graphics processing units have
increased in popularity in high performance computation, due to their inherently parallel architecture.
One of their possible applications is performing matching of multiple signatures in parallel. In this
work, we present details on the implemented multiple string searching algorithm based on deterministic
finite automata which runs on a graphics processing unit. Due to space concerns inherent to DFAs
our algorithm only scans for a substring of each signature, thereby serving as a high-speed pre-filtering
mechanism. Multiple optimizations were implemented in order to increase its performance. In our
experiments with sets of test files, the implemented solution was found to have a speedup of around 28
when compared to the pattern matching portion of ClamAV, an open-source anti-virus engine. On other
sets of test files with different characteristics the solution does not have such a good performance, but
future work is described to improve it in these situations.
Keywords: network security, virus, string matching, GPGPU Computing, CUDA
vii
Contents
List of Figures
xi
List of tables
xiii
Acronyms
xv
1 Introduction
1
2 General Purpose Computing on Graphics Processing Units
3
2.1
Comparing CPUs and GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2.1.1
SIMD Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.1.2
Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2
2.1.3 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
6
2.3
Programming Interfaces and CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.4
2.3.1
CUDA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.3.2
CUDA Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
3 Related Work
3.1
3.2
3.3
3.4
13
Threat Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3.1.1
SNORT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3.1.2
ClamAV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.2.1
Single-pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.2.2
Multiple-pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.2.3
A Sub-linear Scanning Complexity Algorithm . . . . . . . . . . . . . . . . . . . . .
24
3.2.4
Rearranging the Aho-Corasick States . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.2.5
Re-writing some Regular Expression Rules . . . . . . . . . . . . . . . . . . . . . . .
25
3.2.6
Symbolic Finite State Transducers . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
Pattern Matching on GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.3.1
Using SNORT Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.3.2 Using ClamAV Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
32
4 System Architecture
33
4.1
Algorithm Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4.2
Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.2.1
34
Reducing Memory Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
4.3
4.4
4.2.2
GPU Memory Space for the Files and Work Division . . . . . . . . . . . . . . . . .
36
4.2.3
Mapping to the architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
4.3.1
Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.3.2
Texture Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.3.3
Final States and Signature Types . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.3.4
Coalescing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.3.5
Double-buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
5 Evaluation
45
5.1
Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
5.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
6 Future Work and Conclusions
49
6.1
Problematic Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
6.2
Scanning Method for Large Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
6.3
Incremental Automata Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
6.4
Transposing the Files using SSE Instructions . . . . . . . . . . . . . . . . . . . . . . . . .
50
6.5
Verification Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
6.6
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
Bibliography
53
x
List of Figures
2.1
Comparison of single precision performance between modern CPUs and GPUs. . . . . . .
4
2.2
Comparison of double precision performance between modern CPUs and GPUs . . . . . .
4
2.3
The fermi architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.4
The fermi shared multiprocessor hardware architecture. . . . . . . . . . . . . . . . . . . .
7
2.5
Mapping of CUDA’s software abstractions to hardware. . . . . . . . . . . . . . . . . . . .
10
3.1
Example SNORT rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.2
An example of performing naive string matching. . . . . . . . . . . . . . . . . . . . . . . .
16
3.3
NFA for the patterns ab*c, c and ad built from applying Thompson’s algorithm. . . . . .
19
3.4
DFA for the patterns ab*c, bc and ad. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.5
Complete DFA for the patterns ab*c, bc and ad. . . . . . . . . . . . . . . . . . . . . . . .
22
4.1
Processing the signatures and sending them to the GPUs. . . . . . . . . . . . . . . . . . .
36
4.2
4.3
Matching process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Structure of the automata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
40
4.4
Automata after the re-structuring for texture memory use. . . . . . . . . . . . . . . . . . .
40
4.5
Contiguous and interleaved files. Blocks marked with an X have no meaningful data in
them, as they are out of a file’s boundaries. . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.6
Difference between sending a whole file and sending it in two parts. . . . . . . . . . . . . .
42
4.7
Difference between execution times of sending a single file parts and sending several parts.
43
5.1
Execution times for random files with the same size. . . . . . . . . . . . . . . . . . . . . .
46
5.2
Execution times for files with a same size. . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
5.3
Execution times for different numbers of false positives. . . . . . . . . . . . . . . . . . . .
48
xi
List of Tables
3.1
Distribution of signature types in ClamAV. . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.2
Results of profiling ClamAV by scanning a large file. . . . . . . . . . . . . . . . . . . . . .
15
3.3
Transition table for the DFA in Figure 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
5.1
Execution times for files with different sizes. . . . . . . . . . . . . . . . . . . . . . . . . . .
47
xiii
Acronyms
CPU
Central Processing Unit
GPU
Graphics Processing Unit
SDKs Software Development Kits
CUDA Compute Unified Device Architecture
BWT Burrows-Wheeler-Transform
IDS
Intrusion Detection System
GFLOPS Giga Floating Point Operations per Second
MIMD Multiple Instruction Multiple Data
SIMD Single Instruction Multiple Data
ECC
Error-Correcting Code
SSE
Streaming SIMD Extensions
NFA
Non-deterministic Finite Automaton
DFA
Deterministic Finite Automaton
xv
Chapter 1
Introduction
As technology gets more advanced, malware writers are using a whole new set of techniques to build
malware that is more efficient and harder to detect. Malware is also becoming more prevalent in the
web: for instance, anti-virus companies can receive malware samples at a rate of about one sample
every 5 seconds. Due to this emergence, both IT professionals and regular users alike are facing various
challenges when it comes to protection. An example of an effort to increase the defenses against viruses
is the cooperation between anti-virus companies to detect a new variation of a virus (or a new virus
itself): samples of recent viruses are placed into a shared repository to which the anti-virus companies
have access. Yet, there is still a given time window (from the time where the virus first appears to the
time where an anti-virus company updates their product) in which users are vulnerable.
Anti-virus products try to minimize the chance of infection of a machine, employing various techniques
to do so. For instance, they might take a sample and analyze their behavior in run-time, in order to
check if the sample does anything that it is not supposed to. However, this run-time analysis should
not be the first thing to do to a newly arrived sample, as it is very time consuming. The first step in
identifying a virus is usually done by scanning a file, and matching its body to a set of rules or patterns,
called signatures. Each signature can correspond to a different virus or strain of one, and if it occurs in
the body of a file, then we know that such file is malicious.
As stated above, viruses are becoming more and more common on the web, and due to this, signature
sets can only increase in size. Scanning for all of them on a single file can be a process which is very
time-consuming, and if the search was to be done in a naive manner (repeatedly comparing the text
characters against all of the signatures), the process would surely not be completed in an acceptable time
frame. In anti-virus products, this process has to be especially fast, as if it is too slow, then the anti-virus
application becomes the bottleneck of the system, which can cause a significant decrease in the overall
system performance.
Anti-virus products and intrusion detection systems (which scan network packets against a rule set),
employ various algorithms to speed up the scanning process. Such algorithms are part of a problem called
string matching, which as the name indicates, is the problem of finding a given string in a text. The
products also have a signature set, so they must scan for multiple strings inside the same text (which
corresponds to the body of a file). This problem is a generalization of the aforementioned one, and it is
called multi-pattern matching. Many techniques have been implemented to try to speedup multi-pattern
matching, some of them having success in doing so. These techniques are mainly small changes to the
algorithms, but some of them focus on implementing the algorithms on specific hardware, rather than on
a Central Processing Unit (CPU). A common choice in hardware devices to implement algorithms in is
the Graphics Processing Unit (GPU).
The utilization of GPUs for general purpose computing is rapidly gaining popularity, due to the
1
massive parallelism inherent to these devices. GPUs are specialized for computationally intensive and
highly parallel tasks (namely graphics rendering), and as so, they are designed such that more transistors
are devoted to data processing rather than data caching and flow control. A factor that helped GPUs
gain popularity in general purpose computing was the release of Software Development Kits (SDKs) by
NVIDIA and AMD, which are to be used with their respective GPUs. These SDKs provide much needed
help for developers wanting to start programming in GPUs.
Currently, GPUs are used for a whole range of applications, from biomedical calculations to bruteforcing MD5 hashes. When it comes to multi-pattern matching, research efforts have proved that GPUs
can bring significant speedups to the process. It is due to such conclusions, and due to the parallel nature
of GPUs, that we have decided to adopt them in order to speedup the multi-pattern matching part of an
anti-virus engine.
As we will see in the next chapter, anti-virus engines are complex programs composed of different parts
which have different computational costs. We have determined that the most computationally expensive
component of an anti-virus engine is its pattern matching process. Pattern matching refers to searching
for a given string (or multiple strings) in a given input text. In anti-virus engines, pattern matching is
used to search for known anti-virus signatures inside a file (as to verify if that file is malicious or not
according to those signatures).
Considering that pattern matching is a process that can potentially be sped up, we decided to have
as our goal the improvement of such a component in an anti-virus product (we chose to use ClamAV due
to being open-source), so as to decrease the execution time of the overall virus scanning process.
We have found that our system provided a good improvement: speed-ups ranged from around 12 to
28 on different sets of files. On some other sets of files the speedup was lower: if, for example, our system
was only to scan a single large file, then it would have an execution time larger than ClamAV. Future
work is described to improve the performance on these situations, though.
This document is structured as follows. Chapter 2 gives an overview of GPUs and their use in general
purpose programming: their architecture is shown, along with the programming model used in the work.
This chapter should help understand the challenges that programming for GPUs entail. Chapter 3
provides some general information about virus scanning and describes some related work in the area,
namely from pattern matching algorithms, speeding up those existing algorithms by modifying them
to speeding them up by implementing them on GPUs. Chapter 4 describes our system’s architecture in
detail. Chapter 5 evaluates the system against various factors, and compares it against existing solutions.
Finally, Chapter 6 discusses some future work for the system and brings a conclusion to this work.
2
Chapter 2
General Purpose Computing on
Graphics Processing Units
This chapter describes the architecture of a graphics processing unit and explains its advantages and
disadvantages for performing general purpose computing.
Graphics Processing Units (GPUs) are hardware units specially designed to perform graphics operations (or rather, graphics rendering), which are very important for video games. Graphics rendering is
by nature a parallel operation, as a value of a given pixel does not depend on the values of other pixels.
Due to this, GPUs have been designed to perform these parallel operations as fast as possible. They are
essentially a highly parallel, multithreaded, many core processor. As they evolve, they become more and
more powerful, and thus also more and more attractive to programmers in need of implementing a highly
parallel algorithm.
The computing power of a given hardware unit is usually determined by measuring the number of
floating point operations it can perform in a second. Charts on figures 2.1 and 2.2 show the current
measurements of Giga Floating Point Operations per Second (GFLOPS) for single-precision and doubleprecision operations on modern GPUs and CPUs.
From these charts, it is easy to see why GPUs pose such an attraction to programmers needing highperformance applications. A modern high-end GPU can handle around 10 times the GFLOPS of an
equivalent CPU.
However, it is to be noted that these numbers are provided by the GPU vendors. In practice, achieving
such high performance values is immensely difficult (they may even never be reached), as many algorithms
are for instance limited by bandwidth and not by the computing power (being limited by bandwidth means
that the memory cannot provide the data as fast as an execution core would process it).
In our work, we have used a Tesla C2050, which is a workstation card designed specifically for graphics
processing and computing applications (so it is not necessarily suited for video games, which need very
good performance on graphics rendering). Its architecture is further described in Section 2.2.
2.1
Comparing CPUs and GPUs
As seen in the charts in figures 2.1 and 2.2, GPUs are faster on some performance metrics than
CPUs. A few years ago, CPUs increased their performance by increasing their core frequency. However,
increasing the frequency is limited by some physical restrictions such as heat and power consumption.
Nowadays, CPUs’ performance is increased by adding more cores. On computing hardware, the focus is
slowly shifting from CPUs that are optimized to minimize latency towards GPUs that are optimized to
3
Figure 2.1: Comparison of single precision performance between modern CPUs and GPUs.
http://www.feko.info/about-us/News/gpu single precision performance
Figure 2.2: Comparison of double precision performance between modern CPUs and GPUs.
http://www.feko.info/about-us/News/gpu double precision performance
4
augment the total throughput. CPUs that are optimized towards latency use the Multiple Instruction
Multiple Data (MIMD) model, meaning that each of its cores works independently from the others,
executing instructions for several processes.
GPUs on the other hand use a Single Instruction Multiple Data (SIMD) model, where each multiple
processing elements execute the same instruction on multiple data simultaneously.
2.1.1
SIMD Model
The SIMD (Single Instruction Multiple Data) model is suited to the notion that the computation
stages on a GPU are executed sequentially (as in a pipeline) on a flow of data. Each of the computation
stages however, is executed in parallel by the GPU’s processors. In an NVIDIA GPU, processors execute
groups of 32 threads, called warps [5]. Individual threads in the same warp start together, but can then
execute and branch independently (although with some restrictions).
Although threads can generally execute independent instructions, threads belonging to the same warp
can only execute one common instruction at a time. If any thread in a warp diverges, then the threads in
that warp are serially executed for each of the independent instructions, with barriers being inserted after
the execution of such instructions so that threads are then re-synchronized. Thus, efficiency is achieved
when all 32 threads of a warp share the same execution path.
This kind of divergence that forces hardware to serially execute instructions can only exist inside a
warp, as different warps are running on different multiprocessors.
In order to maximize the utilization of a GPU’s resources, thread-level parallelism is needed. Occupancy (the level of threads that are running or ready to run in regards of the total number of possible
threads during the program’s life cycle) is related to the number of warps that belong to a multiprocessor.
Each time an instruction needs to be issued on a multiprocessor, a warp scheduler takes a warp that
is ready to run its next instruction and issues that instruction to the warp’s threads. Some time is needed
for a warp to be ready to execute its next instruction: this is called latency. So thread-level parallelism
effectively targets hiding the latency, and occupancy is essentially a measure of how much latency is
hidden. Maximum occupancy is achieved when the warp scheduler always has an instruction to issue for
some warp at every clock cycle.
2.1.2
Threads
Another main difference between CPUs and GPUs are threads and how they are organized. On a
typical CPU, a small number of threads can run concurrently on a core. On a GPU however, thousands can run concurrently per multiprocessor (our Tesla C2050 card supports up to 1536 threads per
multiprocessor).
Switching between threads on a CPU can also take some significant time, where on a GPU this
switching is done seamlessly and almost instantly. This is due to the fact that on a CPU, the operation
system controls the thread switching behavior. On a GPU, thousands of threads are put into a queue
(in groups of warps). If the hardware must wait on one warp of threads (because for instance it needed
to perform a slow access to memory), then it just swaps that warp with another one that is ready to
execute.
This is mainly possible due to the large amount of registers present on a GPU: separate registers
are allocated to all of the threads, so no swapping of registers or state needs to be done when thread
switching occurs (unlike a CPU, which has less registers and as such their state must be saved in thread
switches).
5
Figure 2.3: The fermi architecture. http://files.tested.com/uploads/0/5/16360-gf100 core.jpg
2.1.3
Memory
Both CPU and GPU have dedicated memory chips available to them. On a CPU, the RAM is
accessible to all code (given that it stays within some boundaries related to processes, boundaries enforced
by the operating system). On a GPU, the RAM is divided virtually (and physically) into several types
(described later in this chapter), each of which has its own characteristics. But mainly the differences are
that the CPU’s RAM is much faster than the GPU’s. The CPU’s RAM has a higher-bandwidth, and it
is supported by multiple, large caches. A GPU’s RAM can be slow depending on the access pattern and
its caches are small, due to being designed for special uses (namely texture filtering) which are related to
the aforementioned types in which the RAM is divided in.
2.2
Hardware Architecture
The Tesla C2050 is an NVIDIA workstation GPU which is based on the Fermi architecture. The Fermi
architecture provides many improvements over previous ones, such as Error-Correcting Code (ECC) across
all of memory types (registers, DRAM, etc) and the ability to execute concurrent processes. The C2050
is composed of 14 shared multiprocessors. A shared multiprocessor is basically a small processor: it has
an execution engine, and plenty of registers. It also has its own small memory and L1 cache. These
14 multiprocessors are then connected to a DRAM memory. In the C2050, this memory has 3GB of
capacity (reduced to around 2.7GB when memory error correction is on). However, its latency is quite
large, so accessing it must be something that programmers must take into consideration when designing
and implementing programs.
In Figure 2.3, we can see an overall look of the architecture present in Fermi GPUs.
6
Figure
2.4:
The
fermi
shared
multiprocessor
http://www.geeks3d.com/public/jegx/201001/fermi gt100 sm.jpg
hardware
architecture.
A Fermi card can have up to 16 shared multiprocessors (our Tesla card has 14), each of which has
capability for 32 cores (or execution units), meaning that at most, 16 x 32 (512) threads can be running
in parallel. They are all inter-connected to a global L2 cache and to memory controllers that are used to
access the GPU’s DRAM. There is also a host interface to communicate with the CPU, and some other
special engines related to graphics operations.
A shared multiprocessor’s architecture is shown in Figure 2.4.
There is a large register file, that is shared between all of the multiprocessor’s active threads. A
multiprocessor also contains a small, fast memory (called the shared memory), and an L1 cache for the
global memory. There is also a special cache, the texture cache. This cache is essentially another cache
for the global memory, but it is optimized for spatial locality and can be populated independently from
the L1 cache.
When it comes to memory, an NVIDIA GPU has different types of physical memory, which can also
be virtually abstracted.There are three main memories on a GPU: device memory, shared memory, and
registers. Device memory is the largest memory (3GB in our Tesla GPU), but also the one with the
largest access latency. It is equivalent to the DRAM of a CPU. Despite being slow, its main use comes
from the fact that it is the only physical memory on the GPU that can be populated by the CPU before
a kernel is launched.
Shared memory is a fast memory that is present on every multiprocessor, and it is shared by all of
7
that multiprocessor’s threads. Its speed comes from the fact that it is present on-chip rather than off
(in other words, shared memory does not need a special bus to be accessed, unlike global memory): no
special buses need to be used to access it. Uncached shared memory latency can be up to 100 times lower
than the global memory’s latency (as long as there are no conflicts in the thread accesses). In order to
achieve such low latency, shared memory is divided into banks. Any memory read or write access made
of some number of memory addresses that are in their own distinct memory banks can therefore be done
simultaneously, effectively increasing the bandwidth. However, if different threads need to access a same
bank, then conflicts arise and accesses need to be serialized (not a big issue as it remains on-chip).
Registers are present on a file in each shared multiprocessor. Every thread has its own set of registers
allocated to it, and register accesses are immediate. Some delays may occur due to read-after-write
dependencies (when an instruction’s operands are not available yet because previous instructions still
need to write them) or bank conflicts, but usually accesses are the fastest that they can be on a GPU.
As for the virtual abstractions, they are all related to the device memory:
• Global memory - Global memory is quite slow, but it can be accessed by 32, 64 or 128 byte transactions. These transactions need to be aligned though: only the first 32, 64 or 128 byte segments
of the global memory that are aligned to their size can be read or written by memory transactions.
When a given warp executes an instruction that accesses global memory, it coalesces the memory
accesses of the threads within the warp into one or more memory transactions, depending on the
size of the word accessed and the requested memory addresses. If requested memory addresses
are far from each other, the warp must effectively perform a single transaction for each memory
address, which causes them to be serialized (and unused words to be transferred). If they however
are completely sequential, then a single transaction can usually be done for all of the addresses: the
hardware performs that transaction and then broadcasts the values to the requesting threads. In a
worst-case scenario with a small number of threads and a very random access pattern, the bottleneck of the execution becomes the global memory: threads are continuously waiting for others to
finish their memory accesses so that they can perform their own.
• Local memory - Local memory space has the same latency and bandwidth as global memory due to
being in the same physical memory. However, this type of memory is organized in such a way that
consecutive 32 bit words are accessed by consecutive threads, and in this way the accesses become
fully coalesced. Local memory, unlike global memory, can not be populated by the CPU or even
the programmer: (large) variables are automatically placed into it by the compiler in the case that
they do not fit into registers.
• Constant memory - This memory space, although residing in device memory, has a small space
allocated to it: 64 KB. But even though it resides on device memory, it is cached on-chip. This
means that a read from constant memory is equal to a read from the cache (which is much faster
than a global memory read). In fact, if threads in a half-warp (which is a group of 16 threads that is
running alongside another one - thus composing a group of 32, a warp - in a shared multiprocessor)
read the same address, that read can be as fast as a read from a register. Reading from different
addresses by threads in a half-warp however, is serialized. Also, this memory is read-only, unlike
global memory.
• Texture memory - The texture memory has basically the same properties as the constant memory,
except that its size is not bound to a fixed value, but the values are that placed into it must respect
some boundaries. It is also optimized for 2-dimensional spatial locality, and it has a dedicated cache.
So threads that read texture addresses close together will achieve good performance. While also
8
read-only, texture memory can be greatly advantageous for algorithms that exhibit spatial locality
in their access patterns.
2.3
Programming Interfaces and CUDA
Developers that want to program GPUs have a choice to make: which application programming
interface to use. The most used ones are:
• Compute Unified Device Architecture (CUDA)[4] - CUDA is a parallel computing architecture
developed by NVIDIA to be used in their GPUs. It provides both a low level and a high level
API, and is composed of simple language extensions for C, as to provide a smaller entry barrier to
developers already used to C or C++. CUDA is also scalable, meaning that it can be seamlessly
applied to NVIDIA GPUs with different characteristics.
• Stream[1] - This architecture is AMD’s response to CUDA. It uses the ”Brook” language, which
is a variant of C designed for parallel computing. CUDA can only be used in NVIDIA cards, and
likewise, Stream can only be used in AMD cards. Stream entered the market later than CUDA, so
its community and environment are significantly smaller than CUDA’s.
• Open Computing Language (OpenCL)[6] - As both CUDA and Stream are proprietary, an open
standard was launched for parallel programming: OpenCL. It can not only be used for GPUs, but
also on every other parallel computing device, be it either a CPU in a server or in a mobile device.
OpenCL is fairly new though, so it is not nearly as mature as the alternatives (especially CUDA).
As we are working with an NVIDIA GPU, our decision of API to use was between CUDA and OpenCL.
CUDA was an easy choice: its developer environment and community were (and still are) much more
mature than OpenCL’s. As programming for GPUs is still relatively recent, challenges related to it are
natural to occur to new programmers , so having an environment with as much support as possible was
important. Also, CUDA provides some much needed performance tools such as a profiler which gives
various data such as occupancy analysis, which helps with getting the maximum performance out of the
GPU.
2.3.1
CUDA Model
The main idea behind CUDA’s model is that it provides abstractions that exist to guide the programmer to partition the problem into sub-problems that can be solved independently by blocks of threads
(which will then be executed on a shared multiprocessor) and each sub-problem into smaller ones, each
of which will be solved by a different thread in a thread block.
Threads and thread blocks are however software entities, while shared multiprocessors are pieces of
actual hardware. Here are the software entities and how they map to the hardware (also shown in Figure
2.5).
• Device - The device is the GPU card that is running the code.
• Host - The host is the CPU that controls the GPU and might also run some code.
• Kernel - A kernel is a function that runs on the device. As mentioned earlier, NVIDIA’s GPUs use
the Single Instruction Multiple Data model. This means that the kernel is executed in parallel by
every thread processor. Recent GPUs have the ability to run multiple kernels concurrently.
9
• Thread and Thread Processor - A thread runs a kernel which is executed on the thread processor
(or CUDA core).
• Thread block and Multiprocessor - Every thread block contains some predefined number of threads.
A multiprocessor executes some thread blocks allocated to it by the GPU, but effectively no more
than a warp of threads (32 threads) can be running on a multiprocessor at a single time.
• Grid - A grid is a set of thread blocks, and a kernel is launched with a given number (specified by
the programmer) of blocks to be run on the device.
Figure
2.5:
Mapping
of
CUDA’s
software
abstractions
http://www.sciencedirect.com/science/article/pii/S1384107609001432#gr1
to
hardware.
By allowing the programmer to specify the number of thread blocks at compile time, programs can
transparently scale to different NVIDIA GPUs with different hardware configurations. The hardware
schedules thread blocks on any shared multiprocessor as long as they are free. However, having the
hardware doing the scheduling also means that we do not have any guarantee if a given thread block has
finished, is executing, or has not begun executing yet.
A kernel’s configuration is an important aspect to consider, as it has an impact on the execution
performance. When a multiprocessor is given one or more thread blocks to execute, it partitions them
into warps that in turn get scheduled for execution by a warp scheduler. A block can have a size up to
a given limit (which depends on the device), but some values are more optimal than others. If a block
has 33 threads, what will happen is that two warps will be created from that block: one with 32 threads,
and one with 1 thread (there are 32 CUDA cores in a shared multiprocessor, thus the scheduler always
builds warps with 32 threads when possible). This will lead to some inefficiency, as at one time only a
single thread will be scheduled for a shared multiprocessor. Blocks should therefore optimally have a size
that is a multiple of 32.
2.3.2
CUDA Workflow
In order to execute a given algorithm in a GPU using CUDA, some steps need to be performed by
the programmer. An example workflow for an algorithm that is to run on the GPU is described:
• First, the programmer implements the algorithm using CUDA’s libraries and C extensions. A kernel
must be defined in order for some work to be performed by the GPU.
10
• The program is then compiled by the CUDA compiler, which translates it into both CPU and GPU
instructions.
• The executable generated in the above step is run.
• Data is read or generated by the CPU and sent to the GPU. The kernel defined in the first step is
then called, which leads to its instructions being sent to the GPU, so it can start executing them.
• The GPU runs the kernel, performing some operations on the data sent by the CPU. When it is
done, the CPU then reads the results back to its memory.
This is a sketch, but it provides a backbone for GPU applications. Even though a small number of
steps is performed, we can see that running a program on a GPU is not as simple as running it on a
CPU: data needs to be sent and read back, which is not an instantaneous process.
2.4
Reflections
As we have seen during this chapter, GPUs have characteristics that set them apart from CPUs:
various memory abstractions, number of cores, differences in threads, etc.
But blindly applying an algorithm to a GPU does not lead to an increase in performance. If for instance
the algorithm needs constant transfers of memory from the CPU, then the performance improvement of
running it in the GPU might be offset by the cost of performing those memory transfers.
The algorithm itself needs to be analyzed as well: if it has highly divergent execution flows between
threads, then the CPU might be a better fit to process it.
GPUs can provide great performance improvements, but some effort is needed in order for them to
do so.
In the following chapter, we will see some examples of pattern matching algorithms which were adapted
to run on GPUs.
11
Chapter 3
Related Work
In this chapter, we present some related work done in the area: standard pattern matching algorithms,
improvements performed on them, and implementing them on GPUs.
3.1
Threat Detection
Threat detection products can be found on a multitude of systems, from personal computers to servers.
Most of them are commercial and require a paid license, having patent-protected algorithms and signature
sets.
SNORT [8] is a widely accepted open-source Intrusion Detection System (IDS). Intrusion detection
systems are a category of thread detection systems designed to scan incoming network packets, as to
detect if an intrusion attempt is underway.
ClamAV [3] is an open-source anti-virus product. Anti-virus solutions scan files instead of packets,
but they also target the detection of possible intrusion or attacks, which might occur if a malicious file
is executed. In order to detect malicious files, anti-virus products employ a plethora of techniques, from
matching a file’s body to a signature set corresponding to known virus footprints, to analyzing their
behavior in run-time.
The fact that they are open-source, together with the support they receive from their parent company
(SourceFire [9]) and their widespread use in the industry, makes them candidate systems adopted for
research in the area.
3.1.1
SNORT
As just mentioned, SNORT is an IDS. Intrusion detection systems work by applying deep packet
inspection techniques to incoming network packets, in order to detect if an attack is being executed.
These deep packet inspection techniques are basically verifying if a given packet matches a signature
from a signature set (taking into account other data such as the protocol that the packet belongs to, for
instance).
Due to being an intrusion detection system, SNORT’s signature set differs from ones from anti-virus
products. While the latter ones are typically just a string (with wildcards or not) that must be matched
on a file, SNORT’s are a little more complex. Also SNORT contains less rules than anti-virus solutions,
but they are a little more complex, as they contain more elements to process.
Each of SNORT’s rules contains an action (or alert), a protocol, a source netmask and port, a
destination netmask and port, and some other options. Of these options, the one named content states
a string that must be matched on the packet. This string, just like regular anti-virus signatures, may
13
alert tcp !$HOME_NET any -> \HOME_NET 80
(msg:"IDS219 - WEB-CGI-Perl access attempt"; flags:PA; content:"perl.exe"; nocase;)
Figure 3.1: Example SNORT rule.
contain wildcards. In contrast, anti-virus signatures usually only contain this string, along with optionally
some other fields such as the offset that the signature must be present in, and the file type in which the
signature must be matched.
There are two main operations needed to be done on each packet to be scanned: packet classification,
and string matching on the packet’s content. According to Roesch [25], the string matching process is
the most computationally expensive one. Thus it is not surprising that research work done with SNORT
typically focuses on improving the performance of this process.
SNORT mainly uses the Aho-Corasick algorithm (described later in this chapter) for pattern matching
purposes. SNORT’s configuration does permit changing the algorithm’s characteristics (such as the type
of the automaton used), but the possible changes in performance due to such different configurations are
not very significant.
Work in the area (also described later on) thus tries to improve SNORT’s pattern matching process,
either by improving the Aho-Corasick algorithm itself, or implementing a new pattern matching algorithm
for SNORT.
3.1.2
ClamAV
ClamAV is an open-source anti-virus engine, used by many people in desktop computers, servers and
mail gateways alike. The signature set of ClamAV contains about one million signatures in total. However, they are not all of the same format. The set is divided in various types, being the most prominent
MD5 hashes and static strings. As of April 2012, the distribution of the signature types was as indicated
by Table 3.1 (the “others” category refers to combinations of signatures, signatures related to archives,
signatures related to spam e-mails, etc).
MD5 Hashes
Basic Strings
Regular
Expressions
Others
Total
1,215,718
88,391
9,421
7,484
1,321,014
Table 3.1: Distribution of signature types in ClamAV.
As we can see, more than 90% of the signatures are MD5 hashes of files. However, MD5 hashes are
calculated by simply applying a hash function to a whole file. Scanning against this type of signatures
does not take a significant amount of time, in contrast to scanning for regular expressions and basic
strings, which must be scanned for inside the file.
This is proven by profiling ClamAV: we did so by scanning a large file (around 600 MBytes), and
found that two functions stood out from the others (considering their execution time). The results are
shown in Table 3.2.
All of the other functions that are not shown in the table had an execution time below 2 seconds
(except for a function related to the database, that took 13 seconds to execute). As such, the biggest
fraction of the running time that we can optimize is the one related to the two functions shown above.
These two functions shown in the table are related to the process of pattern matching, which is done by
14
execution time in
seconds
number of times
called
percentage of
execution time
function name
37.16
2592
38.5%
cli ac scanbuff
36.68
2566
38%
cli bm scanbuff
Table 3.2: Results of profiling ClamAV by scanning a large file.
an anti-virus product in order to determine if a given file contains a known signature (which corresponds
to a virus). As can be seen by their name, are related to ClamAV’s two pattern matching algorithms,
Aho-Corasick and Wu-Manber (the function is actually named “bm”, short for Boyer-Moore; Wu-Manber
is a multi-pattern matching adaptation of Boyer-Moore) - these algorithms will be explained further in
this chapter.
As mentioned, ClamAV’s signature set is quite diverse, but some subset of it needs processing that
is computationally expensive. These signatures can have two different formats (the first of which can no
longer be used to describe new signatures - but old signatures described in it are still valid):
MalwareName=HexSignature
MalwareName:TargetType:Offset:HexSignature[:MinFL:[MaxFL]]
TargetType represents the filetype associated with this signature, Offset represents the location inside a
file where this signature must be present, and MinFL and MaxFL can be used to state that this signature
should be restricted to some detection engine versions.
While the TargetType and Offset parameters represent some computation needed to be done, that
computation is actually only performed if a file matches the hex signature. And it is in this hex signature
that the main complexity of the pattern matching processing lies.
Hex signatures can be split into two different groups: signatures with wildcards, and signatures
without them. Signatures without wildcards are not especially problematic: despite being fairly large
both in number as in length, these can be applied to pattern matching algorithms which don’t have to
worry about matching regular expressions (such algorithms take advantage of that fact to have better
overall runtime complexity). Signatures with wildcards however, can be very problematic: ClamAV’s
wildcards are some of the most complex regular expression characters. Here are some of them:
• ?? - Match any byte. This is equivalent to the “dot” wildcard in POSIX [7] regular expressions.
• * - Match any number of bytes. This is an especially problematic wildcard, as it is equivalent to
the “dot-star” (or “.*”)wildcard conjunction in POSIX. As any number of bytes, and all possible
bytes can be matched here, the regular expression automaton needs to constantly keep considering
the state in which this wildcard is represented.
• {n}, {-n}, {n-}, {n-m} - These are all combinations of repetitions of the ?? wildcard (in fact, {1}
translates to ?? in ClamAV’s internals). While not as problematic as the above wildcard, these
also bring complexity to an equivalent automaton.
Due to the complexity of modern viruses, these wildcards are wildly used in signature sets. In the
ClamAV signature set used (the same as mentioned above), we have found 5,779 “*” wildcards. So not
only are such wildcards highly complex in terms of automaton representation (leading to being expensive
in terms of memory space utilized), but they also exist in a high number. This leads to automata-based
solutions having problems with ClamAV’s signature set, as will be evidenced further along this work.
15
3.2
Pattern Matching
We have already seen earlier in this chapter that pattern matching is the most computationally
intensive task of ClamAV. This is mostly due to the fact that most of the time, every byte of every file (or
packet, in intrusion detection systems) needs to be processed as part of the string searching algorithm.
Pattern matching is the process of finding a string or a set of strings inside a text. Pattern matching
is used widely in variously different applications, from search engines to security solutions.
In order to match a given string ABAB on a text ACABABACAC, we could start by aligning both
of the strings on their first character, and then perform character-by-character comparisons.
Figure 3.2: An example of performing naive string matching.
However, if for instance the text is very long, or if we want to perform matching of several strings
(thus leading to performing the above process for each of the strings), then such a simple process is not
suitable.
For the rest of this work, we will use ClamAV’s notation for wildcards: when a “*” is shown, it is
equivalent to the “.*” wildcard in the POSIX notation regularly used in the industry.
3.2.1
Single-pattern Matching
Single-pattern matching algorithms perform matching of one pattern at a single time. For k patterns
to be matched, the algorithm must be repeated k times.
There are two common single-pattern matching algorithms: Knuth-Morris-Pratt [18] and Boyer-Moore
[12]. Both have a similar backbone idea: that unneeded character comparisons can be skipped by making
use of the information about previously matched characters.
In the Boyer-Moore algorithm, the larger the pattern to search for, the faster the algorithm can match
it. The algorithm compares the pattern to the text from right to left (starting at the rightmost character),
and two rules are used in order to determine how much characters can be skipped when a mismatch occurs
(one is based on the current mismatched character, the other on the number of matched characters before
the mismatch). These rules are dependent on the pattern however, so on each mismatch that occurs, the
distance to be skipped might not be as big as one would hope.
Boyer-Moore can be quite efficient: when the pattern to be searched for does not appear in the text,
its worst-case running time is of O(n + m) (where n is the text length and m is the pattern length).
This is due to the fact that if the pattern does not appear in the text, the algorithm is constantly
skipping characters, without verifying if some possible match exists. When the pattern appears in the
text however, the worst-case running time complexity is O(nm): the algorithm needs to verify possible
matches, and this verification causes the complexity to increase.
3.2.2
Multiple-pattern Matching
Multiple-pattern matching algorithms, in contrast to single-pattern ones, can perform matching of a
set of patterns in a single run. For k patterns to be matched, the algorithm only runs once.
16
Wu-Manber
The Wu-Manber algorithm [33] is very similar to Boyer-Moore, but it can be applied in multiplepattern matching. The main difference is that instead of comparing characters one by one, the algorithm
compares blocks of B characters. In order to do so, it uses three tables: one is analogous to one of
Boyer-Moore’s rules - it is used to determine how many characters in the text can be skipped - and the
other two are used when no shifting, even by a single character, is possible - thus we need to verify which
pattern might be matching at the current position.
A limit m (the length of the smallest pattern) is imposed to the length of each pattern, so that they
all have the same length. A pitfall of this limitation is that if we have small patterns, we can never shift
by a distance greater than the length of the smallest one.
Each possible string in the alphabet is mapped (using a hash function) to an index to the table
that determines the skipping distance (called Shift table). In the scanning phase, there are two possible
situations: either the current string of size B is not a substring of any given pattern (so we can safely
shift m-B+1 characters), or it is. In this latter case, the rightmost occurrence of the string must be found
for all of the patterns, and the shifting distance is then given by the minimum pattern length minus the
greatest of these rightmost occurrences.
As for the other two tables, they are called Hash and Prefix. The Hash table uses the same hash
function as the Shift table, and it maps the last B characters of all patterns. Each entry contains a pointer
that both points to a list of pointers to patterns whose last B characters hash into the corresponding
index of the Hash table. The pointer contained in each entry is also in turn an index to the Prefix table.
This Prefix table maps the first B characters of all patterns into a hash value. It is mainly used to filter
patterns whose suffix is the same as the current string that was matched, but whose prefix is different.
Considering this, the scanning procedure is as follows: a hash value is calculated for the current B
text characters. The value of the Shift table for this hash value is then checked: if it is geather than 0,
then the text is shifted by the given value, and the matching resumes. If not, then the hash value of
the prefix of the text (starting m - the minimum pattern length - characters to the left of the current
position) is calculated.
After this, for each pattern that is pointed to by Hash table for the first hash value that was calculated,
we must check if its entry in the Prefix table is equal to the recently calculated hash value of the prefix.
If they are equal, then the pattern is checked directly (character by character) against the text, in order
to confirm the match.
The Shift table is constructed in O(M) time (where M is the total size of the patterns), and one
hash function takes O(B) (B is the block size - a value that can be defined, but recommended by the
algorithm’s authors to be 2 or 3 for natural language texts) to compute. As such, the total amount of
work in the case of non-zero shifts is O(BN/m) (N is the text size, and m the minimum pattern length).
The algorithm is not only designed to concentrate of typical searches (rather than worst-case behavior),
but also it is designed to be used for natural language texts and queries, so it might not be as efficient
on other applications whose pattern sets might have different characteristics than the ones from natural
language. In Chapter 4, we will show that this holds true: not every application benefits (in terms of
execution time) from using the Wu-Manber algorithm.
Burrows-Wheeler Transform
The Burrows-Wheeler-Transform (BWT) is a reversible transformation that can be performed on
a text, in order to achieve compression. It has been widely used in that area, but its application to
string matching was only discovered in 2000 [14]. In order to do string matching with BWT, one must
do pre-processing on the text (in contrast to other algorithms such as automata-based ones, which do
17
pre-processing on the queries).
The BWT for a given text T is constructed by first appending an “end of string” character (the $
character for instance) to the text (this character should of course not be present in any other position in
the text). Then, all of the cyclic permutations of the text are calculated and sorted. The BWT is then
given by the last character of each of the cyclic permutations, which are arranged in sorted order.
Two additional data structures must also be present: an array, which keeps track of the original
indexes of the permutations before being sorted (called the suffix array), and another one which keeps
the sorted characters of the string.
Searching is done as follows: for each character in the query, we find the corresponding cyclic permutations of the text whose first characters are equal to the one being considered from the query. We
continue on to consider the next character query: we again find its corresponding permutations, but
this time we only consider the first n permutations, where n is the number of times that this character
appeared in the set of last characters from the previously considered cyclic permutations. When we have
done this for all of the characters of the query, the indexes of where the query is present in the text are
given by the suffix array values corresponding to the cyclic permutations who where considered last.
While searching is done in a straightforward manner, BWT-based pattern matching algorithms have
as can be seen, a quite large memory usage: different data structures must be kept in memory, and these
can grow to be quite large depending on the text (around 3.7GB for the human genome [20]). Also,
BWT is more suited for applications which have a fixed text and changing queries (as it applies the
pre-processing on the text).
Finite automata
Other multi-pattern matching algorithms are based on finite automata. Finite automata used in
pattern matching algorithms are either non-deterministic (NFA) or deterministic (DFA). Both kinds are
augmented: directed graphs that have an alphabet, a set of states, and a transition function (that for
each sate, computes a next state, given a character from the alphabet).
Matching is done by, starting at the initial state, computing the value of the transition function for
the current input character, and feed that value to be the next state at the next iteration (where the
transition function is again executed but for the next input character). When one of the states given
by the transition function is final, the algorithm reports that there was a match at the current input
position.
Non-deterministic Finite Automata
A Non-deterministic Finite Automaton (NFA) consists of a finite set of states Q, a finite set of input
symbols Σ called the alphabet, a transition relation ∆ : Q × (Σ ∪ {}) → P (Q) (where P (Q) represents
the set of all the subsets of Q and represents the empty string), a start state q0 ∈ Q and a set of final
states F ⊆ Q. Let a = a1 , a2 , ..., an be an input string over Σ. The NFA recognizes a if in Q a sequence
of states s = s1 , s2 , ..., sn respects the conditions
• s0 ∈ E(q0 ) (the automaton starts in the states reachable by transitions from q0 )
• s0 ∈ ∆(si , ai+1 ) and si+1 ∈ E(s0 ), (i = 0, 1, ..., n − 1) (the next state to transition to is given by
the result of applying the transition relation with the current state and input character, and then
following transitions for the resulting state)
• sn ∈ F (the automaton recognises the input string of any of the states to which it transitions to is
part of the set of final states)
18
Figure 3.3: NFA for the patterns ab*c, c and ad built from applying Thompson’s algorithm. [29]
The execution model of an NFA uses some auxiliary data when processing the input text. As in an
NFA the system may be at more than one state at a single time, usually a vector is kept, which represents
a list of active states. Each time one of the active states is final, the algorithm reports a match for the
possible final states.
An example NFA for the patterns “ab*c”, “bc” and “ad” (where ”*” is the ClamAV wildcard, representing the possibility of zero or more characters being matched) is shown in Figure 3.3. A transition
labeled by Σ represents a group of transitions for all of the characters in the alphabet, which in this case
is Σ = {a, b, c, d}.
Matching on an NFA is a somewhat different (and more complex) process than matching on a DFA:
instead of only following a single transition on each input character, multiple transitions are followed,
which leads to several states being active at any one time.
The process of matching the NFA from Figure 3.3 to the text abc is:
• The only active state at the start of the algorithm is the initial state, state 0.
• On consuming the first character, a, the algorithm marks states 1 and 9 as active.
• The character b is consumed, leading to states 1 and 9 being marked as inactive, and states 2, 3, 5
and 7 marked as active (state 0 is always active, and epsilon transitions are always followed).
• The algorithm finally consumes the character c, which leads to the states 6 and 8 being marked as
active (and states 5 and 7 as inactive). These states are final, so matches are reported on each of
them.
Deterministic Finite Automata
A Deterministic Finite Automaton (DFA) consists of a finite set of states Q, a finite set of input
symbols Σ called the alphabet, a transition function δ : Q × Σ → Q, a start state q0 ∈ Q and a set of final
states F ⊆ Q. Let a = a1 , a2 , ..., an be an input string over Σ. The DFA recognizes a if in Q a sequence
of states s = s1 , s2 , ..., sn respects the conditions
• s0 = q0 (the automaton starts in the state q0 )
• si+1 = δ(si , ai+1 ) (i = 0, 1, ..., n−1) (the next state to transition to is given by the result of applying
the transition function with the current state and input character)
19
a
b
c
d
0
1
2
0
0
1
0
3
0
4
2
0
0
5
0
3
3
3
6
3
4
0
0
0
0
5
0
0
0
0
6
3
3
6
3
Table 3.3: Transition table for the DFA in Figure 3.4
• sn ∈ F (the automaton recognises the input string of any of the states to which it transitions to is
part of the set of final states)
The transition function δ can be represented in different ways, but the standard one is using a 2dimensional table. Each of the table’s cells includes the result of applying the transition function to the
state and input character given by the respective line and column. As an example, for the DFA in Figure
3.4, we can represent its transition function using Table 3.3. All of the cells in the first row represent the
transitions that will occur at state 0 for each of the input characters. Conversely, the cells in the first
column represent the transitions for the input character ‘a’ that will occur at every state. We can see
that the cell given by the first row and first column has the value 1. This means that if the automaton
is on state 0 and the input character is an ‘a’, then the automaton transitions to state 1.
Comparing DFAs with NFAs, a DFA has no transitions by the epsilon character and each state only
has one transition by a given character. This simple model has two main advantages: first, matching
is straightforward and fast, as it requires only a single table lookup (or pointer dereference) per input
character. Second, DFAs are composable, meaning that a set of DFAs can be composed into a single
composite DFA (eliminating the need to transverse multiple DFAs instead of a single one).
Unfortunately, DFAs often do not interact well when combined, yielding a composite DFA whose size
may be exponential in input and often exceeds available memory.
They do however provide a big advantage over NFAs: as computers are deterministic machines, NFAs
are harder to simulate in them (due to the possibility of being in more than one state at the same time),
which causes larger execution times and greater implementation complexity.
Regular expression matching with automata
Regular expressions are gaining widespread use in applications which use searches with expressive
and flexible queries (virus scanning and intrusion detection are examples of such applications). Using
Thompson’s algorithm [29], an NFA can be converted from a regular expression, and then converted to
an equivalent DFA.
As regular expression use becomes more and more common in packet and file scanning applications,
it is imperative that regular expression matching be as fast as possible, as to keep up with line-speed
packet header processing. The inefficiency in regular expression matching is largely due to the fact that
the current matching solutions do not efficiently handle some complex features of regular expressions used
in such applications: many patterns use multiple wildcards (such as ‘.’ and “*”) and some patterns can
contain over ten such wildcards. As regular expressions are converted into NFAs for pattern matching,
large numbers of wildcards can cause the corresponding DFA to grow exponentially. Also, a majority of
the wildcards are used along with length restrictions (such as ‘?’ and ‘+’), and these restrictions can
20
increase the resource needs for expression matching. Finally, groups of characters are also commonly
used in regular expressions (such as one that matches all lowercase characters). Groups of characters can
intersect with one another or with wildcards, causing a highly complex automaton.
Converting an DFA to a DFA
The method to convert an NFA to a DFA is called Subset Construction [24]. Its general idea is that
each state of the DFA corresponds to a set of NFA states. So, when calculating the transition function
for the input, each time the DFA is on a given state, it is also on a set of NFA states.
The algorithm constructs a transition table for the DFA to be built, which will represent for each
state and input character, what will be the next state to transition to. As we are building a DFA from
an NFA, these DFA states will be represented by sets of NFA states. This transition table is built so
that the resulting DFA will end up simulating all of the possible transitions that the NFA can make in a
given input string.
The algorithm defines some operations which effectively give us the resulting set of states that an
NFA has active after performing a transition. Remembering that the NFA can have transitions by the
empty character, which does not consume input, it is clear to define that the set of NFA states that is
active after a given transition is the state s that is given by that transition, plus all of the states given
by epsilon transitions starting from s.
If we repeatedly consider each of these sets of NFA states for the transition table that we want to
build, we end up with a deterministic automaton that for each set of NFA states N and input character
c, gives us another set of NFA states which is composed of states reachable (either by just following a
transition or by following multiple epsilon transitions) by each of the states in N by transitions labeled
by c.
The result of applying the algorithm for the NFA in Figure 3.3 is shown in Figure 3.4 (consider that
the “*” wildcard is the ClamAV one, the equivalent in POSIX would be “.*”).
The conversion algorithm does have a nuance though: it does not give as an automaton capable of
continuous matching (for the situation in which we want to identify all matching patterns in a text, and
not only a single one).
This in turn affects the online complexity of scanning a text using the resulting DFA: if we reach a
given state and the transition for the current input character leads to the initial state, then we must
backtrack on the input (we must re-consider the failed input text). This causes the complexity of the
algorithm to no longer be O(n) on every case - if we need to backtrack, then we must read the input
character that caused the failure transition again; this of course is not ideal (we want an O(n) scanning
algorithm).
However, a simple modification (in implementation terms) can be made to the subset construction
algorithm to make it build a DFA capable of continuous matching. We just add a new transition by
sigma (all of the input characters) from the initial state to itself. What this entails to, is that at each
point in the DFA transition-table building process, we are always considering transitions from the initial
state, and this will propagate throughout the process: we will end up simulating the NFA not just for its
valid input characters, but for every character in the alphabet.
While this allows the algorithm to provide us with an automaton capable of begin used for continous
scanning, it also greatly increases the conversion algorithm’s complexity (there are now many more possible input characters to consider at every state, instead of just the ones present on the NFA transitions).
The resulting DFA can be extremely more complex as well. Figure 3.5 shows a DFA built from this new
version of the algorithm. We can see that considering all of the input characters greatly increased the
DFA’s complexity, namely at the point representing the wildcard (state 3).
21
Figure 3.4: DFA for the patterns ab*c, bc and ad.
Figure 3.5: Complete DFA for the patterns ab*c, bc and ad.
22
To illustrate the advantage, consider what happens when the input is abad. In the first automaton,
the matching gets stuck on state 3 waiting for a c to come in the input, so no match will be reported,
which is incorrect. In the second automaton, if we just follow the transitions, then we see that a match
for ad is actually reported (on state 12). And if afterwards came a c, then the automaton would also
report a match of the ab*c pattern (this would also happen on the first automaton).
What is happening is, at the state that represents the star wildcard, the automaton actually represents
the possibility of every other pattern in the set being present in the input. If our pattern set contains
many wildcards and many patterns (as is the case in ClamAV as in previous sections), we can see how
this becomes a huge problem.
Some approaches have been proposed to counter the state-space explosion that affects DFAs constructed from regular expressions with complex wildcards: HFA [11], hybrid finite automata, split the
DFA under construction at the point where state-space explosion would happen. Lazy DFAs [15] keep
a subset of the DFA that matches the most common patterns in memory; for uncommon strings, they
extend the subset from the corresponding NFA at runtime. This causes the Lazy DFA to be smaller
than the corresponding complete DFA, while providing good performance for common patterns. However, a malicious sender can input widely different patterns into the system, causing it to be dynamically
rebuilding a DFA (and thus slowing down the overall system).
Smith et al. [26] proposed another alternative called XFA, which uses less overall memory than DFA
but requires, for each state, some auxiliary memory (used to track the matching progress) along with
computation. In order to reduce a DFAs memory footprint, they apply the idea that on each state,
instead of having for instance two separate transitions to two different states, they can be grouped into
a single one, which goes into a single state. This state will then have some computation associated with
it, which will differentiate each of the two states that were grouped into this single one. By doing this,
DFAs are effectively compacted, at the cost of some more computation. While the memory cost is greatly
reduced, per-byte complexity is increased, which is not an ideal tradeoff for all applications.
Aho-Corasick
A special automata-based algorithm is the Aho-Corasick algorithm [10]. It uses a DFA in the form
of a trie, which is a data structure just like a tree, except that it is ordered and it is used to store an
associative array. The DFA used in Aho-Corasick is not the classical one: instead of having transitions
for each character of the alphabet on each state, it has transitions for characters that lead to other states
in the same expression (just like a regular DFA), along with failure transitions. These failure transitions
are followed each time a mismatch occurs, and they are designed to keep track of shared prefixes between
patterns (thus allowing the automaton to transition between matches without backtracking).
Using the Aho-Corasick algorithm for string matching consists of (roughly) two steps, like other
automata-based algorithms: first, a single composite automaton is generated from the patten set; afterwards, that automaton is traversed in order to recognize the corresponding patterns in a text. Given a
current state and an input character, the Aho-Corasick algorithm first checks whether there is a valid
transition on the automaton for the next input character. If there is, then that transition is followed;
otherwise, the failure transition is the one followed and the same input character is considered until it
causes a valid transition. In other words, unlike valid transitions, failure transitions do not cause the
input character to be consumed.
Note that in the latter case, the input character is actually read more than once for different reads of
states from the automaton. This means that Aho-Corasick’s complexity is not O(n) as we would hope.
Aho-Corasick cannot also be used for regular expression matching. ClamAV uses it for scanning files
against signatures with wildcards, but it does not actually represent the wildcards in the automaton
23
(which is what would happen in classical automata based algorithms). It splits the signature into parts
divided by its wildcards, and considers each part as an individual pattern. Then, when two patterns
match, ClamAV inspects what occurred between the matches, and compares that against the wildcard
that separates the matched parts. This causes ClamAV to not have the best efficiency in terms of running
time (as if matches occur then further verification is needed), but it also causes it to have a very low
memory footprint. As it is designed to run on headless servers (namely e-mail gateways) which might
have a very small memory footprint, memory efficiency is a bigger concern to ClamAV than raw scanning
performance.
The main reason that ClamAV uses Aho-Corasick is that many parts in the signatures containing
wildcards are short, and thus considering separate small parts in Boyer-Moore would further restrict the
shifting distance, which is already bound by the length of the smallest signature. As for the usage of
Wu-Manber, compared with the sparse automaton representation used by Aho-Corasick, the shift table
present in Wu-Manber is a more compact representation of a large number of signatures that don’t contain
wild cards, leading to lower memory usage.
We have presented the standard industry pattern matching algorithms. For the rest of this section,
we shall describe related work that focuses on improving them.
3.2.3
A Sub-linear Scanning Complexity Algorithm
Huang et. al [17] proposed a pattern matching algorithm which is sub-linear in the scanning time
complexity. It uses a shift table and hash technique (somewhat similar to Boyer-Moore) and inspects
overlaps between pairs of patterns to ensure worst-case performance. Huang et. al calculated the average
pattern length of ClamAV, and like was shown earlier in this chapter, they found that it was lower than
the one on other systems such as SNORT. It was due to that observation that they decided to use a shift
and hash technique (also ClamAV at the time only had used a shift technique in its algorithm). They also
found that the number of long overlaps between suffixes and prefixes of pattern pairs was limited and as
such, their algorithm can shift more bytes after a pattern is matched, ensuring worst-case performance
(where many patterns are present in the text).
The proposed algorithm consists of two phases: an offline preprocessing phase, where the shift table
is constructed like in the Wu-Manber algorithm. Along with each pattern, a list is also stored, composed
of prefixes of other patterns, that overlap with the suffix of the one they are being associated with. This
allows the algorithm to continuously perform matching on the text, and detect multiple patterns that
overlap with each other.
The other phase of the algorithm is the online matching phase, in which a lookup is performed on the
shift table. If the shift value is zero, the prefix hash is calculated and compared with the value present
in the list. If the hash table entry is empty, then the matching continues starting from the prefix’s last
character. Otherwise, exact comparisons are made for each possible pattern (whose prefix hash is the
same as the one that was just calculated) to confirm a possible match.
If a pattern is matched in this phase, then the algorithm checks its data structures to try to find a
consecutive overlapped pattern (and thus match more than one pattern in a single algorithm iteration).
The proposed algorithm was found to have a speed-up of around 6 and 1.5 for the Aho-Corasick and
Boyer-Moore implementations in ClamAV, respectively, while consuming roughly the same amount of
memory.
3.2.4
Rearranging the Aho-Corasick States
When it comes to the Aho-Corasick algorithm, Nishimura et al. [23] proposed speeding it up by
rearranging its states.
24
The Aho-Corasick algorithm consists of two parts: first, a pattern matching automaton is constructed
from the given pattern set; then that automaton is used to search the text. The time complexities of
both parts of the algorithm are, respectively, linear in the total length of the patterns, and linear in the
length of the text.
The pattern matching machine is usually implemented as a two dimensional array (with O(nm) space
complexity, where n is the number of states and m is the alphabet size), which represents a table that
is inspected at each input byte, in order to fund which transition on the state machine should be taken.
The execution time of searching for the patterns in the text depends on the time that it takes to transmit
data from primary or secondary storage. In order to reduce this overhead, data compression can be
used to reduce the text’s and the patterns’ size, therefore decreasing the table’s size and the primary or
secondary storage transmit time. However, a high compression ratio of the patterns and text does not
necessarily lead to a speedup in the matching phase of the algorithm.
In order to make the frequently used states be collected on RAM in order to improve CPU cache
efficiency, Nishimura et al. proposed rearranging them in a breadth-first order: frequently used states
are usually close to the initial state.
Using the rearranging techniques led to a reduction of about 71% in the matching time on a random
text (on an English natural language text, the reduction was of 80%). Such an improvement was possible
due to the pattern set characteristics, causing mismatches to occur often and early. If that was not the
case, the reduction in matching time would not be so significant.
3.2.5
Re-writing some Regular Expression Rules
We have seen earlier in this chapter that some regular expression rules can be quite problematic for
DFA, leading to state-space explosion. Yu et al. [34] tried to optimize DFAs by re-writing some of these
rules. Remembering the problem of regular expressions, a pattern can be matched on the entire input
string or on specific substrings of the input, rather than being only matched on a simple string (of fixed
length). Without a priori knowledge of the starting and ending positions of those substrings, DFA created
for recognizing all substring matches can be highly complex, and such a feature (reporting all matching
substrings) may not be fully necessary for the application being considered. In fact, in payload scanning
applications (intrusion detection systems and anti-virus solutions) we are only interested in knowing if
certain attacks or patterns are present.
Some regular expressions have a special character to indicate that this expression should be matched
only at the beginning of a line. Without this character, there is no prior knowledge of whether or where
a matching substring may appear. To handle these patterns, two types of DFA can be created: repeated
searches or pre-pending ”.*” to each of these patterns (thus explicitly stating that the pattern can occur
anywhere in the input). As the input is scanned from start to end, the DFA can recognize all substring
matches that may start at different positions of the input.
However, patterns that do start with “∧ ” may also bring some problems. If there is a class of characters
that can overlap with the pattern’s prefix (such as for instance the rule “∧ B+[∧ n]” which means matching
at the beginning of a line a positive number of Bs and then anything except a newline), then the resulting
DFA can be quite complex (because in the given example the DFA needs to remember the number of Bs
it has seen and their locations in order to make a correct decision with the next input character).
From an analysis done by the authors, a significant number of patterns in the SNORT rule fell into
this category. In order to solve this issue, they proposed re-writing such patterns. In the example given
above, due to the fact that in payload scanning applications only non-overlapping matches are of interest,
the pattern could be rewritten to “∧ B[∧ n]”.
For patterns targeting buffer overflow attempts (existing on intrusion detection signature sets), that
25
have a length restriction and check if newlines are not present in the pattern, they proposed adding the
“∧ ” character, avoiding state-space explosion.
They also proposed algorithms to efficiently group patterns together, such that patterns in each group
do not adversely interact with each other. Two patterns interact with each other if their composite DFA
contains more states than the sum of the two individual ones. This can clearly happen in our work, as
plenty of patterns contain a wildcard equivalent to “.*”, and we wish to be able to transition between
states of different patterns (as to guarantee O(n) performance).
In order to perform the grouping, they used the notion that if there is no interaction between any
pair selected from three regular expressions, the composite of those regular expressions is not likely to
exceed the sum of the individual ones. Grouping is done for every pair of regular expressions that do not
interact, and until the group dimension is not exceeded (a threshold that is manually set).
Using these re-writing techniques, the authors found that the overall DFA reduction rate in the SNORT
signature set was of about 98%, a large improvement. When it comes to grouping multiple patterns, they
found that the signatures in another intrusion detection system, Bro [2], could all be grouped, so creating
a composite DFA greatly improved the throughput for this signature set. Their implementation was
overall 48 to 704 faster than an NFA implementation. However, the memory usage was still about 3
to 8 times the one used by an NFA-based approach. Another disadvantage is that rewriting patterns is
something that has to be done specifically for each signature set, and as such cannot be automated.
So while in fact this set of techniques does bring benefits in terms of throughput increase and some
memory reduction, this reduction (which might not be sufficient in every application) comes at a cost of
having to be manually implemented, which may not be feasible for huge signature sets, like for instance
that of ClamAV.
3.2.6
Symbolic Finite State Transducers
One of the biggest security threads in the web is (and has been for some years) Cross-Site Scripting
(or XSS). Cross-Site Scripting is possible due to the existance of a flaw in a server, allowing a client to
send data containing malicious code, that will then in turn be sent to and executed by browsers belonging
to other clients. This flaw is usually present on servers due to the non-escaping of some some critical
characters that are specific to code from web pages. If these characters are not modified in any way when
they are sent to other clients’ browsers, then the browsers will interpret them as valid code instead of
data to be viewed. Writing string manipulating functions to handle these specific characters is a hard
problem, as the functions must be efficient and bug-free.
Veanes et. al [22] addressed this issue by using symbolic finite state transducers (which extend classic
transducers - a special automaton allowing for both input and output data on each state - with symbolic
predicates). The key difference between these symbolic finite state transducers and classic finite state
transducers is that each transition, instead of being labeled by a character, is labeled by a formula.
Therefore, if we are at a given state of the transducer, we can only advance to a next state if the current
character respects the formula for the corresponding transition (usually these formulas are like a range
that the character must be in, or a specific value that the character must have).
By having different possible execution flows at each state, the traversal of the transducer is not trivial
for a processor: on each state, instead of simply following the transition, the formula for each transition
must be checked, and if it is satisfied, then we can proceed its corresponding state.
Chapter 2 showed that divergence is one of the main factors that can cause a great loss of performance
in GPUs. As such, symbolic finite state transducers are not at all appropriate for a GPU, as they have
different execution flows on each state.
So far, we have seen standard improvements done to the standard pattern matching algorithms in-
26
troduced earlier in the chapter. We continue to present work that improves them, but doing so by
implementing them in GPUs.
3.3
Pattern Matching on GPUs
With the rising computational power of GPUs, research has been done for their use in several application areas. Many of such areas require pattern matching algorithms, so implementing them in GPUs
is a key factor in improving the overall performance.
Kouzinopoulos et al. [19] performed a comparison of the execution times of parallel versions of
the Knuth-Morris-Pratt [18], Boyer-Moore Horspool [16] (an improvement over the classic Boyer-Moore
algorithm), and Quick-Search [28] algorithms on an NVIDIA GPU. These algorithms were run on several
bacteria sequences, as genome sequencing is one of the areas that benefits the most from GPUs being
used for pattern matching. They also ran a naive string searching algorithm, and they found that the
speedup compared to the CPU-only version, was of 2.4.
The speedup achieved in Knuth-Morris-Pratt, Boyer-Moore Horspool and Quick-Search was of 18, 20
and 19 (respectively), proving that GPUs can be quite efficient on pattern matching. They found that
the speedup could be further improved by only using the GPU’s shared memory (instead of using the
global memory).
They concluded that the GPU can run string matching algorithms in a very efficient manner, outperforming the CPU. However, in order to do so, these algorithms cannot be directly ported from the CPU
to the GPU: they have to be adapted in some way in order to fully use the GPU’s computational power.
3.3.1
Using SNORT Signatures
Extended Finite Automata on the GPU
Smith et al. [27] implemented a pattern matching system on a GPU based on XFA (introduced earlier
in this chapter). While XFA might seem like a good improvement, they introduce serious challenges for
SIMD architectures. Having more variable updates involve memory accesses in a manner different than
the access patterns used for traversing an automaton. In a GPU memory accesses should be as few and
simple as possible, as they pose one of the most common bottlenecks of an algorithm.
In addition, some XFA use offset counters, which are just like standard counters but store the vale in
an offset list. These special counters, while eliminating some increment instructions, require additional
work at each byte read.
Performing matching with a DFA has two levels of control-flow: a function is applied to every input
string, and then that function iterates every character of the input string. XFA have a third level: for
each state, the interpreter must then execute every instruction for that state and then iterate over the
entries in the offset list (in order to increment the respective counters). It is clear that processing these
instructions causes many conditional branches.
XFA thus suffer from large amounts of control flow divergence in their third level: as a statement
must be evaluated on each state, different instructions may be executed after that evaluation.
In their implementation, the authors found that the percentage of SIMD instructions that diverged was
more than 50%. As mentioned on Chapter 2, modern GPUs do support divergence, but at a (sometimes
large) performance cost.
The authors concluded that signature matching requires memories that support efficient regular accesses, some fast accesses to a smaller memory and capabilities to hide the latency of irregular accesses.
These are all characteristics that are present on modern GPUs, so DFA processing (which has a somewhat
predictable control flow) should be efficiently supported.
27
As their proposed XFA alternative exhibits branching that is dependent on the input data, it cannot
be reasonably supported by GPUs. The authors did propose that as they found such branching infrequent
on real traffic, some mechanism could be implemented to deviate from SIMD instructions once data that
could cause divergence was found.
Measurements from their work proved the disadvantages of XFA: their measured speedup with XFA
was of about 6.7, while a DFA implementation reached 8.6.
They proposed caching the XFA on the GPU in order to improve throughput, but this could also be
done for DFA, so the performance issue should not be tackled from an implementation point-of-view, but
rather from a design one.
Porting of Aho-Corasick to the GPU
Vasiliadis et. al proposed a solution in which the Aho-Corasick algorithm was ported to the GPU, in
order to perform high-speed intrusion detection [30].
Section 3.1.1 described the SNORT intrusion detection system, which organizes the signatures in
groups based on the source and destination port numbers of each rule (it therefore groups signatures
based on the protocol that the packets that contain such signatures use).
In the authors’ implementation, a separate detection engine instance is used to match the patterns
belonging to a rule group. In order to improve the CPU to GPU memory transfer throughput, each rule
group is copied using a buffer that was allocated using page-locked memory. The CPU to GPU transfer is
also performed using direct memory access, thus freeing the CPU during this time. They have also used
a double-buffering scheme, so that when the first buffer becomes full, it is sent to the texture memory so
that the GPU can process it. While this processing is in place, the CPU is copying newly arrived packets
in the second buffer.
The processing that the GPU performs is the execution of the Aho-Corasick algorithm. The automaton
used by Aho-Corasick is stored in a traditional two-dimensional array, where each cell contains not only
the information of the next state to move to, but also information regarding whether that state is final
or not. As to improve memory access efficiency, the automaton is stored in the GPU’s texture memory.
Two different parallelization methods were implemented. In the first one, each packet was split into
parts of an equal (and fixed) size. Each thread then searches each part of the packet in parallel. In the
second method, each thread is assigned a whole packet to search in.
Every time a thread matches a pattern inside a given packet, it reports the occurrence by appending
the pattern to an array present in global memory. The reports for each packet will be written in a
separate row of the array, following the order that they were copied to texture memory. This means that
the array will have the same number of rows as the number of packets contained in the batch, which
can grow to be quite large. This array is also copied to the CPU memory when the pattern matching
execution has finished, as for the CPU to report the occurrences of matched patterns.
However, reporting an occurrence is not a simple task, as some corner cases must be handled. Casesensitive patterns (Aho-Corasick is a case insensitive algorithm) must be handled by performing an extra
search, offset-oriented rules (which state a location on the input string where the pattern must be present)
must be explicitly checked, and patterns with a common suffix must be confirmed.
Returning to their proposed parallelization methods as briefly described above, each of them has its
own advantages and disadvantages. In the approach of assigning parts of a packet to each thread (and
assigning a packet to each multiprocessor), all threads are assigned the same amount of work. This causes
the execution to have very little divergence (which as seen would hinder SIMD execution). Moreover, in
this case the texture cache is used entirely for the state tables, as no other data is present on the texture
memory.
28
A drawback of this approach is that extra processing is needed for solving the border detection
problem. What happens is that as each thread is assigned a different part of each packet, there is a
possibility that some patterns are not found, as they could start in a given part and finish in another one
(so a thread would never completely finish the matching of such a pattern). In order to solve this issue,
all of threads must scan a little longer (the largest pattern’s length) than the length of their assigned
part. The authors mention that, as each packet is copied to the shared memory of each multiprocessor,
communication costs between threads are avoided. But if a packet is so large as it cannot fit into the
shared memory (which in our Tesla C2050 card is 48KB), then these costs are no longer avoided.
As for the second approach, assigning a single packet to each thread, its biggest disadvantage is that
different threads have different amounts of work (due to varying packet sizes). This means that threads
of a same warp will have to wait until all of the threads in that warp have finished searching in the packet
that was assigned to them. However, no additional computation will occur since every packet is processed
in isolation.
The authors found that overall, due to the cost of divergence in the execution, the best parallelization
method was assigning chunks of packets to different threads.
They found that the system’s performance was quite good, as it yielded a speedup of 2.4. The CPU
was not completely offloaded though, so this opens up the system to targeted attacks. An attacker could
craft packets that would need to be able to be verified by the CPU (after the GPU processing), thus
slowing down the system.
Lin et al. [21] also proposed a port of the Aho-Corasick algorithm to the GPU, but with a somewhat
different approach than the one by the work described above. As stated, a problem with running the
Aho-Corasick algorithm on a GPU (while dividing the file(s) to be scanned amongst threads, which
is a direct implementation of the algorithm) is boundary detection. This problem can be resolved by
having threads process overlapped computation on the boundaries, but the overhead of such overlapped
computation can degrade the performance.
Due to this issue, the authors opted for another approach: instead of assigning a packet’s chunk to a
thread, they assign each byte of the input stream in order to identify any pattern that may start at the
thread’s starting location.
This idea of allocating each byte of an input stream to a thread to identify any virus pattern starting at
the thread starting location has an important implication on the efficiency. First, in the conventional AhoCorasick automaton, the failure transitions are used to back-track the state machine when a mismatch
occurs, so as to identify the patterns starting at any location of an input stream. Since in the proposed
algorithm a GPU thread only needs to check for patterns starting at a particular location, back-tracking
is not needed. Therefore, the failure transitions of the Aho-Corasick automaton can all be removed. And
also, as each thread is allocated a different input stream byte to search for, there is no need for border
detection.
The authors’ algorithm allocates a large number of threads, and most of these have a high probability
of terminating early. This means that threads will have to wait for others in order to proceed, which can
lead to some overall performance loss.
In the implemention, the authors used the maximum number of threads per thread block. The AhoCorasick automaton was split as to be able to fit into each multiprocessor’s shared memory. The split
was done by dividing all of the patterns into several groups (patterns of a same group have some prefix
similarity), and separately compiling these groups into smaller state machines, leading to lower memory
usage. Due to the fact that all threads of a warp traverse the same state machine (which has no failure
transitions) but start at different bytes on the input stream, it is likely that many threads terminate much
earlier than others. In order to fully utilize the stream processors of the GPU, the authors evaluated
two different ways to assign the threads. A first approach was to equally distribute the bytes of an input
29
stream to each thread of the same block. A second one was to utilize a global variable that assigned new
jobs to threads that have already finished. As this variable can only be accessed by a single thread at
a time, its accesses should be atomic. Atomic accesses in a GPU can lead to a serious degradation in
performance, as threads need to wait for the same resource. The authors decided, based on experimental
results, that equally distributing the bytes of an input stream to each thread was the right approach to
use.
Their results found that the system had an overall 2 to 2.5 speedup over the CPU version (taking into
account memory transfers). Although a nice improvement in performance, the authors’ proposed system
has a simplistic work division, leading to possible uneven work between threads (depending on the input
data).
Regular expression matching
Vasiliadis et al. [32] extended SNORT to support regular expression matching using a GPU. The
proposed system is based on DFA: all of the contained regular expressions of the SNORT signature set are
compiled and converted into DFA using Thompson’s algorithm. Unlike regular DFA-based solutions, in
the proposed work, each regular expression is compiled into a separate automaton (that is then transferred
to the GPU). This work effectively serves as an extension of one described above by the same authors,
but which performed regular expression matching on the CPU. In this one, packets are sent to the GPU
along with an identifier of the regular expression that they need to be matched against. So packets are
not scanned in parallel on the same automaton, but on different ones.
During the search phase, all of the automata reside in the GPU’s texture memory. Using the identifier
of the regular expression, each thread will scan the whole packet in isolation.
The automata used do not support all possible wildcards of a regular expression though (in order
to prevent state-space explosion). So in order to preserve accuracy and precision, all of the regular
expressions that fail to compile into DFA are matched in the CPU using the corresponding NFA.
While the speedup achieved by the authors was quite good in their measurements (about 48), an
alternative could be used: compiling all of the regular expressions int o a single one, effectively increasing
the overall system throughput (although also increasing, potentially exponentially, the memory required).
And as in their previous work, the CPU was also not still offloaded: it had to scan every packet for some
special rules that could not be converted to be processed on the GPU.
Cascarano et al. [13] proposed an approach based on NFA. The adoption of a NFA instead of a DFA
allows the system to efficiently store a very large regular expression set in a limited amount of memory.
Instead of the regular representation of automata (a table which in every cell contains the number of
the next state to transition to), the authors opted for a ”symbol-first” representation: a list of (source,
destination) tuples representing transitions is kept, sorted by their triggering symbol. This list can grow
to be very large, so it must be stored in the GPU’s global memory, along with an auxiliary data structure
that records the first transition for each symbol (in order to allow easy random look-ups).
To reduce the number of transitions to be stored (and also to speed up execution), the system adopts
a special representation for self-looping states (those with an outgoing transition to themselves for each
symbol of the alphabet). These states are marked as persistent in a dedicated bit-vector and, once reached
during a traversal and marked as active, will never be reset. To improve memory access throughput, the
bit-vectors containing current and future active state sets are stored in shared-memory.
As for the work division in the GPU, every thread in each block executes the main scanning algorithm.
After the execution of some instruction, threads explicitly wait for the others to reach the same point
before proceeding. Parallelism is exploited not only because at any given time multiple blocks are active
to process multiple packets, but also because for each packet, each thread examines a different transition
30
among those pending for the current symbol. In this way, a large number of transitions can be processed
in parallel, and if there are enough threads, then the time spent in processing a single symbol will be
effectively equal to what would be required for a DFA.
A comparison of the throughput achieved by the authors’ system was made against the one achieved
by a HFA (Section 3.2.2) solution. If the NFA was multistrided (instead of considering a single character
for a transition, pairs of characters are considered), then the throughput reached far better values.
When it comes to the memory consumption, it was comparable to that of corresponding HFA: multistriding does improve the run-time performance (mainly due to shortened input packets), but it also
results in larger automata (due to increased transition counts).
In the authors’ system, in contrast to other approaches, the bottleneck was not memory bandwidth,
but the processing speed of the execution cores. So as faster hardware arrives, the system’s throughput
increases.
The overall speedup when compared to a CPU execution was not as good as a typical DFA approach,
which can be explained by the fact that NFA still have a higher per-byte processing cost. The proposed
system becomes ideal when signature sets become extremely large (at which point DFA solutions have
their memory usage become too high).
3.3.2
Using ClamAV Signatures
Vasiliadis et al. [31] proposed a solution in which ClamAV was modified to run on a GPU-based
DFA matching engine. As mentioned above in this chapter, DFA are the automata with the highest
performance, being able to guarantee O(n) time complexity. However, they also have a high memory
cost, causing some researches to find alternatives with lower performance.
The proposed system took into account the fact that most approaches rely on fast and accurate
filtering of the ”no-match” case, as usually the majority of network traffic and files are not malicious.
The system contains two major phases: a high-speed filtering engine on the GPU, and a lower speed
verification engine that is run on the CPU.
The system starts by preprocessing the DFA from the signature set of ClamAV. Due to the large size
of the virus signature set of ClamAV, constructing a single DFA is infeasible. In order to overcome this,
they chose to only use a portion from each virus signature.
At the scanning phase, the input data will be initially scanned by the DFA on the GPU. Due to not
using the complete signatures, the DFA may not be able to match an exact virus signature from a data
stream, as in many cases the length of the signature is longer than the length of the prefix used to create
the automaton. This causes the GPU scanning to become a first-level filtering, effectively offloading a
significant portion of work from the CPU. However, the longer the portion taken from each signature,
then the fewer the number of false positives that the CPU will eventually have to scan.
When it comes to assigning input data to each thread, the simplest approach would be to use multiple data streams, one for each thread, and in separate memory areas. However, this would result in
asymmetrical processing for each shared multiprocessor, and would not scale very well (if the sizes of the
input vary, then the amount of work per thread will not be the same). As such, they decided to have
each thread scan a different portion of the input data stream. Using such a strategy means of course that
boundary detection is necessary.
The approach to this problem described in previous works, which is to scan each chunk and continue
on until reaching the length of the largest pattern, would not work in this case. As mentioned in Chapter
2, ClamAV’s signatures may contain wildcards such as ’*’, which means that we cannot determine that
signature’s size. And in that case, we no longer have a ”largest pattern” for which to calculate its length.
As such, the authors found that the best approach was to have each thread scan further from its chunk
31
until it reaches a fail or final-state. This avoids communication between threads regarding boundaries in
the input data buffer.
The measurements done reflect the efficiency of a DFA-based solution: they found that the system
had a maximum throughput of around 20 Gpbs, which constituted a speedup of around 100 over the
CPU-only version.
3.4
Summary
In this chapter we have described relevant work performed in the area. Most of it focused on SNORT,
and as such the algorithmic choices were different than the ones we face. However, one piece of work is
close to ours, using GPUs to speed-up ClamAV. It brought a very important idea: that the GPU can be
used as a high-speed filtering mechanism. Using that idea, we build our system such that most of the
GPUs capabilities are taken advantage of. Our system differs from the one just mentioned in regard to
the work division: the solution by Vasiliadis et al. [31] was only capable of scanning a small number of
files in parallel, due to each thread being assigned to a different portion of a file. Our system however,
has its strength in scanning multiple files in parallel: that is the case when the biggest improvement is
had (if only one file is scanned then our system performs poorly).
Details of our implementation are described in the following chapter.
32
Chapter 4
System Architecture
In this chapter, we will explain our system’s architecture and the decisions behind it, along with some
optimizations done to improve overall performance.
4.1
Algorithm Decisions
ClamAV has a signature set that can basically be divided into basic strings and regular expressions,
and we want to scan files against these two signature subsets. String matching algorithms are varied, but
when it comes to regular expressions, the choice is between an NFA-based solution or a DFA-based one.
As we saw in the previous chapter, NFA-based solutions don’t work as well as DFA-based ones on GPUs:
NFAs need to backtrack each time a mismatch occurs, and as such a stack of states already processed is
needed (as to know to which state to backtrack to). This stack would have as maximum size the length
of the largest pattern, so it can grow to be quite large. Having multiple stacks per thread means that
this approach can be infeasible for GPUs.
Chapter 3 evidenced that the best possible scanning algorithm for regular expressions is a DFA-based
algorithm, as it has a complexity of O(n) (each byte is inspected once). It is also quite simple in its online
scanning phase (the automaton construction is the complex part), and as such its GPU implementation
should be somewhat straightforward.
However, for the signature set with no wildcards (or static strings), the choice is not clear. We can use
fixed-string algorithms like Boyer-Moore, or we can again use an automata-based one. As mentioned in
Chapter 3, ClamAV does use different algorithms for the different signature subsets (Aho-Corasick for the
signatures with wildcards, and Boyer-Moore for the ones without). As we are considering implementing
the system on a GPU, we can not simply blindly apply ClamAV’s techniques: those algorithms were
chosen due to ClamAV’s requirements for memory and speed. Namely, memory usage should be as low
as possible - speed can be sacrificed to keep it low; in our system, we use all of the available GPU’s memory.
Also, as mentioned in Chapter 2, divergence is a big cause for performance loss in GPU programs, so we
need the simplest algorithm possible.
A Burrows-Wheeler-Transform-based algorithm would seem like a good algorithm to be used, but it
also brings some issues. Such an algorithm works by applying the BWT not to the signature set, but to
the text that we want to scan in. In our use case, this would mean that for each and every different file
that we wish to scan, we would have to calculate its BWT, which means spending pre-processing time
that on other algorithms would be used for scanning. Other algorithms also do pre-processing, but they
do it on the signature data, which changes a lot less than the files to be scanned (signature data can
change once a day, which would entail doing the automata pre-processing phase once a day; with BWT,
the pre-processing has to be done for every file).
33
Besides, due to the algorithm that uses BWT, files might end up using much more memory space
(due to the suffix array that has to be kept as mentioned in Section 3.2.2) than what is available on a
GPU. BWT-based algorithms are popular for genome sequencing (due to the way sequencing works by splitting the DNA into small fragments and matching them), but our case is somewhat the inverse
of genome sequencing: we apply pre-processing to the queries, which are fixed (instead of on the text,
which is what BWT-based algorithms do). Due to this, BWT is not the best choice for use in our work.
We have also implemented the Wu-Manber algorithm (in the CPU) in order to compare its execution
time against a DFA-based solution, and we found that despite its complexity being theoretically better, it
actually performed worse on the ClamAV signature set. This might be due to the fact that the algorithm’s
performance is directly related to the length of the smallest signature, and ClamAV contains signatures
with only 3 bytes of length. This means that in the Wu-Manber algorithm, the biggest distance we can
shift by is 3.
Having such small minimum signature lengths effectively limits the values possible for the shift table
(2563 values at most). In our tests, performed with real-world files from the /usr/bin folder of a Linux
distribution, we have found that the algorithm spent most of its time trying to check if a match actually
occurred (the shift value was zero). As this process (checking if a match occurs) involves access to
different data structures, hash calculations and string comparisons, the more times we have to perform
it, the lesser the performance of our process. Given that the algorithm needs to run on a GPU, the thread
divergence caused by this process would be far too great.
Taking this into account, together with the fact that other static string matching algorithms didn’t
prove quite as fast as one would hope, we decided to use the same O(n), DFA-based matching algorithm
for both regular expressions and static strings. We also chose to represent the DFAs as 2-dimensional
transition tables. Other representations (such as adjacency lists) are more efficient in terms of space, but
less so in terms of processing complexity. As we are using GPUs, we want the processing of the state
transitions to be as simple as possible in order to minimize divergence.
4.2
Design Decisions
In this section, we explain some decisions made for our system.
4.2.1
Reducing Memory Space
As we saw in Chapter 3, ClamAV’s signature set can be very problematic for some automata-based
systems (designed for sub-string matching): the existence of a high number of the “*” wildcard, means
that if we are to build a full DFA, then at each point where we must represent the “*” wildcard there will
be a high number of additional states to represent the possibility of all of the other patterns occurring
at that point.
Due to the design choice of using an O(n) scanning algorithm, we found that the above problem was
present on our signature set, and as such we could not include the whole signature set to be scanned,
both in the case of regular expressions and static signatures. In our system we have two available GPUs,
and our total DFA size was larger than the sum of both of the GPU’s available space.
Truncating Signatures
We have thus decided that for each signature, only part of it would be considered as to reduce the
memory requirements. This will lead to occurrences of false-positives, but later in this section we will
describe a mechanism to greatly reduce them.
34
The choice of the signature subset to be considered for use in the scanning is not trivial. Signatures
have semantic meaning, and as such, we can not just consider the greatest possible sub-string of each
signature. As an example, consider the following signature: “ABCDEFGH*.rdata”. In this case, it is clear
that of both signature parts, “ABCDEFGH ” is the greater one, and as such, if there was no semantic
meaning, it would be the one with lesser chance of not happening in a file (if there is no semantic
meaning, the probability of a string with size s being existent in a file is 1/256s ). However, if we are
scanning Portable Executable files, then that same substring has a much greater chance of happening
than “ABCDEFGH ” (in fact, all PE files include the “.rdata” substring in them).
Our process for extracting substrings from signatures was the following:
• For static signatures, we simply chose a string of a given size (detailed later on) that is present in the
middle of the signature. The position in the signature in which to choose the string is the middle:
usually this is the part with the most variance, as both the beginning and the end of signatures
most times contain characters that are present in regular files. For instance, a Portable Executable
signature is very likely to contain the “DOS HEADER” substring in its beginning.
• For regular expressions, we first check if the signature has a substring with no wildcards. If it does,
then we use that string. If it doesn’t, then we try to find the largest substring containing the least
possible amount of wildcards.
This mechanism of extracting smaller parts from signatures still yielded some false positives. In
order to reduce them, we opted to introduce another step in the matching algorithm: note in Chapter 3
that each signature has a given file type associated with it. Instead of reporting all of the occurring
matches of signatures for all of the file types, we check if the current file being matched has the same
type as the signature that was just matched (we have modified ClamAV to expose this file type detection
functionality, and we perform this detection before scanning a file). If it does, then we report the match.
If not, then we carry on with the algorithm.
Even after this optimization to reduce false positives, some were still present. This resulted in we
having manually chosen the subsets of signatures with a low matching rate on a set of test files (corresponding to the /usr/bin folder in a Linux distribution). A more automatic method would be to choose
a set of test files, and for each signature, find all of its subsets. Then, scan the test files for each of
the subsets, and whichever subset has the smallest matching rate, is the one chosen to be used. Such a
process, although automatic, can be very costing in terms of execution time if the signature set is very
large.
False-positives were still not fully eliminated though, due to some signatures being especially problematic: some of them are very small, such as 3 or 4 bytes in length. This means that they have a
very high probability of occurring anywhere in a file. In fact, all of these signatures have a given file
position associated with them, which indicates that their match is only to be reported if they occur at
that position. Constantly checking positions in files is definitely not ideal for a GPU though, as it would
either massively increase the complexity (for each position in a file, check for the signatures associated
with it), or either bring divergence to the algorithm (some signatures can actually be present in intervals
of positions, so checking them would not be a trivial process, causing other threads to be stalled waiting
for such process to complete).
We have decided to keep this small number of remaining false-positives in order to save matching
efficiency. As they are pertinent mainly on Portable Executable files, the fact that these files are present
as an attachment to an e-mail should be seen with some suspicion, so doing a full-verification of them is
not wasted time.
Some even more problematic signatures were discarded for future work. The problem is described in
detail in Chapter 6.
35
Due to not being able to use complete signatures, some false-positive verification mechanism has to
take place: it is possible that a given signature’s part matches, but the whole does not. If a match
occurs, then that signature needs to be verified, and in our system that is done by simply passing the
corresponding file to ClamAV (this means that we only scan up until finding a match in a file). If however
no match occurs, which should be most of time (considering that most of the files are not malicious),
then no further verification is needed. In the case that the files are in fact malicious or are specifically
constructed in order for a match to occur in the pre-filtering phase, then the execution time greatly
increases, as we shall see in Chapter 5.
Splitting the Automaton
The choice of using only a signature’s subset however is not enough to mitigate the memory space
issues (it could be enough, but in order to do so, we would have to consider a tiny subset of each
signature, which would massively increase the false-positive rate, at a point where manual interaction
would not yield less false positives). Another design choice done to resolve this issue was to divide the
final signatures DFA into several smaller DFAs. As these different DFAs are composed of a lesser number
of signatures, the subset construction algorithm doesn’t have to keep track of as many emulated states,
which not only allows it to run faster, but also causes the resulting automatons to be smaller.
In order to further reduce the number of states in each automaton corresponding to a signature subset,
we have decided to group signatures (in order to form a subset) based on shared prefixes. Signatures that
share the same prefix will also share the same DFA states.
A diagram of the overall process is shown in Figure 4.1.
Figure 4.1: Processing the signatures and sending them to the GPUs.
4.2.2
GPU Memory Space for the Files and Work Division
Even though we want to utilize all of the GPUs memory, we cannot simply use all of it for the
automatons. We want to scan files, and if we would to send these files to the GPU on each iteration of
the algorithm (or even read them directly from the CPU’s memory), the execution time would be severely
impacted: memory transfers from the CPU to the GPU can be costly depending on the size of the data
transferred.
It was straightforward to decide that files to be scanned should reside in a GPU for as long as the
matching process takes place. This does limit the size of the automatons that are present on each GPU,
but it is certainly a better option than constantly transferring the files from the CPU (or reading them
from the CPU’s DRAM, which has a large latency when accessed from the GPU).
36
Some space in the GPU’s global memory is therefore allocated to files. However it is possible that the
size of the files to be scanned is larger than this allocated space. In order to overcome this problem, we
chose to divide files into chunks (or parts). Each file to be scanned is divided into several parts, which
will separately be sent to the GPU to be scanned. Having files split into parts means that the GPUs
need to keep track of the states (and results) of the automatons in between the parts. For example, if the
matching process of a given file’s first part ends at state 1, then the next part should start to be matched
at state 1 instead of state 0 (the initial state), as to ensure that possible pattern matches between a file’s
parts are reported.
When it comes to the work division, taking into account our GPU’s (Tesla C2050) characteristics,
we have chosen to assign each of the smaller automata to a different multiprocessor: as they are already
independent from each other, we can exploit that fact to have threads process files in different automata
in parallel. Our GPU has 14 multiprocessors, so there are 14 smaller automata for each of the GPUs.
Assigning each automaton to a different multiprocessor, and taking into account the fact that most
of the e-mail attachments present in e-mail gateways are small files, we have decided to assign each of
the threads in a multiprocessor to a different file. A different work distribution could be had, such as
having multiple threads process different parts of the same file, but this would not be ideal as only a
small number of files could be scanned at a time; this would also mean that memory transfers from the
CPU to the GPU would be an overhead to consider, as they would be done very often. Having multiple
files being scanned at a time leads us to performing less overall memory transfers (performing a single
transfer of the files to memory instead of performing multiple smaller transfers of single files).
In order to reduce possible divergence inside a warp in a multiprocessor, we have decided to order
the files sent to the GPU by their size. This way, if the files to be scanned have similar sizes, warps in a
multiprocessor will scan files with similar sizes, thus reducing the amount of time that threads wait for
the others before completing their work. We also avoid situations in which for instance there is a single
file whose size is bigger than all of the others. This would lead to the thread that is assigned to that
file having a much larger processing time than all of the others (which would have to wait for it before
completing).
Overall, the process of verifying a set of files is:
• 14 Automatons are built for both the static strings signature subset and the regular expressions
signature subset. The process takes around 2 hours for each, so it is expensive. Signatures are not
updated that often though, so this is not a problem. The automatons reside in each of their GPU’s
texture memory, until they are eventually updated with new ones.
• At some point in time, the CPU is given a set of files. If the set’s total size is less than the available
remaining space in the GPU’s memory, then the files are simply sent to the GPU (in groups of
1024 - our GPU supports a maximum of 1024 threads per block, and remember that each block is
processing files for a different automaton). If the set’s size is bigger than the available size, then
files are divided into parts and groups of parts are sent to the GPU.
• Once the GPUs receive the data to be processed, they run the matching algorithm: states are
fetched from the state table resident in texture memory, and checked if they are final. If they
are, their corresponding signature file type is verified against the file type of the file present in the
current thread (these file types are saved in shared memory). If these match, then the current file
is marked as being suspicious and the thread responsible for it stops processing.
• Once all of the threads in both GPUs stop processing, then their results (files marked as suspicious)
are aggregated and sent to ClamAV for a full verification.
Figure 4.2 shows a diagram of the matching process.
37
Figure 4.2: Matching process.
4.2.3
Mapping to the architecture
As each of our GPUs has 2.7 Gigabytes of available memory, we decided that having around 2
Gigabytes for the automatons was a good enough amount of space.
We have found that each of the automatons’ maximum number of states was around 130,000 (recall
that each state has 256 transitions due to the alphabet being 1 byte in size, and each state number is
represented as an integer, which is 4 bytes). The total size of the automatons is in fact less than 2GB,
but it is so because there is another restriction on the automatons’ maximum number of states. That
restriction is related to how the automatons are stored in the GPU’s memory, and it limits the maximum
number of states to 130,000. Details are presented in Section 4.3.2.
The space to allocate the files to is the remaining one, so it amounts to around 700 MB. As the system
was designed to run on an e-mail gateway, which contains mostly small files, this value does not pose any
problems.
We mentioned earlier that in order to choose a substring from each of the static signatures, one was
taken from the middle of each signature. We have found the optimal value for this substring’s size to
be 22 bytes. This is the largest value that we can choose so that our automatons do not exceed the
aforementioned space limit.
For regular expressions, a similar mechanism was performed: if there was a substring with no wildcards
that exceeded a given size, then we chose that substring. That size was found to be 25 bytes: it is the
smallest one that incurred in a small amount - around 3% - of false positives on files from /usr/bin. If
there is no such substring, then we find the largest substring with the least possible amount of wildcards.
We found that in order to respect the size restrictions on the automata, we can only have some “??”
wildcards (around 1600 in a total of around 8500 regular expressions) and “OR” expressions (stating that
at a given position some given characters can be present).
4.3
Optimizations
After implementing our system according to the decisions described in the previous section, we found
that its performance was not satisfactory. Namely, little or no improvement was had over an equivalent
38
CPU version of the algorithm. In order to improve the system’s performance, we implemented some
optimizations, which are described in this section.
4.3.1
Memory Accesses
Automata-based scanning algorithms are quite simple in computational terms: all they do is repeatedly fetch values from the state-transition table, and the values to be fetched depend on previous ones.
But even though it is simple computationally, this does not mean that it cannot pose a problem to some
architectures: a minimum of two memory accesses are required for every algorithm iteration (one to fetch
the next input character, and one to fetch the next state), which can be costly in case the accessed positions are not present in the cache. Also, verifying if a given state is final translates to another memory
access.
As was seen in Chapter 2, the large memory present in a GPU is the slowest and it does not have
the support of a large cache as is the case in CPUs. Performing various memory accesses per algorithm
iteration is quite costly, and effectively becomes the GPU’s bottleneck.
In a first implementation of our algorithm, we performed these such accesses for each iteration, and
we found that the performance while processing 64 files in parallel was actually worse than a serial CPU
version: threads would be constantly waiting for their turn in accessing global memory.
4.3.2
Texture Memory
As also seen in Chapter 2, texture memory has a separate cache, and it is designed to be used in
algorithms with 2D locality. Automata-based algorithms can exhibit such locality (depending on the
input data, the algorithm can be constantly fetching the same states), so the texture memory was a
natural choice to include our automaton in, instead of global memory. Global memory is in fact a bad
choice for the automatons to be present in. Automatons have random access patterns, as the states to be
fetched depend on the input data, which cannot be predicted. Global memory does not work very well
with random access patterns: if two threads request values from two memory addresses that are distant
from each other, then the GPU performs two serial memory accesses. However, if the two memory
addresses are sequenced, then a single memory access is performed, and the two values are broadcast to
each requesting thread. Global memory is therefore a better suit for linear access patterns. We have one
data structure in our algorithm that has such a pattern: the input data. As we will see in a following
section, performing memory accesses to sequenced addresses greatly improves the execution time over
performing them to addresses spaced from each other.
Using texture memory is not trivial though: the data structure to be used has to be bound to a
texture, and thus it must adhere to some restrictions on its dimensions’ size. Namely for 2D arrays, each
of the dimensions (x and y) must not exceed 65,536 elements in size.
Our automata (shown in Figure 4.3) do not adhere to these restrictions: the y dimension is larger
than 65,536.
So some re-structuring of the data is needed: we decided to separate each of the automaton parts
(recall that we have a separate automaton for each multiprocessor in a GPU) into two separate sub-parts,
to adhere to the 65,536 size restriction. This is shown in Figure 4.4.
This however means that we also need to do some re-sizing of each of the automata: if we are only to
have two parts with 65,536 elements each (in a dimension), then we can only have 65, 536 × 2 (131,072)
elements in total for that dimension.
With this re-structuring, the addressing must also be changed. Previously, if we wanted say to fetch
the state pointed to by a transition by the character ‘A’ from state 70,000, we would go to the table cell
with coordinates <65; 70,000> (65 is the decimal value of the ‘A’ character). But now, the table only
39
Figure 4.3: Structure of the automata.
Figure 4.4: Automata after the re-structuring for texture memory use.
has 65,536 values on the vertical dimension. What needs to be done is that, each time we want to access
values from states above 65,536, we must do a change in coordinates: the vertical coordinate is now equal
to the modulo of itself with 65,536, and the horizontal coordinate is added 256.
4.3.3
Final States and Signature Types
At every iteration of the algorithm, we need to check if the state fetched from the automaton is final
or not. Performing this memory access is quite expensive and most times useless, as most files in an
e-mail server (where the system is to be placed) are not malicious.
Also, as described above, we need to keep in memory the types corresponding to each signature in
order to reduce the rate of false-positives. But keeping these values in global memory would not be ideal:
they are only accessed in the case that we have found a final state, and they are not accessed in an orderly
fashion (so no coalescing of accesses could be done).
In order to save these memory accesses, the same idea is applied: due to the choice of using texture
memory, we can only have 130,000 states in each automaton. This means that in each cell of the statetransition matrix, there will not be a number larger than 130,000. However, we still need 4 bytes to keep
this value in, and many bits will be unused (as the value 130,000 does not use them all to be represented).
Taking this into account, we decided to represent final states as negative values. Checking if a state
is final is now a trivial operation, compared to accessing a value in the GPU’s memory.
The type corresponding to a given signature is also embedded in that signature’s final state number:
as ClamAV only has 10 types, from 0 to 9, every time we reach a final state in the NFA to DFA conversion
process, we simply multiply that state’s number by 10, and add the corresponding signature’s type. As
an example, let’s say that state number 100 is associated with a signature whose filetype is 5. That
40
state’s number will now be 100 × 10 + 5 = 1005.
Conversely, checking a type for a given final state is now reduced to getting the remainder of the
division of that state’s number by 10. For the above example, fetching the filetype is done this way:
1005 mod 10 = 5.
This has a limitation however: if a new filetype appears in ClamAV, then we can no longer perform
these operations. A better solution would be to perform a bitwise shift operation: this would ensure that
only the exact number of needed bits would be used. In our current solution, we use all of the 4 bytes.
4.3.4
Coalescing
GPUs benefit greatly from aligned memory accesses. If accesses to global memory in a warp are not
to sequential memory accesses, then the GPU’s hardware will have to perform one separate memory
transaction (occupying the bus) for each access. If they are aligned then a single transaction can be
performed, and the fetched values broadcast to the warp threads.
In our system, as mentioned earlier in the chapter, each thread in a multiprocessor is assigned to a
different file, and naturally they all start processing the file at its beginning. Having each of the threads
accessing contiguous memory locations means that the files can not be stored contiguously: they need
to be interleaved, so that all of threads access the same position at the same time, and in contiguous
addresses. Figure 4.5 shows both of these situations.
Figure 4.5: Contiguous and interleaved files. Blocks marked with an X have no meaningful data in them,
as they are out of a file’s boundaries.
Files are stored contiguously on the disk, so in order to have them interleaved, we need to perform
an operation on them. As a set of files can be represented by a matrix, interleaving them is equal to
performing the transpose of such matrix.
This transpose is not a trivial task to perform though. Having a set of test files totaling 600 megabytes,
this transpose takes around 2 seconds, which is a significant amount of time considering that the GPU’s
processing time was around 9 seconds.
An alternative was studied: the problem with non-coalesced accesses is that the transaction bus is
not fully utilized. Threads must wait for others to use the bus to perform their accesses, which only fetch
4 bytes at a time (in order to reduce the total number of memory accesses, instead of having each access
fetch a single byte, we have grouped 4 subsequent bytes from each file into a single data structure which
is what is fetched on each access). We increased the amount of data that each memory accessed fetched
(so the transaction bus occupation would increase) and found that coalesced accesses were still faster:
having a single memory access broadcast values to every requesting thread provides a great benefit.
41
Figure 4.6: Difference between sending a whole file and sending it in two parts.
4.3.5
Double-buffering
Transposing a non-square matrix takes some time in the CPU. If we were to perform this transposing
every time we want to send files to the GPU, then some inefficiency would be present: the GPU would
be waiting for the CPU to finish the transposing before being able to process the data.
Likewise, sending the data from the CPU to the GPU also takes some time, although far less than
the one needed for the transposing (in our card, sending around 600MB takes around 40 milliseconds).
In order to cut these times, we have chosen to partition the data to be sent: the GPU will effectively
process all of the data as before, but not all at once.
We divide each file into parts, and we only send parts of files at once. The smaller the parts (and
thus the bigger the number of parts of a file), the less time we spend performing the transpose, but the
more launches we perform on the GPU.
Separating the files into parts allows us to perform background execution on the CPU: while the GPU
is processing a given batch of parts of files, the CPU is already transposing them; the CPU even sends
them to the GPU while it is processing the previous batch (this is possible due to the use of page-locked
memory on the CPU).
Once a batch is finished, the GPU then has to wait a very small amount of time (compared to waiting
for a transpose) before it can start processing the next one. This waiting is due to the fact that splitting
files into parts means that we have to keep track of states in between file parts, as not to loose any
possible match information. So each time a batch is finished, the GPU copies the last processed states
and results onto the next batch.
Figure 4.6 shows the overall scanning process this way and how it differs from just sending the whole
files.
As we can see, even though small copies need to be performed in between the processing of parts
(because as mentioned earlier in this chapter we need to keep track of states in between parts), the fact
that the transpose operation on the files can be performed in the background decreases the total execution
time.
The chart on Figure 4.7 shows the improvement over sending several parts of files with double-buffering
when compared to sending the whole file.
The execution times were measured by scanning a set of 3431 test files with various sizes, which
correspond to images, PDF files, videos, audio files and others. The time for sending a single part is our
baseline, and what we wish to improve.
We can see that the optimal time was reached when files were divided into 16 parts: by Figure 4.6, the
execution time of the transpose of the first part decreases as the part’s size also decreases; the rest of the
total time is unchanged. So as we further divide each file into several smaller parts, this initial transpose
42
Figure 4.7: Difference between execution times of sending a single file parts and sending several parts.
time becomes smaller. However, execution times when the number of parts is larger than 16 depend on
the files themselves: it is possible that we divide the files into so many parts that each part contains
only 4 bytes, which is the amount of bytes that each thread requests on each iteration. Thus, having
such small parts can lead to the sending of the states from a previous part to the current one being the
bottleneck of the system: if we have plenty of small parts, then the processing time for each one is very
small, but we need to do more of the aforementioned copies of states in between the processing of parts.
These copies cannot be overlapped with the execution, so the more we do, the bigger the execution time
becomes.
Based on this set of test files, which corresponds to a real-world scenario, we have decided to divide
each of the files into 16 parts, the optimal value found for that set.
4.4
Summary
We have described our system’s architecture, along with the decisions that were behind it: truncating
the signatures and splitting the automaton in order to reduce memory space. We then presented some
implemented optimizations, which allowed the system to have good performance values. In the next
chapter, we will see how it compares to ClamAV’s pattern matching module on sets of test files.
43
Chapter 5
Evaluation
In previous chapters, we have described our related work on the area, along with our system’s architecture. In this chapter, we will present results on how our system performs when running on different
sets of test files.
5.1
Experiment Setup
For our experiments, we use a computer equipped with two NVIDIA Tesla C2050 cards, an Intel i7 950
processor and 12GB of RAM. All of the times for ClamAV were obtained by calling the clamd daemon
and then scanning files using clamdscan. This way, the time used to initialize ClamAV’s data structures
is not counted in the total scan time. Additionally, we could not reliably profile the pattern matching
portion of ClamAV. Thus, considering that, by Table 3.2 the pattern matching portion of ClamAV uses
approximately 80% of its time, we have taken the times given by clamdscan and multiplied them by
0.8 in order to obtain an approximation of the time spent in pattern matching. The execution times
for the GPU are given by the largest execution time of both of the GPUs: if a GPU finishes processing
before the other, then it waits before the verification module is executed. The times for the CPU are
measured with an equivalent version of the algorithms on the GPUs: two threads run in parallel, and
each of them respectively processes files on the automatons for the regular expression signatures and the
static signatures. When they are both done, the verification module is called for any file which matched
a signature.
5.2
Results
Our algorithm is memory-bound, so with less memory accesses, the faster it performs. This means
that performance is highly dependent on the sizes of the files to be scanned.
The chart on Figure 5.1 shows just that. It shows the total execution time for the GPU when scanning
constant-character files, which are composed of repetitions of a single character (so there are no matches
on the sub-signatures by either GPU; the texture cache is also always being accessed, as the algorithm
continuously fetches the same state). The total size of the files is the same (around 670 Megabytes), but
more files were repeatedly added (so each of them has a size smaller than on a previous measurement).
As we can see, having around 1024 files yielded the smallest total execution time: this is because each
of them has a smaller size than any of the others, and as such less memory accesses are performed in
total.
But comparing the execution time of our GPU algorithm with ClamAV in the case where the files
to be processed do not contain any structure does not constitute a realistic scenario: users usually send
45
Figure 5.1: Execution times for random files with the same size.
pictures, documents, archives, and other files which have some structure. ClamAV performs some preanalysis on the files, and as such if they don’t have any structure they will be skipped, leading to a very
small scanning time.
We have thus performed the above comparison, but with real files (with equal sizes - again reducing
each of the file’s size when adding more files to be scanned): images, videos, archives, PDFs, and others
(including one executable).
The chart on Figure 5.2 shows the results.
Like in the files without any structure, the biggest improvement was seen when scanning smaller
files. A nuance happens in this set of test files though: one of the files is a portable executable. As we
mentioned in Chapter 4, some signatures are very small, and occur very frequently in executable files.
Thus, each time a portable executable file is to be scanned, those signatures match, causing the file to
be marked as suspicious and subsequently verified by ClamAV. In our test set only one of the files is
an executable, so the verification process is very fast (ClamAV takes a significant amount of time when
scanning various files, not when scanning a single one).
But this set of test files is still not completely realistic, as all of the files have the same size, thus not
leading to any divergence. We tested our algorithm on a more realistic set of files: files from the /usr/bin
folder of a Linux distribution, which contains files with different sizes. We changed the number of files
that were scanned, along with the total amount of data and the size of the largest file, to see what each
of these characters had an impact on our system’s performance.
Table 5.1 shows the results.
We can see that as expected, the biggest speed-up was had when scanning a large amount of files with
a relatively small size: it appears that ClamAV does not handle a large amount of files very well, but our
system does. Scanning a small amount of files however led to our system having a worse performance
than ClamAV. As such, we can conclude that our system should only execute on situations where many
files have to be scanned, and ideally, these files should have limited sizes. This is the case in e-mail
gateways: they contain many e-mail attachments, and usually e-mail servers limit attachments to a given
size.
In Chapter 4, we mentioned the possibility of false-positives being present in the system, and we
described a mechanism to verify them: we simply use ClamAV itself to verify files that had a given
46
Figure 5.2: Execution times for files with a same size.
Number of
files
Total size
(bytes)
Size of
largest file
(bytes)
GPU (ms)
CPU (ms)
ClamAV
(ms)
Speedup
(vs
ClamAV)
10
114,870
44,552
40
11
34
0.9
100
17,815,178
3,974,440
2,744
660
830
0.3
1000
57,702,247
663,936
956
2,364
1,829
2.7
2000
85,604,681
340,376
674
3,329
10,218
15.1
3431
97,291,831
170,176
527
3,667
14,691
27.8
Table 5.1: Execution times for files with different sizes.
47
Figure 5.3: Execution times for different numbers of false positives.
signature match in them.
This characteristic of our system is in fact one of its weak points, as it can be attacked by a malicious
user constructing specially crafted files. In order to see the impact of such an attack on the system, we
measured the execution time for the above 1024 (real) files, but while continuously adding a signature to
them, so that a match would be reported by the GPU. With an increasing number of files (out of 1024)
having a match, then the most work the verification module (ClamAV) must perform.
The chart on Figure 5.3 shows the results.
In the case where only one false-positive out of 1024 is present, our system performs well as expected
(ClamAV only has to scan a single file, which is a fast process). However, as the number of files needing
to be verified increases, so does our system’s total execution time. At the point where more than 512
files out of 1024 need to be verified, our system’s performance is worse than ClamAV’s, thus eliminating
the purpose of having implemented our system.
Despite being an extreme case (as the vast majority of files is not malicious), this test helps understand
what would happen in the case of a targeted attack.
48
Chapter 6
Future Work and Conclusions
Our system has a good speed-up over the CPU-only ClamAV in some cases. However, in other cases,
our system does not perform as well. In this chapter, we propose future work that could be done to
improve some drawbacks present in our system.
6.1
Problematic Signatures
In our system, some false-positives were manually resolved by choosing a different substring of a
signature then the one that caused the false-positive. However, for some signatures this is not possible.
These signatures are small variations of one such as “62696e{-10}47455420{-50}2e657865{-5}627965 ”.
We cannot include wildcards such as those present in these signatures because the combination of generating a full automaton with the relatively high number of regular expressions leads to state-space explosion.
Including parts from this signature (“62696e” and “47455420 ” are different parts for instance) would be
the natural solution. However, this signature can be matched for any file (it is not restricted to any
file type), and all of its parts had a high occurrence in our test files. Additionally, considering the subsignature “2e657865{-5}627965 ” was not helpful, as it was also present in the test files. This family of
signatures was thus discarded to future work, as at the time focus had to be shifted to implementing and
optimizing our algorithm on the GPU. As for what could be done in regard to including these signatures:
we could either reduce the size of all of the other ones in order to be able to include the wildcards present
in this signature family, or we could have some mechanism that in the matching algorithm, would separately report matches of each of the parts in these signatures. After the processing, we could check if all
of the parts matched, and if yes, further verify the signature with ClamAV.
6.2
Scanning Method for Large Files
Currently, the biggest problem with the algorithm is the way it handles large files. When a large file
is sent to be processed by the GPU, a single thread takes responsibility for it, and scans it in a serial
manner. This causes a great deal of inefficiency, as all of the other threads are idle. This is an extreme
example, but it can happen in some other cases where for instance we have 1025 files needing to be
scanned. Our system scans files in groups of 1024, so it would be possible that a single file would be
scanned in isolation. If this file is large, then our system performs poorly.
As future work, we propose implementing a new manner of scanning such large files: when a large file
is sent to the GPU, instead of a single thread being assigned to it, we should assign several (optimally,
the number of remaining available threads). Each of these threads should then scan its assigned piece
of the file, while respecting some boundary conditions. These conditions relate to the possible occurring
49
patterns: as a pattern might occur between two pieces assigned to two different threads, each thread
should not stop scanning at the end of its assigned piece, but instead continue on until it reaches the
length of the smallest pattern. Doing so, we can guarantee that any pattern that occurs in between pieces
is correctly detected and reported.
This is the approach taken by related work on the area (such as [31]), so it is proven to be effective.
6.3
Incremental Automata Changes
Currently, each time the signature set is changed (such as for instance when a new virus signature
is added), we re-calculate all of the automatons. This means that the propagation of a change in the
signature set to our system can take an hour, as that is the time needed to rebuild the automata.
Alternatively, each time a signature is added, then we could create a DFA for it, and merge it with
one of the existing ones. While this would also take some time (the back-transitions would have to be
re-calculated for that automaton), it would take less time than rebuilding all of the automata.
6.4
Transposing the Files using SSE Instructions
In our system, we use coalesced accesses to global memory as to increase the transaction bus occupation
rate. However, using coalesced accesses means that the files need to be stored in an interleaved manner,
which in turn means that we need to apply a transformation (a transpose) to them in order to have them
interleaved. We have shown in Chapter 4 that this transpose takes a significant amount of time (that
was reduced by splitting the files to be scanned into several parts).
In order to further improve the transpose of the files, we propose that some Streaming SIMD Extensions (SSE) instructions which can be used in modern CPUs be used to optimize the transpose process.
6.5
Verification Mechanism
In Chapter 4, we showed that in our system, the verification process in order to confirm if a given
signature matches in a file was simply to pass that file to ClamAV. This can be inefficient though, as if
a given signature matches, then we only need to verify the file against that signature.
An alternative to our current solution would be to associate all of a signature’s details (file type,
positions in a file where it occurs, etc) to the sub-signature chosen to be used in the automaton building
process.
When sub-signatures matched, then we take their respective details and use the Aho-Corasick algorithm to completely verify if the full signature is present in the file or not (this is what ClamAV does).
However, this means that we can no longer stop scanning when we find a match: if we did, then
some other occurring signature might be lost, as the verification process would only check for the first
signature that matched. We would have to scan the entire file, but this is not an issue: this is effectively
what already happens in our dual-GPU system. Even if all of the files to be scanned have a match in
a GPU, then that GPU still has to wait for the other one to finish its processing before the verification
process can take place.
6.6
Conclusions
We started by showing the problem that we proposed ourselves to solve, along with background on it.
We gave an introduction to general purpose computing on GPUs, and showed how they can be utilized
50
to achieve high performance. We then presented related work on the area, which served as a basis for
our own.
Details were shown for our system, an O(n) algorithm running on two GPUs which pre-processes files
in a high-speed manner in order to detect if possible virus signatures are present in them. It differs from
other related work using ClamAV as it is capable of scanning a high quantity of files in parallel.
The system was found to have a good speed-up on the case that it was designed for: an e-mail gateway
with a large amount of files with limited sizes. A number of optimizations were implemented in order to
achieve this performance, which was not possible with a first implementation.
Our solution did not perform so well on some cases, and it is open to targeted attacks. Future work
was then proposed to mitigate some of these issues and further improve the solution’s performance.
51
Bibliography
[1] Amd stream. www.amd.com/stream.
[2] Bro network security monitor. http://www.bro-ids.org.
[3] Clamav. http://www.clamav.net/lang/en/.
[4] Nvidia cuda. http://www.nvidia.com/object/cuda home new.html.
[5] Nvidia cuda programming guide. http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA
[6] Open cl. http://www.khronos.org/opencl/.
[7] Portable operating system interface. http://pubs.opengroup.org/onlinepubs/9699919799.
[8] Snort. http://www.snort.org/.
[9] Sourcefire. http://www.sourcefire.com.
[10] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: an aid to bibliographic search.
Commun. ACM, 18:333–340, June 1975.
[11] Michela Becchi and Patrick Crowley. A hybrid finite automaton for practical deep packet inspection.
In Proceedings of the 2007 ACM CoNEXT conference, CoNEXT ’07, pages 1:1–1:12, New York, NY,
USA, 2007. ACM.
[12] Robert S. Boyer and J. Strother Moore. A fast string searching algorithm. Commun. ACM, 20:762–
772, October 1977.
[13] Niccolo Cascarano, Pierluigi Rolando, Fulvio Risso, and Riccardo Sisto. iNFAnt: NFA pattern
matching on GPGPU devices. SIGCOMM Computer Communication Review, 40(5), oct 2010.
[14] P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Proceedings of the
41st Annual Symposium on Foundations of Computer Science, FOCS ’00, pages 390–, Washington,
DC, USA, 2000. IEEE Computer Society.
[15] Todd J. Green, Ashish Gupta, Gerome Miklau, Makoto Onizuka, and Dan Suciu. Processing xml
streams with deterministic automata and stream indexes. ACM Trans. Database Syst., 29:752–788,
December 2004.
[16] R. Nigel Horspool. Practical fast searching in strings. Software: Practice and Experience, 10(6):501–
506, 1980.
[17] N.-F Huang and W.-Y Tsai. SHOCK: A Worst-Case Ensured Sub-Linear Time Pattern Matching
Algorithm for Inline Anti-Virus Scanning. In ICC 2010 - 2010 IEEE International Conference on
Communications, pages 1–5. IEEE.
[18] Donald E. Knuthf, James H. Morris, Jr. :l, and Vaughan R. Pratt. Fast pattern matching in strings,
1974.
[19] Charalampos S Kouzinopoulos and Konstantinos G Margaritis. String Matching on a multicore GPU
using CUDA. In 2009 13th Panhellenic Conference on Informatics, pages 14–18. IEEE, 2009.
[20] Heng Li and Richard Durbin. Fast and accurate long-read alignment with burrows&#8211;wheeler
transform. Bioinformatics, 26(5):589–595, March 2010.
53
[21] Cheng-Hung Lin, Sheng-Yu Tsai, Chen-Hsiung Liu, Shih-Chieh Chang, and Jyuo-Min Shyu. Accelerating String Matching Using Multi-threaded Algorithm on GPU. In GLOBECOM 2010 - 2010
IEEE Global Communications Conference, pages 1–5. IEEE.
[22] Benjamin Livshits Margus Veanes, David Molnar and Lubomir Litchev. Generating Fast String
Manipulating Code Through Transducer Exploration and SIMD Integration. Technical Report MSRTR-2011-124, Microsoft Research, 2011.
[23] Fukamachi S. Nishimura T. and Shinohara T. Speed-up of aho-corasick pattern matching machines
by rearranging states. In SPIRE, 2001.
[24] M. O. Rabin and D. Scott. Finite automata and their decision problems. IBM Journal of Research
and Development, 1959.
[25] Martin Roesch. Snort - lightweight intrusion detection for networks. In Proceedings of the 13th
USENIX conference on System administration, LISA ’99, pages 229–238, Berkeley, CA, USA, 1999.
USENIX Association.
[26] Randy Smith, Cristian Estan, and Somesh Jha. Xfa: Faster signature matching with extended
automata. In Proceedings of the 2008 IEEE Symposium on Security and Privacy, pages 187–201,
Washington, DC, USA, 2008. IEEE Computer Society.
[27] Randy Smith, Neelam Goyal, Justin Ormont, Karthikeyan Sankaralingam, and Cristian Estan. Evaluating gpus for network packet signature matching. In Proceedings of the International Symposium
on Performance Analysis of Systems and Software (ISPASS), 2009.
[28] Daniel M. Sunday. A very fast substring search algorithm. Commun. ACM, 33(8):132–142, August
1990.
[29] Ken Thompson. Programming techniques: Regular expression search algorithm. Commun. ACM,
11(6):419–422, June 1968.
[30] G Vasiliadis, S Antonatos, and M Polychronakis. Gnort: High performance network intrusion detection using graphics processors. . . . in Intrusion Detection, 2008.
[31] Giorgos Vasiliadis and Sotiris Ioannidis. GrAVity: a massively parallel antivirus engine. In RAID’10:
Proceedings of the 13th international conference on Recent advances in intrusion detection. SpringerVerlag, sep 2010.
[32] Giorgos Vasiliadis, Michalis Polychronakis, Spiros Antonatos, Evangelos P. Markatos, and Sotiris
Ioannidis. Regular expression matching on graphics hardware for intrusion detection. In Proceedings
of the 12th International Symposium on Recent Advances in Intrusion Detection, RAID ’09, pages
265–283, Berlin, Heidelberg, 2009. Springer-Verlag.
[33] Sun Wu and Udi Manber. A fast algorithm for multi-pattern searching. Technical report, 1994.
[34] Fang Yu, Zhifeng Chen, Yanlei Diao, T. V. Lakshman, and Randy H. Katz. Fast and memory-efficient
regular expression matching for deep packet inspection. In Proceedings of the 2006 ACM/IEEE
symposium on Architecture for networking and communications systems, ANCS ’06, pages 93–102,
New York, NY, USA, 2006. ACM.
54
Download

Detecting Computer Viruses using GPUs Thesis to obtain the Master