PROCESS MINING WITH SEQUENCE CLUSTERING
Miguel Dias Garutti Malheiros
Dissertação para obtenção do Grau de Mestre em
Engenharia Informática de Computadores
Júri
Presidente:
Prof. Pedro Sousa
Orientador:
Prof. José Tribolet
Vogais:
Prof. Diogo Ferreira
Prof. Miguel Mira da Silva
Julho de 2007
Agradecimentos
Aos professores Diogo Ferreira e Marielba Zacarias pelo enorme apoio que me deram e
disponibilidade absoluta. O esforço que fizeram para que este trabalho resultasse foi muito para
além das suas obrigações.
Ao professor José Tribolet pela orientação que me deu e pelas suas sugestões e opiniões acerca
de vários aspectos fundamentais na minha vida de estudante.
Aos meus pais, aos meus avós e à minha namorada pelo apoio incondicional, motivação e
confiança que sempre me transmitiram. Um agradecimento especial à minha mãe pelo exemplo
de força e trabalho que me deu.
Ao meu colega Pedro Ferreira pela troca de ideias e apoio que me deu.
ii
Abstract
The goal of this work was to test an approach to process mining that from an event log containing
event records originating from different tasks or processes could identify sequences of related
events in a completely automatic way. We also proposed to apply a sequence clustering
algorithm to these sequences. The sequence clustering algorithm we used groups similar
sequences in clusters and constructs a model, in the form of a Markov chain, for each cluster.
In order to generate event logs, we developed an application that executed sequences of actions
over a small database and recorded these actions in logs. After capturing the event log it was
necessary to determine what sequences were in it. We analyzed all events and established
connections between them according to a specific criterion. The resulting graphs, where states
are events and edges are event connections constitute sequences of actions when its nodes are
sorted chronologically. These sequences were fed as input to the sequence clustering algorithm.
All the different types of sequences were identified correctly except one whose model was slightly
different from what it was supposed to be. Moreover, the algorithm was able to correctly identify
which types of sequences were similar to others, thus indicating what types were variations of
other types.
We concluded that with careful preparation of the input data the sequence clustering algorithm
proved to be a good tool for modelling different types of sequences of actions and for making
explicit the differences between these types.
Keywords: process mining, sequence clustering, event log, log preprocessing
iii
Resumo
O objectivo deste trabalho era testar uma abordagem ao process mining onde, a partir de um log
de eventos contendo eventos originados por diferentes tarefas ou processos, conseguissemos
identificar sequências de eventos de um modo totalmente automático. Propusemos também
aplicar um algoritmo de sequence clustering a estas sequências. O algoritmo de sequence
clustering que usámos agrupa sequências semelhantes em clusters e constrói um modelo, na
forma de uma cadeia de Markov, para cada cluster.
De modo a gerar logs de eventos desenvolvemos uma aplicação que executa sequências de
acções sobre uma pequena base de dados e registámos estas acções em logs. Depois de
termos um log de eventos era necessário determinar que sequências nele existiam. Analizámos
todos os eventos e criámos ligações entre eles segundo um critério por nós escolhido. Os grafos
resultantes, onde os estados são eventos e as arestas são ligações entre eventos, representam
sequências quando os seus nós são ordenados cronologicamente. Estas sequências
constituíram o input fornecido ao algoritmo de sequence clustering.
Todos os tipos de sequências foram identificados correctamente excepto um, cujo modelo era
ligeiramente diferente do que devia ser. O algoritmo foi também capaz de identificar
correctamente que tipos de sequências eram mais parecidos entre si, consequentemente
indicando que tipos eram variações de outros tipos.
Concluimos que com uma cuidadosa preparação dos dados de entrada o algoritmo de sequence
clustering provou ser uma boa ferramenta para modelar diferentes tipos de sequências de
acções e para explicitar as diferenças entre esses tipos.
Palavras-chave: process mining, sequence clustering, log de eventos, pré-processamento de
log
iv
TABLE OF CONTENTS
AGRADECIMENTOS
II
ABSTRACT
III
RESUMO
IV
TABLE OF CONTENTS
V
LIST OF TABLES
VII
LIST OF FIGURES
VIII
LIST OF FIGURES
VIII
LIST OF ABBREVIATIONS
X
CHAPTER 1: INTRODUCTION
1
CHAPTER 2: STATE OF THE ART
3
PROCESS MINING APPROACHES
3
JONATHAN E. COOK
3
RAKESH AGRAWAL
4
JOACHIM HERBST
6
GIANLUIGI G RECO
7
WIL VAN DER AALST
8
ANA KARLA ALVES DE MEDEIROS
9
VAN DONGEN
10
CONCLUSION
11
CHAPTER 3: SEQUENCE CLUSTERING
13
v
THE MICROSOFT SEQUENCE CLUSTERING ALGORITHM
13
USING SQL SERVER 2005 ANALYSIS SERVICES
16
CHAPTER 4: DATA COLLECTION
21
A BANKING DATABASE
21
SIMULATION OF BANKING OPERATIONS
22
CHECKING ACCOUNT CREATION
23
SAVINGS ACCOUNT CREATION
24
LOAN CREATION
26
LOAN PAYMENT
27
THE SIMULATOR APPLICATION
29
THE COLLECTED EVENT LOG
29
CHAPTER 5: LOG PREPROCESSING
31
FILTERING THE LOG
31
STORING QUERIES
31
DISCOVERING THE SEQUENCES
33
REMOVING LONG CONNECTIONS
34
BUILDING THE CASE AND NESTED TABLES
35
CHAPTER 6: RESULTS
38
CHAPTER 7 : CONCLUSION AND FUTURE WORK
42
HIERARCHICAL SEQUENCE CLUSTERING
42
FINAL REMARKS
45
REFERENCES
47
vi
LIST OF TABLES
Table 1 - Workflow log (Aalst 2004)
8
Table 2 - Causal matrix (Medeiros 2006)
10
Table 3 - Process log (Dongen 2004)
11
Table 4 - State transition matrix for states {rainy, nice, snow} (Grinstead 1997)
14
Table 5 - Case table example
15
Table 6 - Nested table example
15
Table 7 - Wizard steps and respective code for analysis services configuration (P. Ferreira 2007
adapted)
18
Table 8 - Checking account creation sequence: queries and meaning
23
Table 9 - Checking account creation sequence (version 2) queries
24
Table 10 - Checking account creation sequence (version 3) queries
24
Table 11 - Checking account creation sequence (version 4) queries
24
Table 12 - Savings account creation sequence: queries and meaning
25
Table 13 - Savings account creation sequence (version 2) queries
26
Table 14 - Loan creation sequence: queries and meaning
26
Table 15 - Loan payment sequence: queries and meaning
27
Table 16 - Loan payment sequence (version 2): queries and meaning
28
vii
LIST OF FIGURES
Figure 1 - Workflow lifecycle (Aalst 2003)
1
Figure 2 - A Framework for Process Discovery (Cook 1995)
3
Figure 3 - First algorithm in action with log {ABCDE, ACDBE, ACBDE} (Agrawal 1998)
5
Figure 4 - Second algorithm in action with log {ABCF, ACDF, ADEF, AECF} (Agrawal 1998)
5
Figure 5 - The most specific HMM that covers the log L = {abcac, abcc, abcabcc, abcabcac} (Herbst
1998)
6
Figure 6 - The most general HMM that covers the log L (Herbst 1998)
6
Figure 7 - Hierarchy tree and details for its leaves (Greco 2005)
7
Figure 8 - Instance graph for case 1 (Dongen 2004)
11
Figure 9 - Markov chain (Grinstead 1997)
13
Figure 10 - Viewer and ZappingSequence tables
15
Figure 11 - Data Mining Object Hierarchy (Tang 2005 adapted)
16
Figure 12 - Banking database diagram
21
Figure 13 - Customer table example
22
Figure 14 - Account table example
22
Figure 15 - Depositor table example
22
Figure 16 - Simulator application interface
29
Figure 17 – Profiler event log
30
Figure 18 – Tables Query and Keys
31
Figure 19 - Query table for INSERT Customer example
32
Figure 20 - Keys table for INSERT Customer example
32
Figure 21 - Query table for Open Savings Account sequence example
32
Figure 22 - Keys table for the Open Savings Account sequence example
32
Figure 23 – Table Follows
33
Figure 24 - Query 2 follows query 1 example
34
Figure 25 - Follows table for the open savings account example
34
viii
Figure 26 - Graph view of two banking sequences. Links have the names of the parameters whose
values are the same in the queries they connect. (Ferreira 2007)
35
Figure 27 - Links between events that belong to different sequences. (Ferreira 2007)
35
Figure 28 - Tables Sequence and Action.
36
Figure 29 - Sequence table example
37
Figure 30 - Action table example
37
Figure 31 - Sequence models for clusters 1, 2, 3 and 4
38
Figure 32 - Cluster 5’s sequence model
39
Figure 33 - Sequence models for clusters 6, 7, 8 and 9
39
Figure 34 - Cluster diagram
40
Figure 35 - Sequence table for hierarchical clustering example
43
Figure 36 - Action table for hierarchical clustering example
43
Figure 37 - Second level clustering: cluster 1
44
Figure 38 - Second level clustering: cluster 2
44
Figure 39 - Second level clustering: cluster 3
45
ix
List of Abbreviations
FSM:
Finite State Machines
HMM:
Hidden Markov Model
API:
Application Programming Interface
AMO:
Analysis Management Objects
ADOMD.NET: ActiveX Data Objects Multidimensional for .NET
ADO.NET:
ActiveX Data Objects for .NET
DMX:
Data Mining Extensions
SQL:
Structured Query Language
x
xi
CHAPTER 1: INTRODUCTION
Current transactional information systems (e.g. Enterprise Resource Planning, Supply Chain
Management, Customer Relationship Management and Workflow Management Systems) are
capable of recording with great detail the interactions users have with them and store enormous
amounts of records, which are called logs. The underlying implicit information that they contain
and its potential use has created the necessity to develop tools which are able to mine this
information and make it explicit with the help of some formal representation. Process mining is
one such tool. This technique takes event logs from a business process as its input and tries to
generate a model for the business process that generated the logs. A business process is a set of
coordinated tasks and activities, conducted by both people and equipment, that will lead to
accomplishing a specific organizational goal (SearchCIO 2007).
The generated process models can be used to formalize a process which had never been
designed, to make a delta analysis of the process (i.e. evaluate the differences between the
designed process and the real process) or to improve an already known model. Contrary to the
traditional approach of first designing the process and then applying it process mining tries to first
capture the real process from real execution and then formalize it (Figure 1).
Figure 1 - Workflow lifecycle (Aalst 2003)
However, there are some obstacles to generating the process models. Some processes are
intrinsically complex and therefore are very difficult to mine. Also, incorrect logs (noise) can cause
the mined model to contain errors or be inaccurate. Moreover, some processes contain activities
which are not executed in computers and cannot be automatically logged, however we will
abstract from this issue.
1
In chapter 2 we present a survey of the most significant process mining approaches, including
proposed algorithms and their potential use as well as their applicability. Most authors assume
that an event log only contains events from a single business process, i.e., that each event has
an associated case or process instance. This comes as a major disadvantage since (1) the
classes of information systems that are able to generate such logs are restricted to processaware systems, and (2) it becomes impossible to apply and benefit from process mining in
scenarios where the log data is not available in that form (Ferreira 2007). In this report we present
an approach that makes no such assumption.
Our goal was to test an approach to process mining that, from an event log containing event
records originating from different tasks or processes, could identify sequences of related events
in a completely automatic way. We also propose to apply a sequence clustering algorithm to
these sequences. The sequence clustering algorithm we used groups similar sequences in
clusters and constructs a model, in the form of a Markov chain, for each cluster. No input
information about the business logic of these events is provided or necessary. As a result, the
models found do not constitute perfect models of the behavior that generated the events, but
illustrate approximately what types of sequences were executed and their meaning. The
sequence clustering algorithm is discussed in chapter 3.
In order to generate event logs, we developed an application that executed sequences of actions
over a small database and recorded these actions in logs. By looking at these logs alone it is
impossible to know exactly what sequences originated them since they are just a long list of
events. However, as we know the types of sequences that originated the logs, we have a way of
assessing the quality of our approach by comparing the models found to the original sequences
we programmed, and therefore we were able to make adjustments accordingly. The simulator
application is described in chapter 4.
After capturing the event log it is necessary to determine what sequences are in it. In chapter 5
we present our approach to this problem. We analyze all events and establish connections
between them according to a specific criterion. The resulting graphs, where states are events and
edges are event connections constitute sequences of actions when its nodes are sorted
chronologically.
In chapter 6 we discuss and show the results obtained. General conclusions and future work are
presented in chapter 7.
2
CHAPTER 2: STATE OF THE ART
Process Mining Approaches
In this section we present a summary of various authors’ approaches to process mining. All of the
proposed algorithms use an event log as input and starting point for the discovery of underlying
processes. An event log (sometimes called process trace or audit trail) is list of records resulting
from the execution of some process where, in order to be “minable”, each record has to contain
(at least) the activity being executed, the process instance identifier and the time of execution. If
the activities are totally ordered the execution times of the activities are dispensable.
In the following section we use the terms process and workflow interchangeably. We also do not
differentiate between task and activity although most business process views consider that these
two terms are on different levels of granularity. Furthermore, we consider process mining,
process discovery and workflow mining to represent the same technique and, more importantly,
we consider this technique to be control-flow (or behavioral) process mining and not, for example
organizational process mining which is concerned with modeling social interactions. Thao Linh Ly
(Ly 2006), for example, used organizational models in conjunction with process mining in the
discovery of rules for assigning work items to agents.
Jonathan E. Cook
Cook's (Cook 1995) defines process discovery as a method for deriving a formal process model
from basic data collected on the process. His approach, however, aims at retrieving only the
behavioral aspects of the process ignoring others such as the relationships between artifacts
produced or relationships between the agents. The goal is not to discover a complete model of
the process but an approximation of the behavioral patterns in it. These patterns are modeled as
finite state machines (FSM).
Figure 2 - A Framework for Process Discovery (Cook 1995)
3
In order to achieve the goal a trace of process execution is collected and analyzed (Figure 2).
This trace is in the form of an event stream, in other words, a sequence of atomic actions. Three
methods for the analysis of this event stream are proposed and compared. The first, RNet, is a
purely statistical approach to inference and uses neural networks. The main disadvantage
detected is that given a known perfect sample the network is unable to produce a FSM just for
that stream, i.e. the FSM also models behavior not present in the stream. An advantage of this
method is that, being statistical, it is robust to noise. The second method, KTail, is purely
algorithmic. KTail defines the current state from the future states that can occur from it. One
advantage of this method is that it can be parameterized by the value k, which indicates how far
in the “future” (relative to its current state) it looks in the sample stream. A disadvantage is that it
is not robust to noise. The third method, using Markov chains, was developed by Cook and uses
event-sequence probability tables to build an FSM that only accepts sequences of events whose
probabilities exceed some threshold. This threshold is a parameter of the method and the
method’s level of robustness to noise is controlled by it. Markov is a mix between an algorithmic
and statistical technique.
Rakesh Agrawal
Agrawal’s (Agrawal 1998) main goal is to construct a complete process model from logs of
executions of that process. As in (Cook 1995) the focus is on mining the behavioral - control-flow
- aspects of the process. Agrawal models processes as graphs where the vertices represent
activities and the edges represent control conditions. This is different from the FSM that Cook
used where the activities where transitions between states. A consequence of this difference is
that in Agrawal’s models each activity appears only once while in Cook’s FSM they could appear
multiple times. Agrawal associates each edge in the graph with a Boolean function (whose
argument is the state of the process) which determines if the edge is followed, however, in
(Agrawal 1998) the author’s not concerned with finding these functions, only the graph itself. The
author’s notion of dependency is transitive: if B depends on A and C depends on B then C
depends on A.
Agrawal states that for a process model graph G to be conformal with a log L if has to fulfill three
requirements. First, each dependency in L has to be represented in G. Second, there can be no
dependencies in G between activities that are independent in L. Finally, it must be possible to
follow in G every execution that is in L.
Agrawal proposes two algorithms for finding a conformal directed acyclic graph given a log from
an acyclic process. The first one assumes that each execution contains each activity exactly once
and produces a minimal conformal graph. The algorithm starts by building a graph with a vertex
4
for each activity and adds an edge for each pair of activities that follow each other. It then
removes edges that appear in both directions and finally computes the transitive closure of the
resulting graph.
Figure 3 - First algorithm in action with log {ABCDE, ACDBE, ACBDE} (Agrawal 1998)
The second one doesn’t make this assumption and produces a conformal graph which may not
be minimal. The differences from the first algorithm are that this one computes the strongly
connected components in the resulting graph and removes the edges between the vertices in
each component and also identifies and marks, for each execution, the minimal set of edges
necessary to keep the graph consistent with that execution and removes all unmarked edges in
the end.
Figure 4 - Second algorithm in action with log {ABCF, ACDF, ADEF, AECF} (Agrawal 1998)
The algorithms are able to mine processes with cycles by re-labeling repeated tasks and merging
them after the graph is produced. Noise is dealt with by associating a counter with each edge and
removing edges with a count bellow a certain threshold.
Agrawal states that the algorithm was able to find the underlying processes with both synthesized
and real data.
5
Joachim Herbst
Herbst (Herbst 2000, 1998) developed three machine learning algorithms able to induct process
models from execution traces of instances of the same process. The first two are used to induct
sequential process models and the third one is used to induct concurrent workflow models. The
focus is always on the structure of the transitions.
The first algorithm (Herbst 1998) is called Bayesian model merging and models processes as
hidden Markov models (HMM). It first creates the most specific HMM that is consistent with the
log of execution traces (Figure 5) and progressively generalizes it by merging states. This
algorithm stops when it has maximized a heuristic function that expresses how well the model
reflects the log. This function is based on Bayesian posterior probability P[model = m | log = L].
This algorithm has the disadvantage of making bad merge decisions when the underlying model
has loops.
Figure 5 - The most specific HMM that covers the log L = {abcac, abcc, abcabcc, abcabcac}
(Herbst 1998)
The second algorithm (Herbst 1998), model splitting, was created as an answer to the problem
explained above and works exactly in the opposite direction of the first one. It starts with the most
generic HMM that is consistent with the log (Figure 6) and tries to specify it progressively, by
splitting on state in two, until a heuristic log-likelihood function is maximized. The author states
that this algorithm is faster and his explanation for this is that appropriate models of businesses
processes are much closer to the most general model than to the most specific one.
Figure 6 - The most general HMM that covers the log L (Herbst 1998)
6
Finally, the third algorithm (Herbst 2000) is very similar to the second one. The main difference is
that in a concurrent process model a split operation may have unpredictable global effects. To
avoid this problem the split operation of this algorithm is done at the instance level.
As HMM are probabilistic models Herbst’s approach is robust to noise.
Herbst created a prototype that implemented these algorithms and compared its results with
Cook’s and Agrawal’s. The prototype was able to discover the same models that Agrawal
discovered but it had problems with Cook’s.
Gianluigi Greco
The goal of Greco’s (Greco 2005) approach is to build a hierarchical process model with various
views of the same process model with different levels of detail (Figure 7). The first step of this
approach (which works in top-down fashion), is to recursively apply clustering to the execution
traces of some process until some limit number of clusters is reached, resulting in a hierarchy of
clusters of process traces. Then, a process model is built for each cluster, using some not
specified process mining algorithm. The second step of this approach (which works in bottom-up
fashion), transforms each non-leaf model in an abstraction of its children. Common activities of
children nodes are preserved while specific activities are merged and substituted by special
activities.
Figure 7 - Hierarchy tree and details for its leaves (Greco 2005)
7
In (Greco 2004) the author proposes a algorithm that besides discovering the control-flow also
discovers global constraints of a process. A global constraint is a rule of the type: if activity A has
been completed activity B cannot be activated. Greco states that control-flow models can express
local constraints (using the dependency relation between activities) but lack the power to express
global constraints. The solution the author proposes starts by building a control-flow model for
the process. Then, it discovers frequent patterns which are unexpected considering the local
constraints, and thus indicate the presence of global constraints. Finally, it creates a different
model for each one of these patterns. Although the global constraints are not explicitly expressed,
together the models implicitly reflect the existence of that constraints. For the simple example
above there would be two models, one where is possible to reach A but not B and another where
it is possible to reach B but not A.
Greco does not address the issue of noise in (Greco 2004, 2005), but since processes are
modeled as graphs some modifications would have to be made.
Wil van der Aalst
Van der Aalst (Aalst 2004) proposes an algorithm, called -algorithm, that can extract a process
model from a event log and represent it as a Petri net. Van der Aalst also describes the class of
processes this algorithm is able to mine correctly.
The author assumes that each event in the log represents a task, that the event’s process
instance identifier is known and that the events are totally ordered (Table 1).
Table 1 - Workflow log (Aalst 2004)
8
The author defines the following relations between tasks based on an event log. Task B follows
task A in some case in the log if B is immediately after A in that case. Task A is the cause for task
B if and only if B follows A but A never follows B in the log. Tasks A and B are parallel if and only
if A follows B and B follows A in the log. Finally, A and B are not related if and only if A does not
follow B and B does not follow A in the log. Van der Aalst assumes that the log is complete, i.e. all
possible follows relations (w. r. t. the underlying process model) are present in the log. The author
also assumes that the log contains no noise.
From the relations above Van der Aalst developed the -algorithm. The author also states the
types of processes which the algorithm cannot discover. The -algorithm cannot mine processes
with short loops or duplicate task for example. Besides this, it is also not robust to noise as the
relations between tasks do not have an associated frequency.
The algorithm was applied to two real cases, a health care process and a fine processing
process. The algorithm was only able to successfully mine the second one.
Ana Karla Alves de Medeiros
Ana de Medeiros (Medeiros 2006) proposes using genetic algorithms in process mining. Genetic
algorithms try to mimic the way evolution works in order to find the best solutions to some
problem. More specifically, there is a population where each individual is a potential solution for
the problem. In each iteration (called generation) a fitness function is used to determine how good
a solution each individual represents. After this evaluation the fittest individuals are copied to the
next generation unchanged. The rest of the population suffers mutation (i.e. random change of an
individual’s components) or crossover (i.e. recombination of two individuals components). The
algorithm stops when some stop condition (like a limit number of generations) is reached. In the
author’s approach individuals are process models and the fitness function evaluates how
consistent the model is w.r.t. the log. The chosen model is the fittest individual of the last
generation.
The internal representation chosen for an individual is a causal matrix (Table 2). The causal
matrix shows the input and output conditions functions of each activity so that it is possible to
understand the dependency relations between them. The fitness function evaluates individuals
according to their completeness (how many traces in the log the individual can parse) and
preciseness (not being able to parse traces which not present in the log). Completeness,
however, is considered more important than preciseness. The goal is to find individuals which
model all the behavior present in the log while not modeling any extra behavior (i.e. not present in
the log). The crossover operator allows the following changes to happen to an individual:
lose/gain activities from/to subsets in its input/output condition functions; exchange causality
9
relations with other individuals; gain causality relations which can be inferred from the population
but are not present in any individual; lose causality relations; and decrease/increase the number
of subsets in the input/output condition functions. Finally, the mutation operator randomly
adds/removes an activity to/from a subset of the input/output condition functions; or randomly
redistributes the activities in these subsets.
Table 2 - Causal matrix (Medeiros 2006)
Noise is dealt with by pruning arcs of the mined model which are used less than a specific
threshold.
Van Dongen
Van Dongen (Dongen 2004, 2005) proposes creating graph models for each of the process
instances present in the log and then combining all of them in order to produce a general model
for the process.
In (Dongen 2004) the author proposes an algorithm that builds a process instance graph from an
event log. These graphs are called instance nets which are basically process models where there
are no decisions to be made. The first step is finding the causality relations of the process. The
causality relation is inferred the same way as in (Aalst 2005): task A is the cause for task B if and
only if B follows A but A never follows B in the log. Note that this relation are still inferred at the
log level, not at the instance level. After this step, the instance ordering of the instance net is
defined the following way: B follows A on the instance net if and only if A causes B (this was
determined in the first step of the algorithm) and there is no C such that A causes C and C
causes B. Table 3 shows an event log and Figure 8 shows the instance net (modeled as a graph)
for case 1. Note that the parallelism of A and B was determined from the whole log in the first
step.
10
Table 3 - Process log (Dongen 2004)
Figure 8 - Instance graph for case 1 (Dongen 2004)
Van Dongen (Dongen 2005) also proposes an algorithm for building a general model for the
process from the instance models created with the algorithm explained above. This is also done
in two steps. First, a projection graph is built from each instance graph. A projection graph of an
instance graph is a graph where each task that was executed in the instance graph is
represented by a single node (repeated tasks disappear). Also, tasks which were connected in
the instance graph are still connected in this graph. Finally, start and final nodes are added. The
second and final step is the construction of the aggregation graph, which models the whole
process. This is done straightforwardly by adding the projection graphs.
Although the Dongen’s approach produces a graph with an associated counter the author does
not specify how to address noise.
Conclusion
All of these approaches have in common the fact that they use an event log produced by the
process they want to mine in order to mine it. However, some of them make special assumptions.
Van der Aalst’s approach for example assumes that the event log is complete (i.e. all possible
follows relation are present in the log) while Agrawal assumes that the processes do not have
cycles (although the author’s algorithms can mine cyclic processes by re-labeling the repeated
tasks). How the algorithms deal with noise is another differentiating factor. Cook’s RNet and
Markov based algorithms and Herbst’s algorithms are all probabilistic and consequently robust to
noise. Agrawal adds a counter to each edge of the graph models and removes edges with a low
11
count in order to resist noise. Ana de Medeiros also removes arcs which are used less than a
certain threshold. Greco and van Dongen do not specify how to deal with noise and van der Aalst
assumes the logs do not have it.
Since most enterprise systems are not process aware it is a bold assumption to think all event
logs will only contain events pertaining to a single process. This way, if an approach was
developed that did not need to make this assumption a large number of real system logs would
become available for process mining. Moreover, we also think robustness to noise is a
fundamental asset to any process mining technique. With these two factors in mind, we present
the sequence clustering algorithm and our approach to process mining in the next chapters.
12
CHAPTER 3: SEQUENCE CLUSTERING
Sequence clustering is a data mining technique that takes a number of sequences and groups
them in clusters so that each cluster contains similar sequences. A sequence is a series of
discrete states (Tang 2005), one example is a genomic sequence where the states are
adenosine, guanine, cytosine and thymidine. The applicability of this technique to genomic
sequences explains its widespread use in the bioinformatics field. Sequence clustering is also a
probabilistic technique, since that in the sequence models produced the transitions between
states have associated probabilities. This makes these models robust to noise, since a transition
with a low probability indicates a sequence path that is rarely followed, thus probably an
erroneous one. Furthermore, the fact that this algorithm takes sequences as input seems to make
it a natural candidate to mine sequences of actions, which in turn can be obtained from event
logs.
Microsoft SQL Server 2005 includes a tool for sequence clustering that makes use of the
Microsoft Sequence Clustering algorithm and which we decided to use in our approach.
The Microsoft Sequence Clustering algorithm
In Microsoft Sequence Clustering algorithm each case is assigned to each cluster with some
probability. Each cluster has an associated Markov chain. A Markov chain is described as a set of
states and the transition probabilities between them. If the chain is currently in state si, then it
moves to state sj at the next step with a probability denoted by pij, and this probability does not
depend upon which states the chain was in before the current state (Grinstead 1997). Figure 9
shows a Markov chain.
Figure 9 - Markov chain (Grinstead 1997)
A Markov chain can also be represented by a state transition matrix as the one shown below (
Table 4).
13
Table 4 - State transition matrix for states {rainy, nice, snow} (Grinstead 1997)
The definition presented above is actually the definition of a first order Markov chain. In an n th
order chain the future state’s probability depends on the previous n states.
The Microsoft Sequence Clustering algorithm involves the following steps (Tang 2005):
1. Initialize the model parameters at random, i.e. initialize the state transition probabilities of
the Markov chains from each cluster at random;
2. Assign each case to each cluster with some probability;
3. Recalculate the model parameters taking into consideration the probabilities of each case
belonging to each cluster, i.e. recalculate the state transition probabilities of each Markov
chain of each cluster considering the probabilities of each case belonging to that cluster.
4. Verify if the model has converged, in case it has not, reiterate from step 2.
In order to work the algorithm needs at least two tables, a table with the non-sequence attributes
(the case table) and a nested table of the first with the sequence key. There can be only one
sequence nested table in each model. An example is shown in Figure 10, the goal is to model the
behavior of a TV viewer when he is zapping between channels. This is similar to the example
presented in (Tang 2005). The Viewer table contains general data – non-sequence attributes about the viewer. Table ZappingSequence is the sequence nested table and it contains three
attributes, ViewerID which is the foreign key to the main table; ChannelType which is the state of
the sequence; and SequenceID which saves the sequence position of a state.
14
Figure 10 - Viewer and ZappingSequence tables
For example, if viewer number 1 zapped between channels CNN, Eurosport, MTV, BBC and
SporTV in this order and viewer 2 zapped between SkyNews, VH1 and ESPN in this order the
tables would look like this:
Table 5 - Case table example
Viewer
ViewerID
Address
TelephoneNumber
Email
1
Happy Street
123456789
[email protected]
2
Sunny Street
987654321
[email protected]
Table 6 - Nested table example
ZappingSequence
ViewerID
ChannelType
SequenceID
1
News
1
1
Sports
2
1
Music
3
1
General
4
1
Sports
5
2
News
1
2
Music
2
2
Sports
3
15
The algorithm is tunable by the following parameters: number of clusters the model contains;
minimum number of cases in each cluster; maximum number of different states an attribute can
have; and the maximum number of different states the sequence attribute can have.
Using SQL Server 2005 Analysis Services
Before using the case and nested table we have to complete a series of steps that create the
necessary structures for the algorithm to function. These involve the creation of a database, a
datasource, a datasource view, a mining model and a mining structure (Figure 11). All these
steps are executed in the Analysis Services module of SQL Server and they are common to the
application of any data mining algorithm. In fact, the choice of the data mining algorithm to use is
only made when the mining model is created. In this section we explain all this in more detail.
Figure 11 - Data Mining Object Hierarchy (Tang 2005 adapted)
The first thing to do is to create a database that will store the datasource, datasource view and
mining structure. This is a very simple step where we only have to indicate a server and the name
of the database to create in that server.
16
The datasource is also a very simple object consisting of a connection string to the database just
created. The datasource view on the other hand contains the schema of the case and nested
table, which is provided to it in the form of a dataset object augmented with custom properties
such as table descriptions (Tang 2005) . In our example this dataset contains the schema for
tables Viewer and ZappingSequence.
“The mining structure describes the domain of the problem in terms the data mining engine
understands.” (Tang 2005) The mining structure comprises several mining structure columns which are connected to source columns from the datasource view, their data types and content
types. For example, in our case we would indicate that Viewer is the case table and
ZappingSequence is the nested table, that the ViewerID attribute connects the two tables, that
the SequenceID attribute indicates the order of the events within the sequence and also that the
ChannelType indicates the name of the event.
Finally, the mining model creation is where we indicate which data mining algorithm is going to be
used and configure its parameters. The parameters for the Sequence Clustering algorithm are the
number of clusters, minimum number of cases in each cluster, maximum number of states a
sequence may have and maximum number of different states an attribute may have.
The following table summarizes the analysis service configuration process for an example where
the case table is called “ActionsInteractions” and the nested table is called “Nested”. For each
step we show the Visual Studio wizard screen and the code that accomplishes the same goal.
The code interacts with several data mining APIs made available by SQL Server 2005. These
include: ADOMD.NET, which implements the dataadapter and datareader interfaces specifically
for the analysis services, making it more efficient than ADO.NET, and contains the mining
structure and mining model structures; and AMO, which also contains the mining structure and
mining model structures but that, instead of using them for executing queries and data exploration
like ADOMD.NET, uses them for data mining objects management. The DMX language was also
used. This language is an extension of SQL used for creation and manipulation of data mining
objects. (P. Ferreira 2007)
17
Table 7 - Wizard steps and respective code for analysis services configuration (P. Ferreira 2007
adapted)
Datasource Creation
RelationalDataSource ds = new RelationalDataSource()
ds.ConnectionString = ...
Datasource View Creation
DataSourceView dsv = new DataSourceView()
DataSet dsvDataSet = new DataSet()
string queryString = "SELECT * FROM ActionsInteractions";
SqlDataAdapter da1 = new SqlDataAdapter(queryString, conString);
da1.FillSchema(dsvDataSet, SchemaType.Mapped);
...
Dsv.Schema = dsvDataSet;
18
Mining Structure Creation
MiningStructure ms = new MiningStructure()
ms.Source = new DataSourceViewBinding(dsv.ID);
ScalarMiningStructureColumn case = new ScalarMiningStructureColumn()
case.IsKey = true;
miningStructure.Columns.Add(case);
...
Mining Model Creation
MiningModel miningModel = miningStructure.CreateMiningModel()
miningModel.Algorithm = "Microsoft_Sequence_Clustering"
19
Mining Model Processing
miningModel.AlgorithmParameters.Add("CLUSTER_COUNT", 10)
miningModel.AlgorithmParameters.Add("MINIMUM_SUPPORT", 3)
miningModel.AlgorithmParameters.Add("MAXIMUM_STATES", 30);
miningModel.AlgorithmParameters.Add("MAXIMUM_SEQUENCE_STATES", 30)
Visualizing the Results
foreach (MiningContentNode node in MiningModel.Content
{
foreach (MiningContentNode child in node.Children)
{
foreach (MiningDistribution md in node.Distribution)
Console.WriteLine(...)
}
}
The process mining activity begins with an event log. Unfortunately, we cannot apply the
sequence clustering algorithm directly to the log. As we already explained, we need to have a
case table and a nested table. This way, several steps must be taken in order to preprocess the
log and be able to build these two tables from it. The next chapters explain how we generated an
event log using a simulator application, all the steps that had to be taken to construct the case
and nested tables and the result obtained after finally applying the sequence clustering technique.
20
CHAPTER 4: DATA COLLECTION
The first step of our approach involved the development of a banking activity scenario, consisting
of a fictitious database and a small application that interacted with it, which could produce specific
event sequences, such as: creation of a checking account, creation of a savings account,
creation of a loan and payment of a loan. Since the goal of our approach was to develop
sequence models for the sequences of actions that originated a certain log, the event logs
resulting from the monitoring of the banking event sequences constituted the test data for our
approach and, consequently, were used to verify its results.
In this chapter we describe the small banking database that was created, the program module
that operates over it and types of event sequences that are produced and also the event traces
that result from the execution of this scenario.
A banking database
Since the goal was to apply the Microsoft Sequence Clustering algorithm to the discovered
sequences of events and since this algorithm is included in Microsoft SQL Server 2005 we chose,
for simplicity and easiness of integration, to develop an SQL Server database for the simulation
scenario. This, in turn, enabled us to take advantage of the SQL Server Profiler tool which
monitors the SQL Server Database Engine and registers data about each event.
The database consists of the following tables: accounts, which can be savings accounts or
checking accounts and belong to a specific branch of the bank; loans (also associated to
branches) and loan payments; customers, associated to loans through the borrower relationship
and to accounts through the depositor relationship; and employees associated to customer
through the customer-banker relationship (Figure 12).
Figure 12 - Banking database diagram
21
Figure 13, Figure 14 and Figure 15 show how some tables look after the insertion of Jane, John
and Michael and their respective accounts in the database.
Figure 13 - Customer table example
Figure 14 - Account table example
Figure 15 - Depositor table example
Simulation of banking operations
The event log is generated by simulating several banking operations. This simulation is
accomplished by executing SQL queries over the banking database just presented. The following
banking operations may be executed:
1. Checking account creation
2. Savings account creation
3. Loan creation
4. Loan payment
Operations 1, 2 and 4 may be executed in more than one way. For example, the checking
account creation operation has four variations. In order for these operations to be executed the
database must already be populated with some data like branches and employees.
We now describe in more detail each one of thess types of operations. The description does not
include auxiliary steps that are executed for simulation purposes only, for example, when a loan
22
is created the borrower associated with it is a randomly chosen customer from the list of all
existing customers. However, this step involves a database query execution and, therefore, will
leave a trace on the log. This issue will be discussed on the next chapter when we explain how
the log is preprocessed.
Checking Account Creation
This sequence involves the following conceptual steps:
1. Creating new customer;
2. Creating new checking account;
3. Associating new client and new account;
4. Associating existing bank employee to new client.
Table 8 shows the specific queries the application executes on the database, and which are also
captured in the SQL Profiler log, and their meaning.
Table 8 - Checking account creation sequence: queries and meaning
SQL Query
Meaning
INSERT INTO Customer
Creation of customer number 85045, whose
VALUES (85045,'John','Happy Street',
name is John and who lives on Happy Street
'Sunny City')
at Sunny City.
Creation of account number 34220, with a
INSERT INTO Account
VALUES (34220,705,'Campo Grande')
balance of 705 and belonging to Campo
Grande branch.
INSERT INTO Checking_Account
Registration of the account as a checking
VALUES (34220, 207)
account. Overdraft amount of 207.
INSERT INTO Depositor VALUES (85045,34220)
Creation of customer – account relation.
INSERT INTO Cust_Banker
VALUES (85045,6,'moyvo')
Association of new customer with employee
number 6. The last attribute is the type of the
relation and is not relevant.
As you can see there are no queries shown on the table above that create employee number 6 or
branch ‘Campo Grande’. This is due to the fact that both branches and employees are assumed
to exist in the database prior to the execution of the application, therefore when the database was
created it was populated with a number of employees and branches.
The customer number, name and city are randomly generated each time a sequence is executed,
as are the account number, balance and overdraft amount. The branch and employee number
are randomly chosen from the ones present in the database at the time. Although this last step
23
interacts with the database it is an auxiliary step only for simulation purposes and consequently is
not shown in the sequence.
In order to introduce some degree of variability in the simulation three additional versions of this
sequence were developed. These variations simply change the order in which the queries are
executed (Table 9, Table 10 and Table 11).
Table 9 - Checking account creation sequence (version 2) queries
SQL Query
INSERT INTO Account VALUES (34220,705,'Campo Grande')
INSERT INTO Checking_Account VALUES (34220, 207)
INSERT INTO Customer VALUES (85045,'John','Happy Street','Sunny City')
INSERT INTO Cust_Banker VALUES (85045,6,'moyvo')
INSERT INTO Depositor VALUES (85045,34220)
Table 10 - Checking account creation sequence (version 3) queries
SQL Query
INSERT INTO Account VALUES (34220,705,'Campo Grande')
INSERT INTO Checking_Account VALUES (34220, 207)
INSERT INTO Customer VALUES (85045,'John','Happy Street','Sunny City')
INSERT INTO Depositor VALUES (85045,34220)
INSERT INTO Cust_Banker VALUES (85045,6,'moyvo')
Table 11 - Checking account creation sequence (version 4) queries
SQL Query
INSERT INTO Customer VALUES (85045,'John','Happy Street','Sunny City')
INSERT INTO Cust_Banker VALUES (85045,6,'moyvo')
INSERT INTO Account VALUES (34220,705,'Campo Grande')
INSERT INTO Checking_Account VALUES (34220, 207)
INSERT INTO Depositor VALUES (85045,34220)
Savings Account Creation
This sequence involves the following conceptual steps:
1. Choosing a checking account belonging to the customer;
2. Creating a new savings account;
3. Associating customer with the new account;
24
4. Transferring funds from checking account to new savings account.
Table 12 shows the specific sequence of queries that are executed on the database.
Table 12 - Savings account creation sequence: queries and meaning
SQL Query (or Stored Procedure)
Meaning
SELECT a.account_number, a.balance
FROM Depositor AS d, Account AS a,
Checking_Account AS c
Selection of all checking accounts
WHERE a.account_number = d.account_number
belonging to customer number 17214.
AND c.account_number = a.account_number
AND d.customer_id = 17214
INSERT INTO Account
Creation of account number 74652, with
VALUES (74652,0,'Campo Grande')
balance 0 in the Campo Grande branch.
INSERT INTO Savings_Account VALUES (74652, 1)
Registration of the account as a savings
account. Interest rate of 1%.
INSERT INTO Depositor VALUES (17214,74652)
Association of customer with new
account.
Transfer of funds from checking account
exec internal_account_transfer 7583,74652,189
to new savings account. Amount to
transfer is 189.
Note that the choice of a customer is not shown. This is done the same way as for branches and
employees, i.e., by retrieving all existing customers and randomly choosing one. As a customer is
only created when a checking account is created every customer has at least one checking
account. Therefore, after selecting a random customer, the first checking account belonging to
him is chosen as the “parent account” of the savings account. The first account is chosen for
simplicity reasons only. In the example above the account number 74652 is the first account of
customer 17214.
Also note that “exec internal_account_transfer” is a stored procedure that transfers a specific
amount of funds between two accounts.
As in the checking account case we created a second version of the sequence where the order of
the queries was changed (Table 13).
25
Table 13 - Savings account creation sequence (version 2) queries
SQL Query (or Stored Procedure)
SELECT a.account_number, a.balance
FROM Depositor AS d,
Account AS a, Checking_Account AS c
WHERE a.account_number = d.account_number
AND c.account_number =
AND
a.account_number
d.customer_id = 17214
INSERT INTO Account VALUES (74652,0,'Campo Grande')
INSERT INTO Savings_Account VALUES (74652, 1)
exec internal_account_transfer 7583,74652,189
INSERT INTO Depositor VALUES (17214,74652)
Loan Creation
The creation of a loan is done through the execution of the following sequence of conceptual
actions:
1. Verifying total balance of the customer;
2. Creating a loan not superior to customer’s total balance;
3. Associating client with loan.
Table 14 shows the queries that are performed on the database when a loan creation sequence
is executed.
Table 14 - Loan creation sequence: queries and meaning
SQL Query
Meaning
SELECT SUM(a.balance)
FROM Account AS a, Depositor AS d
Verification of total balance of client
WHERE a.account_number = d.account_number
number 29412.
AND d.customer_id = 29412
INSERT INTO Loan
VALUES (77709,324,'Campo Grande')
Creation of loan number 77709,
referring to an amount of 324 in
branch Campo Grande
INSERT INTO Borrower VALUES (29412,77709)
Association of client with loan
26
The customer is randomly chosen as was explained in the last sequence. There are no variants
as it represents the only possible order in which the queries can be carried out.
Loan Payment
The loan payment sequence refers to the total payment of loan. As the loan can be paid either
with cash or by transfer there are two versions for this sequence. The differences between the
two versions are more than just the order in which the actions are executed, so we can actually
consider them two different sequences. The conceptual steps of the first version are the following:
1. Paying loan with cash;
2. Registering loan payment;
3. Deleting loan record and associated payment and borrower records.
It may seem odd to delete the payment record immediately after creating it. The reason for this is
that, despite assuming only total loan payments - i.e., all payments are liquidations - we wanted
to simulate that the banker only discards the loan records after verifying that the debt was settled.
Table 15 shows the loan payment sequence queries and their meaning.
Table 15 - Loan payment sequence: queries and meaning
SQL Query (or Stored Procedure)
Meaning
UPDATE Loan SET amount = amount - 56
Update of loan number 83209’s amount. (The
WHERE loan_number = 83209
initial amount was 56 so the debt is settled.)
INSERT INTO Payment
VALUES (83209,38148,'1-4-2007',56)
Registering of payment number 38148 pertaining
to loan 83209, paid on 1-4-2007 whose amount is
56
DELETE FROM Borrower
WHERE loan_number = 83209
DELETE FROM Payment
WHERE loan_number = 83209
DELETE FROM Loan
WHERE loan_number = 83209
Deletion of borrowers pertaining to this loan.
Deletion of payments pertaining to this loan.
Deletion of loan.
Analogously to the other sequences, the random choice of a loan and retrieval of the amount in
debt are auxiliary steps to the simulation and are not considered in the sequence.
27
The second version of this sequence involves the following steps:
1. Selecting an account belonging to the client;
2. Paying the loan by transfer;
3. Registering loan payment;
4. Deleting loan record and associated payment and borrower records.
Table 16 illustrates this sequence’s queries and their meaning.
Table 16 - Loan payment sequence (version 2): queries and meaning
SQL Query (or Stored Procedure)
Meaning
SELECT customer_id FROM Borrower
Getting all the customers associated with a
WHERE loan_number = 83209
particular loan.
SELECT a.account_number, a.balance
FROM Depositor AS d, Account AS a,
Checking_Account AS c
WHERE a.account_number = d.account_number
Selection of all checking accounts belonging to
customer number 40416.
AND c.account_number = a.account_number
AND d.customer_id = 40416
exec loan_payment_by_transfer
Loan payment by transfer. Source account
53211, 83209,56
53211, loan number 83209 and amount 56.
INSERT INTO Payment
VALUES (83209, 38148,'1-4-2007',56)
DELETE FROM Borrower
WHERE loan_number = 83209
DELETE FROM Payment
WHERE loan_number = 83209
DELETE FROM Loan
WHERE loan_number = 83209
Registering of payment number 38148
pertaining to loan 83209, paid on 1-4-2007
whose amount is 56
Deletion of borrowers pertaining to this loan.
Deletion of payments pertaining to this loan.
Deletion of loan.
The first step intends to simulate that the client states he wants to pay a loan by transfer and the
banker select all borrowers associated with that loan and confirms the client is one of them. It is
assumed that the client is always the first borrower on the list.
As before, the random selection of a customer, a loan associated with that customer and the
amount in debt are auxiliary steps. The account chosen to pay the debt is the first account of the
customer for simplicity. “exec loan_payment_by_transfer” is a stored procedure that uses a
specific amount of money from an account to pay a loan.
28
The simulator application
As stated above, the application that interacts with the database is able to perform four different
operations: creation of a checking account, creation of a savings account, creation of a loan and
payment of a loan.
Figure 16 shows the simulator interface. Each button executes one of the operations mentioned.
Figure 16 - Simulator application interface
The collected event log
As explained before, in order to capture the interactions between application and database on a
log we used the SQL Server Profiler tool, which monitors the SQL Server Database Engine and
records data about each event. The types of events recorded can be chosen manually. In our
case we were only interested in SQL query completion and stored procedure completion events.
Figure 17 illustrates the SQL Profiler event log resulting from the savings account creation
sequence (version 1).
29
Figure 17 – Profiler event log
Note that the first two queries do not belong to the sequence presented above. These are
auxiliary actions that exist only for simulation purposes. In this case these two queries help the
application choose the customer that creates the account and the branch where the account is
created. As they are recorded alongside every other query and because they do not belong to the
sequence we must filter them later. We will see in the next chapters how this is done.
After the log was captured we saved it into a table of our database in order to be able to
manipulate it. As stated in chapter 3, we cannot apply the sequence clustering algorithm directly
to this event log. By looking at the log we can only see a list of queries without any type of
associated case id. It is impossible to know exactly how many sequences exist, where each one
starts, ends or what events are included in it. Remember that the simulation phase is now over,
and we assume that this log was generated by a real system of which we do not know anything
about. The problem then, is to find a way of determining what sequences of events exist in it.
After finding these sequences we must structure them in a way understandable by the sequence
clustering algorithm. The next chapter shows how to find the hidden sequences.
30
CHAPTER 5: LOG PREPROCESSING
After capturing the log it is necessary to process it and restructure its data so that we are able to
apply the Sequence Clustering algorithm. As explained in chapter 2, Figure 10, this algorithm
requires a case table and a nested table to function. These tables identify which sequences exist
and which events belong to each sequence. Since the log is just a long list of queries and stored
procedures we were required to find a way to determine where each sequence begins and ends
and which events belong to each one. An application called Sequence Builder was developed
with this purpose. In this chapter we describe the steps taken by this application in order to fulfill
its purpose.
Filtering the log
After saving the log to a database table called Trace, which mimics the structure of Figure 17, the
first step in log processing is to take each row and parse the TextData field in order to know what
type of query it represents and what tables and attributes were manipulated by it. When this is
known the program decides whether to store the query or ignore it. The program ignores a query
if it is not a select, insert, delete, update or execute stored procedure command or if it is too
generic to be of use, like a select, delete or update command without a where clause. For
example, the query SELECT * FROM Customer would be ignored. This way the auxiliary
queries mentioned in the last chapter are filtered.
Storing queries
In order to support the preprocessing steps that will be described ahead, two new database
tables were created to store relevant queries: Query and Keys. Figure 18 shows these two tables.
Figure 18 – Tables Query and Keys
31
To better illustrate how the storing process works we will describe how some queries shown in
chapter 3 are registered. INSERT INTO Customer
VALUES (85045,'John','Happy
Street','Sunny City'), for example, originates the following records:
Figure 19 - Query table for INSERT Customer example
Figure 20 - Keys table for INSERT Customer example
Note that the program only stores the value for the customer_id field, which is the primary key for
table Customer. There is no need to store anything else, because the tag and the key_name,
key_value pair unequivocally identify that this query is an insertion of customer 85045 in the
database. The program knows how many primary keys each table has and their names because
it queries the database catalog before the log processing begins and stores these data.
Executing the Open Savings Account sequence, Table 12, produces the following records:
Figure 21 - Query table for Open Savings Account sequence example
Figure 22 - Keys table for the Open Savings Account sequence example
32
There are three things to retain in this example. First, it is possible to see that the tag is
constructed by concatenating the type of query with the names of all the tables read or written by
it (Figure 21, query_id 1) or, in case of a stored procedure, by the word exec followed by the
name of the procedure. Second, since some queries interact with tables with more than one
primary key, some query_ids have more than one key_name, key_value pairs (query_id 4).
Finally, all stored procedure’s parameters are stored in table Keys and the name for the
parameters are generated by concatenating the word spparameter with a sequential number.
Although we could access the stored procedures’ internal structure by querying the database, it
usually is too complex and variable to parse, making it very hard to know which tables are being
manipulated, therefore we assume all the parameters are important even if we don’t know what
they are.
Also note that the query_id field indicates the order in which the rows were read from the log and,
consequently, the sequential order in which the queries were executed.
Discovering the sequences
Having saved every relevant query in the Query and Keys we still don’t have more than a
disconnected list of queries, albeit a more organized one. It is now time to determine what
sequences of queries exist in this list. As we can not know in advance where a sequence starts or
ends, we decided to first find out to which queries each query is connected to.
But how do we know two queries are connected and consequently belong to the same
sequence? We simply compare each query with every other query and if they share a key_name,
key_value pair we assume they’re connected. Stored procedures special key_names are
assumed to match every other key_name, this way, if at least one of the queries being compared
is a stored procedure, key_value matching is enough to assume a connection. In the opening
savings account example above (Figure 21 and Figure 22), query 1 is connected to query 4 (by
the customer_id, 1108249055 pair) and query 5 is connected to queries 2, 3 and 4 (by the
key_value 1880652078 - stored procedure special case).
The connection that are found are saved in table Follows (Figure 23).
Figure 23 – Table Follows
33
For example, if we determine that query 1 is connected to query 2, meaning they belong to the
same sequence, the table would look like this:
Figure 24 - Query 2 follows query 1 example
Since the query_id attribute indicates the order in which queries were executed, we know query 2
happened after query 1, therefore, when we determine queries 1 and 2 are connected we also
know query 2 follows query 1 and not the other way around.
Figure 25 shows all the connections present in the savings account creation example (Figure 21
and Figure 22).
Figure 25 - Follows table for the open savings account example
After all possible query pairs are compared the Follows table will actually contain a graph
representation of the whole log, where states are events and edges are connections between
events. Figure 26 illustrates this idea for the “Open Checking Account” and “Open Savings
Account” sequences.
Removing long connections
It is easy to see that if a customer creates two accounts, for example, there will be connections
not only among events intuitively belonging to a sequence but also between different sequences.
This case is illustrated in Figure 27, where there should be three different sequences but there
are only two. In order to solve this problem we developed a heuristic where very long connections
are removed from table Follows, since these connections are rarely meaningful links between
events at this level. However, these links may provide insight into higher level patterns of
behavior (Ferreira 2007) and thus are stored for future use (see chapter 7). To do this, the
average connection length is calculated and all connections that exceed a certain average based
34
threshold are removed. In Figure 27, for example, the average link length is (18 + x)/11, where x
is the length of the dashed link. We could reject the dashed link if, for example, x ≥ 2*(18+ x)/11,
meaning x ≥ 4, which results very well for this example (Ferreira 2007). Experience with the
heuristic has shown us that a detailed study of how to determine the best threshold value will
have to be undertaken in the future as the best value varies greatly with the input log.
Figure 26 - Graph view of two banking sequences. Links have the names of the parameters whose
values are the same in the queries they connect. (Ferreira 2007)
Figure 27 - Links between events that belong to different sequences. (Ferreira 2007)
Building the case and nested tables
Having removed all the long links from the Follows table the application now has to find out what
specific sequences exist, i.e., the application must go through all implicit sub-graphs in this table,
sort the queries by chronological order (sorting by query_id has the same effect as already
discussed) and store the sequence found. In order to do this the application executes the
following cycle:
1. Get the first query from the Query table. If table is empty terminate;
2. Find the nodes belonging to this query’s sub-graph by using a recursive SQL query over
table Follows;
35
3. This group of nodes constitutes a sequence and this sequence is now inserted on a case
and a nested table (Figure 28).
4. Remove all queries belonging to that sequence from the Query table to avoid repeatedly
finding the same sequence.
Figure 28 shows the diagram of the two tables that store the sequences that were found. Table
Sequence stores the sequence_ids that exist while table Action stores the events (queries) that
belong to each sequence. The position field is simply the query_id of the query represented and
the action_tag is its tag.
Sequence
PK
sequence_id
Action
PK,FK1
PK
sequence_id
position
action_tag
Figure 28 - Tables Sequence and Action.
We have now found all the sequences present in the log and have the data structured in a case
table nested table format, where the Sequence table is the case table and the Action table is the
nested table. If you look at the example in chapter 3, you can see that these tables mimic the
structure of those in Figure 10. Note that we what we call position corresponds to the Zapping
example’s SequenceID, what we call sequence_id corresponds to the Zapping example’s
ViewerID and our action_tag corresponds to ChannelType. We proceeded to this change of
names because we disagreed with the nomenclature in (Tang 2005) and thought the term
“sequence id” was better suited to identify a sequence than the position of an event inside a
sequence.
The two figures below illustrate how the Sequence and Action tables look after processing a log
with only two sequences, a checking account creation and a savings account creation.
36
Figure 29 - Sequence table example
Figure 30 - Action table example
Note that the position attribute is global. Meaning we can know if an action from a sequence
happened before or after an action from another sequence. As the global chronological order of
the events is easily determined just by looking at the log we decided to keep the position attribute
global. This does not influence the results of the sequence clustering algorithm since it will always
assume that each sequence begins at the event with the lowest position value in that sequence.
It is now time to apply the Sequence Clustering algorithm which will enable us to find Markov
models for the different types of sequences.
37
CHAPTER 6: RESULTS
We will now present the results we obtained in applying the techniques from chapter 5 and 6 over
a simulator generated log of about one thousand events consisting of more or less 200
sequences. When the sequence clustering algorithm is ran without specifying a number of
clusters it uses a heuristic to determine the most “adequate” number of clusters. Doing this over
our data resulted in the algorithm grouping nine different sequence models in eight clusters, i.e.,
one of the clusters contained two Markov chains for two different types of sequences. Seeing this,
we instructed the algorithm to create 9 clusters and ran it again. Each cluster had now just one
model (Figure 31, Figure 32 and Figure 33). As you can see, all different types of sequences
were identified correctly except one.
Clusters 1 and 8 represent the two ways of opening a savings account, while clusters 2, 3, 4 and
7 model the four checking account creation sequences. Cluster 6 shows the loan creation
sequence and cluster 9 the money payment version of the loan payment sequence. All these
Markov chains have transitions with 100% probability and therefore constitute deterministic
graphs. Finally, cluster 5 contains the model for the loan payment by transfer sequence, albeit
with some errors.
Figure 31 - Sequence models for clusters 1, 2, 3 and 4
38
Figure 32 - Cluster 5’s sequence model
Figure 33 - Sequence models for clusters 6, 7, 8 and 9
The loan payment by transfer sequence is only sequence that was not modeled correctly; the
action SelectDepositorAccountChecking_Account appears and was not supposed to. Although an
action of this type is included in the loan payment by transfer sequence it is easy to see in Table
16 that that action is not connected to any other action of that sequence by our criteria (same
key_name, key_value pair), thus the action we see on the figure belongs to some other
39
sequence, more specifically one of the Savings Account Creation sequences, which are the only
other sequences that have the SelectDepositorAccountChecking_Account action. This action is
included in this sequence because it shares a key_name, key_value pair with one of its actions.
On a real usage scenario we wouldn’t have this knowledge, so it’s important that this kind of
noise does not compromise the understanding of the sequences of actions that were executed.
The percentages associated with transitions should be a good indicator of noise. Indeed, in this
example, the low percentage associated with the transition to this action, 28%, is sufficient to
indicate that we are in the presence of noise. Moreover, the population associated with each
cluster (number of sequences in that cluster) or state (number of times the event corresponding
to that state appears in the sequences of that state’s cluster) also indicates the representative
power of that cluster or state. This is illustrated by the color of the cluster or state; the darker the
blue the bigger the population.
Figure 34 shows the cluster diagram for this example. The links indicate cluster similarity. The
algorithm considered clusters 2, 3, 4 and 7 all very similar. As we know these clusters correspond
to the open checking account sequence variations. As these variations only change the order of
the actions and as each cluster only contains sequences corresponding to one of these variations
it is easy to understand why the algorithm found them to be similar. It also found similarities
between clusters 9 and 5 and between clusters 1 and 8 which correspond to the loan payment
and saving account opening sequences respectively. Cluster 6 is isolated indicating the algorithm
found it to be different from all other. This cluster contains the loan creation sequences, which do
not have any variations and whose events are different from the other sequences, thus justifying
its isolated position.
Figure 34 - Cluster diagram
We conclude that it was in fact possible to find and group correctly all the different types of
sequences and variations contained in the log. Moreover, the algorithm was able to identify which
types of sequences were similar to others, thus indicating what types were variations of other
types.
40
In the next chapter we will present the conclusions drawn from these results and from all our work
in general. We also give a glimpse of some future work which has already been started.
41
CHAPTER 7 : CONCLUSION AND FUTURE WORK
With careful preparation of the input data the Sequence Clustering algorithm proved to be a good
tool for modelling different types of sequences of actions and for making explicit the differences
between these types. Also, the fact that Markov chains have probabilities associated with
transitions gives us a tool to deal with noise. By ignoring low probability transitions erroneous
sequence paths or exceptional behaviour may be detected.
However, the algorithm has some shortcomings that, we hope, can be corrected in future
versions. The first problem is its tendency for putting different types of sequences in the same
cluster. When this happens it is clear by looking at the cluster that we’re in the presence of two
different types of sequences. We think this happens because the algorithm first determines the
number of clusters, according to its heuristic, and only then decides which sequences go to each
cluster. If the number of clusters is smaller than the number of different types of sequences the
algorithm will put two types in the same cluster. Another related problem is that the parameters
are not deterministic but merely indicative, this way, setting the number of clusters to nine won’t
necessarily result in the algorithm creating nine clusters. Actually, more often that not, the
algorithm chooses a close number that it thinks is better, according to its heuristic.
Shortcomings aside the execution of the algorithm is the simplest step of the whole approach.
The biggest challenge is to identify the set of sequences of actions in a completely automatic
way. All of the process mining approaches presented in chapter 1 assume logs only contain
events resulting from executions of a single process. Our approach makes no such assumption
since most enterprise systems’ logs won’t contains events pertaining to just one process and also
because these logs will contain events whose processes and process instances are unknown.
This way, we had to develop a way of determining the case id for each event. The technique
designed to that effect, matching key_name, key_value pairs in order to build graphs
corresponding to sequences is the most important step of the described approach since it
contributes for the mining of event logs resulting from multiple, interleaved process events.
Hierarchical sequence clustering
Having found the models for the sequences we had initially generated with the simulator we
closed a cycle. Having discovered sequences of actions, which we now designate by tasks, the
path is clear for findings at the process level. Since each cluster found contains a type of
sequence, or a task, if we apply the sequence clustering algorithm again over these clusters
(Figure 31, Figure 32 and Figure 33), i.e., treating tasks as events and trying to find sequences of
tasks, we expect to get some insight into the behavior at process level. Therefore, we will now
42
present a possible approach to this goal and which is based on hierarchical clustering, i.e., the
successive application of sequence clustering in order to derive higher level results each time.
As mentioned in chapter 5 a heuristic was developed to avoid storing erroneous links between
queries. This way, links whose size was significantly bigger than the average link size were
removed from table Follows. These long links were not deleted but instead stored on a different
table, similar to the Follows table, and which we called LongFollows.
The links in this table provide us with enough information to establish connections at the task, or
sequence, level. We will illustrate this idea with an example. Take Figure 29 and Figure 30 for
instance. Suppose we had found a link between the first query of sequence 1 and the last query
of sequence 2, i.e., between the query at position 1 and the query at position 10, and that
because this link was much bigger that the others we had removed it from the Follows table and
stored it at the LongFollows table. Now, suppose that after applying the sequence clustering
algorithm sequence 1 is placed in cluster 1 and sequence 2 is placed in cluster 2. By looking at
the link 1 -> 10 in table LongFollows and since we know query 1 pertains to sequence 1 and
query 10 pertains to sequence 2 we know there is a connection between sequence 1 and
sequence 2. Furthermore, by querying the mining model we know in which cluster each sequence
is and therefore we know there is a connection between cluster 1 and cluster 2. We have now a
method for determining connections between sequences, or tasks. If we replace sequences by
actions and the corresponding clusters by action tags on tables Sequence and Action we obtain
the following:
Figure 35 - Sequence table for hierarchical clustering example
Figure 36 - Action table for hierarchical clustering example
This method constitutes a generic technique for hierarchical clustering, since it can be applied as
many times as desired in order to derive higher level results each time.
The results obtained by applying this method to the clusters in Figure 31, Figure 32 and Figure
33, are shown in Figure 37, Figure 38 and Figure 39. In Figure 37 it is visible the strong
connections from clusters 2, 4 and 7 to cluster 6 indicating that an open checking account
43
sequence usually precedes the creation of a loan. We, of course, know this is always true, as in
the simulator the creation of a loan implies the existence of a customer and, therefore, a checking
account. In Figure 38 there is a very strong connection from cluster 2 to cluster 8, indicating that
after the creation of a checking account a savings account is created, and from clusters 3 and 4
to cluster 6 as well as a weaker link from cluster 7 to 6, once more stating that loan creation is
preceded by checking account creation. Figure 39 also supports this last idea. Also note that the
intuitive connection between loan creation and loan payment, which corresponds to transitions
from cluster 6 to clusters 5 and 9, is not clear from these results.
Figure 37 - Second level clustering: cluster 1
Figure 38 - Second level clustering: cluster 2
44
Figure 39 - Second level clustering: cluster 3
All in all these results prove that some higher level conclusions may be drawn with the successive
applications of the algorithm and also that some more work has to be done in this respect since
some transitions obtained are not very obvious and do not correspond to expected behavior and
some important transitions are missing or are not strong enough. Nevertheless this seems to be a
promising approach to draw conclusions at higher levels.
Final remarks
“The ultimate objective of Organizational Design and Engineering is to develop the body of
knowledge needed to model the organization in such a way that will allow understanding the
organization and its behaviors as well as to predict and control the outcomes of the different
organizational design and operation activities with a known degree of certainty (or uncertainty).“
(Matos 2007)
In the context of organizational engineering, our approach positions itself as a tool for modelling
hidden organizational behaviour at the task or process level. The direct analysis of event logs is
able to provide us with an idea of the types of tasks executed (which was demonstrated in
chapter 6). These tasks constitute the basic component of an organization’s business processes.
Furthermore, there is a real possibility that the undertaking of successive sequential steps of
analysis over these results, as suggested, will explicit patterns of behavior at successively higher
levels of abstraction. Nevertheless, there are some difficulties.
SQL queries constitute a relatively low level machine operation. If we aim at finding higher level
patterns of behavior based on higher level actions (or tasks) performed by either machines or
humans the fundamental steps of this approach could, in theory, be applied in almost the same
way, if a similar kind of event log is available. But the higher we go the more complex actions
become. At some point it will be difficult to determine what entities each action manipulates and
45
consequently decide whether they belong to the same sequence or no. Take a messaging
application for example. We could use the subject field to determine if two messages belong to
the same sequence. But people use the same subjects to start new sequences. We could use the
body of the message, but it may not be enough. Numerous aspects would have to be used to
determine what the essence of the action is, what univocally identifies it. At some point human
validation would be needed. In the context of enterprise architecture the discovery of processes in
this way would be hindered by the complexity and high level of abstraction of the informational
entities manipulated by each activity.
Still in the realm of organizational engineering it is possible to establish a parallel between
comparing SQL query parameters in order to find sequences of related events, and comparing
action resources in order to discover human activity contexts (Zacarias 2007). This means that
the same technique can be used to identify different features of an enterprise architecture.
Ultimately, techniques such as process mining can be used as a useful instrument to gain insight
into the implicit behaviour associated with a given enterprise architecture.
The experiments described in this report confirm that sequence clustering algorithms can be
valuable tools in the sequential analysis of event logs and, consequently, in the discovery of
organization’s tasks, overcoming (with the help of appropriate preprocessing techniques) certain
limitations of current algorithms. In order to gain knowledge at the process level some more work
will have to be done so that we can understand what additional data is necessary to connect
tasks. There is also the possibility that human validation will be necessary.
46
REFERENCES
(Aalst 2003) van der Aalst, W. M. P., van Dongen, B. F., Herbst, J., Maruster, L., Schimm, G.,
and Weijters, A. J. M. M.: Workflow Mining: A Survey of Issues and Approaches. (2003)
(Aalst 2004) van der Aalst, W. M. P., Weijters, A. J. M. M., and Maruster, L.: Workflow Mining:
Discovering process models from event logs. (2004)
(Agrawal 1998) Agrawal, R., Gunopolos, D., and Leymann, F.: Mining Process Models from
Workflow Logs.: In: EDBT 98: Proceedings of the sixth International Conference on Extending
Database Technology. (1998)
(Cook 1995) Cook, J. E., and Wolf, A.L.: Automating Process Discovery through Event-Data
Analysis. In: ICSE ’95: Proceedings of the 17th International Conference on Software Engineering.
ACM Press, New York, NY, USA (1995) 73-82
(Dongen 2004) van Dongen, B. F., and van der Aalst, W. M. P.: Multi-Phase Process Mining:
Building Instance Graphs (2004)
(Dongen 2005) van Dongen, B. F., and van der Aalst, W. M. P.: Multi-Phase Process Mining:
Aggregating Instance Graphs into EPCs and Petri Nets. In: Proceedings of the Second
International Workshop on Applications of Petri Nets to Coordination, Workflow and Business
Process Management (PNCWB). (2005)
(Ferreira 2007) Ferreira, D., Zacarias, M., Malheiros M. and Ferreira, P: Approaching Process
Mining with Sequence Clustering: Experiments and Findings. In: BPM 2007. (2007)
(P. Ferreira 2007) Ferreira, P.: Levantamento de Processos de Negócio por Observação de
Contextos. (2007)
(Greco 2004) Greco, G., Guzzo, A., Pontieri, L., and Saccà, D.: Mining Expressive Process
Models by Clustering Workflow Traces. (2004)
(Greco 2005) Greco, G., Guzzo, A., and Pontieri, L.: Mining Hierarchies of Models: From Abstract
Views to Concrete Specifications. (2005)
(Grinstead 1997) Grinstead, C. M., and Snell, J. L.: Introduction to Probability. American
Mathematical Society. Second edition. (1997) Chapter 11
47
(Herbst 1998) Herbst, J.: Integrating Machine Learning and Workflow Management to Support
Acquisition and Adaptation of Workflow Models. (1998)
(Herbst 2000) Herbst, J.: A Machine Learning Approach to Workflow Management. (2000)
(Ly 2006) Ly, L.T., Rinderle, S.B., Dadam, P., and Reichert, M.U. Mining Staff Assignment Rules
from Event Based Data. In: Proc. Business Process Management Workshops, BPM 2005
International Workshops. (2006)
(Matos 2007) Matos, Miguel.: Organizational Engineering: An Overview of Current Perspectives.
(2007)
(Medeiros 2006)
Medeiros, A. K. A.: Genetic Process Mining PhD Thesis. Technische
Universiteit Eindhoven. (2006)
(Molloy 1982) Molly, M. K.: Performance Analysis Using Stochastic Petri Nets. (1982)
(SearchCIO 2007) Search CIO.
http://searchcio.techtarget.com/sDefinition/0,,sid19_gci1088467,00.html (2007)
(Tang 2005) Tang, Z., and MacLennan, J.: Data Mining with SQL Server 2005. Wiley Publishing,
Inc., Indianapolis, Indiana, USA (2005) Chapter 8
(Zacarias 2007) Zacarias, Marielba., Pinto, H. Sofia., Tribolet, José.: Reverse-engineering of
Individual and Inter-Personal Work Practices: A Context-based Approach. (2007)
48
Download

PROCESS MINING WITH SEQUENCE CLUSTERING Engenharia