PROCESS MINING WITH SEQUENCE CLUSTERING Miguel Dias Garutti Malheiros Dissertação para obtenção do Grau de Mestre em Engenharia Informática de Computadores Júri Presidente: Prof. Pedro Sousa Orientador: Prof. José Tribolet Vogais: Prof. Diogo Ferreira Prof. Miguel Mira da Silva Julho de 2007 Agradecimentos Aos professores Diogo Ferreira e Marielba Zacarias pelo enorme apoio que me deram e disponibilidade absoluta. O esforço que fizeram para que este trabalho resultasse foi muito para além das suas obrigações. Ao professor José Tribolet pela orientação que me deu e pelas suas sugestões e opiniões acerca de vários aspectos fundamentais na minha vida de estudante. Aos meus pais, aos meus avós e à minha namorada pelo apoio incondicional, motivação e confiança que sempre me transmitiram. Um agradecimento especial à minha mãe pelo exemplo de força e trabalho que me deu. Ao meu colega Pedro Ferreira pela troca de ideias e apoio que me deu. ii Abstract The goal of this work was to test an approach to process mining that from an event log containing event records originating from different tasks or processes could identify sequences of related events in a completely automatic way. We also proposed to apply a sequence clustering algorithm to these sequences. The sequence clustering algorithm we used groups similar sequences in clusters and constructs a model, in the form of a Markov chain, for each cluster. In order to generate event logs, we developed an application that executed sequences of actions over a small database and recorded these actions in logs. After capturing the event log it was necessary to determine what sequences were in it. We analyzed all events and established connections between them according to a specific criterion. The resulting graphs, where states are events and edges are event connections constitute sequences of actions when its nodes are sorted chronologically. These sequences were fed as input to the sequence clustering algorithm. All the different types of sequences were identified correctly except one whose model was slightly different from what it was supposed to be. Moreover, the algorithm was able to correctly identify which types of sequences were similar to others, thus indicating what types were variations of other types. We concluded that with careful preparation of the input data the sequence clustering algorithm proved to be a good tool for modelling different types of sequences of actions and for making explicit the differences between these types. Keywords: process mining, sequence clustering, event log, log preprocessing iii Resumo O objectivo deste trabalho era testar uma abordagem ao process mining onde, a partir de um log de eventos contendo eventos originados por diferentes tarefas ou processos, conseguissemos identificar sequências de eventos de um modo totalmente automático. Propusemos também aplicar um algoritmo de sequence clustering a estas sequências. O algoritmo de sequence clustering que usámos agrupa sequências semelhantes em clusters e constrói um modelo, na forma de uma cadeia de Markov, para cada cluster. De modo a gerar logs de eventos desenvolvemos uma aplicação que executa sequências de acções sobre uma pequena base de dados e registámos estas acções em logs. Depois de termos um log de eventos era necessário determinar que sequências nele existiam. Analizámos todos os eventos e criámos ligações entre eles segundo um critério por nós escolhido. Os grafos resultantes, onde os estados são eventos e as arestas são ligações entre eventos, representam sequências quando os seus nós são ordenados cronologicamente. Estas sequências constituíram o input fornecido ao algoritmo de sequence clustering. Todos os tipos de sequências foram identificados correctamente excepto um, cujo modelo era ligeiramente diferente do que devia ser. O algoritmo foi também capaz de identificar correctamente que tipos de sequências eram mais parecidos entre si, consequentemente indicando que tipos eram variações de outros tipos. Concluimos que com uma cuidadosa preparação dos dados de entrada o algoritmo de sequence clustering provou ser uma boa ferramenta para modelar diferentes tipos de sequências de acções e para explicitar as diferenças entre esses tipos. Palavras-chave: process mining, sequence clustering, log de eventos, pré-processamento de log iv TABLE OF CONTENTS AGRADECIMENTOS II ABSTRACT III RESUMO IV TABLE OF CONTENTS V LIST OF TABLES VII LIST OF FIGURES VIII LIST OF FIGURES VIII LIST OF ABBREVIATIONS X CHAPTER 1: INTRODUCTION 1 CHAPTER 2: STATE OF THE ART 3 PROCESS MINING APPROACHES 3 JONATHAN E. COOK 3 RAKESH AGRAWAL 4 JOACHIM HERBST 6 GIANLUIGI G RECO 7 WIL VAN DER AALST 8 ANA KARLA ALVES DE MEDEIROS 9 VAN DONGEN 10 CONCLUSION 11 CHAPTER 3: SEQUENCE CLUSTERING 13 v THE MICROSOFT SEQUENCE CLUSTERING ALGORITHM 13 USING SQL SERVER 2005 ANALYSIS SERVICES 16 CHAPTER 4: DATA COLLECTION 21 A BANKING DATABASE 21 SIMULATION OF BANKING OPERATIONS 22 CHECKING ACCOUNT CREATION 23 SAVINGS ACCOUNT CREATION 24 LOAN CREATION 26 LOAN PAYMENT 27 THE SIMULATOR APPLICATION 29 THE COLLECTED EVENT LOG 29 CHAPTER 5: LOG PREPROCESSING 31 FILTERING THE LOG 31 STORING QUERIES 31 DISCOVERING THE SEQUENCES 33 REMOVING LONG CONNECTIONS 34 BUILDING THE CASE AND NESTED TABLES 35 CHAPTER 6: RESULTS 38 CHAPTER 7 : CONCLUSION AND FUTURE WORK 42 HIERARCHICAL SEQUENCE CLUSTERING 42 FINAL REMARKS 45 REFERENCES 47 vi LIST OF TABLES Table 1 - Workflow log (Aalst 2004) 8 Table 2 - Causal matrix (Medeiros 2006) 10 Table 3 - Process log (Dongen 2004) 11 Table 4 - State transition matrix for states {rainy, nice, snow} (Grinstead 1997) 14 Table 5 - Case table example 15 Table 6 - Nested table example 15 Table 7 - Wizard steps and respective code for analysis services configuration (P. Ferreira 2007 adapted) 18 Table 8 - Checking account creation sequence: queries and meaning 23 Table 9 - Checking account creation sequence (version 2) queries 24 Table 10 - Checking account creation sequence (version 3) queries 24 Table 11 - Checking account creation sequence (version 4) queries 24 Table 12 - Savings account creation sequence: queries and meaning 25 Table 13 - Savings account creation sequence (version 2) queries 26 Table 14 - Loan creation sequence: queries and meaning 26 Table 15 - Loan payment sequence: queries and meaning 27 Table 16 - Loan payment sequence (version 2): queries and meaning 28 vii LIST OF FIGURES Figure 1 - Workflow lifecycle (Aalst 2003) 1 Figure 2 - A Framework for Process Discovery (Cook 1995) 3 Figure 3 - First algorithm in action with log {ABCDE, ACDBE, ACBDE} (Agrawal 1998) 5 Figure 4 - Second algorithm in action with log {ABCF, ACDF, ADEF, AECF} (Agrawal 1998) 5 Figure 5 - The most specific HMM that covers the log L = {abcac, abcc, abcabcc, abcabcac} (Herbst 1998) 6 Figure 6 - The most general HMM that covers the log L (Herbst 1998) 6 Figure 7 - Hierarchy tree and details for its leaves (Greco 2005) 7 Figure 8 - Instance graph for case 1 (Dongen 2004) 11 Figure 9 - Markov chain (Grinstead 1997) 13 Figure 10 - Viewer and ZappingSequence tables 15 Figure 11 - Data Mining Object Hierarchy (Tang 2005 adapted) 16 Figure 12 - Banking database diagram 21 Figure 13 - Customer table example 22 Figure 14 - Account table example 22 Figure 15 - Depositor table example 22 Figure 16 - Simulator application interface 29 Figure 17 – Profiler event log 30 Figure 18 – Tables Query and Keys 31 Figure 19 - Query table for INSERT Customer example 32 Figure 20 - Keys table for INSERT Customer example 32 Figure 21 - Query table for Open Savings Account sequence example 32 Figure 22 - Keys table for the Open Savings Account sequence example 32 Figure 23 – Table Follows 33 Figure 24 - Query 2 follows query 1 example 34 Figure 25 - Follows table for the open savings account example 34 viii Figure 26 - Graph view of two banking sequences. Links have the names of the parameters whose values are the same in the queries they connect. (Ferreira 2007) 35 Figure 27 - Links between events that belong to different sequences. (Ferreira 2007) 35 Figure 28 - Tables Sequence and Action. 36 Figure 29 - Sequence table example 37 Figure 30 - Action table example 37 Figure 31 - Sequence models for clusters 1, 2, 3 and 4 38 Figure 32 - Cluster 5’s sequence model 39 Figure 33 - Sequence models for clusters 6, 7, 8 and 9 39 Figure 34 - Cluster diagram 40 Figure 35 - Sequence table for hierarchical clustering example 43 Figure 36 - Action table for hierarchical clustering example 43 Figure 37 - Second level clustering: cluster 1 44 Figure 38 - Second level clustering: cluster 2 44 Figure 39 - Second level clustering: cluster 3 45 ix List of Abbreviations FSM: Finite State Machines HMM: Hidden Markov Model API: Application Programming Interface AMO: Analysis Management Objects ADOMD.NET: ActiveX Data Objects Multidimensional for .NET ADO.NET: ActiveX Data Objects for .NET DMX: Data Mining Extensions SQL: Structured Query Language x xi CHAPTER 1: INTRODUCTION Current transactional information systems (e.g. Enterprise Resource Planning, Supply Chain Management, Customer Relationship Management and Workflow Management Systems) are capable of recording with great detail the interactions users have with them and store enormous amounts of records, which are called logs. The underlying implicit information that they contain and its potential use has created the necessity to develop tools which are able to mine this information and make it explicit with the help of some formal representation. Process mining is one such tool. This technique takes event logs from a business process as its input and tries to generate a model for the business process that generated the logs. A business process is a set of coordinated tasks and activities, conducted by both people and equipment, that will lead to accomplishing a specific organizational goal (SearchCIO 2007). The generated process models can be used to formalize a process which had never been designed, to make a delta analysis of the process (i.e. evaluate the differences between the designed process and the real process) or to improve an already known model. Contrary to the traditional approach of first designing the process and then applying it process mining tries to first capture the real process from real execution and then formalize it (Figure 1). Figure 1 - Workflow lifecycle (Aalst 2003) However, there are some obstacles to generating the process models. Some processes are intrinsically complex and therefore are very difficult to mine. Also, incorrect logs (noise) can cause the mined model to contain errors or be inaccurate. Moreover, some processes contain activities which are not executed in computers and cannot be automatically logged, however we will abstract from this issue. 1 In chapter 2 we present a survey of the most significant process mining approaches, including proposed algorithms and their potential use as well as their applicability. Most authors assume that an event log only contains events from a single business process, i.e., that each event has an associated case or process instance. This comes as a major disadvantage since (1) the classes of information systems that are able to generate such logs are restricted to processaware systems, and (2) it becomes impossible to apply and benefit from process mining in scenarios where the log data is not available in that form (Ferreira 2007). In this report we present an approach that makes no such assumption. Our goal was to test an approach to process mining that, from an event log containing event records originating from different tasks or processes, could identify sequences of related events in a completely automatic way. We also propose to apply a sequence clustering algorithm to these sequences. The sequence clustering algorithm we used groups similar sequences in clusters and constructs a model, in the form of a Markov chain, for each cluster. No input information about the business logic of these events is provided or necessary. As a result, the models found do not constitute perfect models of the behavior that generated the events, but illustrate approximately what types of sequences were executed and their meaning. The sequence clustering algorithm is discussed in chapter 3. In order to generate event logs, we developed an application that executed sequences of actions over a small database and recorded these actions in logs. By looking at these logs alone it is impossible to know exactly what sequences originated them since they are just a long list of events. However, as we know the types of sequences that originated the logs, we have a way of assessing the quality of our approach by comparing the models found to the original sequences we programmed, and therefore we were able to make adjustments accordingly. The simulator application is described in chapter 4. After capturing the event log it is necessary to determine what sequences are in it. In chapter 5 we present our approach to this problem. We analyze all events and establish connections between them according to a specific criterion. The resulting graphs, where states are events and edges are event connections constitute sequences of actions when its nodes are sorted chronologically. In chapter 6 we discuss and show the results obtained. General conclusions and future work are presented in chapter 7. 2 CHAPTER 2: STATE OF THE ART Process Mining Approaches In this section we present a summary of various authors’ approaches to process mining. All of the proposed algorithms use an event log as input and starting point for the discovery of underlying processes. An event log (sometimes called process trace or audit trail) is list of records resulting from the execution of some process where, in order to be “minable”, each record has to contain (at least) the activity being executed, the process instance identifier and the time of execution. If the activities are totally ordered the execution times of the activities are dispensable. In the following section we use the terms process and workflow interchangeably. We also do not differentiate between task and activity although most business process views consider that these two terms are on different levels of granularity. Furthermore, we consider process mining, process discovery and workflow mining to represent the same technique and, more importantly, we consider this technique to be control-flow (or behavioral) process mining and not, for example organizational process mining which is concerned with modeling social interactions. Thao Linh Ly (Ly 2006), for example, used organizational models in conjunction with process mining in the discovery of rules for assigning work items to agents. Jonathan E. Cook Cook's (Cook 1995) defines process discovery as a method for deriving a formal process model from basic data collected on the process. His approach, however, aims at retrieving only the behavioral aspects of the process ignoring others such as the relationships between artifacts produced or relationships between the agents. The goal is not to discover a complete model of the process but an approximation of the behavioral patterns in it. These patterns are modeled as finite state machines (FSM). Figure 2 - A Framework for Process Discovery (Cook 1995) 3 In order to achieve the goal a trace of process execution is collected and analyzed (Figure 2). This trace is in the form of an event stream, in other words, a sequence of atomic actions. Three methods for the analysis of this event stream are proposed and compared. The first, RNet, is a purely statistical approach to inference and uses neural networks. The main disadvantage detected is that given a known perfect sample the network is unable to produce a FSM just for that stream, i.e. the FSM also models behavior not present in the stream. An advantage of this method is that, being statistical, it is robust to noise. The second method, KTail, is purely algorithmic. KTail defines the current state from the future states that can occur from it. One advantage of this method is that it can be parameterized by the value k, which indicates how far in the “future” (relative to its current state) it looks in the sample stream. A disadvantage is that it is not robust to noise. The third method, using Markov chains, was developed by Cook and uses event-sequence probability tables to build an FSM that only accepts sequences of events whose probabilities exceed some threshold. This threshold is a parameter of the method and the method’s level of robustness to noise is controlled by it. Markov is a mix between an algorithmic and statistical technique. Rakesh Agrawal Agrawal’s (Agrawal 1998) main goal is to construct a complete process model from logs of executions of that process. As in (Cook 1995) the focus is on mining the behavioral - control-flow - aspects of the process. Agrawal models processes as graphs where the vertices represent activities and the edges represent control conditions. This is different from the FSM that Cook used where the activities where transitions between states. A consequence of this difference is that in Agrawal’s models each activity appears only once while in Cook’s FSM they could appear multiple times. Agrawal associates each edge in the graph with a Boolean function (whose argument is the state of the process) which determines if the edge is followed, however, in (Agrawal 1998) the author’s not concerned with finding these functions, only the graph itself. The author’s notion of dependency is transitive: if B depends on A and C depends on B then C depends on A. Agrawal states that for a process model graph G to be conformal with a log L if has to fulfill three requirements. First, each dependency in L has to be represented in G. Second, there can be no dependencies in G between activities that are independent in L. Finally, it must be possible to follow in G every execution that is in L. Agrawal proposes two algorithms for finding a conformal directed acyclic graph given a log from an acyclic process. The first one assumes that each execution contains each activity exactly once and produces a minimal conformal graph. The algorithm starts by building a graph with a vertex 4 for each activity and adds an edge for each pair of activities that follow each other. It then removes edges that appear in both directions and finally computes the transitive closure of the resulting graph. Figure 3 - First algorithm in action with log {ABCDE, ACDBE, ACBDE} (Agrawal 1998) The second one doesn’t make this assumption and produces a conformal graph which may not be minimal. The differences from the first algorithm are that this one computes the strongly connected components in the resulting graph and removes the edges between the vertices in each component and also identifies and marks, for each execution, the minimal set of edges necessary to keep the graph consistent with that execution and removes all unmarked edges in the end. Figure 4 - Second algorithm in action with log {ABCF, ACDF, ADEF, AECF} (Agrawal 1998) The algorithms are able to mine processes with cycles by re-labeling repeated tasks and merging them after the graph is produced. Noise is dealt with by associating a counter with each edge and removing edges with a count bellow a certain threshold. Agrawal states that the algorithm was able to find the underlying processes with both synthesized and real data. 5 Joachim Herbst Herbst (Herbst 2000, 1998) developed three machine learning algorithms able to induct process models from execution traces of instances of the same process. The first two are used to induct sequential process models and the third one is used to induct concurrent workflow models. The focus is always on the structure of the transitions. The first algorithm (Herbst 1998) is called Bayesian model merging and models processes as hidden Markov models (HMM). It first creates the most specific HMM that is consistent with the log of execution traces (Figure 5) and progressively generalizes it by merging states. This algorithm stops when it has maximized a heuristic function that expresses how well the model reflects the log. This function is based on Bayesian posterior probability P[model = m | log = L]. This algorithm has the disadvantage of making bad merge decisions when the underlying model has loops. Figure 5 - The most specific HMM that covers the log L = {abcac, abcc, abcabcc, abcabcac} (Herbst 1998) The second algorithm (Herbst 1998), model splitting, was created as an answer to the problem explained above and works exactly in the opposite direction of the first one. It starts with the most generic HMM that is consistent with the log (Figure 6) and tries to specify it progressively, by splitting on state in two, until a heuristic log-likelihood function is maximized. The author states that this algorithm is faster and his explanation for this is that appropriate models of businesses processes are much closer to the most general model than to the most specific one. Figure 6 - The most general HMM that covers the log L (Herbst 1998) 6 Finally, the third algorithm (Herbst 2000) is very similar to the second one. The main difference is that in a concurrent process model a split operation may have unpredictable global effects. To avoid this problem the split operation of this algorithm is done at the instance level. As HMM are probabilistic models Herbst’s approach is robust to noise. Herbst created a prototype that implemented these algorithms and compared its results with Cook’s and Agrawal’s. The prototype was able to discover the same models that Agrawal discovered but it had problems with Cook’s. Gianluigi Greco The goal of Greco’s (Greco 2005) approach is to build a hierarchical process model with various views of the same process model with different levels of detail (Figure 7). The first step of this approach (which works in top-down fashion), is to recursively apply clustering to the execution traces of some process until some limit number of clusters is reached, resulting in a hierarchy of clusters of process traces. Then, a process model is built for each cluster, using some not specified process mining algorithm. The second step of this approach (which works in bottom-up fashion), transforms each non-leaf model in an abstraction of its children. Common activities of children nodes are preserved while specific activities are merged and substituted by special activities. Figure 7 - Hierarchy tree and details for its leaves (Greco 2005) 7 In (Greco 2004) the author proposes a algorithm that besides discovering the control-flow also discovers global constraints of a process. A global constraint is a rule of the type: if activity A has been completed activity B cannot be activated. Greco states that control-flow models can express local constraints (using the dependency relation between activities) but lack the power to express global constraints. The solution the author proposes starts by building a control-flow model for the process. Then, it discovers frequent patterns which are unexpected considering the local constraints, and thus indicate the presence of global constraints. Finally, it creates a different model for each one of these patterns. Although the global constraints are not explicitly expressed, together the models implicitly reflect the existence of that constraints. For the simple example above there would be two models, one where is possible to reach A but not B and another where it is possible to reach B but not A. Greco does not address the issue of noise in (Greco 2004, 2005), but since processes are modeled as graphs some modifications would have to be made. Wil van der Aalst Van der Aalst (Aalst 2004) proposes an algorithm, called -algorithm, that can extract a process model from a event log and represent it as a Petri net. Van der Aalst also describes the class of processes this algorithm is able to mine correctly. The author assumes that each event in the log represents a task, that the event’s process instance identifier is known and that the events are totally ordered (Table 1). Table 1 - Workflow log (Aalst 2004) 8 The author defines the following relations between tasks based on an event log. Task B follows task A in some case in the log if B is immediately after A in that case. Task A is the cause for task B if and only if B follows A but A never follows B in the log. Tasks A and B are parallel if and only if A follows B and B follows A in the log. Finally, A and B are not related if and only if A does not follow B and B does not follow A in the log. Van der Aalst assumes that the log is complete, i.e. all possible follows relations (w. r. t. the underlying process model) are present in the log. The author also assumes that the log contains no noise. From the relations above Van der Aalst developed the -algorithm. The author also states the types of processes which the algorithm cannot discover. The -algorithm cannot mine processes with short loops or duplicate task for example. Besides this, it is also not robust to noise as the relations between tasks do not have an associated frequency. The algorithm was applied to two real cases, a health care process and a fine processing process. The algorithm was only able to successfully mine the second one. Ana Karla Alves de Medeiros Ana de Medeiros (Medeiros 2006) proposes using genetic algorithms in process mining. Genetic algorithms try to mimic the way evolution works in order to find the best solutions to some problem. More specifically, there is a population where each individual is a potential solution for the problem. In each iteration (called generation) a fitness function is used to determine how good a solution each individual represents. After this evaluation the fittest individuals are copied to the next generation unchanged. The rest of the population suffers mutation (i.e. random change of an individual’s components) or crossover (i.e. recombination of two individuals components). The algorithm stops when some stop condition (like a limit number of generations) is reached. In the author’s approach individuals are process models and the fitness function evaluates how consistent the model is w.r.t. the log. The chosen model is the fittest individual of the last generation. The internal representation chosen for an individual is a causal matrix (Table 2). The causal matrix shows the input and output conditions functions of each activity so that it is possible to understand the dependency relations between them. The fitness function evaluates individuals according to their completeness (how many traces in the log the individual can parse) and preciseness (not being able to parse traces which not present in the log). Completeness, however, is considered more important than preciseness. The goal is to find individuals which model all the behavior present in the log while not modeling any extra behavior (i.e. not present in the log). The crossover operator allows the following changes to happen to an individual: lose/gain activities from/to subsets in its input/output condition functions; exchange causality 9 relations with other individuals; gain causality relations which can be inferred from the population but are not present in any individual; lose causality relations; and decrease/increase the number of subsets in the input/output condition functions. Finally, the mutation operator randomly adds/removes an activity to/from a subset of the input/output condition functions; or randomly redistributes the activities in these subsets. Table 2 - Causal matrix (Medeiros 2006) Noise is dealt with by pruning arcs of the mined model which are used less than a specific threshold. Van Dongen Van Dongen (Dongen 2004, 2005) proposes creating graph models for each of the process instances present in the log and then combining all of them in order to produce a general model for the process. In (Dongen 2004) the author proposes an algorithm that builds a process instance graph from an event log. These graphs are called instance nets which are basically process models where there are no decisions to be made. The first step is finding the causality relations of the process. The causality relation is inferred the same way as in (Aalst 2005): task A is the cause for task B if and only if B follows A but A never follows B in the log. Note that this relation are still inferred at the log level, not at the instance level. After this step, the instance ordering of the instance net is defined the following way: B follows A on the instance net if and only if A causes B (this was determined in the first step of the algorithm) and there is no C such that A causes C and C causes B. Table 3 shows an event log and Figure 8 shows the instance net (modeled as a graph) for case 1. Note that the parallelism of A and B was determined from the whole log in the first step. 10 Table 3 - Process log (Dongen 2004) Figure 8 - Instance graph for case 1 (Dongen 2004) Van Dongen (Dongen 2005) also proposes an algorithm for building a general model for the process from the instance models created with the algorithm explained above. This is also done in two steps. First, a projection graph is built from each instance graph. A projection graph of an instance graph is a graph where each task that was executed in the instance graph is represented by a single node (repeated tasks disappear). Also, tasks which were connected in the instance graph are still connected in this graph. Finally, start and final nodes are added. The second and final step is the construction of the aggregation graph, which models the whole process. This is done straightforwardly by adding the projection graphs. Although the Dongen’s approach produces a graph with an associated counter the author does not specify how to address noise. Conclusion All of these approaches have in common the fact that they use an event log produced by the process they want to mine in order to mine it. However, some of them make special assumptions. Van der Aalst’s approach for example assumes that the event log is complete (i.e. all possible follows relation are present in the log) while Agrawal assumes that the processes do not have cycles (although the author’s algorithms can mine cyclic processes by re-labeling the repeated tasks). How the algorithms deal with noise is another differentiating factor. Cook’s RNet and Markov based algorithms and Herbst’s algorithms are all probabilistic and consequently robust to noise. Agrawal adds a counter to each edge of the graph models and removes edges with a low 11 count in order to resist noise. Ana de Medeiros also removes arcs which are used less than a certain threshold. Greco and van Dongen do not specify how to deal with noise and van der Aalst assumes the logs do not have it. Since most enterprise systems are not process aware it is a bold assumption to think all event logs will only contain events pertaining to a single process. This way, if an approach was developed that did not need to make this assumption a large number of real system logs would become available for process mining. Moreover, we also think robustness to noise is a fundamental asset to any process mining technique. With these two factors in mind, we present the sequence clustering algorithm and our approach to process mining in the next chapters. 12 CHAPTER 3: SEQUENCE CLUSTERING Sequence clustering is a data mining technique that takes a number of sequences and groups them in clusters so that each cluster contains similar sequences. A sequence is a series of discrete states (Tang 2005), one example is a genomic sequence where the states are adenosine, guanine, cytosine and thymidine. The applicability of this technique to genomic sequences explains its widespread use in the bioinformatics field. Sequence clustering is also a probabilistic technique, since that in the sequence models produced the transitions between states have associated probabilities. This makes these models robust to noise, since a transition with a low probability indicates a sequence path that is rarely followed, thus probably an erroneous one. Furthermore, the fact that this algorithm takes sequences as input seems to make it a natural candidate to mine sequences of actions, which in turn can be obtained from event logs. Microsoft SQL Server 2005 includes a tool for sequence clustering that makes use of the Microsoft Sequence Clustering algorithm and which we decided to use in our approach. The Microsoft Sequence Clustering algorithm In Microsoft Sequence Clustering algorithm each case is assigned to each cluster with some probability. Each cluster has an associated Markov chain. A Markov chain is described as a set of states and the transition probabilities between them. If the chain is currently in state si, then it moves to state sj at the next step with a probability denoted by pij, and this probability does not depend upon which states the chain was in before the current state (Grinstead 1997). Figure 9 shows a Markov chain. Figure 9 - Markov chain (Grinstead 1997) A Markov chain can also be represented by a state transition matrix as the one shown below ( Table 4). 13 Table 4 - State transition matrix for states {rainy, nice, snow} (Grinstead 1997) The definition presented above is actually the definition of a first order Markov chain. In an n th order chain the future state’s probability depends on the previous n states. The Microsoft Sequence Clustering algorithm involves the following steps (Tang 2005): 1. Initialize the model parameters at random, i.e. initialize the state transition probabilities of the Markov chains from each cluster at random; 2. Assign each case to each cluster with some probability; 3. Recalculate the model parameters taking into consideration the probabilities of each case belonging to each cluster, i.e. recalculate the state transition probabilities of each Markov chain of each cluster considering the probabilities of each case belonging to that cluster. 4. Verify if the model has converged, in case it has not, reiterate from step 2. In order to work the algorithm needs at least two tables, a table with the non-sequence attributes (the case table) and a nested table of the first with the sequence key. There can be only one sequence nested table in each model. An example is shown in Figure 10, the goal is to model the behavior of a TV viewer when he is zapping between channels. This is similar to the example presented in (Tang 2005). The Viewer table contains general data – non-sequence attributes about the viewer. Table ZappingSequence is the sequence nested table and it contains three attributes, ViewerID which is the foreign key to the main table; ChannelType which is the state of the sequence; and SequenceID which saves the sequence position of a state. 14 Figure 10 - Viewer and ZappingSequence tables For example, if viewer number 1 zapped between channels CNN, Eurosport, MTV, BBC and SporTV in this order and viewer 2 zapped between SkyNews, VH1 and ESPN in this order the tables would look like this: Table 5 - Case table example Viewer ViewerID Address TelephoneNumber Email 1 Happy Street 123456789 [email protected] 2 Sunny Street 987654321 [email protected] Table 6 - Nested table example ZappingSequence ViewerID ChannelType SequenceID 1 News 1 1 Sports 2 1 Music 3 1 General 4 1 Sports 5 2 News 1 2 Music 2 2 Sports 3 15 The algorithm is tunable by the following parameters: number of clusters the model contains; minimum number of cases in each cluster; maximum number of different states an attribute can have; and the maximum number of different states the sequence attribute can have. Using SQL Server 2005 Analysis Services Before using the case and nested table we have to complete a series of steps that create the necessary structures for the algorithm to function. These involve the creation of a database, a datasource, a datasource view, a mining model and a mining structure (Figure 11). All these steps are executed in the Analysis Services module of SQL Server and they are common to the application of any data mining algorithm. In fact, the choice of the data mining algorithm to use is only made when the mining model is created. In this section we explain all this in more detail. Figure 11 - Data Mining Object Hierarchy (Tang 2005 adapted) The first thing to do is to create a database that will store the datasource, datasource view and mining structure. This is a very simple step where we only have to indicate a server and the name of the database to create in that server. 16 The datasource is also a very simple object consisting of a connection string to the database just created. The datasource view on the other hand contains the schema of the case and nested table, which is provided to it in the form of a dataset object augmented with custom properties such as table descriptions (Tang 2005) . In our example this dataset contains the schema for tables Viewer and ZappingSequence. “The mining structure describes the domain of the problem in terms the data mining engine understands.” (Tang 2005) The mining structure comprises several mining structure columns which are connected to source columns from the datasource view, their data types and content types. For example, in our case we would indicate that Viewer is the case table and ZappingSequence is the nested table, that the ViewerID attribute connects the two tables, that the SequenceID attribute indicates the order of the events within the sequence and also that the ChannelType indicates the name of the event. Finally, the mining model creation is where we indicate which data mining algorithm is going to be used and configure its parameters. The parameters for the Sequence Clustering algorithm are the number of clusters, minimum number of cases in each cluster, maximum number of states a sequence may have and maximum number of different states an attribute may have. The following table summarizes the analysis service configuration process for an example where the case table is called “ActionsInteractions” and the nested table is called “Nested”. For each step we show the Visual Studio wizard screen and the code that accomplishes the same goal. The code interacts with several data mining APIs made available by SQL Server 2005. These include: ADOMD.NET, which implements the dataadapter and datareader interfaces specifically for the analysis services, making it more efficient than ADO.NET, and contains the mining structure and mining model structures; and AMO, which also contains the mining structure and mining model structures but that, instead of using them for executing queries and data exploration like ADOMD.NET, uses them for data mining objects management. The DMX language was also used. This language is an extension of SQL used for creation and manipulation of data mining objects. (P. Ferreira 2007) 17 Table 7 - Wizard steps and respective code for analysis services configuration (P. Ferreira 2007 adapted) Datasource Creation RelationalDataSource ds = new RelationalDataSource() ds.ConnectionString = ... Datasource View Creation DataSourceView dsv = new DataSourceView() DataSet dsvDataSet = new DataSet() string queryString = "SELECT * FROM ActionsInteractions"; SqlDataAdapter da1 = new SqlDataAdapter(queryString, conString); da1.FillSchema(dsvDataSet, SchemaType.Mapped); ... Dsv.Schema = dsvDataSet; 18 Mining Structure Creation MiningStructure ms = new MiningStructure() ms.Source = new DataSourceViewBinding(dsv.ID); ScalarMiningStructureColumn case = new ScalarMiningStructureColumn() case.IsKey = true; miningStructure.Columns.Add(case); ... Mining Model Creation MiningModel miningModel = miningStructure.CreateMiningModel() miningModel.Algorithm = "Microsoft_Sequence_Clustering" 19 Mining Model Processing miningModel.AlgorithmParameters.Add("CLUSTER_COUNT", 10) miningModel.AlgorithmParameters.Add("MINIMUM_SUPPORT", 3) miningModel.AlgorithmParameters.Add("MAXIMUM_STATES", 30); miningModel.AlgorithmParameters.Add("MAXIMUM_SEQUENCE_STATES", 30) Visualizing the Results foreach (MiningContentNode node in MiningModel.Content { foreach (MiningContentNode child in node.Children) { foreach (MiningDistribution md in node.Distribution) Console.WriteLine(...) } } The process mining activity begins with an event log. Unfortunately, we cannot apply the sequence clustering algorithm directly to the log. As we already explained, we need to have a case table and a nested table. This way, several steps must be taken in order to preprocess the log and be able to build these two tables from it. The next chapters explain how we generated an event log using a simulator application, all the steps that had to be taken to construct the case and nested tables and the result obtained after finally applying the sequence clustering technique. 20 CHAPTER 4: DATA COLLECTION The first step of our approach involved the development of a banking activity scenario, consisting of a fictitious database and a small application that interacted with it, which could produce specific event sequences, such as: creation of a checking account, creation of a savings account, creation of a loan and payment of a loan. Since the goal of our approach was to develop sequence models for the sequences of actions that originated a certain log, the event logs resulting from the monitoring of the banking event sequences constituted the test data for our approach and, consequently, were used to verify its results. In this chapter we describe the small banking database that was created, the program module that operates over it and types of event sequences that are produced and also the event traces that result from the execution of this scenario. A banking database Since the goal was to apply the Microsoft Sequence Clustering algorithm to the discovered sequences of events and since this algorithm is included in Microsoft SQL Server 2005 we chose, for simplicity and easiness of integration, to develop an SQL Server database for the simulation scenario. This, in turn, enabled us to take advantage of the SQL Server Profiler tool which monitors the SQL Server Database Engine and registers data about each event. The database consists of the following tables: accounts, which can be savings accounts or checking accounts and belong to a specific branch of the bank; loans (also associated to branches) and loan payments; customers, associated to loans through the borrower relationship and to accounts through the depositor relationship; and employees associated to customer through the customer-banker relationship (Figure 12). Figure 12 - Banking database diagram 21 Figure 13, Figure 14 and Figure 15 show how some tables look after the insertion of Jane, John and Michael and their respective accounts in the database. Figure 13 - Customer table example Figure 14 - Account table example Figure 15 - Depositor table example Simulation of banking operations The event log is generated by simulating several banking operations. This simulation is accomplished by executing SQL queries over the banking database just presented. The following banking operations may be executed: 1. Checking account creation 2. Savings account creation 3. Loan creation 4. Loan payment Operations 1, 2 and 4 may be executed in more than one way. For example, the checking account creation operation has four variations. In order for these operations to be executed the database must already be populated with some data like branches and employees. We now describe in more detail each one of thess types of operations. The description does not include auxiliary steps that are executed for simulation purposes only, for example, when a loan 22 is created the borrower associated with it is a randomly chosen customer from the list of all existing customers. However, this step involves a database query execution and, therefore, will leave a trace on the log. This issue will be discussed on the next chapter when we explain how the log is preprocessed. Checking Account Creation This sequence involves the following conceptual steps: 1. Creating new customer; 2. Creating new checking account; 3. Associating new client and new account; 4. Associating existing bank employee to new client. Table 8 shows the specific queries the application executes on the database, and which are also captured in the SQL Profiler log, and their meaning. Table 8 - Checking account creation sequence: queries and meaning SQL Query Meaning INSERT INTO Customer Creation of customer number 85045, whose VALUES (85045,'John','Happy Street', name is John and who lives on Happy Street 'Sunny City') at Sunny City. Creation of account number 34220, with a INSERT INTO Account VALUES (34220,705,'Campo Grande') balance of 705 and belonging to Campo Grande branch. INSERT INTO Checking_Account Registration of the account as a checking VALUES (34220, 207) account. Overdraft amount of 207. INSERT INTO Depositor VALUES (85045,34220) Creation of customer – account relation. INSERT INTO Cust_Banker VALUES (85045,6,'moyvo') Association of new customer with employee number 6. The last attribute is the type of the relation and is not relevant. As you can see there are no queries shown on the table above that create employee number 6 or branch ‘Campo Grande’. This is due to the fact that both branches and employees are assumed to exist in the database prior to the execution of the application, therefore when the database was created it was populated with a number of employees and branches. The customer number, name and city are randomly generated each time a sequence is executed, as are the account number, balance and overdraft amount. The branch and employee number are randomly chosen from the ones present in the database at the time. Although this last step 23 interacts with the database it is an auxiliary step only for simulation purposes and consequently is not shown in the sequence. In order to introduce some degree of variability in the simulation three additional versions of this sequence were developed. These variations simply change the order in which the queries are executed (Table 9, Table 10 and Table 11). Table 9 - Checking account creation sequence (version 2) queries SQL Query INSERT INTO Account VALUES (34220,705,'Campo Grande') INSERT INTO Checking_Account VALUES (34220, 207) INSERT INTO Customer VALUES (85045,'John','Happy Street','Sunny City') INSERT INTO Cust_Banker VALUES (85045,6,'moyvo') INSERT INTO Depositor VALUES (85045,34220) Table 10 - Checking account creation sequence (version 3) queries SQL Query INSERT INTO Account VALUES (34220,705,'Campo Grande') INSERT INTO Checking_Account VALUES (34220, 207) INSERT INTO Customer VALUES (85045,'John','Happy Street','Sunny City') INSERT INTO Depositor VALUES (85045,34220) INSERT INTO Cust_Banker VALUES (85045,6,'moyvo') Table 11 - Checking account creation sequence (version 4) queries SQL Query INSERT INTO Customer VALUES (85045,'John','Happy Street','Sunny City') INSERT INTO Cust_Banker VALUES (85045,6,'moyvo') INSERT INTO Account VALUES (34220,705,'Campo Grande') INSERT INTO Checking_Account VALUES (34220, 207) INSERT INTO Depositor VALUES (85045,34220) Savings Account Creation This sequence involves the following conceptual steps: 1. Choosing a checking account belonging to the customer; 2. Creating a new savings account; 3. Associating customer with the new account; 24 4. Transferring funds from checking account to new savings account. Table 12 shows the specific sequence of queries that are executed on the database. Table 12 - Savings account creation sequence: queries and meaning SQL Query (or Stored Procedure) Meaning SELECT a.account_number, a.balance FROM Depositor AS d, Account AS a, Checking_Account AS c Selection of all checking accounts WHERE a.account_number = d.account_number belonging to customer number 17214. AND c.account_number = a.account_number AND d.customer_id = 17214 INSERT INTO Account Creation of account number 74652, with VALUES (74652,0,'Campo Grande') balance 0 in the Campo Grande branch. INSERT INTO Savings_Account VALUES (74652, 1) Registration of the account as a savings account. Interest rate of 1%. INSERT INTO Depositor VALUES (17214,74652) Association of customer with new account. Transfer of funds from checking account exec internal_account_transfer 7583,74652,189 to new savings account. Amount to transfer is 189. Note that the choice of a customer is not shown. This is done the same way as for branches and employees, i.e., by retrieving all existing customers and randomly choosing one. As a customer is only created when a checking account is created every customer has at least one checking account. Therefore, after selecting a random customer, the first checking account belonging to him is chosen as the “parent account” of the savings account. The first account is chosen for simplicity reasons only. In the example above the account number 74652 is the first account of customer 17214. Also note that “exec internal_account_transfer” is a stored procedure that transfers a specific amount of funds between two accounts. As in the checking account case we created a second version of the sequence where the order of the queries was changed (Table 13). 25 Table 13 - Savings account creation sequence (version 2) queries SQL Query (or Stored Procedure) SELECT a.account_number, a.balance FROM Depositor AS d, Account AS a, Checking_Account AS c WHERE a.account_number = d.account_number AND c.account_number = AND a.account_number d.customer_id = 17214 INSERT INTO Account VALUES (74652,0,'Campo Grande') INSERT INTO Savings_Account VALUES (74652, 1) exec internal_account_transfer 7583,74652,189 INSERT INTO Depositor VALUES (17214,74652) Loan Creation The creation of a loan is done through the execution of the following sequence of conceptual actions: 1. Verifying total balance of the customer; 2. Creating a loan not superior to customer’s total balance; 3. Associating client with loan. Table 14 shows the queries that are performed on the database when a loan creation sequence is executed. Table 14 - Loan creation sequence: queries and meaning SQL Query Meaning SELECT SUM(a.balance) FROM Account AS a, Depositor AS d Verification of total balance of client WHERE a.account_number = d.account_number number 29412. AND d.customer_id = 29412 INSERT INTO Loan VALUES (77709,324,'Campo Grande') Creation of loan number 77709, referring to an amount of 324 in branch Campo Grande INSERT INTO Borrower VALUES (29412,77709) Association of client with loan 26 The customer is randomly chosen as was explained in the last sequence. There are no variants as it represents the only possible order in which the queries can be carried out. Loan Payment The loan payment sequence refers to the total payment of loan. As the loan can be paid either with cash or by transfer there are two versions for this sequence. The differences between the two versions are more than just the order in which the actions are executed, so we can actually consider them two different sequences. The conceptual steps of the first version are the following: 1. Paying loan with cash; 2. Registering loan payment; 3. Deleting loan record and associated payment and borrower records. It may seem odd to delete the payment record immediately after creating it. The reason for this is that, despite assuming only total loan payments - i.e., all payments are liquidations - we wanted to simulate that the banker only discards the loan records after verifying that the debt was settled. Table 15 shows the loan payment sequence queries and their meaning. Table 15 - Loan payment sequence: queries and meaning SQL Query (or Stored Procedure) Meaning UPDATE Loan SET amount = amount - 56 Update of loan number 83209’s amount. (The WHERE loan_number = 83209 initial amount was 56 so the debt is settled.) INSERT INTO Payment VALUES (83209,38148,'1-4-2007',56) Registering of payment number 38148 pertaining to loan 83209, paid on 1-4-2007 whose amount is 56 DELETE FROM Borrower WHERE loan_number = 83209 DELETE FROM Payment WHERE loan_number = 83209 DELETE FROM Loan WHERE loan_number = 83209 Deletion of borrowers pertaining to this loan. Deletion of payments pertaining to this loan. Deletion of loan. Analogously to the other sequences, the random choice of a loan and retrieval of the amount in debt are auxiliary steps to the simulation and are not considered in the sequence. 27 The second version of this sequence involves the following steps: 1. Selecting an account belonging to the client; 2. Paying the loan by transfer; 3. Registering loan payment; 4. Deleting loan record and associated payment and borrower records. Table 16 illustrates this sequence’s queries and their meaning. Table 16 - Loan payment sequence (version 2): queries and meaning SQL Query (or Stored Procedure) Meaning SELECT customer_id FROM Borrower Getting all the customers associated with a WHERE loan_number = 83209 particular loan. SELECT a.account_number, a.balance FROM Depositor AS d, Account AS a, Checking_Account AS c WHERE a.account_number = d.account_number Selection of all checking accounts belonging to customer number 40416. AND c.account_number = a.account_number AND d.customer_id = 40416 exec loan_payment_by_transfer Loan payment by transfer. Source account 53211, 83209,56 53211, loan number 83209 and amount 56. INSERT INTO Payment VALUES (83209, 38148,'1-4-2007',56) DELETE FROM Borrower WHERE loan_number = 83209 DELETE FROM Payment WHERE loan_number = 83209 DELETE FROM Loan WHERE loan_number = 83209 Registering of payment number 38148 pertaining to loan 83209, paid on 1-4-2007 whose amount is 56 Deletion of borrowers pertaining to this loan. Deletion of payments pertaining to this loan. Deletion of loan. The first step intends to simulate that the client states he wants to pay a loan by transfer and the banker select all borrowers associated with that loan and confirms the client is one of them. It is assumed that the client is always the first borrower on the list. As before, the random selection of a customer, a loan associated with that customer and the amount in debt are auxiliary steps. The account chosen to pay the debt is the first account of the customer for simplicity. “exec loan_payment_by_transfer” is a stored procedure that uses a specific amount of money from an account to pay a loan. 28 The simulator application As stated above, the application that interacts with the database is able to perform four different operations: creation of a checking account, creation of a savings account, creation of a loan and payment of a loan. Figure 16 shows the simulator interface. Each button executes one of the operations mentioned. Figure 16 - Simulator application interface The collected event log As explained before, in order to capture the interactions between application and database on a log we used the SQL Server Profiler tool, which monitors the SQL Server Database Engine and records data about each event. The types of events recorded can be chosen manually. In our case we were only interested in SQL query completion and stored procedure completion events. Figure 17 illustrates the SQL Profiler event log resulting from the savings account creation sequence (version 1). 29 Figure 17 – Profiler event log Note that the first two queries do not belong to the sequence presented above. These are auxiliary actions that exist only for simulation purposes. In this case these two queries help the application choose the customer that creates the account and the branch where the account is created. As they are recorded alongside every other query and because they do not belong to the sequence we must filter them later. We will see in the next chapters how this is done. After the log was captured we saved it into a table of our database in order to be able to manipulate it. As stated in chapter 3, we cannot apply the sequence clustering algorithm directly to this event log. By looking at the log we can only see a list of queries without any type of associated case id. It is impossible to know exactly how many sequences exist, where each one starts, ends or what events are included in it. Remember that the simulation phase is now over, and we assume that this log was generated by a real system of which we do not know anything about. The problem then, is to find a way of determining what sequences of events exist in it. After finding these sequences we must structure them in a way understandable by the sequence clustering algorithm. The next chapter shows how to find the hidden sequences. 30 CHAPTER 5: LOG PREPROCESSING After capturing the log it is necessary to process it and restructure its data so that we are able to apply the Sequence Clustering algorithm. As explained in chapter 2, Figure 10, this algorithm requires a case table and a nested table to function. These tables identify which sequences exist and which events belong to each sequence. Since the log is just a long list of queries and stored procedures we were required to find a way to determine where each sequence begins and ends and which events belong to each one. An application called Sequence Builder was developed with this purpose. In this chapter we describe the steps taken by this application in order to fulfill its purpose. Filtering the log After saving the log to a database table called Trace, which mimics the structure of Figure 17, the first step in log processing is to take each row and parse the TextData field in order to know what type of query it represents and what tables and attributes were manipulated by it. When this is known the program decides whether to store the query or ignore it. The program ignores a query if it is not a select, insert, delete, update or execute stored procedure command or if it is too generic to be of use, like a select, delete or update command without a where clause. For example, the query SELECT * FROM Customer would be ignored. This way the auxiliary queries mentioned in the last chapter are filtered. Storing queries In order to support the preprocessing steps that will be described ahead, two new database tables were created to store relevant queries: Query and Keys. Figure 18 shows these two tables. Figure 18 – Tables Query and Keys 31 To better illustrate how the storing process works we will describe how some queries shown in chapter 3 are registered. INSERT INTO Customer VALUES (85045,'John','Happy Street','Sunny City'), for example, originates the following records: Figure 19 - Query table for INSERT Customer example Figure 20 - Keys table for INSERT Customer example Note that the program only stores the value for the customer_id field, which is the primary key for table Customer. There is no need to store anything else, because the tag and the key_name, key_value pair unequivocally identify that this query is an insertion of customer 85045 in the database. The program knows how many primary keys each table has and their names because it queries the database catalog before the log processing begins and stores these data. Executing the Open Savings Account sequence, Table 12, produces the following records: Figure 21 - Query table for Open Savings Account sequence example Figure 22 - Keys table for the Open Savings Account sequence example 32 There are three things to retain in this example. First, it is possible to see that the tag is constructed by concatenating the type of query with the names of all the tables read or written by it (Figure 21, query_id 1) or, in case of a stored procedure, by the word exec followed by the name of the procedure. Second, since some queries interact with tables with more than one primary key, some query_ids have more than one key_name, key_value pairs (query_id 4). Finally, all stored procedure’s parameters are stored in table Keys and the name for the parameters are generated by concatenating the word spparameter with a sequential number. Although we could access the stored procedures’ internal structure by querying the database, it usually is too complex and variable to parse, making it very hard to know which tables are being manipulated, therefore we assume all the parameters are important even if we don’t know what they are. Also note that the query_id field indicates the order in which the rows were read from the log and, consequently, the sequential order in which the queries were executed. Discovering the sequences Having saved every relevant query in the Query and Keys we still don’t have more than a disconnected list of queries, albeit a more organized one. It is now time to determine what sequences of queries exist in this list. As we can not know in advance where a sequence starts or ends, we decided to first find out to which queries each query is connected to. But how do we know two queries are connected and consequently belong to the same sequence? We simply compare each query with every other query and if they share a key_name, key_value pair we assume they’re connected. Stored procedures special key_names are assumed to match every other key_name, this way, if at least one of the queries being compared is a stored procedure, key_value matching is enough to assume a connection. In the opening savings account example above (Figure 21 and Figure 22), query 1 is connected to query 4 (by the customer_id, 1108249055 pair) and query 5 is connected to queries 2, 3 and 4 (by the key_value 1880652078 - stored procedure special case). The connection that are found are saved in table Follows (Figure 23). Figure 23 – Table Follows 33 For example, if we determine that query 1 is connected to query 2, meaning they belong to the same sequence, the table would look like this: Figure 24 - Query 2 follows query 1 example Since the query_id attribute indicates the order in which queries were executed, we know query 2 happened after query 1, therefore, when we determine queries 1 and 2 are connected we also know query 2 follows query 1 and not the other way around. Figure 25 shows all the connections present in the savings account creation example (Figure 21 and Figure 22). Figure 25 - Follows table for the open savings account example After all possible query pairs are compared the Follows table will actually contain a graph representation of the whole log, where states are events and edges are connections between events. Figure 26 illustrates this idea for the “Open Checking Account” and “Open Savings Account” sequences. Removing long connections It is easy to see that if a customer creates two accounts, for example, there will be connections not only among events intuitively belonging to a sequence but also between different sequences. This case is illustrated in Figure 27, where there should be three different sequences but there are only two. In order to solve this problem we developed a heuristic where very long connections are removed from table Follows, since these connections are rarely meaningful links between events at this level. However, these links may provide insight into higher level patterns of behavior (Ferreira 2007) and thus are stored for future use (see chapter 7). To do this, the average connection length is calculated and all connections that exceed a certain average based 34 threshold are removed. In Figure 27, for example, the average link length is (18 + x)/11, where x is the length of the dashed link. We could reject the dashed link if, for example, x ≥ 2*(18+ x)/11, meaning x ≥ 4, which results very well for this example (Ferreira 2007). Experience with the heuristic has shown us that a detailed study of how to determine the best threshold value will have to be undertaken in the future as the best value varies greatly with the input log. Figure 26 - Graph view of two banking sequences. Links have the names of the parameters whose values are the same in the queries they connect. (Ferreira 2007) Figure 27 - Links between events that belong to different sequences. (Ferreira 2007) Building the case and nested tables Having removed all the long links from the Follows table the application now has to find out what specific sequences exist, i.e., the application must go through all implicit sub-graphs in this table, sort the queries by chronological order (sorting by query_id has the same effect as already discussed) and store the sequence found. In order to do this the application executes the following cycle: 1. Get the first query from the Query table. If table is empty terminate; 2. Find the nodes belonging to this query’s sub-graph by using a recursive SQL query over table Follows; 35 3. This group of nodes constitutes a sequence and this sequence is now inserted on a case and a nested table (Figure 28). 4. Remove all queries belonging to that sequence from the Query table to avoid repeatedly finding the same sequence. Figure 28 shows the diagram of the two tables that store the sequences that were found. Table Sequence stores the sequence_ids that exist while table Action stores the events (queries) that belong to each sequence. The position field is simply the query_id of the query represented and the action_tag is its tag. Sequence PK sequence_id Action PK,FK1 PK sequence_id position action_tag Figure 28 - Tables Sequence and Action. We have now found all the sequences present in the log and have the data structured in a case table nested table format, where the Sequence table is the case table and the Action table is the nested table. If you look at the example in chapter 3, you can see that these tables mimic the structure of those in Figure 10. Note that we what we call position corresponds to the Zapping example’s SequenceID, what we call sequence_id corresponds to the Zapping example’s ViewerID and our action_tag corresponds to ChannelType. We proceeded to this change of names because we disagreed with the nomenclature in (Tang 2005) and thought the term “sequence id” was better suited to identify a sequence than the position of an event inside a sequence. The two figures below illustrate how the Sequence and Action tables look after processing a log with only two sequences, a checking account creation and a savings account creation. 36 Figure 29 - Sequence table example Figure 30 - Action table example Note that the position attribute is global. Meaning we can know if an action from a sequence happened before or after an action from another sequence. As the global chronological order of the events is easily determined just by looking at the log we decided to keep the position attribute global. This does not influence the results of the sequence clustering algorithm since it will always assume that each sequence begins at the event with the lowest position value in that sequence. It is now time to apply the Sequence Clustering algorithm which will enable us to find Markov models for the different types of sequences. 37 CHAPTER 6: RESULTS We will now present the results we obtained in applying the techniques from chapter 5 and 6 over a simulator generated log of about one thousand events consisting of more or less 200 sequences. When the sequence clustering algorithm is ran without specifying a number of clusters it uses a heuristic to determine the most “adequate” number of clusters. Doing this over our data resulted in the algorithm grouping nine different sequence models in eight clusters, i.e., one of the clusters contained two Markov chains for two different types of sequences. Seeing this, we instructed the algorithm to create 9 clusters and ran it again. Each cluster had now just one model (Figure 31, Figure 32 and Figure 33). As you can see, all different types of sequences were identified correctly except one. Clusters 1 and 8 represent the two ways of opening a savings account, while clusters 2, 3, 4 and 7 model the four checking account creation sequences. Cluster 6 shows the loan creation sequence and cluster 9 the money payment version of the loan payment sequence. All these Markov chains have transitions with 100% probability and therefore constitute deterministic graphs. Finally, cluster 5 contains the model for the loan payment by transfer sequence, albeit with some errors. Figure 31 - Sequence models for clusters 1, 2, 3 and 4 38 Figure 32 - Cluster 5’s sequence model Figure 33 - Sequence models for clusters 6, 7, 8 and 9 The loan payment by transfer sequence is only sequence that was not modeled correctly; the action SelectDepositorAccountChecking_Account appears and was not supposed to. Although an action of this type is included in the loan payment by transfer sequence it is easy to see in Table 16 that that action is not connected to any other action of that sequence by our criteria (same key_name, key_value pair), thus the action we see on the figure belongs to some other 39 sequence, more specifically one of the Savings Account Creation sequences, which are the only other sequences that have the SelectDepositorAccountChecking_Account action. This action is included in this sequence because it shares a key_name, key_value pair with one of its actions. On a real usage scenario we wouldn’t have this knowledge, so it’s important that this kind of noise does not compromise the understanding of the sequences of actions that were executed. The percentages associated with transitions should be a good indicator of noise. Indeed, in this example, the low percentage associated with the transition to this action, 28%, is sufficient to indicate that we are in the presence of noise. Moreover, the population associated with each cluster (number of sequences in that cluster) or state (number of times the event corresponding to that state appears in the sequences of that state’s cluster) also indicates the representative power of that cluster or state. This is illustrated by the color of the cluster or state; the darker the blue the bigger the population. Figure 34 shows the cluster diagram for this example. The links indicate cluster similarity. The algorithm considered clusters 2, 3, 4 and 7 all very similar. As we know these clusters correspond to the open checking account sequence variations. As these variations only change the order of the actions and as each cluster only contains sequences corresponding to one of these variations it is easy to understand why the algorithm found them to be similar. It also found similarities between clusters 9 and 5 and between clusters 1 and 8 which correspond to the loan payment and saving account opening sequences respectively. Cluster 6 is isolated indicating the algorithm found it to be different from all other. This cluster contains the loan creation sequences, which do not have any variations and whose events are different from the other sequences, thus justifying its isolated position. Figure 34 - Cluster diagram We conclude that it was in fact possible to find and group correctly all the different types of sequences and variations contained in the log. Moreover, the algorithm was able to identify which types of sequences were similar to others, thus indicating what types were variations of other types. 40 In the next chapter we will present the conclusions drawn from these results and from all our work in general. We also give a glimpse of some future work which has already been started. 41 CHAPTER 7 : CONCLUSION AND FUTURE WORK With careful preparation of the input data the Sequence Clustering algorithm proved to be a good tool for modelling different types of sequences of actions and for making explicit the differences between these types. Also, the fact that Markov chains have probabilities associated with transitions gives us a tool to deal with noise. By ignoring low probability transitions erroneous sequence paths or exceptional behaviour may be detected. However, the algorithm has some shortcomings that, we hope, can be corrected in future versions. The first problem is its tendency for putting different types of sequences in the same cluster. When this happens it is clear by looking at the cluster that we’re in the presence of two different types of sequences. We think this happens because the algorithm first determines the number of clusters, according to its heuristic, and only then decides which sequences go to each cluster. If the number of clusters is smaller than the number of different types of sequences the algorithm will put two types in the same cluster. Another related problem is that the parameters are not deterministic but merely indicative, this way, setting the number of clusters to nine won’t necessarily result in the algorithm creating nine clusters. Actually, more often that not, the algorithm chooses a close number that it thinks is better, according to its heuristic. Shortcomings aside the execution of the algorithm is the simplest step of the whole approach. The biggest challenge is to identify the set of sequences of actions in a completely automatic way. All of the process mining approaches presented in chapter 1 assume logs only contain events resulting from executions of a single process. Our approach makes no such assumption since most enterprise systems’ logs won’t contains events pertaining to just one process and also because these logs will contain events whose processes and process instances are unknown. This way, we had to develop a way of determining the case id for each event. The technique designed to that effect, matching key_name, key_value pairs in order to build graphs corresponding to sequences is the most important step of the described approach since it contributes for the mining of event logs resulting from multiple, interleaved process events. Hierarchical sequence clustering Having found the models for the sequences we had initially generated with the simulator we closed a cycle. Having discovered sequences of actions, which we now designate by tasks, the path is clear for findings at the process level. Since each cluster found contains a type of sequence, or a task, if we apply the sequence clustering algorithm again over these clusters (Figure 31, Figure 32 and Figure 33), i.e., treating tasks as events and trying to find sequences of tasks, we expect to get some insight into the behavior at process level. Therefore, we will now 42 present a possible approach to this goal and which is based on hierarchical clustering, i.e., the successive application of sequence clustering in order to derive higher level results each time. As mentioned in chapter 5 a heuristic was developed to avoid storing erroneous links between queries. This way, links whose size was significantly bigger than the average link size were removed from table Follows. These long links were not deleted but instead stored on a different table, similar to the Follows table, and which we called LongFollows. The links in this table provide us with enough information to establish connections at the task, or sequence, level. We will illustrate this idea with an example. Take Figure 29 and Figure 30 for instance. Suppose we had found a link between the first query of sequence 1 and the last query of sequence 2, i.e., between the query at position 1 and the query at position 10, and that because this link was much bigger that the others we had removed it from the Follows table and stored it at the LongFollows table. Now, suppose that after applying the sequence clustering algorithm sequence 1 is placed in cluster 1 and sequence 2 is placed in cluster 2. By looking at the link 1 -> 10 in table LongFollows and since we know query 1 pertains to sequence 1 and query 10 pertains to sequence 2 we know there is a connection between sequence 1 and sequence 2. Furthermore, by querying the mining model we know in which cluster each sequence is and therefore we know there is a connection between cluster 1 and cluster 2. We have now a method for determining connections between sequences, or tasks. If we replace sequences by actions and the corresponding clusters by action tags on tables Sequence and Action we obtain the following: Figure 35 - Sequence table for hierarchical clustering example Figure 36 - Action table for hierarchical clustering example This method constitutes a generic technique for hierarchical clustering, since it can be applied as many times as desired in order to derive higher level results each time. The results obtained by applying this method to the clusters in Figure 31, Figure 32 and Figure 33, are shown in Figure 37, Figure 38 and Figure 39. In Figure 37 it is visible the strong connections from clusters 2, 4 and 7 to cluster 6 indicating that an open checking account 43 sequence usually precedes the creation of a loan. We, of course, know this is always true, as in the simulator the creation of a loan implies the existence of a customer and, therefore, a checking account. In Figure 38 there is a very strong connection from cluster 2 to cluster 8, indicating that after the creation of a checking account a savings account is created, and from clusters 3 and 4 to cluster 6 as well as a weaker link from cluster 7 to 6, once more stating that loan creation is preceded by checking account creation. Figure 39 also supports this last idea. Also note that the intuitive connection between loan creation and loan payment, which corresponds to transitions from cluster 6 to clusters 5 and 9, is not clear from these results. Figure 37 - Second level clustering: cluster 1 Figure 38 - Second level clustering: cluster 2 44 Figure 39 - Second level clustering: cluster 3 All in all these results prove that some higher level conclusions may be drawn with the successive applications of the algorithm and also that some more work has to be done in this respect since some transitions obtained are not very obvious and do not correspond to expected behavior and some important transitions are missing or are not strong enough. Nevertheless this seems to be a promising approach to draw conclusions at higher levels. Final remarks “The ultimate objective of Organizational Design and Engineering is to develop the body of knowledge needed to model the organization in such a way that will allow understanding the organization and its behaviors as well as to predict and control the outcomes of the different organizational design and operation activities with a known degree of certainty (or uncertainty).“ (Matos 2007) In the context of organizational engineering, our approach positions itself as a tool for modelling hidden organizational behaviour at the task or process level. The direct analysis of event logs is able to provide us with an idea of the types of tasks executed (which was demonstrated in chapter 6). These tasks constitute the basic component of an organization’s business processes. Furthermore, there is a real possibility that the undertaking of successive sequential steps of analysis over these results, as suggested, will explicit patterns of behavior at successively higher levels of abstraction. Nevertheless, there are some difficulties. SQL queries constitute a relatively low level machine operation. If we aim at finding higher level patterns of behavior based on higher level actions (or tasks) performed by either machines or humans the fundamental steps of this approach could, in theory, be applied in almost the same way, if a similar kind of event log is available. But the higher we go the more complex actions become. At some point it will be difficult to determine what entities each action manipulates and 45 consequently decide whether they belong to the same sequence or no. Take a messaging application for example. We could use the subject field to determine if two messages belong to the same sequence. But people use the same subjects to start new sequences. We could use the body of the message, but it may not be enough. Numerous aspects would have to be used to determine what the essence of the action is, what univocally identifies it. At some point human validation would be needed. In the context of enterprise architecture the discovery of processes in this way would be hindered by the complexity and high level of abstraction of the informational entities manipulated by each activity. Still in the realm of organizational engineering it is possible to establish a parallel between comparing SQL query parameters in order to find sequences of related events, and comparing action resources in order to discover human activity contexts (Zacarias 2007). This means that the same technique can be used to identify different features of an enterprise architecture. Ultimately, techniques such as process mining can be used as a useful instrument to gain insight into the implicit behaviour associated with a given enterprise architecture. The experiments described in this report confirm that sequence clustering algorithms can be valuable tools in the sequential analysis of event logs and, consequently, in the discovery of organization’s tasks, overcoming (with the help of appropriate preprocessing techniques) certain limitations of current algorithms. In order to gain knowledge at the process level some more work will have to be done so that we can understand what additional data is necessary to connect tasks. There is also the possibility that human validation will be necessary. 46 REFERENCES (Aalst 2003) van der Aalst, W. M. P., van Dongen, B. F., Herbst, J., Maruster, L., Schimm, G., and Weijters, A. J. M. M.: Workflow Mining: A Survey of Issues and Approaches. (2003) (Aalst 2004) van der Aalst, W. M. P., Weijters, A. J. M. M., and Maruster, L.: Workflow Mining: Discovering process models from event logs. (2004) (Agrawal 1998) Agrawal, R., Gunopolos, D., and Leymann, F.: Mining Process Models from Workflow Logs.: In: EDBT 98: Proceedings of the sixth International Conference on Extending Database Technology. (1998) (Cook 1995) Cook, J. E., and Wolf, A.L.: Automating Process Discovery through Event-Data Analysis. In: ICSE ’95: Proceedings of the 17th International Conference on Software Engineering. ACM Press, New York, NY, USA (1995) 73-82 (Dongen 2004) van Dongen, B. F., and van der Aalst, W. M. P.: Multi-Phase Process Mining: Building Instance Graphs (2004) (Dongen 2005) van Dongen, B. F., and van der Aalst, W. M. P.: Multi-Phase Process Mining: Aggregating Instance Graphs into EPCs and Petri Nets. In: Proceedings of the Second International Workshop on Applications of Petri Nets to Coordination, Workflow and Business Process Management (PNCWB). (2005) (Ferreira 2007) Ferreira, D., Zacarias, M., Malheiros M. and Ferreira, P: Approaching Process Mining with Sequence Clustering: Experiments and Findings. In: BPM 2007. (2007) (P. Ferreira 2007) Ferreira, P.: Levantamento de Processos de Negócio por Observação de Contextos. (2007) (Greco 2004) Greco, G., Guzzo, A., Pontieri, L., and Saccà, D.: Mining Expressive Process Models by Clustering Workflow Traces. (2004) (Greco 2005) Greco, G., Guzzo, A., and Pontieri, L.: Mining Hierarchies of Models: From Abstract Views to Concrete Specifications. (2005) (Grinstead 1997) Grinstead, C. M., and Snell, J. L.: Introduction to Probability. American Mathematical Society. Second edition. (1997) Chapter 11 47 (Herbst 1998) Herbst, J.: Integrating Machine Learning and Workflow Management to Support Acquisition and Adaptation of Workflow Models. (1998) (Herbst 2000) Herbst, J.: A Machine Learning Approach to Workflow Management. (2000) (Ly 2006) Ly, L.T., Rinderle, S.B., Dadam, P., and Reichert, M.U. Mining Staff Assignment Rules from Event Based Data. In: Proc. Business Process Management Workshops, BPM 2005 International Workshops. (2006) (Matos 2007) Matos, Miguel.: Organizational Engineering: An Overview of Current Perspectives. (2007) (Medeiros 2006) Medeiros, A. K. A.: Genetic Process Mining PhD Thesis. Technische Universiteit Eindhoven. (2006) (Molloy 1982) Molly, M. K.: Performance Analysis Using Stochastic Petri Nets. (1982) (SearchCIO 2007) Search CIO. http://searchcio.techtarget.com/sDefinition/0,,sid19_gci1088467,00.html (2007) (Tang 2005) Tang, Z., and MacLennan, J.: Data Mining with SQL Server 2005. Wiley Publishing, Inc., Indianapolis, Indiana, USA (2005) Chapter 8 (Zacarias 2007) Zacarias, Marielba., Pinto, H. Sofia., Tribolet, José.: Reverse-engineering of Individual and Inter-Personal Work Practices: A Context-based Approach. (2007) 48