Pós-Graduação em Ciência da Computação “TEST CASE PRIORITIZATION BASED ON DATA REUSE FOR BLACK-BOX ENVIRONMENTS” Por Lucas Albertins de Lima Dissertação de Mestrado Universidade Federal de Pernambuco [email protected] www.cin.ufpe.br/~posgraduacao RECIFE, AGOSTO/2009 Universidade Federal de Pernambuco CENTRO DE INFORMÁTICA PÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃO Lucas Albertins de Lima “TEST CASE PRIORITIZATION BASED ON DATA REUSE FOR BLACK-BOX ENVIRONMENTS” Este trabalho foi apresentado à Pós-Graduação em Ciência da Computação do Centro de Informática da Universidade Federal de Pernambuco como requisito parcial para obtenção do grau de Mestre em Ciência da Computação. ORIENTADOR(A): Prof. PhD. Augusto Cézar Alves Sampaio CO-ORIENTADOR(A): Prof. PhD. Juliano Manabu Iyoda RECIFE, AGOSTO/2009 Lima, Lucas Albertins de Test case prioritization based on data reuse for Black-box environments / Lucas Albertins de Lima. Recife: O Autor, 2009. xii, 100 folhas : fig., tab. Dissertação (mestrado) – Universidade Federal de Pernambuco. CIn. Ciência da Computação, 2009. Inclui bibliografia e apêndice. 1. Engenharia de software. 005.1 CDD (22. ed.) I. Título. MEI2009- 114 To my parents, the most supportive and lovely persons a man could ever desire. AGRADECIMENTOS Primeiramente gostaria de agradecer a Deus por ter me proporcionado a realização deste trabalho e por ter me guiado durante toda a sua realização. Em seguida as pessoas mais importantes da minha vida, meus pais José Nivaldo e Ana Maria, por sempre acreditarem que eu posso dar mais um passo adiante, e por estarem sempre ao meu lado me ajudando e me dando força sempre que preciso. A Pâmela minha namorada, pela sua compreensão e incentivo durante o desenvolvimento deste trabalho. A toda a minha famı́lia, por sempre torcerem por mim e me incentivarem desde de que me dou por gente. Aos meus orientadores, Augusto Sampaio e Juliano Iyoda, por toda compreensão e dedicação que tornaram possı́vel a concretização deste trabalho. Ao professor Alexandre Mota por ter sido o padrinho deste trabalho nos indicando o problema a ser resolvido. À professora Flávia Barros pelo embasamento teórico necessário na construção das heurı́sticas. Ao colega Dante Torres pela ajuda no algoritmo da permutação. Ao colega e professor Eduardo Aranha por sua contribuição em relação a todos os estudos empı́ricos. A Ricardo Sales, Taı́za Montenegro e Thaysa Paiva pelo desenvolvimento da técnica com algoritmos genéticos e realização de experimentos com heurı́sticas. A André Lacerda e todos os testadores envolvidos na execução do experimento controlado. Ao CNPq pelo incentivo financeiro através da bolsa de estudos. Ao programa Motorola/BTC (Brazil Test Center) que proporcionou um ambiente industrial para aplicabilidade e validação deste trabalho. Finalmente, a todos os colegas do grupo de pesquisa do programa Motorola/BTC pelo apoio dado durante o desenvolvimento desta pesquisa e a todos os colegas do programa de pós-graduação do Centro de Informática da UFPE. iv Beware of bugs in the above code; I have only proved it correct, not tried it. —DONALD KNUTH RESUMO Priorização de casos de teste é um método que visa ordenar testes para obter ganhos de acordo com critérios especı́ficos. Este trabalho propõe técnicas de priorização que visam reduzir o tempo gasto na execução manual de casos de teste diminuindo o esforço durante a preparação dos dados; as melhores sequências de teste são aquelas que reusam mais dados. Estas técnicas foram aplicadas em um ambiente de teste para telefones celulares, no qual os testes são criados a partir de requisitos e executados manualmente. Quatro técnicas são propostas, incluindo uma abordagem exploratória baseada na geração de permutações de testes e três heurı́sticas: Gulosa, Simulated Annealing e Algoritmo Genético. A técnica de permutação sempre retorna as melhores sequências, no entanto, ela só pode ser aplicada para suites de teste de tamanho pequeno devido à complexidade computacional do algoritmo. Já as heurı́sticas podem, em princı́pio, ser aplicadas para suites de teste arbitrárias; contudo, não se pode garantir que as melhores sequências sejam sempre produzidas. Foi desenvolvida uma ferramenta que mecaniza o processo de priorização ajudando os testadores a registrar informações, executar priorizações e escolher sequências a partir dos resultados retornados. Estudos empı́ricos foram realizados, comparando a técnica de permutação com a técnica de priorização existente, na qual testes são priorizados manualmente baseados em uma heurı́stica que usa uma estrutura de árvore, como também, conhecimento e intuição do testador. Os resultados mostram que a técnica de permutação reduziu aproximadamente em 25-30% o tempo de configuração e 13.5% o tempo total (configuração e procedimentos). Experimentos com nossas heurı́sticas mostraram que elas têm efetividade similar para pequenas e grandes suites de teste. As técnicas propostas trazem a resultados significativos não só na execução das sequências, mas também nas suas gerações que são automatizadas pela ferramenta. Palavras-chave: Teste de Software, Priorização de Casos de Teste, Reuso de Dados vi ABSTRACT Test Case Prioritization is an approach that aims to order test cases to obtain gains according to specific criteria. This work proposes test case prioritization techniques aiming to decrease the time spent in manual execution by reducing the effort of the data preparation needed for each test case; the better sequences of tests are those that reuse more data. We applied these techniques in a mobile phone testing environment where tests are manually executed and designed based on requirements. Four techniques are proposed, including one exploratory approach based on permutation generation and three heuristics: Greedy, Simulated Annealing and Genetic Algorithm. The permutation technique always yields the best sequences, however it can only be applied to a test suite of limited size due to the computational complexity of the algorithm. The heuristics can, in principle, be applied to arbitrary test suites, however there is no guarantee that the best sequences will be produced. We implemented a tool that mechanizes the prioritization process helping testers to register information, execute the prioritization techniques and choose from sequences yielded as results. Empirical studies were performed, comparing the permutation approach to the existing prioritization technique where the test cases are prioritized manually based on a heuristic that uses a tree structure and, knowledge and intuition from the testers. Results show that the permutation technique reduced approximately 25-30% of configuration time and 13.5% of total execution time (configuration and procedure time). Experiments regarding our heuristics have shown they have similar effectiveness when applied to small and larger test suites. Our techniques yield significant results not just in the execution sequence but also in the sequence generation, which is automated by our tool. Keywords: Software Testing, Test Case Prioritization, Data Reuse vii CONTENTS Chapter 1—Introduction 1.1 Organization 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2—Background 2.1 2.2 6 Software Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Levels of Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Test Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.3 Testing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Test Case Prioritization . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 3—Related Work 3.1 3.2 3.3 14 White-Box Test Case Prioritization Techniques . . . . . . . . . . . . . . 14 3.1.1 Prioritizing Test Cases For Regression Testing . . . . . . . . . . . 14 3.1.2 Other White-Box Initiatives . . . . . . . . . . . . . . . . . . . . . 17 Black-Box Test Case Prioritization Techniques . . . . . . . . . . . . . . . 18 3.2.1 Test Case Prioritization for Black Box Testing . . . . . . . . . . . 18 3.2.2 The Automatic Generation of Load Test Suites and the Assessment of the Resulting Software . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.3 Requirements-Based Test Case Prioritization . . . . . . . . . . . . 21 3.2.4 A Test Execution Sequence Study for Performance Optimization . 23 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Chapter 4—Permutation Technique for Black-Box Test Case Prioritization 4.1 4 The Permutation Technique . . . . . . . . . . . . . . . . . . . . . . . . . viii 28 30 ix CONTENTS 4.1.1 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 5—Empirical Studies 5.1 43 5.1.1 Experiment Planning . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1.1.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1.1.3 Experimental Material . . . . . . . . . . . . . . . . . . . 44 5.1.1.4 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1.1.5 Experiment Design . . . . . . . . . . . . . . . . . . . . . 46 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1.2.1 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . 46 Empirical Study 2 - Controlled Experiment (ES2-CE) . . . . . . . . . . . 49 5.2.1 Experimental Planning . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2.1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2.1.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.1.3 Experimental Material . . . . . . . . . . . . . . . . . . . 51 5.2.1.4 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.1.5 Hypotheses and variables . . . . . . . . . . . . . . . . . 53 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2.2.1 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.3.1 Preparations . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.3.2 Deviations . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.4.1 Total configuration time . . . . . . . . . . . . . . . . . . 60 5.2.4.2 Total procedure time . . . . . . . . . . . . . . . . . . . . 60 5.2.4.3 Total execution time . . . . . . . . . . . . . . . . . . . . 61 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.5.1 Evaluation of results and implications . . . . . . . . . . 61 5.2.5.2 Threats to Validity . . . . . . . . . . . . . . . . . . . . . 63 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2.2 5.2.3 5.2.4 5.2.5 5.3 42 Empirical Study 1 - Case Study (ES1-CS) . . . . . . . . . . . . . . . . . 5.1.2 5.2 34 x CONTENTS Chapter 6—Prioritization Heuristics 65 6.1 Greedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.3 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.4.1 Genetic Algorithm Parametrization . . . . . . . . . . . . . . . . . 73 6.4.2 Heuristic Techniques Comparison . . . . . . . . . . . . . . . . . . 75 Chapter 7—Mechanization 7.1 7.2 77 Software Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.1.1.1 Core Module . . . . . . . . . . . . . . . . . . . . . . . . 80 7.1.1.2 UI Module . . . . . . . . . . . . . . . . . . . . . . . . . 82 Main Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.2.1 Add Test Case screen . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.2.2 Prioritization screen . . . . . . . . . . . . . . . . . . . . . . . . . 84 Chapter 8—Concluding Remarks 87 8.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Appendix A—appendix A.1 Test Suites Used in the Controlled Experiment . . . . . . . . . . . . . . . 92 92 LIST OF FIGURES 3.1 The calculation of M0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 The calculation of M1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 The weight prioritization equation [Sri04]. . . . . . . . . . . . . . . . . . 22 3.4 An example of an ordered tree using the approach [RBJ07]. . . . . . . . . 25 4.1 An example using the reuse function. . . . . . . . . . . . . . . . . . . . . 31 4.2 Possible phone combinations. . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 An example keeping the phone state. . . . . . . . . . . . . . . . . . . . . 33 4.4 An example using pruning mechanism. . . . . . . . . . . . . . . . . . . . 35 4.5 Three equivalent test cases. . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.6 Reducing test set with equivalence class. . . . . . . . . . . . . . . . . . . 38 4.7 The transitive Closure mechanism. . . . . . . . . . . . . . . . . . . . . . 39 4.8 Example of cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.9 Cluster as a representative test case. . . . . . . . . . . . . . . . . . . . . 41 5.1 An example of a Latin square used in our experiment. . . . . . . . . . . . 56 5.2 Results from the ANOVA with respect to the total configuration time. . . 60 5.3 Results from the ANOVA with respect to the total procedure time. . . . 61 5.4 Results from the ANOVA with respect to the total execution time. . . . . 62 6.1 An example of successors from a state in our simulated annealing algorithm. 70 7.1 The Architecture of the Prioritization Tool. . . . . . . . . . . . . . . . . 79 7.2 Class diagram of the entities submodule. . . . . . . . . . . . . . . . . . . 81 7.3 The screen for data insertion. . . . . . . . . . . . . . . . . . . . . . . . . 82 7.4 The screen for test case insertion. . . . . . . . . . . . . . . . . . . . . . . 84 7.5 A prioritization execution with the stateful permutation. . . . . . . . . . 85 xi LIST OF TABLES 3.1 Subjects used in the experiments [RUCH01]. . . . . . . . . . . . . . . . . 17 3.2 An example using PORT [Sri04]. . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Test case representation and their relationship [RBJ07]. . . . . . . . . . . 25 4.1 Feasibility study results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.1 An example of 2 lists with random orders of execution containing 4 input data generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2 Execution times of configuration data collected in ES1-CS (in seconds). . 47 5.3 Comparison among approaches. . . . . . . . . . . . . . . . . . . . . . . . 48 5.4 Dependent and independent variables of the experiment. . . . . . . . . . 54 5.5 Results from the Latin squares with respect to the total configuration time and the total procedure time. . . . . . . . . . . . . . . . . . . . . . . . . 6.1 59 Best parameters for the Genetic Algorithm found in the experiments performed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Results of experiments executed with different sets of test cases. . . . . . 76 A.1 The test suite I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 A.2 The test suite II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 A.3 Test suites sorted according to each prioritization technique. . . . . . . . 93 6.2 xii CHAPTER 1 INTRODUCTION Software quality is an essential discipline of software engineering, strictly involved with product quality and customer satisfaction. As quality is a measure of confidence in a product, software customers demand a trustworthy reliability through the application of such a measure. In particular, there is a critical need to reduce cost, effort, and time-tomarket of software. Hence, the challenge is to meet customer’s expectations through the application of quality principles in the software development process. Software quality disciplines aim to measure if a software is well designed (quality of design), and how well the software conforms to its requirements (quality of conformance) [Pre05]. One of the most used methods to assure quality of conformance is software testing. In this context, a test can be defined as an action for exercising the software, where the objective is to find application faults. It has become a normal part of modern software engineering due to its efficiency (when applied correctly) and relatively cheaper cost compared to other approaches (formal validation, reviews, debugging). However, its cost can reach more than 50% of the total cost of the software development process [MSBT04]. Therefore, any effort to optimize the creation and the execution of test cases could help companies to solve their problem concerning the lack of resources that sometimes affects the progress of the testing process. One approach to the creation problem is to automatically generate test cases with a mechanized process, reducing the time spent creating tests. Regarding the optimization of test case execution, there are several approaches to a large variety of problems (some of them are presented in Chapter 2). For example, test selection aims to choose the right set of tests from a larger suite, following some criteria previously specified, allowing testers to reach some the goal related to the selected criteria, keeping a coverage rate for instance. However, some categories of problems require that a fixed number of tests be executed to assure quality in testing, avoiding potential losses 1 INTRODUCTION 2 in the defect discovery capability. We can imagine a situation where a set of test case cannot be reduced any more through the application of other techniques, such as test selection, because some errors may not be detected or to assure that some functionality is being tested. In these cases, an alternative is to arrange a sequence of test cases in an ordered way such that the execution in this specific order will make the tester reach some goal. Some examples of goals are reaching a coverage criteria earlier, finding defects earlier, and, in our case, reducing the time spent on test execution. This approach is well known as Test Case Prioritization [EMR00] and we instantiated it using the data reuse concept. Our problem is to provide an ordered sequence of tests that can be executed as fast as possible. In order to achieve this goal, we propose four test case prioritization techniques. An exploratory strategy based on permutation generation, and three heuristic-based techniques: Greedy, Simulated Annealing and Genetic Algorithm [RN03]. They aim to generate sequences of tests that when executed the inputs needed for each test may be reused from the execution of a previous test in the sequence. As data is reused the tester (or machine) saves time spent on preparation of these data, which in some test case environments is the most critical phase. Hence, the time spent on the execution is decreased. This time saved can be used to execute more tests or to make testers devote more time to other testing activities. At the end, the main objective is to increase test team productivity, helping companies to deal with the problem of lack of resource. Although, it is important to emphasise that defect discovery capability of the tests being prioritized is not the focus of this work. We assume that the test design phase was performed correctly and the tests were appropriately chosen at the test plan phase done by a test specialist. Our focus is on improving test execution performance through test data reusability. We applied these techniques in a mobile phone testing environment at the BTC (Brazil Test Center ) program, which is a partnership between Motorola and the Informatics Center of the Federal University of Pernambuco (CIn-UFPE). The characteristics from this industrial environment fit perfectly our objectives, as test teams have daily demands on test case executions and test case prioritization can be applied to help them meet these daily testing marks. Test cases in BTC are built from a functional testing perspective (also known as INTRODUCTION 3 black-box testing), which means that the source code is not available [Bei95]. The test design is based on requirements, from where the test engineers retrieve information for creating them. They are described textually, and the tester should read the text and exercise the actions presented inside it on the mobile phone, and compare if the result of such an action corresponds to the expected behavior also described in the test case. Testing in industry is still largely conducted in a manual basis [AR02, Ber07], which is often justified by limitations of existing automation frameworks and an unclear ROI (Return Over Investment) on automation [Hof99, RW06]. The creation process of these test cases can be automated by using a tool developed by the BTC research team. The TaRGeT (Test and Requirements Generation Tool) automatically generates test suites from use case scenarios written in a Controlled Natural Language (CNL). The possible scenarios are specified as use cases described in CNL in a template that supports automatic processing. Then, from use cases, a test suite can be generated, where each test case represents a flow from the written use cases. Experiments revealed a great effort reduction by using TaRGeT on the BTC testing environment [NCT+ 07]. Regarding the testing execution process, a test case technique was proposed by research students from BTC aiming to improve test team productivity [RBJ07]. This technique uses an heuristic to build a sequence according to the data that can be reused keeping a tree where each path from the root to the leaves represents a sequence of test cases that has some data reusability among them. In the end the largest path is returned. However, this technique is not mechanized and the testers may perform their methodology manually. Moreover, it does not guarantee that the best sequence is produced. This work was the main motivation for our research. The main contributions of our work are four mechanized test case prioritization techniques that use the concept of data reuse to generate sequences of tests and, in particular, the permutation technique that guarantees the computation of the best sequences (there might be more than one) for test suites with a small number of tests as it uses an exploratory algorithm. We also provide empirical evidences that the permutation technique can generate better sequences than those generated by the manual approach. Our experimental results show that the permutation technique reduced approximately 25-30% of configuration time, and this represented a reduction of 13.5% of total execution time, 1.1 ORGANIZATION 4 which is the time spent in the whole test suite, both to prepare the phone for a test case and to execute the test case procedures. This gain can be even larger when our technique is applied in test suites with no prioritization. The manual technique experiments reported economy of around 5.56%, reducing an average of 4 minutes the execution time. Authors claim that the saved time could be used to execute more 1 or 2 test cases. Therefore, we assume that our prioritized sequences could allow sequences almost 20% more economical, which could allow the execution of even more test cases increasing the tester team’s productivity. In what follows, we present the organization of this work. 1.1 ORGANIZATION The remainder of this work is organized as follows: Chapter 2 introduces some basic concepts used throughout the work: Software Testing, Levels of Testing, Test Designs, Testing Approaches and Test Case Prioritization. Chapter 3 discusses the work related to our strategies. We contextualize white-box and black-box test case prioritization techniques. Chapter 4 describes the permutation technique developed in this work that uses an optimized exploratory strategy to generate the best sequences regarding reusability of data. Chapter 5 presents two comparative empirical studies between the permutation technique and the manual approach used by testers: a case study and a controlled experiment. Chapter 6 presents the three techniques that use artificial intelligence concepts: Greedy, Simulated Annealing and Genetic Algorithm. In addition, a comparative study among them is also presented. Chapter 7 details the tool developed to support the mechanization of the presented techniques. 1.1 ORGANIZATION 5 Chapter 8 describes the concluding remarks, discussing our contributions, the limitations of our approach, and future work. CHAPTER 2 BACKGROUND 2.1 SOFTWARE TESTING In this chapter, we present a brief introduction to software testing and test case prioritization. In the software industry, testing has become an activity to assure quality of programs by validating assumptions about the software built. As software has become more and more complex each day, clients demand confidence in the product they are investing on. So, software testing plays a very important role in the achievement of a satisfied level of trustiness among the software stakeholders. The main objective of testing is to find unpredictable situations that the software might behave with the exercise of some scenarios. These situations are described as errors, which are incorrect results from human actions. An error is manifested as a failure, a deviation of the system from the expected result, which is caused by a defect, also called fault: a flaw in the system. These definitions are described in the IEEE Standard Glossary of Software Engineering Terminology [IEE90]. The result of a test execution is a verdict informing if the software worked as predicted (passed) or whether it has some problem (failed) for the specific situation tested. As for the entity which has the capacity to judge if a test has passed or failed, the term Oracle is used, which could be a human mechanism, the person who performs the tests, or automatic, some testing tool. The Oracle’s judgement considers the inputs provided for the test, the outputs generated compared to the expected results. To predict these results the oracle must consider the status of objects involved in the test, say State, defined as the set of all values of all objects for a given set of objects at any given moment [Bei95]. Inputs are any means by which data can be supplied to a test like, for instance, stored data, generated data, keystrokes, messages. Outputs are the observable state of objects at the end of a test execution, including a changed data object, a message, a new screen, the absence of a message, a blank screen. Some authors distinguish outputs and outcomes, 6 7 2.1 SOFTWARE TESTING but for simplicity we are considering the former as a general term for both. From the definitions above, now we can describe the structure of a test case which is composed of the following parts [Jor02]: Initial State: State of how all test data should be set before the beginning of the test execution. Procedures: Inputs and actions that should be performed in the test objects during the test execution. Expected Results: Outputs that should be expected during the execution of the test procedures. Final State: Output State whose all test data are set after the test execution. In the context of mobile phones, where we are inserted, the Initial State phase is also called Initial Conditions, and the Procedures are described as a set of Steps composed by the action to be performed and its expected behavior (Expected Results). Initially, there are a lot of definitions and classifications that can be suited to software testing, and we know some of these definitions are not a consensus among all testing authors. However, we need to provide some information about how the testing activity can be classified to make use of these definitions along this document. Broadly, tests can be divided according to categories where each category has its subdivision. We consider the following categories: Level of Testing, Test Design and Testing Approaches. 2.1.1 Levels of Testing Commonly, the tests are divided into four basic levels, each one associated to a phase in the software development process. They are: Unit Testing, Integration Testing, System Testing and Acceptance Testing . Unit Testing: The focus is on atomic units of testing. The main goal is to take the smallest piece of software being tested, isolate it from other pieces and verify if it behaves as expected. It is also called component testing, in which each 8 2.1 SOFTWARE TESTING unit is tested alone to find any errors in its source code. This level of testing is recommended to be done by programmers since they know well how the code is and how to break it. It should guarantee that each module functions well before integration. There is a large variety of supporting tools for this kind of level, known as the xUnit family [Ham04]; Integration Testing: This level of testing should explore the interaction and consistency of successfully tested modules. For example, if components X and Y have both passed their unit tests and there is a new component Z created from the aggregation of X and Y, an integration test should be done to explore the self-consistency of the new module. System Testing: At this level of testing the entire software as a complete sys- tem should be tested. The software is tested together with the whole environment where it is inserted. It is done to explore system behaviors that cannot be perceived during unit or integration testing. Examples of some of these behaviors are: installation, configuration, throughput, security, resource utilization, data integrity, storage management and reliability. Acceptance Testing: This is a process to obtain confirmation by the client of the software under test, through trial or review, that there is a conformance between the system and the requirements previously defined. It is usually performed by the client itself or end-users. 2.1.2 Test Design Another possible classification of tests is according to the design method used to create them. There are two most common methods for doing test design: black-box testing and white-box testing. Other types of test design exist, as gray-box testing, which is a hybrid approach of the previous two. White-Box Testing: This kind of test takes advantage of the knowledge of the internal mechanism of a system or component [IEE90]. It is also named structural testing or glass-box testing, as the tester has the perception of how the software works internally, visualizing it clearly. Tests are derived from the structure of the 9 2.1 SOFTWARE TESTING tested object and the main concern is the proper selection of program or subprogram paths to exercise during the battery of tests. They are normally used in unit testing as, generally, the testers are the programmers who know how the code is built. Some of the benefits from the usage of this kind of test design are: easy gathering information about test coverage and control flow, focused testing as it can test atomic pieces of code, data integrity is more easily achieved by tracking the items in the code, internal boundaries are clearly visible in the code and algorithm testing (if you are using a traditional algorithm, probably there are some tests about it in the literature) [KFN99]. White-box testing is not the focus of this work. Black-Box Testing: Totally the opposite, black-box testing assumes that the system is viewed as a closed box, hence the testers should have no information about how the code behaves or how it is structured. While white-box design allows the view inside the system, and it focuses specifically on using internal knowledge of the software to guide the selection of test data, black-box design does not explicitly use knowledge of the internal structure of the software to generate tests. This design method focuses on testing functional requirements. Synonyms for black-box include behavioral, functional, opaque-box, and closedbox. Therefore, the main objective is to test the behaviors and functionalities of the system without seeing how it works inside, but, just seeing the descriptions of how it should execute its tasks. The mechanism of this kind of design is to feed the program under test with inputs retrieved from the knowledge of specifications or tester’s experience and observe the output conformance. Black-box testers do not spend time learning the source code, instead, they study the program from its requirements, which is how the customer will work with it. These test designs can be applied together with any level of testing, although, some levels are more inclined to specific designs, as generally unit tests are constructed from a white-box point of view and acceptance tests from a black-box one. But, what is almost a consensus is that when used together they increase software quality. The main reason is because they usually capture different types of failures in the system as white-box 2.1 SOFTWARE TESTING 10 works deeply into the code while black-box can have a wide view over the system. For example, despite the structural testing approaches make it easy to run certain types of tests, black-box thinking exposes errors that will elude white-box testers [KFN99]. We focus on black-box testing, since only the requirements and the executable code are available in the context of the project in which this work is inserted. 2.1.3 Testing Approaches There are lots of testing approaches to specific purposes. Some authors refer to them as other levels of testing, but we define them as a separate group from the previous one. Some examples are Stress Testing, which run tests to determine if the software can withstand an unreasonable load with insufficient resources or extreme usage, Performance Testing, which run tests to determine actual performance as compared to predicted performance, and Regression Testing,which is detailed below because of its relevance to the state of the art concerning Test Case Prioritization. Regression Testing is used to confirm that a new release of a software has not regressed, i.e., lost functionality. Thus, it is a testing process used to validate modified software and detect whether new faults have been introduced into previously tested code. Its process consists of rerun test cases previously passed to verify if any functional part of the code has been broken with the increments. There are different approaches where all tests or part of them are executed again. However, re-executing tests every time a modification takes place can be highly costly for systems with a high number of test cases; some approaches try to minimize the size of the test suites to be run, but in practice this is not always an available option to every software company [OTPS98]. Therefore, many investments on research have been done to minimize this problem and one of the techniques suggested is Test Case Prioritization, which is discussed in Section 2.2. In spite of testing being an effective way to discover bugs, it does not guarantee the absence of errors in software, only their presence [Dij72]. From this point of view, testing cannot prove the correctness of programs as it is infeasible to evaluate all their possibilities of execution. A simple example is a program that presents some behavior according to a formal parameter with type integer. As there is an infinite number of integers, we cannot evaluate every one, thus, we cannot prove its correctness, but we can 11 2.1 SOFTWARE TESTING imagine some specific values that could run the program in an undesirable way. For such values we can use some testing techniques to improve the quality of our testing, which would, potentially, improve the quality of our program. Some testing techniques are described below. Boundary Testing: To solve the problem of testing all possibilities from an input with infinite number of values, the tester should choose the ones with higher likelihood of causing flaws. This technique proposes a test that focuses on putting the software being tested under boundary and limit conditions. Once a limit situation is found, the tester must create a test with this limit value and with the values around it. Supposing that 10 and -10 are discovered as limit situations under the system, they should be tested together with 11, 09, -11 and -9; moreover the central value is advised to be tested as well, so, 1, 0 and -1, are also in the set of the test data. Statement, Branch and Path Coverage: For statement coverage, the number of statements in the code should be executed at least once. Similarly, in branch coverage all outcomes from conditions should be tested along the program. In path coverage, every flow of execution is traversed. This coverage technique includes branch coverage, as it ensures the test of every decision outcome, but in addition, path coverage verifies all dependencies among every branch coverage outcome, becoming a more robust technique compared to the previous ones. Mutation Testing: Some defects are purposely introduced along the code to verify the effectiveness of test suites. These faults are inserted in small modifications along the program, one at a time, creating a new version of it which is referred to as a ”mutant”. If the tests detect the mutant, it is disposed, but if it passes undetected the tests are not enough effective. This technique is also used to generate efficient test suites. Equivalence partitioning: It is a technique that chooses some tests to be exe- cuted instead of the whole test suite with a large size of elements. These selected tests must be representative among some class of tests called equivalence partition, i.e, they must have similar behavior. The selected subset should cover all partitions; doing so, all possible scenarios are covered and cost spent on test execution 2.2 TEST CASE PRIORITIZATION 12 are reduced [Pre05]. An instantiation of this technique is described as Test Set Minimization [WHLM98]. According to Wong et al., a test set minimization procedure finds a minimal subset in terms of the number of test cases that preserves the coverage with respect to a certain coverage criterion of the original test set. An implemented example of this technique is the tool named ATACMIN [HL92] which generates test sets minimized. It uses an enumeration algorithm to find optimal solution and heuristics when the exact solutions are not obtained in reasonable time. The techniques presented above are related to improving the testing process by improving test case generation, detecting faults, guaranteeing effectiveness, and to reduce the cost of the execution effort. Test Case Prioritization can also be an alternative to address this latter issue, and it is the subject of the next section. 2.2 TEST CASE PRIORITIZATION In the literature, the problem of ordering test cases is commonly known by the term test case prioritization (TCP). It considers that a test set can be prioritized following some criteria to achieve a goal. Unlike test set minimization and equivalence partitioning, this technique does not suggest the reduction of elements in a test suite, but just adjust them in some sequence that will bring gains related to the goal to be reached. As we have mentioned before, reducing the size of test sets may not be possible in some cases; it may happen due to quality requirements, coverage criteria to be kept or testing targets to be met, thus, an alternative is using a Test Case Prioritization technique to the purpose being pursued. The test case prioritization problem is strongly tied to regression testing as seen in various works [RUCH01, JG08, KP02, QNXZ07, Sri04, ST02, ZNXQ07]. As discussed in the Section 2.1.3, regression testing is a process used to validate modified software and detect whether new faults have been introduced into previously tested code; it can become a very expensive and laborious task. Nevertheless, Elbaum et al. [EMR02] also show that test case prioritization can also be employed in the initial testing of software, and sometimes, it is used with a different terminology [AW95]. In our approach, the prioritization technique can be applied in any phase of the software testing execution; 2.2 TEST CASE PRIORITIZATION 13 once the tests are created they can be submitted to our technique which will generate an ordered sequence of tests. Elbaum et al. formally defined the test case prioritization problem as follows [EMR00]: Definition 1. Let S be a test suite, P the set of permutations of S and f a function from P to the real numbers. The test case prioritization problem consists in finding p ∈ P such that ∀p0 ∈ P. (p 6= p0 ) ⇒ (f (p) ≤ f (p0 )) where P represents the set of all possible prioritizations of S, and f is a function that, applied to any such sequence denoted by p , assigns a real value to that ordering allowing the comparison of two prioritizations. Then, the definition refers to a search inside P for the best prioritization p. However, the main problem of test case prioritization is the generation of P . We have to produce all permutations of test cases, which in practice is not feasible for a large number of tests. For instance, assuming that it takes 1µsec to generate a single permutation, the permutation of up to 12 tests takes no more than few minutes. However, for 17 tests, it takes 10 years. For more than 25 elements, the time required is far greater than the age of the earth [Sed77]. Despite this assumption being very old, over 30 years ago, and the computability of today being much faster, the complexity of the problem still persists, just allowing the computation of a few more elements. This computational complexity can be seen as a variation of the Traveling Salesman Problem (TSP), where the test cases are the cities, and to go from one test case to another on the prioritization we have an associated weight, which would be the cost of traveling from a city to another. In our case, the main difference is that the cost to travel from city A to city B is different from the cost of traveling from B to A. Moreover, we do not need to go back to the starting city (or the initial test case). But these differences do not change the computational complexity of the problem which is classified as NP-hard [GJ79]. Because of the complexity of the problem, all related works focus on the usage of heuristics, and we present them in the following chapter. Nevertheless, in our strategy, we still explore permutation for small test suits and some heuristics in other cases. CHAPTER 3 RELATED WORK In this chapter we describe some test case prioritization techniques that are related to our work. As the methodologies can be applied to different kinds of testing, we divided the prioritization techniques according to the level of code accessability: white-box techniques, that use source code information, and black-box techniques, with no such information at all. Each technique also has a specific purpose and applied to restricted testing environments, as ours. 3.1 WHITE-BOX TEST CASE PRIORITIZATION TECHNIQUES Despite our work is not a white-box technique most work on test case prioritization take into consideration the source code of applications to generate their orderings, including the paper that formally defined test case prioritization [EMR00]. Most of them are applied to reduce the impact of regression testing, which can be very difficult to execute as discussed in section 2.1. As our technique does not involve the white-box context, we will detail one referenced work about regression white-box testing prioritization, concentrating on definitions, and relevant concepts. 3.1.1 Prioritizing Test Cases For Regression Testing The initiative reported in [RUCH01] is one of the most referenced works in test case prioritization area. It has formally defined the problem and has discussed it in the regression testing context. The main motivation is the high cost of testing, especially in regression testing. The work considers other techniques to regression testing as test selection and test suite minimization techniques. The motivation for not using them are difficulties like the loss on the fault detection capability, which happens because they deal with a reduced set of test cases, which means coverage cannot always be maintained. 14 3.1 WHITE-BOX TEST CASE PRIORITIZATION TECHNIQUES 15 Then, test case prioritization is proposed as a new approach that does not lose information, but arrange test cases in some way such that the faults are discovered earlier. Furthermore, nine techniques are described, where actually the first three of them are pseudo-techniques, used only for comparison, and the other six represent heuristics that use test coverage information, produced by prior executions of test cases, to order them for subsequent execution. The techniques are: 1. No prioritization: the application of no technique 2. Random prioritization: randomly order the test cases in a test suite. 3. Optimal prioritization: This is not a practical technique when the size of the suite is very large (as discussed earlier in Chapter 2) because it might not be possible to generate all permutations for suites with large number of tests, but this technique is used to measure the success of the other practical heuristics. To avoid generating all permutations, they have developed a greedy ”optimal” prioritization algorithm, even knowing that it may not always produce the best ordering. 4. Total statement coverage prioritization: prioritize test cases in terms of the total number of statements they cover by counting the number of statements covered by each test case and then sorting the test cases in descending order of that number. 5. Additional statement coverage prioritization: Initially selects the test case which covers more statements (the same way as the total statement coverage prioritization) but the following test cases are chosen according to the number of statements covered by them that were not covered yet on the test suite, i.e., the ones that add more new statements to the coverage. 6. Total branch coverage prioritization: the same as total statement coverage prioritization, except that it uses test coverage measured in terms of program branches rather than statements. 7. Additional branch coverage prioritization: the same as additional statement coverage prioritization, except that it uses test coverage measured in terms of program branches rather than statements. 3.1 WHITE-BOX TEST CASE PRIORITIZATION TECHNIQUES 16 8. Total fault-exposing-potential (FEP) prioritization: the order is based on the ability of a test case to expose a fault. It is obtained from the probability that a fault in a statement will cause a failure for a given test case. This probability must be an approximation of the fault-exposing-potential, and it is acquired through the adoption of an approach that uses mutation analysis to produce a combined estimate of propagation-and-infection that does not incorporate independent execution probabilities. 9. Additional fault-exposing-potential (FEP) prioritization: Analogous to the extensions made to total statement and branch coverage prioritization to yield additional statement and branch coverage prioritization, total FEP prioritization is extended to create additional fault-exposing potential (FEP) prioritization. After selecting a test case t, other test cases that exercise statements exercised by t have their FEP values lowered, then, another test is selected based on the best FEP value and the process repeats until all test cases have been ordered. Furthermore, a series of experiments are described using a measure to compare effectiveness of the various test case prioritization techniques. The APFD measure (average of the percentage of faults detected) plays the role of the function f discussed before. It measures how quickly the prioritized suite detects the faults, with the range from 0 to 100. Higher APFD numbers mean faster (better) fault detection rates. To perform the tests with the techniques, eight C programs were used as seen in Table 3.1, with the information about lines of code, number of versions, number of mutants, the test pool size and the average test suite size. The first seven are programs from Siemens Corporate Research and are commonly known as the Siemens programs; the last one is a program developed for the European Space Agency and is referred to as space. As can be seen, the space program is the larger and more complex than the other subjects. These programs were used as experiments in other works related to this subject [EMR02, JG08, KP02, JH03]. Some intuition acquired from experiments performed with the techniques described shows, in general, that test case prioritization can substantially improve the rate of fault detection in test suites. The six heuristics proposed obtained such improvements but in the case of the program schedule, no heuristic outperformed the untreated or randomly 3.1 WHITE-BOX TEST CASE PRIORITIZATION TECHNIQUES 17 Experiment Results Table 3.1 Subjects used in the experiments [RUCH01]. prioritized test suites. In almost every case, including the space program, additional FEP prioritization outperformed prioritization techniques based on coverage. Considering the overall results on the Siemens programs, branch-coverage-based techniques almost always performed as well as or better than their corresponding statement-coverage-based techniques. Considering differences between total and additional branch and statementcoverage-based techniques, there was no clear winner overall. The authors conclude the paper explaining that, in this study, test case prioritization techniques can improve the rate of fault detection of test suites, as shown in the empirical studies, and that a benchmark was left with the APFD measure to compare test case prioritization techniques and other measures. For future work, the gap between the optimal test case prioritization and the heuristics used, even with FEP-based techniques, was not bridged, then, other heuristics could be used to reduce this gap. Finally, the test case prioritization problem has many other facets that can be faced, with other objectives and goals, and not all them were addressed in this paper. 3.1.2 Other White-Box Initiatives A series of works have been done to address problems related to white-box test design. An extension to the work previously described can be found in [EMR02], where 16 techniques are proposed (including 4 already presented in [RUCH01]); the other 12 are not related to statements or branches, but to functions. These techniques use coverage information, probabilities on fault existence or exposure, or a combination of them, to prioritize 3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES 18 the tests. Their experiments show that function-level had similar results to statementlevel prioritization techniques. Another extension was carried out later by Jeffrey and Gupta, proposing techniques based on statement coverage that takes into consideration how statements affect each other [JG08]. Their conclude that their techniques have more potential than the total statement and branch coverage prioritization techniques. Jones and Harrold presented a technique that prioritizes test cases using the modified condition/decision coverage (MC/DC) criterion, which is a form of exhaustive testing that uses branch coverage [JH03]. Their case study shows encouraging results to reduce the cost of regression testing for users of the MC/DC testing criterion. Srivastava and Thiagarajan propose a test case prioritization technique based on basic block coverage for large systems [ST02]. The work suggests that it is possible to effectively prioritize tests in large scale software development environments by using binary matching to determine changes at a fine granularity. 3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES In this section we describe some works about test case prioritization techniques proposed for tests with no information retrieved from the source code, but from requirements or historical execution data. 3.2.1 Test Case Prioritization for Black Box Testing The approaches discussed before gave us a good understanding about the prioritization problem, but their experiments are related to source code, more commonly known as white box testing. Regarding the approach presented in [QNXZ07], we can see a closer relation to our problem, because it deals with black box testing as we do, but it still tries to solve regression testing problems. Despite of this, that work does not use code data to build the test prioritization; it uses runtime and historical information. The purpose of this work is to test software in the situation where the source code is not available. This approach reuses test suites and assigns each test case a selection probability according to its historical performance, and then selects ordered test cases to run until testing resources exhaust. Algorithms were developed to order the tests. The main idea of these algorithms is to group all reused test cases by the fault types they 3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES 19 have revealed, and then dynamically adjust their priorities according to test results of related test cases. The technique uses a matrix R that relates each test case to all others, so once a test case revealed regression faults, related test cases can be adjusted to higher priority to acquire better rates of fault detection. Given a test suite T , T 0 a subset of T and R the relation matrix, the general process is described as follows: 1. Select T 0 from T and prioritize T 0 using available test history. 2. Build the test case relation matrix R based on available information. 3. Draw a test case from T 0 , and run it. 4. Reorder the remaining test cases using run-time information and the test case relation matrix R. 5. Repeat step 3 and 4 until testing resource is exhausted. The approach presents three algorithms: one builds the relation matrix and the other two Escalate and Deescalate test cases. The first is described based on the matrix, where each cell is a boolean value which is true if two related test cases reveal the same faulty type, and false otherwise. For simplicity, in this paper example, a test case can reveal only one faulty type. So, the matrix is traversed and filled following this rule. Based upon the relation matrix, the regression testing strategy selects one test case to run, and then escalate or de-escalate the priorities of its related test cases that have not been used. The Escalate algorithm gives higher priorities to test cases related to the previous one tested and which has failed. The Deescalate algorithm does exactly the opposite, because it gives lower priorities to test cases related to others that stayed at the last ordinal orders when one test case has ran successfully. Further, some examples and experiments are shown in [QNXZ07], making use of this approach. A new metric M0 for black box testing is exposed, given that APFD is hard to use in a black box environment, because to find the fault numbers that each test case reveals is said not to be an easy task. Let ti be the ith test case in the ordered test suite, let timei be the execution time of ti , and let fi be a Boolean value that indicates if ti has failed. M0 for a test suite T is given by the first equation seen in Figure 3.1, and the second M1 is used when execution time is not relevant, as seen in Figure 3.2: 3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES M0 = 1 2 Pn 20 Pn timej − timei ) × fi ] Pn j=1 Pn × 100% i timei × i fi i=1 [(2 Figure 3.1 The calculation of M0 . M1 = 1 2 Pn − 2i + 1) × fi ] P × 100% n × ni fi i=1 [(2n Figure 3.2 The calculation of M1 . The experiments were made using Microsoft Word and PowerPoint 2003 with the service packet (SP2) and all bug-fixes. The metric used was M1 and the results show that the strategies using Escalate and Deescalate algorithms are on average 10% better than no adjusting strategy. Then, the authors conclude that the dynamic adjusting methods are of benefit to improve the test suite’s fault detection rate for black-box testing. Finally, the authors explain that despite their relation matrix is built based on faulty types, it can be done using other kinds of information like test objectives. For future work, the authors mention that, as the algorithms are nonnumeric, they intend to research numerical algorithms to adjust selection probabilities of test cases and perform experiments to compare the approaches. 3.2.2 The Automatic Generation of Load Test Suites and the Assessment of the Resulting Software The main purpose of the approach reported in [AW95] is to generate ordered test cases sequences for telecommunications systems. Three automatic test case generation algorithms are introduced, and the authors claim that it can be applied to any software that can be modeled by a Markov Chain [Nor98] using collected data and estimates. The approach is based on a telecommunication profile that includes information about the number of different types of calls, the average service demands and the probability distribution for each type of call, and the average external arrival rate and probability distribution for each call type. Using this information, a model using Markov Chains is constructed, but this model could become very large given a high number of calls. For instance, to test a system with five different call types, each holding at most 500 calls, it would be necessary 268.318.178.226 Markov states requiring over 13.000.000.000 hours of 3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES 21 testing assuming three minutes holding time. So, the algorithms deal with this problem of reducing the amount of states, choosing a representative state that corresponds to the most likely to be executed. This is done using a constant that represents the minimum probability that a state has to have to be used in a test case. Three algorithms are described to solve the problem. The first generates all states that have steady-state probability greater than . The second algorithm generates in order all state transitions such that the product of the state probability and the call arrival rate is greater than . It uses the first algorithm to generate all states of interest and then filters them based on the restriction cited. The third algorithm extends the second one by considering sub-paths rather than individual edges. Empirical studies were performed using five industrial software telecommunications projects ranging in size from 50,000 to more than one million lines of code. Further, it was defined a reliability measure to evaluate the approach. This measure could be used to track the progress of the load testing algorithms, and to guide test planning. The results with the projects show the detection of serious program faults that they predict would not have been detected until field release. They also report a greatly facilitated regression testing and fault localization capability due to the use of these algorithms. They plan to use the notions to domains other than telecommunications systems. This paper is used as reference in [EMR02] as a work where test case prioritization is not used for regression testing but at initial testing. 3.2.3 Requirements-Based Test Case Prioritization As our problem deals with tests acquired from requirements, the work reported in [Sri04] is one of the most related to ours. Besides requirements based testing, it also uses a black-box approach at the system level. Another difference from other approaches is that its main goal is to identify the severe faults earlier and to minimize the cost of test case prioritization. An enhancement on test case prioritization is proposed by incorporating additional knowledge from requirements engineering research. The approach proposes a new strategy called Prioritization of Requirements for Testing (PORT). It uses 3 prioritization factors (PFs) to order test cases: (1) customerassigned priority on requirements, (2) requirement complexity and (3) requirements 22 3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES volatility. These factors are assigned values. The first, customer-assigned priority on requirements (CP), is a value from 1 to 10 assigned by customers based on how important the requirement is. The second, requirement complexity (RC), is also a value from 1 to 10, but it is assigned by the developer based on the perceived implementation difficulties of the requirement. The last, requirements volatility (RV), is the number of times a requirement has changed. The higher the factor value of a requirement, the higher the priority of test cases related to that requirement. Another task is the assignment of weights to the PFs such that the assigned total weight (1.0) is divided amongst all. The assignment is done by the development team, but must be according to the customer needs. Given these values, a weighted prioritization (WP) is calculated for all requirements, as seen in Figure 3.3. WP = n X (P F value × P F weight) P F =1 Figure 3.3 The weight prioritization equation [Sri04]. Test cases are then ordered such that the test cases for requirements with high WP are executed before others. An example is shown in Table 3.2, where the better prioritization is R1, R4, R3 and R2. Factors R1 R2 R3 R4 Weights Customer Priority 7 4 10 8 0.30 Req. Complexity 10 5 4 9 0.40 Req. Volatility 10 8 5 5 0.30 WP 9.1 5.6 6.1 7.5 1.00 Table 3.2 An example using PORT [Sri04]. Further, a case study is detailed on an industrial project that comprises 152 KLOC. The preliminary results show that requirements complexity and volatility impact fault density. For future work, an analysis on the industrial data to determine the effectiveness of customer priority should be made. 3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES 3.2.4 23 A Test Execution Sequence Study for Performance Optimization The research reported in [RBJ07] has the objective to find out an optimization in the execution of mobile phone tests inside the BTC program. As said before in Chapter 1, the tests are in natural language and the testers should read the test cases, perform the actions described and verify if the state of the phone corresponds to the description of the expected results in the test case. As the testers have a high demand on daily test suites execution, the time spent on the execution of these tests becomes a critical factor to the teams. So any effort to increase their efficiency could help these teams to provide a better service. Another aspect is that, usually, testers do not follow a predefined order, but their own feelings based on practical experience on how well they could execute a test suite. These decisions end up being random and personal. In this context, a test case structure can be described as follows: Pre-condition (Setup) - some setups needed in order to start the test case execution. This information is treated differently from the order data because they do not change during the test execution, and they should be executed together in the beginning of the whole ordered test sequence. Generally, this setup information is related to the act of setting bits or basic configurations on the phone. They are not considered in the reuse strategy. Input Data - data needed for the execution of the test that can change during test execution to produce outputs. They are considered in the reuse. Output Data - data resulted from the execution of the test. Elements that should be in the state of the phone in the end of the test. They are considered in the reuse. So, the main idea is to make an ordered sequence based on the reuse of this information, minimizing the time spent to calibrate a mobile phone for a test execution. As a test case can be represented as a set of inputs, some procedure and a set of outputs, the objective is to generate a sequence where each test case is linked to another combining inputs and outputs. The methodology is proposed not to be mechanized but to be followed by the testers like a guide. It is divided into a sequence of steps. The first step refers to a survey on 3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES 24 configurations, input and output data from the test cases. As said before, the configurations (setups) are done before the entire test sequence is executed. The input and output data discovered are tagged with a unique ID and organized in a table. It serves to identify them later. All test cases are redefined following a standardization, which presents a test case as a unique ID, a set of inputs and a set of outputs using the tagged IDs from the previous step. To use the methodology, a root test case is created, defined as INIT, which represents a fictitious test case for the sequence to have an initial point to be ordered from. It has no inputs, and usually its outputs are the ones that require the minimal effort to be executed (for example, “0 contacts in the phonebook”). Given that all these steps are performed, the ordering itself can be initialized. It is done using a tree data structure, where INIT is the root node. To define the children of the nodes, one verifies the test cases that have intersections between their inputs with the outputs of the current node. So, the children nodes of the root are the test cases that have their input elements inside the set of output elements of the root node, and so on. There are some rules to be followed. For instance, a node cannot be repeated on its sub-tree and whenever it is not possible to insert new nodes, that sequence is finished. For example, in Table 3.3, we have some test cases with their input and output sets. The INIT is the fist test case in the table. The intersections among the test cases are highlighted in bold. We can see in Figure 3.4 that TC 097, TC 255 and TC 256 are the children nodes from the root node (their inputs intersect the output of INIT). Following this rule the whole tree is constructed. Each branch of the tree corresponds to a test case sequence. Once the tree is constructed, then the ordered sequence is chosen taking the deepest branch of the tree. However, there are cases in which not all test cases are in the deepest branch. A new tree with the remaining test cases is built and the process is redone until there are no test cases left. When there are two or more branches with the same depth, to make the algorithm deterministic, the leftmost path is chosen. An example of tree using the elements of Table 3.3 is displayed in Figure 3.4. A case study was driven, comparing this approach to ordered sequences created by testers from their feeling and personal experience (what they were using up to this research), and the approach has shown better results in 83% of the performances and the average gain was over 5.56% reducing the execution time of the sequences in around 4 minutes. Authors claim that the saved time could be used to execute other one or 3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES Table 3.3 Test case representation and their relationship [RBJ07]. Figure 3.4 An example of an ordered tree using the approach [RBJ07]. 25 3.3 DISCUSSIONS 26 even two test cases. Currently, this methodology is being used for some teams inside the BTC. Although this approach has brought some gains in efficiency terms, it does not always bring the optimal ordered sequence. A generated sequence can occasionally be the optimal one, but it is not certain. In next section, we have a discussion about the contributions, criticisms and opinions over the related work and their relevance to our problem. 3.3 DISCUSSIONS Each work has its importance in the test case prioritization area, but with respect to our problem, we have some more specific considerations about them. The approach of [RUCH01] is a remarkable contribution to the test case prioritization area because it has a formal definition of the problem and contributed to establish a research topic under the term ”test prioritization”. Test case prioritization schedules test cases in order to increase their ability to meet some performance goal. Although this approach give us a background of the problem from a standard perspective, regarding our problem, these contributions are not applicable because the approach is designed for a white-box problem and takes in consideration only regression testing. On the other hand, the work reported in [QNXZ07] is in the context of black-box testing, taking into consideration the faulty types of each test case and use algorithms to determine the ordering sequence, but the focus is on regression testing problems. The approach in [AW95] presents a black-box example of test case generation and prioritization that works at the initial testing phase. This approach requires the system to be modeled as Markov Chains. It also requires an operational profile of the problem context. The example presented uses an operational profile from a telecommunications domain. We still need to further investigate whether Markov Chains are a suitable mathematical model for our application domain. The approaches reported in [Sri04] and in [RBJ07] present closely related work. The first describes a black-box non-regression testing approach based on requirements. Its goal is to discover the severe faults earlier on the testing process. For that, it uses some factors and weights that should be given by customers and developers to retrieve a value for each requirement. Based on these values the tests are ordered according to the 3.3 DISCUSSIONS 27 requirements they are associated with. However, no prioritization technique is scheduled to test cases from the same requirement. The second work was the main inspiration for our research, and the first attempt to optimize test case execution through prioritization inside BTC program. This research provides some good results, as shown earlier, and its methodology is already adopted in some test execution teams. However, there are also some drawbacks. As said before, following those steps does not always guarantee the generation of the best execution sequence. In order to recover the best sequence we have to analyze all possible permutations. On the other hand, a feasible generation of all permutations depends on the size of the test suite. When we have a small set of test cases, it could be feasible. Our experiments showed that the perfect sequence can be generated from suites of up to 15 test cases in the worst case. In cases where the perfect sequence is infeasible, we can also experiment with different heuristics. However, heuristics cannot guarantee the best alternatives. Using heuristics we are susceptible to make arbitrary choices that lead us to a solution that may not always be a good one. For instance, one of the choices reported in [RBJ07] is the definition of the root node that restricts the possibilities of starting with nodes that have minimal preparation efforts. However, there might be a case in which we start from test cases with medium or high preparation efforts and still get a minimal cost at the end. Another choice is the selection of the leftmost branch when there are two sequences with the same depth. These decisions can interfere in the quality of the sequences generated. Moreover, this approach lacks a formal definition of the measure function f , which makes it difficult to compare the results with different heuristics. In a numeric form we can use mathematical concepts and return a value representing the results in a more concrete and clear way. Finally, except by [RBJ07], all approaches focus on ordering the test cases according to the chance of discovering faults, while only we and [RBJ07] are using test case prioritization to optimise the test cases execution considering the inputs and outputs provided. CHAPTER 4 PERMUTATION TECHNIQUE FOR BLACK-BOX TEST CASE PRIORITIZATION Test Case Prioritization can be used to achieve many goals, as shown in the previous chapter, and most of the techniques developed come from well defined heuristics. The main cause to use heuristics is the difficulty to find the best prioritization, given it is a problem with factorial order of complexity. However, contrary to other initiatives that do not consider an optimal solution, we noticed that an exhaustive strategy that brings the best results is feasible for small test suites. Whenever this methodology is not feasible, heuristics must be taken in consideration even knowing the optimal result may not be achieved. We discuss some proposed heuristics further in Chapter 6. A similar initiative have been made but regarding Test Set Minimization discussed earlier. This problem is also a NP-complete, so the same restrictions regarding computational cost are applied. The tool ATACMIN [HL92] uses an exponential algorithm to find best solutions. It also implements some heuristics to be used in the cases where these solutions are not found in a feasible time. We developed this work inserted in the mobile phone testing environment at BTC program. We deal with functional tests where the source code of the system under test is not available, the tests are specified in English and are executed manually. By observing the daily work activities at the BTC, we noticed that several test suites contain a small number of test cases. A feasibility study was taken to verify if an exhaustive technique could be applied. We run a permutation algorithm (without any optimizations), to check the possible limited amount of tests we could run with this technique. We had in mind that such an algorithm could be left to run an entire day, as the testers received the test suite in a period of one day in advance of their execution. These results are listed in Table 4.1. We notice that a test suite with 15 test cases can run (in the worst case) in less than 24 hours. Clearly, to generate all permutations 28 29 PERMUTATION TECHNIQUE FOR BLACK-BOX TEST CASE PRIORITIZATION of 16 tests or more takes more than one day to prioritize. This was the main motivation to develop an approach which exhaustively looks for the best sequence, as this number fits well with the general test suite size. Larger test suites could still be prioritized with our optimized permutation algorithm (to be explained later). # of Elements T ime # of Sequences 1 < 1 ms 1 2 < 1 ms 2 3 < 1 ms 6 4 < 1 ms 24 5 < 1 ms 120 6 < 1 ms 720 7 < 1 ms 5040 8 < 1 ms 40320 9 16 ms 362880 10 78 ms 3628800 11 969 ms 39916800 12 11 s 479001600 13 2 min 6227020800 14 34 min 87178291200 15 8h 1307674368000 Table 4.1 Feasibility study results. Let us recall the Definition 1(Chapter 2): Let S be a test suite, P the set of permutations of S and f a function from P to the real numbers. The test case prioritization problem consists in finding p ∈ P such that ∀p0 ∈ P. (p 6= p0 ) ⇒ (f (p) ≤ f (p0 )) Unlike all related work, we are not concerned with the definition of appropriate heuristics for test sequence generation. Our main concern is the definition of an appropriate f which captures the correct notion of an efficient sequence. In our case, f is related to the 4.1 THE PERMUTATION TECHNIQUE 30 level of reuse of phone’s data and configurations. Our function f computes the time the tester spends preparing the phone between the execution of two tests. Before detailing the permutation strategy, we define some concepts related to a test case in mobile phone environment: Identifying Information ID, suite and description. Setups A set of configurations that does not change during test execution of a suite. For instance, the set up of specific configuration bits on the mobile phone does not change during the execution of the entire suite. These configurations should be executed before the execution of the whole test sequence. Phones Each phone can be a PUT (Phone Under Test) or an auxiliary phone. Input/Output State Input state is the required state of the phone before the execution of a test. Output state is the state of the phone after the execution of a test. The input state is data required to be stored on the phone before the execution of a particular test. The output state is the state of the phone when the test is finished. Each Input/Output State is composed of a set of input/output data. For instance, ”the phone contains 3 GIF files” and ”there is 1 contact in the phonebook” are examples of input and output states. 4.1 THE PERMUTATION TECHNIQUE We develop a technique based on permutation generation of all possible sequences from a test case set [LIS08]. This decision is justified by the fact that many test suites we deal with have a limited size of elements, which allows current computers to generate all possibilities and return the best results. Our function f reflects the level of reuse of data (input and output) of mobile phones. Each data has an associated cost of generation which is the time spent to set the input state on the phone. For example, a JPEG file has an associated time to be added and an associated time to be deleted on the phone. The cost computed by f is simply the total time spent preparing the input states from one test to the next. The output state is used to verify possible reuses from one test to the next in the sequence. The output data from a test that match with the input data 4.1 THE PERMUTATION TECHNIQUE 31 from the next test in the sequence means that there is no need to generate such input data at this test, consequently, no effort is spent and no addition to the sequence cost. Figure 4.1 shows a simplified example of how the calculation works. Each box represents a test case, where we only show its input and output states. On the first element, we cannot reuse any data as there is no previous test. So, we add to the total cost of the sequence the time to generate the input state of the test A. In this case, we add the time a tester takes to insert 1 Email. After the test A is executed (the steps for executing the test are omitted from the Figure 4.1), we can reuse the outputs of A in the inputs of B. Note that the execution of test A produced a second Email. So, before executing the test B, the tester does not need to add the two Emails required, only the GIF file. The time spent on adding a GIF is added to the total cost. At last, from B to C, we can reuse the two Emails and one of the two GIFs needed in C, adding to the total cost only the time spent on adding one more GIF. Figure 4.1 An example using the reuse function. The time to generate each kind of input data can come either by interviewing the testers or by measuring the testers’ activities during their general day-work, or (as we did in our case), by carrying out a small experiment (see Chapter 5). Notice that such measurements are needed to be taken only once for each kind of data (like saving a GIF file or deleting an e-mail), as these data can be reused in other actions of prioritization. Another consideration regards our total non-interest in the time spent during the execution of the test steps. This is understandable as no reuse can be done in such tasks because the input state has already been prepared. Therefore, we presume that the time spent 32 4.1 THE PERMUTATION TECHNIQUE in the execution of these procedures does not affect our calculations. Although we know this value can change according to other variables as tester’s experience, motivation, level of difficulty, among others [AB08b]. This variation is still immaterial with respect to the reuse of data from one test to the next. There is still one issue missing in this calculation: what about tests with multiple phones (say, several PUTs and several auxiliary phones)? How do we reuse their inputoutput information? This scenario makes the calculation a little bit more complex. This happens because we have to decide which phone will reuse the output state of another phone from the previous test. We must link each phone from a test A to each phone from a test B in the best way, which is where more reuses take place. Before calculating the final cost, we have to combine, say, each PUT’s output state with another PUT’s input state to check which combination produces the lowest cost. Figure 4.2 shows possible combinations where in 4.2(a) the output state of the phone X is reused by the phone Y0 and, similarly, the output state of phone Y is reused by the phone X0 , while in 4.2(b) we have the opposite situation (assuming that all phones are, say, PUTs). To perform this verification, we apply an algorithm similar to the permutation one. For instance, the phones of test A in Figure 4.2 are combined exhaustively with all phones of the same kind in test B. By traversing all possible combinations we can verify which pairs make more reuses. As tests usually require one or two phones (three at most), the generation of these pairs does not impact the total sequence generation time. We do not promote reuse from a PUT to an auxiliary phone (or vice-versa) as, in general, auxiliary phones cannot be used as PUTs. (a) (b) Figure 4.2 Possible phone combinations. Another aspect related to the calculation of the cost is the decision to accumulate the state of the phones from one test to the next (a stateful approach). Keeping the state 4.1 THE PERMUTATION TECHNIQUE 33 means that a phone that is used through a sequence of tests can accumulate data from the reuses of all test cases executed. An example can be seen in Figure 4.3, where the state of test A is updated with an Email and a GIF file. Note that the GIF file is not used by the test B, but it is still in the phone state. So, it can be reused by the test C without any configuration cost added to the sequence. The GIF file is only reused by test C, as a result of the accumulated state of the phone after executing tests A and B. Figure 4.3 An example keeping the phone state. Alternatively, we can calculate the cost without taking into account the cumulative phone state by looking strictly at the possible reuses between two test cases. In this case, the cost of running two particular tests is immutable regardless the sequence (the context) they belong to. This happens because they do not suffer any interference from the execution of any other tests executed previously. This variation (called stateless) allows us to calculate the costs of reuse between any two tests in advance before generating the permutations. Both stateless and stateful approaches have advantages and disadvantages. The former has a better performance as the costs between tests do not change and their costs are calculated only once before the permutation starts. Therefore, they can be stored in advance and retrieved when needed. This cannot be applied to the stateful algorithm as the state of the phones changes along the sequence and affects all the following tests. So, 4.1 THE PERMUTATION TECHNIQUE 34 each possible sequence requires the cost to be computed on-the-fly. However, the stateful approach undoubtedly produces a more accurate calculation concerning the execution time (see Section 5.1.2.1). 4.1.1 Optimizations As a permutation algorithm is very costly in a computational sense, some optimizations were implemented to allow the algorithm to finish its processing in less time. They are the pruning mechanism, equivalence classes, transitive closure and user-defined clusters. The first, prunes sequences which will never be elected as optimal. The other two try to reduce the number of test cases to be given as input to the permutation algorithm without affecting the original optimal sequence of the suite. By reducing the size of the suite, we also reduce the time to generate the permutations. Pruning Mechanism: The goal is to avoid spending time with sequences that will never be considered optimal. Initially, a greedy algorithm returns some test sequence. Its cost is used as initial value to the best cost found so far, which is used to prune sequences incapable of being optimal. The time to run the greedy algorithm is insignificant compared to the whole permutation algorithm, thus, it doesn’t affect the total computing time. While the sequences are being built, the permutation algorithm calculates their partial costs as well. Then, if these partial costs outperform the best one found so far, the current sequence is abandoned together with any other sequences which have the same prefix. Hence, the total computation time is decreased. An example is shown is Figure 4.4. In the first two rows, the sequences are generated entirely, i.e., up to the last element, and are compared with the best cost found. In the third row, the cost up to the second element already exceeds the best cost. Thus, the sequence being built is halted together with all sequences starting with Test1 followed by Test5. The same happens in the fourth row. The fifth row is calculated completely and a new best cost is found, so the best cost variable is updated to 58. The algorithm that generates the permutations with a pruning mechanism for optimization is shown in Algorithm 4.1. It is recursive and requires 6 parameters: the list of test cases being prioritized (tests[]), an auxiliary list used to represent the sequences 4.1 THE PERMUTATION TECHNIQUE 35 Figure 4.4 An example using pruning mechanism. being generated (aux[]), an auxiliary list of boolean values used for controlling the tests already permuted (marked[]), an integer variable to control the size of the sequences being generated (depth), the integer variable to keep the cost of the current sequence being generated (totalCost). There are two global variables: bestCost maintains the best sequence cost found so far and optimals is a list of sequences that possess the best cost at the end of the algorithm. The three lists used (tests[], aux[] and marked[]) have the same length, which is the amount of tests being prioritized. The first time the algorithm is called these parameters have the following values: tests[] contains the test cases being prioritized; aux[] has all test cases initialized with null values; marked[] has all boolean values set to false; depth and totalCost begin with 0 and bestCost is initialized with a greedy function, which has the first element randomized, that returns the cost of the sequence found. This is used to quickly prune the sequences that outperforms the cost of this greedy sequence. At line 1 the algorithm iterates n times, where n is the size of the test suite. It is used together with the list of boolean values to guarantee that each test is used in each sequence generated. Line 2 checks if the value has been already traversed, if it is not, the test is used, otherwise, the next iteration happens. Line 3 marks the test from the current index as used and at line 4 the test is added in the auxiliary list that represents the current list been generated. The depth parameter is used to maintain how many tests have being added to the sequence been constructed, hence, it is an index where the next test should be allocated. Line 5 calculates the cost to add the test aux[depth] to the 4.1 THE PERMUTATION TECHNIQUE 36 Algorithm 4.1 permute() Require: tests[], aux[], marked[], depth, totalCost Ensure: tests.length = aux.length = marked.length 1: 2: for i = 0 to tests.length-1 do if !marked[i] then 3: marked[i] ⇐ true 4: aux[depth] ⇐ tests[i] 5: cost ⇐ calculateCost(aux, depth) 6: if ((cost + totalCost) < bestCost) AN D (depth < tests.length − 1) then 7: 8: 9: permute(tests, aux, marked, depth + 1, cost + totalCost, bestCost) else if (depth = tests.length − 1) AN D ((cost + totalCost) < bestCost) then bestCost ⇐ cost + totalCost 10: optimals.clear() 11: optimals.addCopy(aux[]) 12: 13: else if (depth = tests.length − 1) AN D ((cost + totalCost) = bestCost) then optimals.addCopy(aux[]) 14: end if 15: aux[depth] ⇐ null 16: marked[i] ⇐ f alse 17: 18: end if end for current sequence state(totalCost). From line 6 to 14 there are some comparisons to decide the evolution of the current sequence. Line 6 verifies if there are more tests to be added (depth < test.length) and if the cost of the current sequence (cost + totalCost) does not exceed the best cost found so far. If the sequence does exceed the best cost found so far, the sequence generation finishes, so any sequence with the prefix equal to the sequence generated up to this moment is not even generated. Otherwise, the sequence generation continues in Line 7, performing a recursive call updating the depth to depth + 1 and the totalCost to cost + totalCost. Line 8 tests the case where the sequence reaches the complete size and its cost is better to the best cost found. The bestCost variable is updated, the result set 4.1 THE PERMUTATION TECHNIQUE 37 with all best permutations is cleared and a copy of the sequence with the best result, which is at the aux[] variable, is added to it (lines 9 to 11). When the sequence cost is equal to the best cost, instead of clearing the result set, a copy is added to the it (lines 12 and 13). Finally, lines 15 and 16 reset the test used at the position depth and the boolean variable to allow the use of such test in furthers sequences. Equivalence Class of Similar Test Cases: This optimization is used to reduce the number of elements to run in the prioritization. By doing so, we reduce the computational effort. The strategy tries to find equivalent tests: those that have equal input and output states, and, their input state are equal to their output state individually. Figure 4.5 illustrates an equivalence class with three test cases. All of them have as input state just one phone with 2 IM Accounts, and it is identical to their output state. Hence, their are considered from the same equivalence class. In this particular scenario, executed sequentially, these tests do not produce new outputs as they are equal to their inputs (the other tests are neutral elements). Therefore, they can be represented by only one equivalent test in the permutation algorithm, thus, decreasing the number of elements. When the algorithm over the reduced suite finishes, the equivalent tests can be executed in any order around and close to the representant. Figure 4.5 Three equivalent test cases. Figure 4.6 shows an example where 4.6(a) shows the complete suite, 4.6(b) shows that tests 8, 6 and 4 are equivalent. One of them represents the others, in this case the test 4, and the others are removed from the test set. The permutation is run only with 3 elements displayed in 4.6(c). Transitive Closure: Another strategy that tries to reduce the load on the permutation algorithm by reducing the number of elements to be prioritized. A transitive closure 38 4.1 THE PERMUTATION TECHNIQUE (a) Initial tests (b) Equivalent tests (c) reduced test set Figure 4.6 Reducing test set with equivalence class. can be defined as follows. Definition 2. Let a graph G = (V, E), where V is the set of vertices and E is the set of edges. The transitive closure of G is a graph G+ = (V, E+) such that ∀v, w ∈ V. (v, w) ∈ E+ ⇐⇒ (v ⇒ w 6= ∅) if from an element (vertex) v we can reach an element w (denoted by (v ⇒ w 6= ∅)), then there is a relation (edge) between v and w in G+. The problem to find these closures is strongly used in reachability analysis, and once solved it results in a data structure that answers efficiently range questions like “can I get to x from y?”, and disjoints subsets according to the closure relation. In our approach, we use transitive closure to discover disjoints subsets of test cases, dividing the entire test suite in smaller independent groups. Each group does not affect the others in terms of data reuse. If there is a reuse between two tests, they should belong to the same group. Consequently, they are related to all elements already inserted (because of its transitivity). Otherwise, they should be put in different groups. For instance, given five elements a, b, c, d and e. If there are the following relations: from a to b and from b to c, and from e to d, two sets are created: the first with a, b, c and the second with d and e. As the disjoints sets have no possible reuse between them, there is no need to run the permutation algorithm with groups containing elements from different sets. Hence, we break the whole test set in subsets, execute the permutation in each subset separately, and return the optimal sequences from each subset. Notice that these sequences can be 4.1 THE PERMUTATION TECHNIQUE 39 executed in any order, as each group does not affect the others. Figure 4.7 illustrates the execution of the transitive closure strategy. Tests 1, 4, 7 and 5, 6, 8 belong to different closures. They are prioritized by the permutation algorithm separately and the best ordering among each group is returned. They can now be executed sequentially: 1,4 and 7 from the first set can be followed by 6, 8 and 5 from the second set (or vice-versa: 6,8 and 5 comes before 1, 4 and 7). Figure 4.7 The transitive Closure mechanism. The more groups are found inside a test set, the faster the prioritization algorithm is executed, as the exponential complexity is split in minor problems. This allows us either to reduce the algorithm execution time, or to run it with larger sets of elements, minimizing the limitations presented in Table 4.1. The main difference of the transitive closure and equivalence class of similar test cases optimizations is that the elements from a same group created by the former do not need to be similar, they are grouped according to reuse relation among their elements. In the latter technique the elements of a same group have a distinct characteristic, which is the equality of inputs and outputs among themselves, that allow them to be represented by only one member. Another aspect is that all elements in the group of the transitive closure technique must be prioritized, while only one in the equivalence class needs to be prioritized. User-defined Clusters: Another way to group test cases to improve the efficiency of the prioritization algorithms is to create clusters of them. In our context, a cluster is an aggregation of test cases defined by the tester that must be executed together along a sequence and can be prioritized with other tests being considered as a single one. We can see an example of a cluster in Figure 4.8 where the tests ID135, ID148 and 4.1 THE PERMUTATION TECHNIQUE 40 ID270 must appear together at the prioritized sequences. They become very useful when there are factors besides the data reuse that can have a major influence at the test suite being prioritized. Then, the tester may choose some tests that must be together along a sequence due those factors. Figure 4.8 Example of cluster. For example, imagine that a test engineer knows beforehand the set of test cases. He knows that a determined subset has a more complicated execution than the others and could prefer to execute these test cases together. He could group these tests in a cluster. Then, these tests would be always prioritized as a unique test case. However, it is important to emphasize that the clusters are different of the groups of a Transitive Closure. While the test cases that are part of the clusters are defined by the user, the groups of the Transitive Closure are formed by a pre-processing performed with the test cases before the prioritization. When a cluster is created, the first step is to prioritize the test cases inside the cluster. To do this, we use the Greedy Algorithm described at Chapter 6. It is fast and has an acceptable result to our approach. After this, the cluster will be a representative test case where its input state is the input state from first test case in the order generated by the Greedy and its output state is the output state accumulated from all test cases − the phone’s output state after the execution of the last test case. After this step, it can be prioritized together with other test cases, as seen in Figure 4.9. Nevertheless, there is no guarantee of finding an optimal solution with the use of this feature due to the use of a greed heuristic with the tests in the cluster. Figure 4.9 Cluster as a representative test case. CHAPTER 5 EMPIRICAL STUDIES Experimentation is one of the most used tools to test ideas and validate approaches. Generally, it takes advantage of solid statistical tools to support the inferences over evidences in experiment results, aggregating more confidence over assumptions. To assure our technique reduces the test execution time we took advantage of such mechanisms and we executed some studies validating our approach. Some experiments have been driven to validate the permutation technique presented in Chapter 4. A prototype software was implemented to mechanize the permutation (details about this tool are explained in Chapter 7). All empirical studies were performed in collaboration with the Motorola BTC program. They provided us with infrastructure and human resources to execute our experiments. We worked together with a specific test team inside the BTC. The choice of this team happened because of their innovative approach for improving their testing process. They were already using heuristics to generate test sequences manually [RBJ07], which has already brought some gains to their performance. This team allowed us to compare their current manual technique with our proposal. Therefore, two studies were carried out: the first involves data collection over the input data generation time (period to set each input data of phones) and a comparative case study using its results. We are going to refer to it as ES1-CS (Case Study). The second is a controlled experiment used to validate our approach in comparison to the manual one with more accuracy by using more statistical tools and by measuring the total execution time, it is referred to as ES2-CE (Controlled Experiment). Another difference between the two studies, is that ES1-CS does the calculations only considering the time spent at the preparation of the phones (Initial State test phase), and which we call configuration time. The total execution time is the sum of the configuration time and the procedures execution time. As the techniques aim to discover reuses to be made only during the Initial State phase of test execution, then we assume that the 42 5.1 EMPIRICAL STUDY 1 - CASE STUDY (ES1-CS) 43 major influence in the saved time of a sequence is given by this stage. ES2-CE not only does a more accurate comparison between the techniques, but also shows how reliable the assumption done in ES1-CS is, by verifying the influences of the configuration time in the total execution time. 5.1 EMPIRICAL STUDY 1 - CASE STUDY (ES1-CS) Although this study is simpler, it was very important because it gave us precise data to run our technique and allowed a preliminary comparison with the manual one. It retrieved the times spent generating the input data needed on the mobile phones during test cases execution. This information are the inputs to our technique, as we calculate the best sequences according to the maximum reuse and the input data generation times. For example, a test that needs one video file as input state can be more costly than other test case that needs three e-mail messages because a single video may take more time than the generation of three e-mails. For that reason, the consistence of these times is important during prioritization. The acquisition of this information may not need an experiment every time some prioritization takes place. It could come from the testers’ expertise, or historical executions data. It is important to realize that even coming from other sources this information should reflect as close as possible the reality, otherwise the prioritization result may not be precise as well. Moreover, we used these times to calculate the cost of the manual technique as well. As we have the sequence generated by this technique together with the input and output states needed for each test, we can discover the configured and reused data, and the cost associated with such sequence. Therefore, with the results from both approaches we can compare their performances. 5.1.1 Experiment Planning This experiment justifies itself by the need of reliable data to be used in prioritization techniques which are compared later. It was executed in two phases: a data collection and the test case prioritization together with comparisons based on the collected information. The data collection consisted of the generation of different kinds of input data repeated times to give us an average of effort spent in each configuration. Such informa- 5.1 EMPIRICAL STUDY 1 - CASE STUDY (ES1-CS) 44 tion was used to prioritize test cases and to compare theoretically the two techniques, the automatic permutation and the manual prioritization. 5.1.1.1 Goals The following goals were identified in the first study: G1: Identify the configuration times used in a particular test suite. G2: Identify which technique produces a sequence of tests that can be executed in less time and effort considering solely the data generation time. 5.1.1.2 Participants For the data collection, two testers are used, each one with difference levels of experience. Both are from the team mentioned earlier: one is an expert tester while the other is a trainee with restricted knowledge over testing mobile phones. The choice of these two different levels is to try to dilute the experience factor influence in the experiment results. 5.1.1.3 Experimental Material The objects used in the experiment are a test suite, a data survey of the test suite, the mobile phones used to execute the data generations and the tools involved in the tasks. Test Suite. The test suites are files which contain the steps to be executed in the mobile phone and the expected behavior. They could be split in two parts, the initial condition, where all information about input state needed to the test execution is informed, and the test procedure, which has all steps that the tester should perform along the test. The test suite used for this experiment tests a specific component of the mobile phones. It was chosen for two reasons. First, it has 12 test cases. This is what fits perfectly in the permutation size restrictions, and second, its data survey (see below) was available as the team already had used this suite with the manual technique. We only spent some time validating it and correcting inconsistencies. Data survey. Before we execute the experiment we had to know which input data we should use. They can be, for example, videos, e-mails, sound files, image files. Both techniques, the manual and the permutation need a data survey over the test cases to be prioritized. As the objects of our study are test cases written in English, such task 5.1 EMPIRICAL STUDY 1 - CASE STUDY (ES1-CS) 45 is not mechanizable. Hence, a person reads the test cases and identifies each test case input state and output state. However, such task can be automated in some cases (one example is described later in Chapter 8). But, as explained before, the specific test suite chosen already had its data survey done by testers. We only checked its conformance and corrected it when necessary. The data survey showed a total of 25 input data generations needed in this experiment (see Table 5.2). The mobile phones. The phones used by both testers had the same hardware and software configuration, thus avoiding the discrepancies in the results related to the equipment used according to the two participants. Tools. To perform this experiment two tools were involved in the process. The time was gathered using the tool ManualTest [AB08a]. It is used in the input data generation to collect the time spent on it. The testers indicate to the tool which activity is being performed while tool records the time spent on it. At the end of the experiment, the tool generates logs with the times from each activity. To prioritize the test suites, the tool explained in Chapter 7 was developed. It prioritizes a set of tests according to the possible reuses, using one of the techniques implemented. We used the permutation approach, which traverses all possible sequences and returns the best ones, because we wanted to validate if the automatic prioritization based on reuse is better than manual prioritization, which is mostly based on reuse and testers’ experience. 5.1.1.4 Tasks For each tester in involved in the data collection, the following tasks were performed: 1. Listen to the explanation about activities to be performed. 2. Familiarize themselves with the ManualTest tool. 3. Receive the list containing the input data to be prepared. 4. Check if all the equipment needed to perform the activities is available. 5. Start the execution of the activities using the ManualTest. 6. When all tasks are finished, inform the tool the end of the experiment. 46 5.1 EMPIRICAL STUDY 1 - CASE STUDY (ES1-CS) Data List 1 List 2 1 Insert 1 DRM SD jpg activated IM login (Power On) 2 Insert 1 Video Alert for IM 3 Alert for IM Insert 1 DRM SD jpg activated 4 IM login (Power On) Insert 1 Video Table 5.1 An example of 2 lists with random orders of execution containing 4 input data generation. 7. Recover the logs containing the time taken to generate the input data. 5.1.1.5 Experiment Design For each tester we created lists generated randomly with different orders of executions of the input data generation, where each list has all elements discovered in the survey (see an example in Table 5.1 with 4 elements). After the execution, the average of each generation is calculated. After this step, we have the average time for each generation. Finally, we take a major average regarding the two testers’ averages. This metric is the main average time spent in each configuration. These data will be used in the calculation of the time spent with input data generation on the prioritization technique. 5.1.2 Results and Analysis The expert tester was able to complete 5 sequences of execution, while the beginner tester only finished 3 sequences. They had only 2 hours to execute the maximum number of lists as possible. Their averages were calculated to generate the final average for each data generation. These results are illustrated in Table 5.2. The times presented in the column Average were used to perform a case study regarding the test suite involved. This case study will be explained next. 5.1.2.1 Case Study Given the average results of the input data preparation we could run our permutation technique and compare the cost of the sequences generated with the one created using the manual technique. The process to calculate the costs of the sequences is the same described in Chapter 4. For both techniques they were calculated 47 5.1 EMPIRICAL STUDY 1 - CASE STUDY (ES1-CS) Data Description Expert Beginner Averages 1 Insert 1 DRM SD jpg activated 00:01:03 00:00:50 00:00:57 2 Delete 1 DRM SD jpg activated 00:00:35 00:00:30 00:00:33 3 Insert 1 IM contact 00:00:59 00:00:51 00:00:55 4 Delete 1 IM contact 00:00:32 00:00:33 00:00:33 5 Automatic Lock (On) 00:00:25 00:00:24 00:00:25 6 Insert 1 Music file 00:00:17 00:00:27 00:00:22 7 Delete 1 Music file 00:00:26 00:00:18 00:00:22 8 Insert 1 Phonebook contact 00:00:36 00:00:22 00:00:29 9 Delete 1 Phonebook contact 00:00:17 00:00:13 00:00:15 10 Alert for IM 00:00:21 00:00:17 00:00:19 11 IM login (Power On) 00:00:15 00:00:14 00:00:14 12 Insert 1 SIM contact 00:00:21 00:00:26 00:00:24 13 Delete 1 SIM contact 00:00:16 00:00:11 00:00:14 14 Insert 1 GIF 00:00:30 00:00:18 00:00:24 15 Delete 1 GIF 00:00:22 00:00:24 00:00:23 16 IM login (Auto) 00:00:17 00:00:17 00:00:17 17 Alert (On) 00:00:19 00:00:22 00:00:21 18 Insert 1 Video 00:00:30 00:00:26 00:00:28 19 Delete 1 Video 00:00:19 00:00:25 00:00:22 20 Insert 1 JPG 00:00:16 00:00:12 00:00:14 21 Delete 1 JPG 00:00:18 00:00:18 00:00:18 22 Insert 1 DRM FWL jpg 00:01:20 00:00:33 00:00:56 23 Delete 1 DRM FWL jpg 00:00:21 00:00:26 00:00:23 24 Insert 1 DRM CD jpg 00:00:57 00:00:39 00:00:48 25 Delete 1 DRM CD jpg 00:00:33 00:00:32 00:00:32 Table 5.2 Execution times of configuration data collected in ES1-CS (in seconds). 48 5.1 EMPIRICAL STUDY 1 - CASE STUDY (ES1-CS) Generation Type Generation Time # best results Configuration Time Manual(Stateless) Around 2 weeks 1 641 Manual(Stateful) Around 2 weeks 1 617 Automatic(Stateless) 53 sec 96 455 Automatic(Stateful) 7 min 21 sec 45360 431 Table 5.3 Comparison among approaches. taking in consideration the stateless and stateful approaches. The comparison between the prioritization techniques is shown in Table 5.3. In the first two rows we see that the manual methodology takes around two weeks to generate the sequences. This happens because the testers devote a small period of time per day to work on the sequence generation. So, the whole process can drag on for two weeks. The manual approach is based on the testers’ experience, which always returns a single sequence, and which is not guaranteed to be the best one. The last two rows show the outcome of our tool. Not surprisingly, its generation time is far better than the manual generation. The automatic prioritization generates sequences very quickly compared to the manual. However, this time may be larger when the information are not yet inserted in the prioritization tool. In those cases the generation time must be added by the time a tester takes to register such information. This time depends on the test suite size, but for an usual test suite it may take no longer than a few hours. Therefore, in such cases, the automatic technique still shows to be a more reasonable solution. Notice that depending on the amount of reuses, the number of best sequences can be very large (several sequences might have the same lowest cost). The fewer the reuses among the tests, the larger the set of best sequences. This happens because the number of possibilities is huge. For example, for 12 tests, there are 479,001,600 sequences, so there is a high probability of existing more than one best sequence. In the case of the stateful permutation, this number is even larger. This happens because a particular test which does not bring any reuse placed at a specific position in the sequence, may well be placed in several subsequent positions with the same cost (provided that the subsequent tests do not delete configurations). This increases the number of possible best sequences in comparison to the stateless permutation. 5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE) 49 The automatic technique on both stateless and stateful approaches improved the configuration execution time in around 30% in comparison to the manual technique (i.e., the time spent to execute the configurations of the sequences generated by the tool is 30% less than the time spent to execute the configurations of the sequence generated manually). This is not surprising as the tool verifies all possible sequences and therefore guarantees the best results. Also, a sequence created manually has high probability of not being the best one because some group of reuses can be difficult to visualize by humans, especially if we think of the amount of possible permutations. Hence, providing an automatic generator is doubly advantageous: it produces sequences in less time and the generated sequences can be executed faster. However, as said previously, this study is based on the precondition that the configuration time is the major responsible for the reduction on test suite execution time, as the procedures execution time must be constant regardless the amount of reuse performed. To strengthen this assumption we realize we needed to perform another experiment. We ask testers to execute the test sequences generated by the two techniques, but this time including both data configuration and test procedures. The results achieved until this moment were mainly theoretical, as the testers didn’t execute the tests. They executed only the input data generation. The details about this experiment is described next. 5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE) The case study showed promising results on the comparison between our technique and the testers’ technique. Therefore, we ran a more elaborated empirical study which could provide more confidence regarding our premises. From this point of view we elaborated a controlled experiment, which is described in [LAIS09], to verify if the total time spent in a test suite execution is influenced mainly by the configuration time instead of the procedures time. As this is a formal experiment, it is described with a higher level of detail based on a variety of works available in the literature, which provided us guidelines on how design, analyze and report experiments in software engineering [Pfl94, Pfl95a, Pfl95d, Pfl95b, Pfl95c, JP05, JCP08]. 5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE) 5.2.1 50 Experimental Planning We designed and ran a controlled experiment to evaluate the benefits provided by the manual and the automatic test prioritization techniques. The plan and design are described in what follows. 5.2.1.1 Goals The research objective of this experiment is to: Compare test case prioritization techniques and analyse their impact on the total configuration, procedure and execution times from the point of view of test managers in the context of manual execution of black-box tests of mobile phone applications. To achieve this objective, we define the following goal for this study (following the Goal/Question/Metric method [BCR94]): G: Identify which investigated technique provides test sequences with smaller execution times. The assessment of this goal is performed by answering the following research questions: Q: When using the automatic technique instead of the manual one, what is the impact on the total a. configuration time? b. procedure time? c. execution time? Our research hypothesis is that the automatic prioritization technique provides a larger reduction of the total configuration time when compared to the manual one, but both techniques have equal impact on the total procedure time. This is understandable as data reuse only affects the configuration time. For example, if a test reuses a GIF file from a previous test, the time saved for not inserting that GIF affects the configuration time. By definition, the execution time may be affected to some extent as well. However, the procedure time should not be affected as the test procedure 5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE) 51 simply assumes that the input is already available, i.e. it assumes the existence of a GIF file in the phone. M: To answer questions Q.a to Q.c, we use the following metrics: – ConfDiff: the difference between the mean total configuration time of the automatic and the manual techniques – ProcDiff: the difference between the mean total procedure time of the automatic and the manual techniques – ExecDiff: the difference between the mean total execution time of the automatic and the manual techniques 5.2.1.2 Participants A group of 8 testers was allocated to the experiment. All testers belonged to the BTC program, where 4 of them were experienced testers, and the remaining 4 were beginners. The allocation of the testers was provided by their manager, according to their availability. 5.2.1.3 Experimental Material The objects used in the experiment are 2 test suites, the mobile phones used to execute the tests, the tools involved in the tasks and documents left to the testers as part of the experiment procedure. The Test Suites. For this experiment 2 test suites are used: test suite I and test suite II. The former contains 6 test cases, and the latter contains 8 test cases. These suites where chosen according to the following criteria: Their sizes are less than 16, so that the permutation technique could run in feasible time. The input/output data survey is available. The data survey is the description of the inputs and outputs of each test in a machine readable format. Recall that the tests are written in plain English. So, this information has to be extracted manually. The number of tests to be executed should not be larger than 15 in order to allow the testers to carry out the experiment within 1 working day. 5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE) 52 See Appendix A.1 for a detailed description of the test suites I and II. The mobile phones. Phones used in this experiment can be of two types: a Phone Under Test (PUT) or an auxiliary phone. All testers used the same PUTs throughout the experiment. And they all have the same hardware and software. On the other hand, auxiliary phones can be of any kind as they are not under test and they are only needed for basic functions. Tools. The same tools used in the first study were used here. To prioritize the test suites under the automatic technique, we used the tool described in Chapter 7 and we used ManualTEST [AB08a] to accurately record the time taken by the testers to execute the configuration and the procedure activities. Documents. Two documents were given to the testers during the experiment: the reuse guideline and the participant instructions. The reuse guideline helps the testers to perceive the data reuse that they should make along the sequence. It contains a table where the rows are the tests in the order of execution, and the columns have the information about the input and output states of each test. Data reuse is marked through these states. The participant instructions provide detailed explanations about each step of the experiment together with a few recommendations. The testers read the instructions just before the execution of the tests. 5.2.1.4 Tasks The tasks to be performed by the participants during the experiment are: 1. Listen to the explanation about the experiment. 2. Receive the training on the ManualTEST tool. 3. Use ManualTEST in order to get familiarised with it. 4. Read the participant instructions document to fully understand the steps to be followed during the experiment. 5. For each test sequence, the testers should perform the following steps using the ManualTEST: (a) Setup the phones: the tester executes the setup activity. 5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE) 53 (b) Execute the test sequences: the tester starts to execute the tests using the reuse guideline as reference. For each test, the following activities should be performed: i. Look for the inputs required in the reuse guideline. ii. Execute the input activity. iii. Execute the procedure activity. 6. When the execution of both test sequences are finished, the logs generated by ManualTEST are collected. 5.2.1.5 Hypotheses and variables In this section, we present the statistical hy- potheses that we want to test and the variables of our experiment. We define the following null statistical hypotheses H0i and their alternative hypotheses H1i , where i is the index of their related research questions (Q.i). For instance, the null hypothesis H0a says that there is no statistical difference on the total configuration time by using either the manual or the automatic technique: H0a : µConf T imeA = µConf T imeM H1a : µConf T imeA 6= µConf T imeM H0b : µP rocT imeA = µP rocT imeM H1b : µP rocT imeA 6= µP rocT imeM H0c : µExecT imeA = µExecT imeM H1c : µExecT imeA 6= µExecT imeM We also identify the dependent and independent variables of our experiment, as presented in Table 5.4. In what follows, we describe each of the independent variables. 5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE) Name Description Levels or Range Tester Participant that per- 1..8 formed the test execution. Test suite Set of tests executed I, II by the tester. Technique Test case prioritiza- M, A tion technique. ConfTimet Total configuration [0, ∞] time of a test sequence when using the test prioritization technique t. ProcTimet Total procedure time [0, ∞] of a test sequence when using the test prioritization tech- nique t. ExecTimet Total execution time [0, ∞] of a test sequence when using the test prioritization tech- nique t. Table 5.4 Dependent and independent variables of the experiment. 54 5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE) 55 Tester : this variable represents the employees that execute the tests and their experience to perform tests of mobile phones. There are 2 groups: the experienced testers and the beginners, where each group has four participants. Test suite: to allow the feasibility of the experiment, only 2 test suites were chosen: the test suite I with 6 tests, and the test suite II with 8 tests. Technique: this variable represents the prioritization techniques involved in the experiment, which are the treatments to be applied to the problem. The automatic permutation technique is called A, and the manual technique is called M. Other factors related to the environment (like computers, tools, etc.) are controlled to minimize their impact on the test execution time. 5.2.2 Experiment Design In this experiment, we need to control the effect of the testers’ experience and the effect of the size and complexity of the test suites. The effect of these two sources of variation can be high on the configuration and procedure times. Also, there may exist a learning effect that should be controlled when people repeat the same tasks. We selected the Latin square design to run the experiment [BHHH78, Kue99], since it is useful when the experimenter desires to block two different sources of variations and to reduce the learning effect. Besides that, Latin square designs allow experiments with a relatively small number of runs in comparison to a complete randomized blocking design. This is one of the main motivations to the usage of this specific design in our experiment, as we have a limited number of participants available. In this design, the number of rows, columns, and treatments must be equal to allow the construction of a square matrix. In our case, a 2x2 matrix is used (see Figure 5.1). Once the rows (testers) and columns (test suites) are fixed, the treatments (test case prioritization techniques) are assigned to them at random and respecting the property where each treatment exists once per row and once per column. This property is related to the reduction of the learning effect, as each tester applies different prioritization techniques for different test suites. The participants are randomly selected in groups of two persons with the same experience level. As there are 4 experienced testers and 4 beginners, we have 2x2 latin 5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE) Suite I Suite II Tester X M A Tester Y A M 56 Figure 5.1 An example of a Latin square used in our experiment. squares replicated four times. During the experiment, each tester receives the two suites to be executed, each one prioritized according to a different technique. In each latin square, the test suites and the techniques alternate between the two testers. Figure 5.1 shows an example. Given a replicate with testers X and Y, X executes the suite I (ordered by the manual technique) and the suite II (ordered by the automatic technique). Y executes the suite I (ordered by the automatic technique) and the suite II (ordered by the manual technique). Note that we need to randomize only the first allocation of technique/tester/suite in a 2x2 latin square, since the remaining allocations follow from it. 5.2.2.1 Procedure This section details the procedure used to run our experiment. Each tester executes her tasks in her own working environment. This makes the experiment close to the testers’ reality. The executions are scheduled according to the testers’ availability. We allocate two testers per day, both belonging to the same Latin square. With this constraint, we ensure that each replicate is executed in similar environment conditions. Also, the effect of different conditions occurred in different days are considered in the statistical analysis through the effect of each replicate. We inform the participants that they should execute the tests as they normally do and that their execution times will remain private, not being used for personal evaluation. This information is important not to intimidate them to execute the tests either too fast or too slow in comparison to what they usually do. The testers should verify if all needed resources are available on their desks (phones, cables, softwares, etc) before the execution of the test suite. It is recommended for them to avoid distractions and unnecessary interruptions during the test execution. When the test execution begins, the tester should indicate in the ManualTEST which test is running and the type of the current activity (configuration or procedure). For any interruption during the test execution, the tester should pause the ManualTEST as well. When the 5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE) 57 activity restarts, the tool should be resumed. When the test execution ends, we recover from the tester’s machine the logs containing the times related to the configuration and the procedure activities. These logs are used to perform the analysis described in the next section. 5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE) 5.2.3 58 Execution This section presents the preparation procedures for our experiment and shows deviations occurred during the execution of the experiment. 5.2.3.1 Preparations To build the four 2x2 Latin squares replicates, we fixed the columns with the test suites I and II. Each tester was classified by their manager as expert or beginner. We allocated the testers to the rows in such a way that a Latin square replicate contains testers with the same level of experience. Finally, we randomized the allocation of prioritization techniques in each replicate. 5.2.3.2 Deviations Two deviations occurred during the execution of our experiment. First, one of the tester experts had to leave on holidays before executing her tasks. Hence, she was replaced by another tester with the same experience level. The second deviation was a problem related to the build installed on the phones in one of the executions. Two testers executed the tests with phones whose build was different from the previous executions. This build caused a different behaviour of the phones during the tests of the suite I. Consequently, we discarded the execution for this suite. Another execution for this particular component was rescheduled to the following week, potentially reducing the learning effect. 5.2.4 Analysis The data collected during the experiment is presented on Table 5.5. Each pair of times presented are the configuration time/ procedure time observed for combinations of test suite (I or II), tester (1 to 8) and prioritization technique (A or M). We also present the average time per tester, per test suite, per replicate and the grand average. The statistical analysis performed in this study is presented in the next sections. In all of them, we validated the assumptions of the Analysis of Variance (ANOVA): errors are normally and independently distributed with mean zero and homogeneous variances among treatments. We also validated an additional assumption of the statistical model (see Equation 5.1) used in our latin square design using the Tukey’s Test for Nonadditivity 5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE) 59 Table 5.5 Results from the Latin squares with respect to the total configuration time and the total procedure time. [SC89]: the effect of the rows, columns and treatments terms are additive (no significant effect of testers, test suites and prioritization techniques interactions [BHHH78]). yijkl = µ + Repi + Testerj [Repi ] + TestSuitek +Techniquel + Errorijkl (5.1) The variables which appear in Equation 5.1 are: µ: The overall mean. Repi : Represents the ith Latin square replicate of the experiment. Testerj [Repi ]: A nested variable that represents the jth row (tester)) in the ith replicate. The identification of the tester in the statistical model requires not only the identification of the row of the latin square (1 or 2), but also requires the determination of the latin square replicate (1, 2, 3 or 4). TestSuitek : Represents the kth test suite used in the experiment. Techniquel : Represents the lth technique under comparison (automatic and man- ual). Errorijkl : The error term. 5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE) 60 yijkl : The observation (configuration or procedure time) for the lth treatment that is applied in the jth row (tester) and kth column (test suite) of the ith Latin square replicate. 5.2.4.1 Total configuration time The results of the statistical analysis with respect to the total configuration time is presented in Figure 5.2. The column Prob > F indicates the probability of being wrong when rejecting the null hypothesis saying that a given factor have no significant effect on the response variable (time). As we can see, the ANOVA detected a significant effect of the investigated factor Technique with p−value = 0.0069 at α = 0.05. Therefore, the null hypothesis H0a is rejected. Figure 5.2 Results from the ANOVA with respect to the total configuration time. In order to identify which technique provided the larger configuration time reduction, Figure 5.2 also shows a comparative analysis of the total configuration time when using the automatic (AUTO) and the manual (MAN) techniques. The comparative analysis presents the mean time (in seconds) using each technique and statistical information about their difference. For instance, the ConfDiff metric representing the mean difference between the automatic technique and the manual one was −216.125 seconds (approximately 3.6 minutes), with upper confidence level of -84.862 seconds and lower confidence level -347.388 seconds at α = 0.05. 5.2.4.2 Total procedure time Figure 5.3 shows the results of the ANOVA with respect to the total procedure time. In this analysis, the ANOVA did not detect a significant effect for factor Technique (p−value = 0.2150) at α = 0.05. Hence, we cannot 5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE) 61 reject the null hypothesis H0b . This result can be better visualized by observing the confidence interval for the mean difference between the automatic technique and the manual one. Although the estimated mean for this difference (ProcDiff) was −218.25 seconds, its confidence interval includes the value 0 (zero). Figure 5.3 Results from the ANOVA with respect to the total procedure time. 5.2.4.3 Total execution time Figure 5.4 shows the results of the ANOVA with respect to the total execution time. As the ANOVA detected a significant effect for the factor Technique (p−value = 0.0465) with α = 0.05, we reject the null hypothesis H0c . Therefore, there is a difference in using the automatic and the manual technique. The mean difference between the automatic technique and the manual one was −433.125 seconds (approximately 7.2 minutes). This metric defined earlier as ExecDiff indicates that the automatic technique provided sequences executed in less time than the manual one. 5.2.5 Interpretation This section discusses the data shown in the analysis and addresses the threats to validity. 5.2.5.1 Evaluation of results and implications The analysis of the total config- uration time shows that the techniques significantly affect the time taken by the configuration activity. This result rejects the null hypothesis H0a and supports the alternative hypothesis H1a . The configuration time is strongly affected by the data reuse performed 5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE) 62 Figure 5.4 Results from the ANOVA with respect to the total execution time. over the test suite execution. In average, the use of the automatic prioritization technique in our experiment reduced the total configuration time in 25%. Regarding the total procedure time, the null hypothesis H0b cannot be rejected and therefore, we conclude that the techniques do not affect the total procedure time. This was expected as the time spent with the procedure activity is not affected by the data reuse present along a sequence. The procedure activity is executed under the assumption that the configuration activity (and all data reuse) are done. Finally, the most encouraging result came from the total execution time. The total configuration time is relevant enough to affect the total execution time. We were able to reject the null hypothesis H0c with a p-value of 0.0464 at α = 0.05, indicating that the prioritization techniques affect the total execution time. This is understandable as the data reuse happens during the configuration activity, which is part of the execution activity. In average, the automatic technique reduced in our experiment the total execution time in 13.5%. These results ensure that sequences generated by the tool using the automatic technique are executed in less time than the others created by testers using the manual methodology. It is understandable as some scenarios caught by the software can become difficult to be visualized by human beings. Another important aspect shown in this experiment regards the reliability on the data reuse concept, as the data reuse experienced through the total configuration time are representative enough over the total execution time. 5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE) 5.2.5.2 63 Threats to Validity In this section, we present some of the potential threats to the validity of our experiment. Internal validity: The results we gathered from this experiment could have been affected by the following factors: 1) Defects in the prioritization and the measuring tools. This threat has been controlled by testing the code of the tools and validating their outputs before the start of the experiment. 2) As we run each Latin square in different days, the testers’ motivation may differ depending on their workload that day. In order to minimize this threat, we motivated the testers by showing them the importance of this study to their work process. 3) Network problems during the test executions. Some testers faced a network problem while executing their tests. The server of the mobile phone provider was taking too long to reply to the phone’s requests. To control this problem, whenever a connection to the server was too slow, the testers stop the time counting during server’s downtime. Construct validity: We use data reuse to reduce the time spent on test execution. As time is our metric for prioritization effectiveness, its measurement must be well controlled. In order to avoid a misuse of the ManualTEST tool, two countermeasures have been taken. First, before the start of the experiment, we taught the testers how to use the tool. The participants had to try it a few times until they felt confident. Second, we watched the testers during the experiment in order to avoid an inappropriate use of the tool. We also needed a monitor to check if the testers record the time taken on each activity correctly. The presence of a monitor could affect the behaviour of the testers during the sessions, despite of the fact that they were previously informed to ignore the presence of the monitor. External validity: The generalisation of our results could be threatened by the following factors. 1) Object representativeness. We chose our test suites mostly based on size. The prioritization techniques might have a different effect on test suites of different sizes from the suites I and II. Moreover the suites I and II might not be general enough to apply the techniques to all kinds of test suites. For instance, test suites with no data reuse do not benefit from the automatic technique. 2) Few testers were available. Therefore, we chose them mostly by availability instead of performing a probabilistic sampling. 3) Testing process representativeness. The testing process used in the company may affect somehow the benefits of each test prioritization technique. We control this threat 5.3 CONCLUDING REMARKS 64 by running the experiment using the BTC testing process, which uses most of the good practices in software testing. Conclusion validity: 2 x 2 Latin squares have few degrees of freedom for the error term, limiting the use and the precision of statistical analyses (ANOVA, etc.). We controlled this threat by replicating four times the Latin square in our experiment, which increases the degrees of freedom of the error term. 5.3 CONCLUDING REMARKS Both empirical studies show relevant results to the assessment of the permutation technique as the best alternative compared to the manual technique. The first study provides a reliable input information to our technique through the time collected from the data generation. It also presents a preliminary result in the case study showing that the sequences generated by the permutation technique are executed in 30% less time than the manual technique regarding the configuration time. The second experiment is more formal and complete than the first. It considers not only the configuration time, but the total execution time, which is the sum of the configuration time and the procedures time. Two assessments are detailed in this experiment: the influence of the configuration time over the total execution time, and another comparison between the manual and permutation techniques. Both results are encouraging as the time saved in the configuration phase showed to be relevant over the total execution time with statistical significance and the sequences from the permutation technique have been executed in 13.5% less time than the manual technique. As we already showed significant evidences from the improvement obtained by the use of the permutation technique rather than the manual technique, we studied and developed some alternatives to those cases where the size of the test suite is large. In the next chapter we detail some heuristics developed to allow an automatic test case prioritization regarding the data reuse concept to such cases. CHAPTER 6 PRIORITIZATION HEURISTICS As seen in Chapter 3, all initiatives toward test case prioritization propose different kinds of heuristics to a variety of goals, few of them addressing black-box testing. As already mentioned, our objective is to reduce test suite execution time. To solve this problem we used a metric based on data reuse. Chapter 5 provides reliable evidences on the relevance of this metric supported by the use of experiments, and, it shows the good results achieved with a permutation solution compared to a manual prioritization heuristic. However, permutation is infeasible for large suites. To solve this problem we decided to implement some heuristics to help testers finding good sequences of tests to be run reusing some data along the suite executed, therefore reducing its execution time. The use of heuristics are needed when there is no practical solution to find the optimal result, although their use does not always guarantee the discovery of the optimal solution. As the number of elements increase, also the state space to be checked becomes exponentially larger, hence, the determination if the solution found through an heuristic is the best in those cases is intractable. However, a good heuristic with a reliable heuristic function, generally, brings results as close as possible to the optimal, apart from some specific advantages depending on the heuristic, like less use of memory or faster processing. As we already have a cost function, detailed in Chapter 4, we used it together with some artificial intelligence concepts and produced three heuristics. The first uses a greedy algorithm, as it takes the locally next optimal solution regarding the reuse between a set of tests. The others use concepts of global optimization problems, they are a simulated annealing approach and a genetic algorithm solution. These three approaches are detailed next. 65 6.1 GREEDY 6.1 66 GREEDY The term greedy comes from the fact that the decision over the construction of the solution is made based on the best next option: the part-made solution is incremented with the next local optimal state and the process repeats until a complete solution is built. Such algorithms resembles depth-first search in that it follows a single path all the way to the goal, but will backtrack when a dead end is hit. Generally, the use of such strategies comes with some benefits, minimal search cost, often perform quite well and solutions are found quickly. Nevertheless, such approaches do not always find the optimal solutions because their just takes immediate best choices instead of analyzing long term options that could lead to better results. According to this idea, we developed an approach that builds a sequence of test iteratively finding the next optimal test to be added until the whole list is fulfilled; in other words, we search the next test that reuses more data given the sequence built so far and attach it to the sequence; the process repeats as long as there are tests not bound. Such approach runs very quickly compared to the permutation algorithm, and the results produced are good enough as reuse is obtained. However, sometimes the best solution may not be found due the fact that the best test to be chosen reusing data earlier could give rise to even more reuse later on. Moreover, depending on the choice over the first element of the sequence, the result can vary drastically. To solve this problem we decided to run this algorithm n times, where n is the size of the set of tests, because each element is allocated as the first element each time. Finally, the sequence with the best cost is returned. As the greedy algorithm executes very fast, this approach becomes feasible allowing us to return the best greedy sequence. The major steps are detailed in Algorithm 6.1. The main idea is to create n greedy sequences where each test is allocated once as the first element. At line 1 we assign the higher integer to a variable that represents the best cost found so far. From line 2 to 8 we have an iteration over the n greed sequences. Line 3 includes a method call that retrieves the greed sequence where the ith test is allocated as the first element. Lines 4 to 7 check if cost of the ith greed sequence is better than the best result found so far. When it is true, the best cost and the best sequences list are updated. Lines 8 and 9 check if the cost 6.2 SIMULATED ANNEALING 67 Algorithm 6.1 greedy() Require: tests[] 1: bestCost ⇐ MAX-INTEGER 2: for i = 0 to tests.length − 1 do 3: h[] ⇐ findGreedySequence(i, tests) 4: if cost(h) < bestCost then 5: bestCost ⇐ cost(h) 6: bestSequences.clear() 7: bestSequences ⇐ h[] 8: 9: 10: else if cost(h) = bestCost then bestSequences ⇐ h[] end if 11: end for 12: return bestSequences of the sequence found is equal to best found and add such sequence to the best sequences list in such cases. Line 12 yields the greed sequences with the lower cost. The method f indGreedySequence(i, tests), called in line 3, is described in Algorithm 6.2. From line 1 to 3, we have an assignment of the ith element as the head of the sequence inside the state variable, which maintains the sequence and calculates the cost of adding a new test based on the state of phones added so far. Line 3 separates the head element from the rest of the tests that will be compared. From line 4 to 7, the greed sequence is being built with the remaining elements. The method called in line 5 is responsible for the local optimal task present in greed algorithms. It returns the next test that can be added to the state with the lower cost. Then, this test is added to the state in line 6 and removed from the remaining set of tests. Line 9 returns the greed sequence found. 6.2 SIMULATED ANNEALING Some problems have the property that the state description itself contains all the information needed to discover a solution. For these kinds of problems one practical solution is to use iterative improvement algorithms. The general idea is to start with a complete configuration and iteratively perform modifications to improve its quality. The best way 6.2 SIMULATED ANNEALING 68 Algorithm 6.2 findGreedySequence(headIndex, tests) Require: headIndex, tests[] 1: head ← tests[headIndex] 2: state.add(head) 3: tests[] ← tests[] − head 4: for i = 2 to tests.length do 5: aux ⇐ findBestReuse(state, tests[]) 6: state.add(aux) 7: tests[] ← tests[] − aux 8: end for 9: return state.sequence; to understand such algorithms is to screen all the states laid out on the surface of a landscape. The height of any point corresponds to the evaluation function of the state at that point. The main objective is to search the landscape for the highest peaks, which corresponds to the optimal solutions. Two known algorithms from this class are the Hill-Climbing and Simulated Annealing [RN03]. The Hill-Climbing starts from an initial configuration, then generates its successors and updates the current configuration to the best successor. The process repeats until no successor can outperform its father. The problem with this algorithm is that it easily finds peaks, but we have no guarantees that it is the highest or a high enough peak. The peak found comes from the closest height point of the initial configuration, hence, the choice of this initial configuration is crucial. Some approaches can be used to improve this algorithm as the Random-restart Hill Climbing, where when a peak is reached, the algorithm restarts from another random location. The Simulated Annealing can be viewed as an optimized Hill-Climbing. This strategy instead of picking the best successor, it picks a random one. If its cost is better than the father’s, the current sequence is updated, otherwise, there is an associated probability to take even a worst successor. This move is justified to avoid points that would lead to false peaks (local maxima, not an optimal solution). The term Simulated Annealing comes from the industrial process of annealing, which means the gradual cooling of a liquid until it freezes. The iterative characteristic of the algorithm uses this idea. A 6.2 SIMULATED ANNEALING 69 Algorithm 6.3 simulatedAnnealing() Require: current[], a initial list of tests T , a Temperature, coolingRate, the rate the temperature is decreased 1: while T > 0 do 2: next[] ⇐ chooseSuccessor(current) 3: 4E ⇐ cost(current[]) - cost(next[]) 4: if 4E > 0 then 5: 6: 7: current[] ⇐ next[] else current[] ⇐ next only with probability e4E/T 8: end if 9: T ⇐ T × (1 − coolingRate) 10: end while 11: return current parameter T is used to represent a temperature, a second parameter, coolingRate, is used to decrease the value of T until it reaches zero and the algorithm halts. At higher values of T , the algorithm tends to accept bad moves. Based on this idea, we developed a test prioritization algorithm using concepts of simulated annealing shown in Algorithm 6.3 It requires an initial configuration where the tests should be arranged, the temperature, and the rate this temperature should be cooled. Thus, from line 1 to 10 we have a while loop that iterates until the temperature decreases for an approximation to zero. At line 2, the successors of the current node are generated. Basically, each one is generated exchanging the position from one test with its front neighbor (e.g., for a sequence {T 1, T 2, T 3, T 4} three successors are generated, {T 2, T 1, T 3, T 4}, {T 1, T 3, T 2, T 4} and {T 1, T 2, T 4, T 3}) as illustrated in Figure 6.1. When all successors are created, one is randomly selected to be compared with its father (the current sequence). Line 3 calculates the cost variation between the current sequence and its chosen successor. From line 4 to 8 is verified if delta is greater than 0, which means that the successor is more economical, and so the successor becomes the current at line 5; otherwise, there is a chance that even a worst successor takes 6.2 SIMULATED ANNEALING 70 Figure 6.1 An example of successors from a state in our simulated annealing algorithm. the place of the father given a probability shown at line 7. At the end of the while loop the temperature is cooled according to the cooling rate variable. Finally, when the temperature reaches an approximation to zero, the current sequence discovered so far is returned, as stated in line 11. Despite the choice on the successors being very simple, this algorithm can be very effective due its iterative nature. Another aspect is the choice for the ”bad” moves early to avoid local maxima points, and find optimal solutions. One of the great advantages of such algorithms is the memoryless concept, as it only keeps the current state, a very little portion of memory is required to run it. Another version of this algorithm was developed where the temperature parameter is replaced by the time in seconds the user wants to execute the technique. The need for the cooling rate parameter is discarded because the temperature is cooled through the time running out. Note that in the experiments performed with the heuristics, which is explained later in Section 6.4, we used the first approach using two parameters. Although 6.3 GENETIC ALGORITHM 71 this alternative version has not been used in this project, it was developed for use in future experiments. 6.3 GENETIC ALGORITHM Apart from the two heuristics implemented in the previous sections, another heuristic, based on Genetic Algorithm, has been implemented in [SMP09, Sal09], where the authors also carry out a comparison among the three heuristics, as reported in the next section. Here we briefly summarise the implementation of Genetic Algorithm of [SMP09, Sal09] to create the context for the comparison presented in the next section. A Genetic Algorithm [MC96] is an iterative procedure that maintains a population of individuals; these individuals are candidate solutions to the problem being solved, in our case a sequence of test cases. Each iteration of the algorithm is called a generation. During each generation, the individuals of the current population are rated for their effectiveness as solutions. This step is called assessment. Based on these ratings, a new population of candidate solutions is formed using specific genetic operators. Each individual is represented by a codification and is called a chromosome. Every chromosome consists of genes which have specific values. A gene is represented by test case in some position at a sequence. Basically, the algorithm consists of a number of iterations performing assessments and improvements in the sequence been generated. A simplified description is shown in Algorithm 6.4. Four parameters are required: the size of the population being maintained through the algorithm execution, the number of iterations the algorithm executes, and the crossing and mutation rates, which represent the percentage of crossings and mutations that may be performed on each individual. Initially, a population is generated randomly in line 1, and from line 2 up to 7, a series of repetitions is made representing the evolution of this initial population. Each iteration consists of an assessment, where the cost of each individual is calculated, and then a subset is selected representing the best ones in line 4. In line 5 some crossings are performed with the individuals selected generating new ones. The crossing consists of taking two individuals and exchanging some of their genes. To avoid a trend of quick 6.4 CASE STUDIES 72 Algorithm 6.4 geneticAlgorithm() Require: tests[], the set of tests popSize, the size of the population that will be generated iterations, number of iterations cRate, crossing rate mRate, mutation rate 1: generatePopulation(popSize, tests[]) 2: for i = 1 to iterations do 3: assessment() 4: selection() 5: mutation(mRate) 6: crossing(mRate) 7: end for 8: return bestResult convergence to local optima instead of global optima, mutations are performed in line 6. The mutation introduces a random change in the chromosome. A gene of this chromosome is randomly selected to be exchanged with other gene of the chromosome. The Genetic Algorithm processes these steps until all iterations previously determined have been run. The best chromosome found so far is then returned as the best result. With these three heuristics implemented we performed a series of case studies to assess their effectiveness. These studies are described next. 6.4 CASE STUDIES With the use of the tool detailed in Chapter 7, some experimentation has been conducted in [SMP09] regarding the heuristics proposed. As we already showed the effectiveness of the permutation technique, which is automated using this tool, compared to the manual methodology applied by testers in the previous experiments described in the last chapter, here we summarise complementary results to the ones obtained in [LAIS09], which are relevant to those cases where neither the permutation nor the manual methodology can be applied; these are the cases when a high number of tests are being prioritized. For these cases the three previous heuristics described were implemented in the tool and some 6.4 CASE STUDIES 73 empirical studies were performed to assess their efficiency in producing good enough prioritizations. The text in the rest of this section is a summary of what is presented in [SMP09]. To perform these studies three test suites were chosen. They are used to test specific components from mobile phones and each one has a set of data configurations needed to execute the procedures from the tests. We are referring to them as test suite 007 with 17 test cases, test suite 016 with 11 test cases and test suite 158 having 72 test cases, totalizing 100 tests. An initial analysis effort was done to retrieve the input/output information of each test, and to register the associated data and test cases in the tool. With all the material inserted in the tool we planned the experiment. The test suites have been arranged in five different groups to be executed with the different techniques. They are: Group A: Test suite 007 (17 test cases) + test suite 016 (11 test cases) + test suite 158 (72 test cases) = 100 test cases Group B: Test suite 007 = 17 test cases Group C: Test suite 016 = 11 test cases Group D: Test suite 158 = 72 test cases Group E: Test suite 007 + Test suite 016 = 28 test cases Some of the techniques require parametrization. The Simulated Annealing technique needs two parameters, the temperature and the cooling rate. We have used the parameters suggested by Manikas and Cain [MC96]: Temperature = 1000 and Cooling Rate = 0.9. Regarding the genetic algorithm technique, four parameters are needed. To discover the best parametrization according to each group, a previous experiment was conducted to discover the best ones. 6.4.1 Genetic Algorithm Parametrization The experiments involved the execution of the algorithm with the following range of parameters (after each parameter we explain why the range was chosen): 74 6.4 CASE STUDIES Group of Number of Population Crossing Mutation Number of test cases test cases Group A 100 60 0.5 0.2 30 Group B 17 50 0.5 0.2 10 Group C 11 50 0.5 0.1 10 Group D 72 80 0.5 0.2 20 Group E 28 50 0.5 0.1 10 Size Rate Rate Iterations Table 6.1 Best parameters for the Genetic Algorithm found in the experiments performed a Population Size : between 50 and 100 chromosomes and increasing of 10; A small population leads the algorithm in the direction of a local minimum. Moreover, a large population increases the chance to find a better solution; however, this can cause the algorithm to run very slowly; b Crossing Rate: between 0.5 and 0.8 and increasing of 0.1; A low crossing rate means little reuse of existent information. However, a high rate of crossing can cause a premature convergence: in other words, a fast homogenization of the population. c Mutation Rate: 0.1 or 0.2; A high value of mutation rate can transform the Genetic Algorithm in a ran- dom search. d Number of iterations: between 10 and 50 and increasing of 10; Many iterations make the algorithm very slow. Due to randomness of the algorithm, every parameter combination was executed 3 times. The experiments were carried out on a computer with the following characteristics: Core 2 Duo 2.2 GHz processor; 2 GB Memory RAM; Windows XP operational system. The best results for each group are presented in Table 6.1. As we can see, the best parameters to Crossing Rate and Mutation Rate are more or less the same. To small sets of test cases the population size around 50 and the number 6.4 CASE STUDIES 75 of iterations of 10 is a good choice. However, for larger sets, such as 72, it was necessary to increase the population size and the number of iterations. 6.4.2 Heuristic Techniques Comparison Considering that the Genetic Algorithm parameters are defined, it is possible to run an experiment with the three techniques: Greedy (no parametrization needed), Simulated Annealing (Temperature = 100 and cooling rate = 0.9 as explained earlier) and Genetic Algorithm (with the parameters achieved from the previous runs). For each group of test the prioritization techniques were executed 10 times, and the best and the average results of each technique according to the groups were catalogued. The results are shown in Table 6.2 Table 6.2 shows that the results found for suites up to 72 test cases were the same. However, for a set with 100 test cases the results found by the Genetic Algorithm were the best. Although the time spent with the Genetic Algorithm is longer, as it is in the same order of magnitude of Greedy and Simulated Annealing, we can conclude that with our parameters to this problem, for the larger sets of test cases the Genetic Algorithm finds better results, but Simulated Annealing and Greedy can be used with less or none parametrization and find results identical or very close to the Genetic Algorithm. Therefore, all three heuristics can be rated with similar effectiveness. In the next chapter, we detail the tool implemented to support an automatic test case prioritization that was used to perform all experiments described along this and the previous chapters, and that can be used to help testers to execute their test suites more quickly. 76 6.4 CASE STUDIES Group A (100 Tests) Genetic Algorithms Simulated Annealing Greedy Cost (s) Time (ms) Cost (s) Time (ms) Cost (s) Time (ms) AVERAGE 16642 5232,8 16650 139 16758 2201,7 BEST 16580 3844 16650 62 16758 1031 Group B (17 Tests) Genetic Algorithms Simulated Annealing Greedy Cost (s) Time (ms) Cost (s) Time (ms) Cost (s) Time (ms) AVERAGE 303 184,5 303 24,8 303 37,5 BEST 303 125 303 15 303 31 Group C (11 Tests) Genetic Algorithms Simulated Annealing Greedy Cost (s) Time (ms) Cost (s) Time (ms) Cost (s) Time (ms) AVERAGE 288 79,7 288 26,4 288 23,6 BEST 288 63 288 15 288 16 Group D (72 Tests) AVERAGE BEST Genetic Algorithms Simulated Annealing Greedy Cost (s) Time (ms) Cost (s) Time (ms) Cost (s) Time (ms) 16368,2 3884,4 16424 54,7 16491 403,2 16326 2500 16424 31 16488 359 Group E (28 Tests) Genetic Algorithms Simulated Annealing Greedy Cost (s) Time (ms) Cost (s) Time (ms) Cost (s) Time (ms) AVERAGE 327 220,2 327 26,6 327 59,3 BEST 327 203 327 15 327 47 Table 6.2 Results of experiments executed with different sets of test cases. CHAPTER 7 MECHANIZATION To allow the use of all test case prioritization techniques based on data reuse presented so far, we developed a tool that has all algorithms implemented in it. Such mechanization was very important due the fact that the prioritization at the BTC is manually executed. The requirement analysis was performed by doing meetings with the stakeholders of the project to elucidate some concepts related to the mobile phone domain and to understand some aspects involved in the manual prioritization done by testers. Through these interviews we discovered what information we needed to execute automated test case prioritizations and, moreover, it permitted us to research and develop our techniques. Among its features the tool allows us to store input/output data, setups, resources, and test cases. Moreover, the user can select a set of the catalogued tests together with a technique to perform the prioritization. The best results are displayed with the associated cost, i.e., the time to perform the configurations along the sequence, and the time taken to generate the sequences. Its GUI (Graphical User Interface) facilitates the choice of sequences among those with the same cost. User-defined clusters are also supported as explained earlier in Section 4.1.1. Next we detail the tool from a structural point of view and the technologies involved. Further, the functionalities are described and illustrated by some snapshots. 7.1 SOFTWARE STRUCTURE This section details the architecture of the Prioritization Tool, and overviews some of the project’s decisions. 7.1.1 Architecture Figure 7.1 illustrates the architecture of the tool. It was built using Java [GJSB05], a strongly typed language that provides the development of portable applications and 77 7.1 SOFTWARE STRUCTURE 78 has a large number of libraries, named APIs (Application Programming Interface). Our application was developed in top of two Java-based platforms: the Eclipse Rich Client Platform (RCP) [ML05], for controlling the software life-cycle, and the Prevayler library to manage the persistence of the information exercised in the tool. They are described next: Eclipse Rich Client Platform (RCP) [ML05]: A dynamic plug-in based frame- work for developing client applications. It provides a simple, powerful and flexible way to create desktop systems using the Java language and can be understandable as a middleware that lets available a series of facilities such as a flexible UI (User Interface) paradigm, scalable UIs, extensible applications, help support, contextsensitive help, portability, error handling, and a simple install and update mechanism. The Eclipse Standard Widget Toolkit (SWT) provides a graphical user interface (GUI) toolkit for Java that allows efficient and portable access to the native UI facilities of the OS. The JFace component works on the top of SWT to provide helper classes for handling common UI-related tasks, and the Workbench component organizes the UI in graphical components as views, perspectives, action contributions and editors. Moreover, we used an API that provides advanced and custom widgets (GUI components) from the Nebula Project1 , which is an project inside the Eclipse organization that aims to provide more complex widgets. Prevayler: Prevayler is an open source persistence library for Java [GJ05]. It is an implementation of the Prevalent System architecture pattern, in which data is kept hot in memory with changes journaled for system recovery. Its usage is very simple: the programmer does not need to worry about a database server to run, (the prevayler API persists all the data in the disk); also no configuration is needed, the developer can persist the business objects from her application directly without making them extend any base class or implement any interface in order to be persistent. The only restriction is to encapsulate all modifications of our business objects into instances of the Transaction interface, much like the Command pattern [GHJV94]. Whenever a transaction is initiated, the object is persisted in a journal so that data is not lost if our system crashes. Moreover, the state 1 See more at http://www.eclipse.org/nebula/ 7.1 SOFTWARE STRUCTURE 79 of the entire business object model can be saved anytime the developer wishes. The execution is very fast as all data run in memory, so the only disk access is streaming transactions to the journal. Such API is extremely recommended for simple applications that do not require too much database control. The choice on this library was very important due to its simplicity, permitting us to focus in the development of the techniques and not on storage issues. Figure 7.1 The Architecture of the Prioritization Tool. In addition to the technologies described so far we developed our tool following a basic RCP Architecture with two modules. The core module possesses the business logic environment with all business objects and their relationship. It also has the code implementation from all prioritization techniques already explained and the submodule responsible for the persistence transactions of our data. The UI module has all graphical 7.1 SOFTWARE STRUCTURE 80 implementation for the presentation of the tool, providing screens for data registering and prioritization executions. These modules communicate through a Facade [GHJV94] as illustrated in Figure 7.1. They are described next. 7.1.1.1 Core Module This module is divided in four packages: Entities, Prioritiza- tion, Persistence and Util. Each one is described below: Entities: This submodule has all business objects needed to perform the prioriti- zations. Some of them were discovered through the interviews with testers, while others come from architectural decisions. Figure 7.2 presents a simplified class diagram from UML [OMG03] that reflects this package. The main class is T estCase: it has a set of Setups, which are those configurations used in the execution of the whole sequence. They can be of two types: Flex Bit or General Configuration. A Flex Bit is a binary variable that has to be set on the phones to enable or disable certain features (like Instant Messaging, E-mail, Browser), while a General Configuration may be some other initialization needed to use those features in the phones from a specific test suite. A F lexBit itself in our tool has no value. Only when it is associated with a T estCase (a T estF lexBit) a value is assigned to it. Each test may be associated to a set of Resources, e.g. a bluetooth headset, and each one penalizes the sequence cost with the weight of the resource, that represents a time spent by the initialization of such item. A test case may also be associated to a set of phones that may be either a PUT or an Auxiliary Phone. Each phone contains two DataGroups, one representing the input state and another containing the output state. When executing stateful prioritization, the state accumulated through the sequence is represented by the class State, which has the phones with the aggregated data. At last, a T estGroup represents a set of test cases with no conflicting Flex Bits. When there are Flex Bit conflicts among tests from the same suite, they should be allocated in different test groups and prioritized separately because a bit change along the test execution is very costly. That is why they are executed beforehand in the sequence. A cluster is an specialization from an test case because it behaves like one at the prioritization as explained in Section 4.1.1. 7.1 SOFTWARE STRUCTURE 81 Prioritization: All prioritization techniques are implemented in this submodule: Stateless Permutation, Stateful Permutation, Greedy, Simulated Annealing and Genetic Algorithm. They all implement the same interface P rioritization. Any other technique to be developed can be added to the tool by implementing this interface. Persistence: The persistence submodule provides the transactions needed to per- sist the information about our business objects and a class that simulates a repository of all our objects. This repository maintains lists of the data we want to persist and provides methods to manipulate these lists. For example to create a transaction to persist a test case, we created the AddT estCase class that follows a interface provided by the P revayler API. It simply has a method that calls the repository to add a test case. Each transaction is atomic and its implementation is very simple (we use the same process used to the AddT estCase class). Util: This submodule provides some useful classes to help with phone and test cases calculations and comparisons, algorithms of permutations used in both Stateless and Stateful approaches and some classes to help with debugging. Figure 7.2 Class diagram of the entities submodule. 7.1 SOFTWARE STRUCTURE 7.1.1.2 82 UI Module This module is responsible for the presentation and the usability of the tool. It is composed by the Eclipse Workbench UI objects used to create the GUI with components like the Action Bar filled with the Menu Actions. There are five Menu Actions for registering information (Data, Setup, Resource, Test Case and Clusters), one for prioritization and another to close the tool. Each screen is composed of views arranged in a defined Perspective, and each view is formed by widgets components. Most of these widgets are from the SWT library, but some screens use advanced widgets provided by the Nebula project. For example, the AddData action, when selected, opens the window displayed in Figure 7.3. It uses two views the AddDataView (left) and the ShowDataView (right). They are composed of various kinds of widgets: labels, input fields, buttons and tables. The AddDataView allows users to insert data that has an identification, a description and the parent feature it belongs to. Feature is a general classification for an amount of tests, data and setups. For instance, Instant Messaging is an example of a feature. A data may belong to a feature that can be used by others. Moreover, the associated costs (time) to insert and delete a data from a phone should be registered. When the user hits the Add Data button it is persisted and its information will be shown at the table displayed in the ShowDataView. In this view they can be also removed from the tool, by selecting them and pressing the Remove Data button. Figure 7.3 The screen for data insertion. 7.2 MAIN FEATURES 83 Remind that at this time the registered data has no input/output association, it is treated as an AbstractData. When these data are associated to a test they become an input or an output data. With these technologies we created a set of forms that allow the user to record information about the test suites and to perform test case prioritization executions with the techniques presented in chapters 4 and 6 . 7.2 MAIN FEATURES To illustrate the use of the tool we present two main screens: the Add Test Case and the Prioritization screens. The former has functionalities to record the test cases from test suites with the required information to prioritize them, the latter uses these registered test cases to execute a selected prioritization technique with them and show the results. 7.2.1 Add Test Case screen Figure 7.4 shows the Add Test Case screen. It provides fields to be filled with the test case information divided by blocks in the left view and controls to remove previous registered test cases or visualize them in the right view. Regarding the left view, the first block has fields to be filled with general information about the test case: its identification, a description, the suite identification and the quantity of phones needed. The next block is responsible for the setup information. The user may choose some setups previous registered in the Add Setup screen and associate them with the test case. They may be a General Configuration or a Flex Bit. If the chosen setup is a Flex Bit, the user should inform its value: ON or OFF. When they are registered in the Add Setup screen no value is informed and they are recorded as a FlexBit class, but when they are associated with the test case a TestFlexBit class is instantiated, which is a Flex Bit associated with an ON/OFF value. The setups are added to the test case when the Add Setup button is pressed. On this screen, setups can be removed by selecting them in the setup table and pressing the Remove Setup button. At the third block the user should inform the input/output state of the test case. First, the user may choose the phone (according to the number of phones informed in the first block) and mark the checkbox that informs if it is a PUT or not. Next, a previously registered data must be selected and its data type INPUT or OUTPUT must be informed 84 7.2 MAIN FEATURES Figure 7.4 The screen for test case insertion. together with the amount of data needed (data value). When the Add Data button is pressed, an InputData or OutputData class is instantiated and is added to one of the data tables according to the data type chosen. Under the Data block there is the Resource block where the user may add or remove previous registered resources (For example: Bluetooth Headsets, SIM Cards) - this is not visible in Figure 7.4. Finally, the user should press the Add Test Case button to save the test, which is shown in the table displayed in the right view. The right view allows the removal of registered test cases and the presentation of their information. 7.2.2 Prioritization screen This screen uses the test cases registered in the screen presented in the previous section to prioritize them according a selected technique. Figure 7.5 illustrates a prioritization 7.2 MAIN FEATURES 85 execution with seven test cases using the stateful permutation technique. Figure 7.5 A prioritization execution with the stateful permutation. The user should mark the test cases and clusters to be prioritized at the table located in the top left corner of the screen. Next, the prioritization technique should be selected. As some test case prioritization techniques require parameters, if one of them is selected, its associated parameters should be informed. In particular, the Simulated Annealing technique has two ways of parametrization: by temperature and cooling rate, or by time (in seconds). Once the technique is selected, the user should press the Prioritize button. The tool executes the prioritization technique selected with the test cases and clusters marked. The results are shown in the table on the center of the screen. They are organized 7.2 MAIN FEATURES 86 according to the Setup Groups (group of test separated by conflicting Flex Bits) and Groups (sets of the Transitive Closure discovered by the permutation algorithm). In this case, there is one setup group and two groups of the transitive closure (group 1, with the test cases 253, 258, 262; and group 2, with the test cases 132, 360, 251, 355). In the first group, three sequences have tied with the best results, while in the second, eight were returned as the best sequences. The user should choose one sequence for the first group and another from the second. They can be executed in any order as they are independent from the data reuse point of view. The estimated configuration time (in seconds) and the time (in milliseconds) taken in the prioritization are presented. At the bottom of the screen there is a view where the user can build the sequence she wants to use based on the multiple tied results. The tester chooses the group and the tool presents a sequence of list boxes each one representing one position in the sequence. Each box is filled with the available tests for that specific position. When a test A is chosen for a position X the other positions are re-filled with the available tests that are present in sequences where test A are at position X. In Figure 7.5 for the group 2, there are only two possibilities at the third position where the first is test 251 and the second is test 132, which are the tests 360 and 355. Therefore, when the list of best results becomes very large such facility comes in handy as the user can visualize the possible tests to be executed given that some of them are already allocated in some positions. CHAPTER 8 CONCLUDING REMARKS This work presented a new methodology to test case prioritization reusing data needed along a sequence of test cases aiming to reduce time and effort spent on executing them to improve the productivity of the testing process. Some techniques were proposed and they were applied in a mobile phone testing environment, where the test cases follow the black-box design, are described textually and executed manually by human testers. We implemented an exploratory approach based on permutation to generate sequences of test cases using this concept. The yielded sequences have the optimal reduced cost regarding the data reuse, as the approach verifies all possible sequences, except for some optimizations described below. Nevertheless, the permutation algorithm can be used with a limited amount of test cases due to its computational complexity as it belongs to the set of NP-Complete problems. This algorithm has improvements that contribute to increase the size of tests tolerated during a prioritization, they are: the pruning mechanism, which avoids generating a sequence whose prefix already outperformed the best sequence already found; considering equivalence classes of tests that can reduce the amount of tests in the prioritization, as they all have the same input and output states, and their input state are equal to the output state, not contributing to generate new data along a sequence; grouping test cases using the transitive closure optimization to prioritize them sepa- rately as one group cannot affect reuse in other ones. The prioritization is applied in reduced sets of test cases making some otherwise infeasible prioritization possible; allowing user-defined clusters that are prioritized first and then prioritized together with other test cases and clusters, therefore reducing the amount of test cases being prioritized. However, this feature invalidates the guarantee of the optimal solution; 87 CONCLUDING REMARKS 88 To allow the application of the data reuse method over test suites with a larger number of test cases, we also propose three heuristics using classical artificial intelligence concepts. A greed algorithm that takes the best test case option of reuse alternating all test cases in the head of the sequence. A strategy using simulated annealing that uses iterations controlled by the decreasing of a temperature to generate a sequence. Finally, another that uses a genetic algorithm that also tries to find an ordered list of tests based on iterations through the evolution of the sequences. However, the use of such techniques does not guarantee the best result. Nevertheless, these techniques are a viable alternative when it is infeasible to adopt permutations or the time factor becomes critical during the prioritization process; they have produced good results in reasonable periods of time to small and large test suites. To assess the efficiency of the techniques we performed a series of experiments. First, we performed some comparative studies regarding the permutation technique and a manual heuristic-based technique used by testers [RBJ07]. The first experiment was a case study which has shown that the permutation technique generated sequences 30% more economical in relation to the time spent in configuration time compared to the manual technique. As these results were only a theoretical estimation, we performed a formal controlled experiment where the testers executed sequences generated by both techniques. This experiment served to assess two assumptions: to assure that the time saved concerning configuration activities was representative in the total execution time and to perform another comparison between the two techniques. We assessed with statistical significance that the time saved in configuration was representative enough to reduce the total execution time. The permutation technique has shown to be 13.5% more effective than the manual technique. As the manual technique has brought gains over of 5.56%, we assume that the use of our permutation technique in suites that were not prioritized yet could give gains of almost 20% in the test suite executions. This time saved could be used to execute more test cases, which would help testers to meet or increase their daily productivity marks. Regarding the heuristics, we performed experiments aiming at two purposes: to find good parametrization to a genetic algorithm and to compare the results among the techniques. Five groups of test cases with different sizes were used: Group A (100 test cases), Group B (17 test cases), Group C (11 test cases), Group D (72 test cases) and Group E 8.1 LIMITATIONS 89 (28 test cases). With the parametrization results discovered to the genetic algorithm according to the amount of tests showed in Table 6.1, the comparative study was performed and showed that for a small amount of tests the techniques brought equal results, and for larger test suites (Group A and D), despite the fact that the genetic algorithm has brought better results, the greedy and simulated annealing approaches returned results in the same magnitude of cost in less time, which leads us to suppose that these three techniques have very similar efficiency. However, a more accurate empirical validation is required. In addition, we developed a stand-alone tool that mechanizes all prioritization techniques proposed and that helped us along all empirical studies. The tool provides mechanisms to register data, setup, and resources that are used to register test cases which in turn are used to perform the prioritization executions. Moreover, when a prioritization takes place, the results are displayed in a table together with information about the time spent in the prioritization execution and the estimated cost on configuration time for the test suite selected. Test cases are displayed in groups from the transitive closure optimization, and tests in equivalence classes are highlighted. Finally, the tool also includes a mechanism to help users to choose a unique sequence based on the results, which is very helpful when too many sequences are returned. 8.1 LIMITATIONS The limitations of our work are described below: The only approach that guarantees the discovery of optimal sequences of test cases, the permutation technique, can only be applied to test suites with small size (at most 15 test cases if no optimization is applied). We provided several optimizations in the algorithm that help the technique to take a higher number of test cases. However, there is the possibility that no optimization is applicable for some set of test cases. For example, a test suite that does not have only one group from the transitive closures nor equivalent test cases, and such that no pruning can be performed, will obligatory traverse all possible sequences. We suppose that such cases are unlikely to happen in the application domain we are working, however, we are aware that in such cases only 15 test cases are a feasible number to be prioritized 8.1 LIMITATIONS 90 in less than one day. Another aspect from our data reuse techniques is the overhead in the data survey activity that must be performed to feed our tool allowing a mechanized prioritization. It is a labored task as the testers should read the test cases, identify the input/ouput state of each test case and register them in the tool. Such task could be automated. After having all data gathered, another issue is how to assign the cost associated with each data. We performed experiments to retrieve the costs to configure each data used in our prioritization executions. Nevertheless, it might be infeasible to perform experiments every time that new data appears. Another problem is that this cost may vary from tester to tester depending on their experience and productivity. As a prioritization should be executed for each tester specifically, they should register the time for each data used according to their historical execution or an estimate based on the tester experience. If new data is used, we propose to perform small experiments or to execute the test cases without prioritization to acquire knowledge over the data and further register the historical data from these executions, and only then perform the prioritization. The fewer the possibilities of reuse among a set of test cases, the larger the number of best sequences found by using the permutation technique. For example, it may be very difficult to choose one of 45 thousand best sequences returned by the tool as the result of a prioritization. To minimise this issue we developed in the tool an interface that helps the user to build the preferred sequence based on the results returned, as shown in Chapter 7. Regarding the heuristics proposed the best sequence is not guaranteed to be found. Moreover, a more complex experiment could be performed with more execution and comparisons with the permutation technique when it is possible. The empirical studies presented in this work are inserted in a mobile phone test- ing environment where test cases are executed manually by testers. Therefore, the results from these studies might not be immediately generalized to other application domains. Further experiments should be executed to confirm these techniques 8.2 FUTURE WORK 91 effectiveness in other environments. 8.2 FUTURE WORK The following activities are suggested topics for future work: The input/ouput state information from each test could be detailed in the test itself during its creation or generation process. Such information could be standardized allowing text processing by our tool. In this way, test cases could be automatically registered, what would save the time spent by entering these information in the tool. Others heuristics could be implemented and compared to those presented in Chap- ter 6, creating a benchmark of prioritization techniques based on data reuse. Exploring the use of the transitive closure optimization together with the heuristics. Performing a more complex experiment with different test suites and more iterations using the heuristics proposed and trying to find good parametrization for them. The purpose of such experiment would be to provide a more accurate comparison among the heuristics, and also use the permutation technique when possible in the comparison. Incorporating other aspects (test complexity, test size, tester experience) in our calculations that may affect the time spent in test case execution improving our f function. In summary, we proposed test case prioritization techniques based on data reuse, reducing the time spent to configure them to improve the productivity of testing teams. An approach using permutation that finds the best results for small test suites was detailed together with other three heuristics that can be used to larger test suites. We executed some empirical studies that enforced our expectations that data reuse is relevant to reduce time of a test suite execution and that our techniques are better than the manual technique currently used by testers. APPENDIX A APPENDIX A.1 TEST SUITES USED IN THE CONTROLLED EXPERIMENT Tables A.1 and A.2 show the test suites used in our controlled experiment. The tests are described in terms of their inputs and outputs (test procedures are omitted). The results of the manual and the automatic techniques are shown in Table A.3. Test Inputs Outputs 1 1 Phonebook contact in phone 2 1 Phonebook contact in phone 2 0 phone02 IM contact in phone 1 1 phone02 IM contact in phone 1 1 phone01 IM contact in phone 2 1 phone01 IM contact in phone 2 1 SIM contact in phone 2 1 SIM contact in phone 2 1 IM login (Auto) in phone 2 1 IM login (Auto) in phone 2 1 IM login (Power On) in phone 1 1 IM login (Power On) in phone 1 1 Music file in phone 1 1 Music file in phone 1 1 Alert for IM in phone 1 1 Alert for IM in phone 1 1 Automatic Lock in phone 1 1 Automatic Lock in phone 1 1 phone02 IM contact in phone 1 1 phone02 IM contact in phone 1 1 phone01 IM contact in phone 2 1 phone01 IM contact in phone 2 2 1 Alert phone02 in phone 1 3 4 5 6 1 phone02 IM contact in phone 1 0 phone02 IM contact in phone 1 1 phone01 IM contact in phone 2 1 phone01 IM contact in phone 2 1 phone02 IM contact in phone 1 1 phone02 IM contact in phone 1 1 phone01 IM contact in phone 2 1 phone01 IM contact in phone 2 1 phone02 IM contact in phone 1 1 phone02 IM contact in phone 1 1 phone01 IM contact in phone 2 1 phone01 IM contact in phone 2 1 Video in phone 1 1 Video in phone 1 1 phone02 IM contact in phone 1 1 phone02 IM contact in phone 1 1 phone01 IM contact in phone 2 1 phone01 IM contact in phone 2 1 Phonebook contact in phone 1 1 Phonebook contact in phone 1 Table A.1 The test suite I. 92 Test Inputs Outputs 1 0 SMS Message in phone 1 0 SMS Message in phone 1 0 Info Service Msg. in phone 1 3 Info Service Msg. in phone 1 1 Service field value in phone 1 1 Service field value in phone 1 1 Active channel 0 in phone 1 1 Active channel 0 in phone 1 1 Auto cleanup none in phone 1 1 Auto cleanup none in phone 1 0 Screen saver in phone 1 0 Screen saver in phone 1 1 Ringtone to info in phone 1 1 Ringtone to info in phone 1 2 1 Phonebook contact in phone 1 1 Self contact in phone 1 1 SMS Message in phone 1 1 SMS Message in phone 2 3 1 Word completion in phone 1 1 Word completion in phone 1 1 Itap mode in phone 1 1 Itap mode in phone 1 4 5 1 Itap mode in phone 1 1 Non-DRM HR Pic. in phone 1 1 Non-DRM HR Pic. in phone 1 1 Phonebook contact in phone 1 1 Phonebook contact in phone 1 1 SMS Message in phone 1 1 SMS Message in phone 2 6 1 Phonebook contact in phone 1 1 Phonebook contact in phone 1 1 Self contact in phone 1 1 Self contact in phone 1 1 Call to self in phone 1 1 Call to self in phone 1 1 Msg. App. locked in phone 1 0 Msg. App. locked in phone 1 2 SMS Message in phone 1 7 1 Dialed Call to self in phone 1 1 Dialed Call to self in phone 1 8 2 SMS Message in phone 1 3 SMS Message in phone 1 1 Auto cleanup set in phone 1 1 Auto cleanup set in phone 1 1 SMS Message in phone 1 Table A.2 The test suite II. Suite Manual Automatic I 3-4-5-2-6-1 1-2-6-5-4-3 II 6-8-5-1-7-4-3-2 1-2-6-8-5-7-4-3 Table A.3 Test suites sorted according to each prioritization technique. BIBLIOGRAPHY [AB08a] Eduardo Aranha and Paulo Borba. Manualtest: Improving collection of manual test execution data in empirical studies. In Proceedings of the 5th Empirical Software Engineering Latin-american Workshop (ESELAW), Salvador, Brazil, 2008. [AB08b] Eduardo Aranha and Paulo Borba. Using process simulation to assess the test design effort reduction of a model-based testing approach. In ICSP, pages 282–293. Springer, 2008. [AR02] Carina Andersson and Per Runeson. Verification and validation in industry– A qualitative survey on the state of practice. In International Symposium on Empirical Software (ISESE), pages 37–47. IEEE Computer Society, 2002. [AW95] Alberto Avritzer and Elaine J. Weyuker. The automatic generation of load test suites and the assessment of the resulting software. IEEE Trans. Softw. Eng., 21(9):705–716, 1995. [BCR94] V.R. Basili, G. Caldiera, and H.D. Rombach. The Goal Question Metric Approach. Encyclopedia of Software Engineering, 1:528–532, 1994. [Bei95] Boris Beizer. Black-box testing: techniques for functional testing of software and systems. John Wiley & Sons, Inc., New York, NY, USA, 1995. [Ber07] Antonia Bertolino. Software testing research: Achievements, challenges, dreams. In Lionel C. Briand and Alexander L. Wolf, editors, International Conference on Software Engineering, ISCE 2007, Workshop on the Future of Software Engineering, FOSE, pages 85–103, 2007. 94 BIBLIOGRAPHY 95 [BHHH78] George E. P. Box, William G. Hunter, Stuart J. Hunter, and William G. Hunter. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. Wiley-Interscience, June 1978. [Dij72] Edsger W. Dijkstra. The humble programmer. Commun. ACM, 15(10):859– 866, October 1972. [EMR00] Sebastian Elbaum, Alexey G. Malishevsky, and Gregg Rothermel. Prioritizing test cases for regression testing. In In Proceedings of the International Symposium on Software Testing and Analysis, volume 25, pages 102–112. ACM Press, 2000. [EMR02] Sebastian G. Elbaum, Alexey G. Malishevsky, and Gregg Rothermel. Test case prioritization: A family of empirical studies. Software Engineering, 28(2):159–182, 2002. [GHJV94] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software (Addison-Wesley Professional Computing Series). Addison-Wesley Professional, illustrated edition edition, January 1994. [GJ79] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. [GJ05] Irum Godil and Hans-Arno Jacobsen. Horizontal decomposition of prevayler. In CASCON ’05: Proceedings of the 2005 conference of the Centre for Advanced Studies on Collaborative research, pages 83–100. IBM Press, 2005. [GJSB05] James Gosling, Bill Joy, Guy Steele, and Gilad Bracha. The Java Language Specification, Third Edition. Addison-Wesley Longman, Amsterdam, 3 edition, June 2005. [Ham04] Paul Hamill. Unit test frameworks. O’Reilly, 2004. [HL92] J. R. Horgan and S. A. London. ATAC: A data flow coverage testing tool for C. In Proceedings of Symposium on Assessment of Quality Sofiware Development Tools, pages 2–10, New Orleans, LA, USA, 1992. Springer-Verlag. BIBLIOGRAPHY [Hof99] 96 Douglas Hoffman. Cost benefits analysis of test automation. In Software Testing Analysis & Review Conference (STAR East), Orlando, FL, USA, 1999. [IEE90] IEEE standard glossary of software engineering terminology. Technical report, 1990. [JCP08] A. Jedlitschka, M. Ciolkowski, and D Pfahl. Reporting experiments in software engineering. In F. Shull, J. Singer, and D.I.K. Sjøberg, editors, Guide to Advanced Empirical Software Engineering, chapter 8. Springer-Verlag, London, 2008. [JG08] Dennis Jeffrey and Neelam Gupta. Experiments with test case prioritization using relevant slices. J. Syst. Softw., 81(2):196–221, 2008. [JH03] James A. Jones and Mary Jean Harrold. Test-suite reduction and prioritization for modified condition/decision coverage. IEEE Transactions on Software Engineering, 29:92–101, 2003. [Jor02] Paul C. Jorgensen. Software Testing: A Craftsman’s Approach, Second Edition. CRC, 02. [JP05] A. Jedlitschka and D. Pfahl. Reporting guidelines for controlled experiments in software engineering. Empirical Software Engineering, International Symposium on, 0:10 pp., 2005. [KFN99] Cem Kaner, Jack L. Falk, and Hung Quoc Nguyen. Testing Computer Software, Second Edition. John Wiley & Sons, Inc., New York, NY, USA, 1999. [KP02] Jung-Min Kim and Adam Porter. A history-based test prioritization technique for regression testing in resource constrained environments. In ICSE ’02: Proceedings of the 24th International Conference on Software Engineering, pages 119–129, New York, NY, USA, 2002. ACM. [Kue99] Robert Kuehl. Design of Experiments: Statistical Principles of Research Design and Analysis. Duxbury Press, 2nd edition, 1999. BIBLIOGRAPHY [LAIS09] 97 Lucas Lima, Eduardo Aranha, Juliano Iyoda, and Augusto Sampaio. Test case prioritization based on data reuse - an empirical study. In Proceedings of the Third International Symposium on Empirical Software Engineering and Measurement, Lake Buena Vista, FL, USA, 2009. IEEE Computer Society. To appear. [LIS08] Lucas Lima, Juliano Iyoda, and Augusto Sampaio. A permutation technique for test case prioritization in a black-box environment. In 2nd Brazilian Workshop on Systematic and Automated Software Testing, Campinas-SP, Brazil, October 2008. [MC96] T. W. Manikas and J. T. Cain. Genetic Algorithms vs. Simulated Annealing: A Comparison of Approaches for Solving the Circuit Partitioning Problem. Technical report, 1996. [ML05] Jeff McAffer and Jean-Michel Lemieux. Eclipse Rich Client Platform: Designing, Coding, and Packaging Java(TM) Applications. Addison-Wesley Professional, 2005. [MSBT04] Glenford J. Myers, Corey Sandler, Tom Badgett, and Todd M. Thomas. The Art of Software Testing, Second Edition. Wiley, June 2004. [NCT+ 07] S. Nogueira, E. Cartaxo, D. Torres, E. Aranha, and R. Marques. Model Based Test Generation: An Industrial Experience. In Proceedings of the I Brazilian Workshop on Systematic and Automated Software Testing, 2007. [Nor98] J. R. Norris. Markov Chains. University of Cambridge, 1998. [OMG03] OMG. Unified modeling language. Specification v1.5, Object Management Group, March 2003. http://www.omg.org/cgi-bin/doc?formal/03-03-01. [OTPS98] Akira K. Onoma, Wei-Tek Tsai, Mustafa Poonawala, and Hiroshi Suganuma. Regression testing in an industrial environment. Commun. ACM, 41(5):81– 86, 1998. 98 BIBLIOGRAPHY [Pfl94] Shari Lawrence Pfleeger. Design and analysis in software engineering: the language of case studies and formal experiments. SIGSOFT Softw. Eng. Notes, 19(4):16–20, 1994. [Pfl95a] Shari Lawrence Pfleeger. Design and analysis in software engineering: how to set up an experiment. SIGSOFT Softw. Eng. Notes, 20(1):22–26, 1995. [Pfl95b] Shari Lawrence Pfleeger. Experimental design and analysis in software engineering: choosing an experimental design. SIGSOFT Softw. Eng. Notes, 20(3):13–15, 1995. [Pfl95c] Shari Lawrence Pfleeger. Experimental design and analysis in software engineering, part 5: analyzing the data. SIGSOFT Softw. Eng. Notes, 20(5):14– 17, 1995. [Pfl95d] Shari Lawrence Pfleeger. Experimental design and analysis in software engineering: types of experimental design. SIGSOFT Softw. Eng. Notes, 20(2):14–16, 1995. [Pre05] Roger Pressman. Software Engineering: A Practitioner’s Approach. McGraw-Hill, sixth edition, 2005. [QNXZ07] Bo Qu, Changhai Nie, Baowen Xu, and Xiaofang Zhang. Test case prioritization for black box testing. In COMPSAC ’07: Proceedings of the 31st Annual International Computer Software and Applications Conference - Vol. 1- (COMPSAC 2007), pages 465–474, Washington, DC, USA, 2007. IEEE Computer Society. [RBJ07] H. Rodrigues, R. Baden, and W. Júnior. A Test Execution Sequence Study for Performance Optimization. Technical report, The Federal University of Pernambuco, 2007. [RN03] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ, 2nd edition edition, 2003. BIBLIOGRAPHY 99 [RUCH01] Gregg Rothermel, Roland H. Untch, Chengyun Chu, and Mary Jean Harrold. Prioritizing test cases for regression testing. Software Engineering, 27(10):929–948, 2001. [RW06] Rudolf Ramler and Klaus Wolfmaier. Economic perspectives in test automation: balancing automated and manual testing with opportunity cost. In Proceedings of the International Workshop on Automation of Software Test (AST), pages 85–91, New York, NY, USA, 2006. ACM. [Sal09] Ricardo Sales. Implementing Heuristics for Test Cases Prioritization. Technical report, Universidade Federal do Rio Grande do Norte, Natal, 2009. Monograph. [SC89] George W. Snedecor and William G. Cochran. Statistical Methods. Iowa State University Press, 1989. [Sed77] Robert Sedgewick. Permutation generation methods. ACM Comput. Surv., 9(2):137–164, 1977. [SMP09] Ricardo Sales, Taı́za Montenegro, and Thaysa Paiva. Implementing Heuristics for Test Cases Prioritization and Integration with Target. Technical report, The Federal University of Pernambuco, 2009. [Sri04] H. Srikanth. Requirements Based Test Case Prioritization. In Student Research Forum in 12th ACM SIGSOFT Intl Symposium. ACM SIGSOFT, on the Foundations of Software Engg, 2004. [ST02] Amitabh Srivastava and Jay Thiagarajan. Effectively prioritizing tests in development environment. SIGSOFT Softw. Eng. Notes, 27(4):97–106, 2002. [WHLM98] W. Eric Wong, Joseph R. Horgan, Saul London, and Aditya P. Mathur. Effect of test set minimization on fault detection effectiveness. Softw. Pract. Exper., 28(4):347–369, 1998. [ZNXQ07] Xiaofang Zhang, Changhai Nie, Baowen Xu, and Bo Qu. Test case prioritization based on varying testing requirement priorities and test case costs. BIBLIOGRAPHY 100 In QSIC ’07: Proceedings of the Seventh International Conference on Quality Software, pages 15–24, Washington, DC, USA, 2007. IEEE Computer Society. This volume has been typeset in LATEXwith the UFPEThesis class (www.cin.ufpe.br/∼paguso/ ufpethesis).