Pós-Graduação em Ciência da Computação
“TEST CASE PRIORITIZATION BASED ON
DATA REUSE FOR BLACK-BOX
ENVIRONMENTS”
Por
Lucas Albertins de Lima
Dissertação de Mestrado
Universidade Federal de Pernambuco
[email protected]
www.cin.ufpe.br/~posgraduacao
RECIFE, AGOSTO/2009
Universidade Federal de Pernambuco
CENTRO DE INFORMÁTICA
PÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃO
Lucas Albertins de Lima
“TEST CASE PRIORITIZATION BASED ON
DATA REUSE FOR BLACK-BOX
ENVIRONMENTS”
Este trabalho foi apresentado à Pós-Graduação em Ciência da
Computação do Centro de Informática da Universidade Federal de
Pernambuco como requisito parcial para obtenção do grau de Mestre em
Ciência da Computação.
ORIENTADOR(A): Prof. PhD. Augusto Cézar Alves Sampaio
CO-ORIENTADOR(A): Prof. PhD. Juliano Manabu Iyoda
RECIFE, AGOSTO/2009
Lima, Lucas Albertins de
Test case prioritization based on data reuse for
Black-box environments / Lucas Albertins de Lima. Recife: O Autor, 2009.
xii, 100 folhas : fig., tab.
Dissertação (mestrado) – Universidade Federal de
Pernambuco. CIn. Ciência da Computação, 2009.
Inclui bibliografia e apêndice.
1. Engenharia de software.
005.1
CDD (22. ed.)
I. Título.
MEI2009- 114
To my parents, the most supportive and lovely persons a
man could ever desire.
AGRADECIMENTOS
Primeiramente gostaria de agradecer a Deus por ter me proporcionado a realização deste
trabalho e por ter me guiado durante toda a sua realização.
Em seguida as pessoas mais importantes da minha vida, meus pais José Nivaldo e Ana
Maria, por sempre acreditarem que eu posso dar mais um passo adiante, e por estarem
sempre ao meu lado me ajudando e me dando força sempre que preciso. A Pâmela minha
namorada, pela sua compreensão e incentivo durante o desenvolvimento deste trabalho.
A toda a minha famı́lia, por sempre torcerem por mim e me incentivarem desde de que
me dou por gente.
Aos meus orientadores, Augusto Sampaio e Juliano Iyoda, por toda compreensão e
dedicação que tornaram possı́vel a concretização deste trabalho. Ao professor Alexandre Mota por ter sido o padrinho deste trabalho nos indicando o problema a ser resolvido. À professora Flávia Barros pelo embasamento teórico necessário na construção
das heurı́sticas.
Ao colega Dante Torres pela ajuda no algoritmo da permutação. Ao colega e professor
Eduardo Aranha por sua contribuição em relação a todos os estudos empı́ricos. A Ricardo
Sales, Taı́za Montenegro e Thaysa Paiva pelo desenvolvimento da técnica com algoritmos
genéticos e realização de experimentos com heurı́sticas. A André Lacerda e todos os
testadores envolvidos na execução do experimento controlado.
Ao CNPq pelo incentivo financeiro através da bolsa de estudos. Ao programa Motorola/BTC (Brazil Test Center) que proporcionou um ambiente industrial para aplicabilidade
e validação deste trabalho.
Finalmente, a todos os colegas do grupo de pesquisa do programa Motorola/BTC pelo
apoio dado durante o desenvolvimento desta pesquisa e a todos os colegas do programa
de pós-graduação do Centro de Informática da UFPE.
iv
Beware of bugs in the above code;
I have only proved it correct, not tried it.
—DONALD KNUTH
RESUMO
Priorização de casos de teste é um método que visa ordenar testes para obter ganhos de
acordo com critérios especı́ficos. Este trabalho propõe técnicas de priorização que visam
reduzir o tempo gasto na execução manual de casos de teste diminuindo o esforço durante
a preparação dos dados; as melhores sequências de teste são aquelas que reusam mais
dados. Estas técnicas foram aplicadas em um ambiente de teste para telefones celulares,
no qual os testes são criados a partir de requisitos e executados manualmente.
Quatro técnicas são propostas, incluindo uma abordagem exploratória baseada na
geração de permutações de testes e três heurı́sticas: Gulosa, Simulated Annealing e Algoritmo Genético. A técnica de permutação sempre retorna as melhores sequências, no
entanto, ela só pode ser aplicada para suites de teste de tamanho pequeno devido à
complexidade computacional do algoritmo. Já as heurı́sticas podem, em princı́pio, ser
aplicadas para suites de teste arbitrárias; contudo, não se pode garantir que as melhores
sequências sejam sempre produzidas. Foi desenvolvida uma ferramenta que mecaniza o
processo de priorização ajudando os testadores a registrar informações, executar priorizações e escolher sequências a partir dos resultados retornados.
Estudos empı́ricos foram realizados, comparando a técnica de permutação com a
técnica de priorização existente, na qual testes são priorizados manualmente baseados
em uma heurı́stica que usa uma estrutura de árvore, como também, conhecimento e
intuição do testador. Os resultados mostram que a técnica de permutação reduziu aproximadamente em 25-30% o tempo de configuração e 13.5% o tempo total (configuração e
procedimentos). Experimentos com nossas heurı́sticas mostraram que elas têm efetividade
similar para pequenas e grandes suites de teste. As técnicas propostas trazem a resultados significativos não só na execução das sequências, mas também nas suas gerações que
são automatizadas pela ferramenta.
Palavras-chave: Teste de Software, Priorização de Casos de Teste, Reuso de Dados
vi
ABSTRACT
Test Case Prioritization is an approach that aims to order test cases to obtain gains according to specific criteria. This work proposes test case prioritization techniques aiming
to decrease the time spent in manual execution by reducing the effort of the data preparation needed for each test case; the better sequences of tests are those that reuse more
data. We applied these techniques in a mobile phone testing environment where tests are
manually executed and designed based on requirements.
Four techniques are proposed, including one exploratory approach based on permutation generation and three heuristics: Greedy, Simulated Annealing and Genetic Algorithm. The permutation technique always yields the best sequences, however it can
only be applied to a test suite of limited size due to the computational complexity of the
algorithm. The heuristics can, in principle, be applied to arbitrary test suites, however
there is no guarantee that the best sequences will be produced. We implemented a tool
that mechanizes the prioritization process helping testers to register information, execute
the prioritization techniques and choose from sequences yielded as results.
Empirical studies were performed, comparing the permutation approach to the existing prioritization technique where the test cases are prioritized manually based on a
heuristic that uses a tree structure and, knowledge and intuition from the testers. Results
show that the permutation technique reduced approximately 25-30% of configuration time
and 13.5% of total execution time (configuration and procedure time). Experiments regarding our heuristics have shown they have similar effectiveness when applied to small
and larger test suites. Our techniques yield significant results not just in the execution
sequence but also in the sequence generation, which is automated by our tool.
Keywords: Software Testing, Test Case Prioritization, Data Reuse
vii
CONTENTS
Chapter 1—Introduction
1.1
Organization
1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 2—Background
2.1
2.2
6
Software Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1.1
Levels of Testing . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.1.2
Test Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.3
Testing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . .
10
Test Case Prioritization . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
Chapter 3—Related Work
3.1
3.2
3.3
14
White-Box Test Case Prioritization Techniques
. . . . . . . . . . . . . .
14
3.1.1
Prioritizing Test Cases For Regression Testing . . . . . . . . . . .
14
3.1.2
Other White-Box Initiatives . . . . . . . . . . . . . . . . . . . . .
17
Black-Box Test Case Prioritization Techniques . . . . . . . . . . . . . . .
18
3.2.1
Test Case Prioritization for Black Box Testing . . . . . . . . . . .
18
3.2.2
The Automatic Generation of Load Test Suites and the Assessment
of the Resulting Software . . . . . . . . . . . . . . . . . . . . . . .
20
3.2.3
Requirements-Based Test Case Prioritization . . . . . . . . . . . .
21
3.2.4
A Test Execution Sequence Study for Performance Optimization .
23
Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
Chapter 4—Permutation Technique for Black-Box Test Case Prioritization
4.1
4
The Permutation Technique . . . . . . . . . . . . . . . . . . . . . . . . .
viii
28
30
ix
CONTENTS
4.1.1
Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 5—Empirical Studies
5.1
43
5.1.1
Experiment Planning . . . . . . . . . . . . . . . . . . . . . . . . .
43
5.1.1.1
Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
5.1.1.2
Participants . . . . . . . . . . . . . . . . . . . . . . . . .
44
5.1.1.3
Experimental Material . . . . . . . . . . . . . . . . . . .
44
5.1.1.4
Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
5.1.1.5
Experiment Design . . . . . . . . . . . . . . . . . . . . .
46
Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
46
5.1.2.1
Case Study . . . . . . . . . . . . . . . . . . . . . . . . .
46
Empirical Study 2 - Controlled Experiment (ES2-CE) . . . . . . . . . . .
49
5.2.1
Experimental Planning . . . . . . . . . . . . . . . . . . . . . . . .
50
5.2.1.1
Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
5.2.1.2
Participants . . . . . . . . . . . . . . . . . . . . . . . . .
51
5.2.1.3
Experimental Material . . . . . . . . . . . . . . . . . . .
51
5.2.1.4
Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
5.2.1.5
Hypotheses and variables . . . . . . . . . . . . . . . . .
53
Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . .
55
5.2.2.1
Procedure . . . . . . . . . . . . . . . . . . . . . . . . . .
56
Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.2.3.1
Preparations . . . . . . . . . . . . . . . . . . . . . . . .
58
5.2.3.2
Deviations . . . . . . . . . . . . . . . . . . . . . . . . . .
58
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.2.4.1
Total configuration time . . . . . . . . . . . . . . . . . .
60
5.2.4.2
Total procedure time . . . . . . . . . . . . . . . . . . . .
60
5.2.4.3
Total execution time . . . . . . . . . . . . . . . . . . . .
61
Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
5.2.5.1
Evaluation of results and implications . . . . . . . . . .
61
5.2.5.2
Threats to Validity . . . . . . . . . . . . . . . . . . . . .
63
Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.2.2
5.2.3
5.2.4
5.2.5
5.3
42
Empirical Study 1 - Case Study (ES1-CS) . . . . . . . . . . . . . . . . .
5.1.2
5.2
34
x
CONTENTS
Chapter 6—Prioritization Heuristics
65
6.1
Greedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
6.2
Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
6.3
Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
6.4
Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
6.4.1
Genetic Algorithm Parametrization . . . . . . . . . . . . . . . . .
73
6.4.2
Heuristic Techniques Comparison . . . . . . . . . . . . . . . . . .
75
Chapter 7—Mechanization
7.1
7.2
77
Software Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
7.1.1
Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
7.1.1.1
Core Module . . . . . . . . . . . . . . . . . . . . . . . .
80
7.1.1.2
UI Module . . . . . . . . . . . . . . . . . . . . . . . . .
82
Main Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
7.2.1
Add Test Case screen . . . . . . . . . . . . . . . . . . . . . . . . .
83
7.2.2
Prioritization screen . . . . . . . . . . . . . . . . . . . . . . . . .
84
Chapter 8—Concluding Remarks
87
8.1
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
8.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
Appendix A—appendix
A.1 Test Suites Used in the Controlled Experiment . . . . . . . . . . . . . . .
92
92
LIST OF FIGURES
3.1
The calculation of M0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.2
The calculation of M1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.3
The weight prioritization equation [Sri04]. . . . . . . . . . . . . . . . . .
22
3.4
An example of an ordered tree using the approach [RBJ07]. . . . . . . . .
25
4.1
An example using the reuse function. . . . . . . . . . . . . . . . . . . . .
31
4.2
Possible phone combinations. . . . . . . . . . . . . . . . . . . . . . . . .
32
4.3
An example keeping the phone state. . . . . . . . . . . . . . . . . . . . .
33
4.4
An example using pruning mechanism. . . . . . . . . . . . . . . . . . . .
35
4.5
Three equivalent test cases. . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.6
Reducing test set with equivalence class. . . . . . . . . . . . . . . . . . .
38
4.7
The transitive Closure mechanism. . . . . . . . . . . . . . . . . . . . . .
39
4.8
Example of cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.9
Cluster as a representative test case. . . . . . . . . . . . . . . . . . . . .
41
5.1
An example of a Latin square used in our experiment. . . . . . . . . . . .
56
5.2
Results from the ANOVA with respect to the total configuration time. . .
60
5.3
Results from the ANOVA with respect to the total procedure time. . . .
61
5.4
Results from the ANOVA with respect to the total execution time. . . . .
62
6.1
An example of successors from a state in our simulated annealing algorithm. 70
7.1
The Architecture of the Prioritization Tool. . . . . . . . . . . . . . . . .
79
7.2
Class diagram of the entities submodule. . . . . . . . . . . . . . . . . . .
81
7.3
The screen for data insertion. . . . . . . . . . . . . . . . . . . . . . . . .
82
7.4
The screen for test case insertion. . . . . . . . . . . . . . . . . . . . . . .
84
7.5
A prioritization execution with the stateful permutation. . . . . . . . . .
85
xi
LIST OF TABLES
3.1
Subjects used in the experiments [RUCH01]. . . . . . . . . . . . . . . . .
17
3.2
An example using PORT [Sri04]. . . . . . . . . . . . . . . . . . . . . . .
22
3.3
Test case representation and their relationship [RBJ07]. . . . . . . . . . .
25
4.1
Feasibility study results. . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
5.1
An example of 2 lists with random orders of execution containing 4 input
data generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
5.2
Execution times of configuration data collected in ES1-CS (in seconds). .
47
5.3
Comparison among approaches. . . . . . . . . . . . . . . . . . . . . . . .
48
5.4
Dependent and independent variables of the experiment. . . . . . . . . .
54
5.5
Results from the Latin squares with respect to the total configuration time
and the total procedure time. . . . . . . . . . . . . . . . . . . . . . . . .
6.1
59
Best parameters for the Genetic Algorithm found in the experiments performed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
Results of experiments executed with different sets of test cases. . . . . .
76
A.1 The test suite I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
A.2 The test suite II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
A.3 Test suites sorted according to each prioritization technique. . . . . . . .
93
6.2
xii
CHAPTER 1
INTRODUCTION
Software quality is an essential discipline of software engineering, strictly involved with
product quality and customer satisfaction. As quality is a measure of confidence in a
product, software customers demand a trustworthy reliability through the application of
such a measure. In particular, there is a critical need to reduce cost, effort, and time-tomarket of software. Hence, the challenge is to meet customer’s expectations through the
application of quality principles in the software development process.
Software quality disciplines aim to measure if a software is well designed (quality
of design), and how well the software conforms to its requirements (quality of conformance) [Pre05]. One of the most used methods to assure quality of conformance is
software testing. In this context, a test can be defined as an action for exercising the
software, where the objective is to find application faults. It has become a normal part
of modern software engineering due to its efficiency (when applied correctly) and relatively cheaper cost compared to other approaches (formal validation, reviews, debugging).
However, its cost can reach more than 50% of the total cost of the software development
process [MSBT04].
Therefore, any effort to optimize the creation and the execution of test cases could
help companies to solve their problem concerning the lack of resources that sometimes
affects the progress of the testing process. One approach to the creation problem is to
automatically generate test cases with a mechanized process, reducing the time spent
creating tests.
Regarding the optimization of test case execution, there are several approaches to a
large variety of problems (some of them are presented in Chapter 2). For example, test
selection aims to choose the right set of tests from a larger suite, following some criteria
previously specified, allowing testers to reach some the goal related to the selected criteria,
keeping a coverage rate for instance. However, some categories of problems require that
a fixed number of tests be executed to assure quality in testing, avoiding potential losses
1
INTRODUCTION
2
in the defect discovery capability. We can imagine a situation where a set of test case
cannot be reduced any more through the application of other techniques, such as test
selection, because some errors may not be detected or to assure that some functionality
is being tested.
In these cases, an alternative is to arrange a sequence of test cases in an ordered way
such that the execution in this specific order will make the tester reach some goal. Some
examples of goals are reaching a coverage criteria earlier, finding defects earlier, and, in
our case, reducing the time spent on test execution.
This approach is well known as Test Case Prioritization [EMR00] and we instantiated it using the data reuse concept. Our problem is to provide an ordered sequence of
tests that can be executed as fast as possible. In order to achieve this goal, we propose
four test case prioritization techniques. An exploratory strategy based on permutation
generation, and three heuristic-based techniques: Greedy, Simulated Annealing and Genetic Algorithm [RN03]. They aim to generate sequences of tests that when executed the
inputs needed for each test may be reused from the execution of a previous test in the
sequence. As data is reused the tester (or machine) saves time spent on preparation of
these data, which in some test case environments is the most critical phase.
Hence, the time spent on the execution is decreased. This time saved can be used
to execute more tests or to make testers devote more time to other testing activities.
At the end, the main objective is to increase test team productivity, helping companies
to deal with the problem of lack of resource. Although, it is important to emphasise
that defect discovery capability of the tests being prioritized is not the focus of this
work. We assume that the test design phase was performed correctly and the tests were
appropriately chosen at the test plan phase done by a test specialist. Our focus is on
improving test execution performance through test data reusability.
We applied these techniques in a mobile phone testing environment at the BTC (Brazil
Test Center ) program, which is a partnership between Motorola and the Informatics
Center of the Federal University of Pernambuco (CIn-UFPE). The characteristics from
this industrial environment fit perfectly our objectives, as test teams have daily demands
on test case executions and test case prioritization can be applied to help them meet
these daily testing marks.
Test cases in BTC are built from a functional testing perspective (also known as
INTRODUCTION
3
black-box testing), which means that the source code is not available [Bei95]. The test
design is based on requirements, from where the test engineers retrieve information for
creating them. They are described textually, and the tester should read the text and
exercise the actions presented inside it on the mobile phone, and compare if the result of
such an action corresponds to the expected behavior also described in the test case.
Testing in industry is still largely conducted in a manual basis [AR02, Ber07], which
is often justified by limitations of existing automation frameworks and an unclear ROI
(Return Over Investment) on automation [Hof99, RW06].
The creation process of these test cases can be automated by using a tool developed
by the BTC research team. The TaRGeT (Test and Requirements Generation Tool)
automatically generates test suites from use case scenarios written in a Controlled Natural
Language (CNL). The possible scenarios are specified as use cases described in CNL
in a template that supports automatic processing. Then, from use cases, a test suite
can be generated, where each test case represents a flow from the written use cases.
Experiments revealed a great effort reduction by using TaRGeT on the BTC testing
environment [NCT+ 07].
Regarding the testing execution process, a test case technique was proposed by research students from BTC aiming to improve test team productivity [RBJ07]. This
technique uses an heuristic to build a sequence according to the data that can be reused
keeping a tree where each path from the root to the leaves represents a sequence of test
cases that has some data reusability among them. In the end the largest path is returned.
However, this technique is not mechanized and the testers may perform their methodology manually. Moreover, it does not guarantee that the best sequence is produced. This
work was the main motivation for our research.
The main contributions of our work are four mechanized test case prioritization techniques that use the concept of data reuse to generate sequences of tests and, in particular,
the permutation technique that guarantees the computation of the best sequences (there
might be more than one) for test suites with a small number of tests as it uses an exploratory algorithm. We also provide empirical evidences that the permutation technique
can generate better sequences than those generated by the manual approach. Our experimental results show that the permutation technique reduced approximately 25-30%
of configuration time, and this represented a reduction of 13.5% of total execution time,
1.1 ORGANIZATION
4
which is the time spent in the whole test suite, both to prepare the phone for a test case
and to execute the test case procedures.
This gain can be even larger when our technique is applied in test suites with no
prioritization. The manual technique experiments reported economy of around 5.56%,
reducing an average of 4 minutes the execution time. Authors claim that the saved
time could be used to execute more 1 or 2 test cases. Therefore, we assume that our
prioritized sequences could allow sequences almost 20% more economical, which could
allow the execution of even more test cases increasing the tester team’s productivity.
In what follows, we present the organization of this work.
1.1
ORGANIZATION
The remainder of this work is organized as follows:
ˆ Chapter 2 introduces some basic concepts used throughout the work: Software
Testing, Levels of Testing, Test Designs, Testing Approaches and Test Case Prioritization.
ˆ Chapter 3 discusses the work related to our strategies. We contextualize white-box
and black-box test case prioritization techniques.
ˆ Chapter 4 describes the permutation technique developed in this work that uses an
optimized exploratory strategy to generate the best sequences regarding reusability
of data.
ˆ Chapter 5 presents two comparative empirical studies between the permutation
technique and the manual approach used by testers: a case study and a controlled
experiment.
ˆ Chapter 6 presents the three techniques that use artificial intelligence concepts:
Greedy, Simulated Annealing and Genetic Algorithm. In addition, a comparative
study among them is also presented.
ˆ Chapter 7 details the tool developed to support the mechanization of the presented
techniques.
1.1 ORGANIZATION
5
ˆ Chapter 8 describes the concluding remarks, discussing our contributions, the
limitations of our approach, and future work.
CHAPTER 2
BACKGROUND
2.1
SOFTWARE TESTING
In this chapter, we present a brief introduction to software testing and test case prioritization. In the software industry, testing has become an activity to assure quality of
programs by validating assumptions about the software built. As software has become
more and more complex each day, clients demand confidence in the product they are
investing on. So, software testing plays a very important role in the achievement of a
satisfied level of trustiness among the software stakeholders.
The main objective of testing is to find unpredictable situations that the software
might behave with the exercise of some scenarios. These situations are described as
errors, which are incorrect results from human actions. An error is manifested as a
failure, a deviation of the system from the expected result, which is caused by a defect,
also called fault: a flaw in the system. These definitions are described in the IEEE
Standard Glossary of Software Engineering Terminology [IEE90].
The result of a test execution is a verdict informing if the software worked as predicted
(passed) or whether it has some problem (failed) for the specific situation tested. As for
the entity which has the capacity to judge if a test has passed or failed, the term Oracle
is used, which could be a human mechanism, the person who performs the tests, or
automatic, some testing tool. The Oracle’s judgement considers the inputs provided for
the test, the outputs generated compared to the expected results. To predict these results
the oracle must consider the status of objects involved in the test, say State, defined as
the set of all values of all objects for a given set of objects at any given moment [Bei95].
Inputs are any means by which data can be supplied to a test like, for instance, stored
data, generated data, keystrokes, messages. Outputs are the observable state of objects
at the end of a test execution, including a changed data object, a message, a new screen,
the absence of a message, a blank screen. Some authors distinguish outputs and outcomes,
6
7
2.1 SOFTWARE TESTING
but for simplicity we are considering the former as a general term for both.
From the definitions above, now we can describe the structure of a test case which is
composed of the following parts [Jor02]:
ˆ Initial State: State of how all test data should be set before the beginning of the
test execution.
ˆ Procedures: Inputs and actions that should be performed in the test objects
during the test execution.
ˆ Expected Results: Outputs that should be expected during the execution of the
test procedures.
ˆ Final State: Output State whose all test data are set after the test execution.
In the context of mobile phones, where we are inserted, the Initial State phase is also
called Initial Conditions, and the Procedures are described as a set of Steps composed
by the action to be performed and its expected behavior (Expected Results).
Initially, there are a lot of definitions and classifications that can be suited to software
testing, and we know some of these definitions are not a consensus among all testing
authors. However, we need to provide some information about how the testing activity
can be classified to make use of these definitions along this document.
Broadly, tests can be divided according to categories where each category has its
subdivision. We consider the following categories: Level of Testing, Test Design and
Testing Approaches.
2.1.1
Levels of Testing
Commonly, the tests are divided into four basic levels, each one associated to a phase in
the software development process. They are: Unit Testing, Integration Testing, System
Testing and Acceptance Testing .
ˆ Unit Testing:
The focus is on atomic units of testing. The main goal is to
take the smallest piece of software being tested, isolate it from other pieces and
verify if it behaves as expected. It is also called component testing, in which each
8
2.1 SOFTWARE TESTING
unit is tested alone to find any errors in its source code. This level of testing is
recommended to be done by programmers since they know well how the code is
and how to break it. It should guarantee that each module functions well before
integration. There is a large variety of supporting tools for this kind of level, known
as the xUnit family [Ham04];
ˆ Integration Testing:
This level of testing should explore the interaction and
consistency of successfully tested modules. For example, if components X and Y
have both passed their unit tests and there is a new component Z created from
the aggregation of X and Y, an integration test should be done to explore the
self-consistency of the new module.
ˆ System Testing:
At this level of testing the entire software as a complete sys-
tem should be tested. The software is tested together with the whole environment
where it is inserted. It is done to explore system behaviors that cannot be perceived
during unit or integration testing. Examples of some of these behaviors are: installation, configuration, throughput, security, resource utilization, data integrity,
storage management and reliability.
ˆ Acceptance Testing: This is a process to obtain confirmation by the client of the
software under test, through trial or review, that there is a conformance between
the system and the requirements previously defined. It is usually performed by the
client itself or end-users.
2.1.2
Test Design
Another possible classification of tests is according to the design method used to create
them. There are two most common methods for doing test design: black-box testing and
white-box testing. Other types of test design exist, as gray-box testing, which is a hybrid
approach of the previous two.
ˆ White-Box Testing: This kind of test takes advantage of the knowledge of the
internal mechanism of a system or component [IEE90]. It is also named structural
testing or glass-box testing, as the tester has the perception of how the software
works internally, visualizing it clearly. Tests are derived from the structure of the
9
2.1 SOFTWARE TESTING
tested object and the main concern is the proper selection of program or subprogram
paths to exercise during the battery of tests.
They are normally used in unit testing as, generally, the testers are the programmers
who know how the code is built. Some of the benefits from the usage of this kind
of test design are: easy gathering information about test coverage and control flow,
focused testing as it can test atomic pieces of code, data integrity is more easily
achieved by tracking the items in the code, internal boundaries are clearly visible in
the code and algorithm testing (if you are using a traditional algorithm, probably
there are some tests about it in the literature) [KFN99]. White-box testing is not
the focus of this work.
ˆ Black-Box Testing:
Totally the opposite, black-box testing assumes that the
system is viewed as a closed box, hence the testers should have no information about
how the code behaves or how it is structured. While white-box design allows the
view inside the system, and it focuses specifically on using internal knowledge of the
software to guide the selection of test data, black-box design does not explicitly use
knowledge of the internal structure of the software to generate tests. This design
method focuses on testing functional requirements.
Synonyms for black-box include behavioral, functional, opaque-box, and closedbox. Therefore, the main objective is to test the behaviors and functionalities of
the system without seeing how it works inside, but, just seeing the descriptions of
how it should execute its tasks.
The mechanism of this kind of design is to feed the program under test with inputs
retrieved from the knowledge of specifications or tester’s experience and observe the
output conformance. Black-box testers do not spend time learning the source code,
instead, they study the program from its requirements, which is how the customer
will work with it.
These test designs can be applied together with any level of testing, although, some
levels are more inclined to specific designs, as generally unit tests are constructed from a
white-box point of view and acceptance tests from a black-box one. But, what is almost
a consensus is that when used together they increase software quality. The main reason
is because they usually capture different types of failures in the system as white-box
2.1 SOFTWARE TESTING
10
works deeply into the code while black-box can have a wide view over the system. For
example, despite the structural testing approaches make it easy to run certain types of
tests, black-box thinking exposes errors that will elude white-box testers [KFN99].
We focus on black-box testing, since only the requirements and the executable code
are available in the context of the project in which this work is inserted.
2.1.3
Testing Approaches
There are lots of testing approaches to specific purposes. Some authors refer to them as
other levels of testing, but we define them as a separate group from the previous one. Some
examples are Stress Testing, which run tests to determine if the software can withstand
an unreasonable load with insufficient resources or extreme usage, Performance Testing,
which run tests to determine actual performance as compared to predicted performance,
and Regression Testing,which is detailed below because of its relevance to the state of
the art concerning Test Case Prioritization.
Regression Testing is used to confirm that a new release of a software has not regressed,
i.e., lost functionality. Thus, it is a testing process used to validate modified software and
detect whether new faults have been introduced into previously tested code. Its process
consists of rerun test cases previously passed to verify if any functional part of the code has
been broken with the increments. There are different approaches where all tests or part
of them are executed again. However, re-executing tests every time a modification takes
place can be highly costly for systems with a high number of test cases; some approaches
try to minimize the size of the test suites to be run, but in practice this is not always an
available option to every software company [OTPS98]. Therefore, many investments on
research have been done to minimize this problem and one of the techniques suggested is
Test Case Prioritization, which is discussed in Section 2.2.
In spite of testing being an effective way to discover bugs, it does not guarantee
the absence of errors in software, only their presence [Dij72]. From this point of view,
testing cannot prove the correctness of programs as it is infeasible to evaluate all their
possibilities of execution. A simple example is a program that presents some behavior
according to a formal parameter with type integer. As there is an infinite number of
integers, we cannot evaluate every one, thus, we cannot prove its correctness, but we can
11
2.1 SOFTWARE TESTING
imagine some specific values that could run the program in an undesirable way. For such
values we can use some testing techniques to improve the quality of our testing, which
would, potentially, improve the quality of our program.
Some testing techniques are described below.
ˆ Boundary Testing:
To solve the problem of testing all possibilities from an
input with infinite number of values, the tester should choose the ones with higher
likelihood of causing flaws. This technique proposes a test that focuses on putting
the software being tested under boundary and limit conditions. Once a limit situation is found, the tester must create a test with this limit value and with the values
around it. Supposing that 10 and -10 are discovered as limit situations under the
system, they should be tested together with 11, 09, -11 and -9; moreover the central
value is advised to be tested as well, so, 1, 0 and -1, are also in the set of the test
data.
ˆ Statement, Branch and Path Coverage: For statement coverage, the number
of statements in the code should be executed at least once. Similarly, in branch
coverage all outcomes from conditions should be tested along the program. In path
coverage, every flow of execution is traversed. This coverage technique includes
branch coverage, as it ensures the test of every decision outcome, but in addition,
path coverage verifies all dependencies among every branch coverage outcome, becoming a more robust technique compared to the previous ones.
ˆ Mutation Testing:
Some defects are purposely introduced along the code to
verify the effectiveness of test suites. These faults are inserted in small modifications
along the program, one at a time, creating a new version of it which is referred to as
a ”mutant”. If the tests detect the mutant, it is disposed, but if it passes undetected
the tests are not enough effective. This technique is also used to generate efficient
test suites.
ˆ Equivalence partitioning:
It is a technique that chooses some tests to be exe-
cuted instead of the whole test suite with a large size of elements. These selected
tests must be representative among some class of tests called equivalence partition,
i.e, they must have similar behavior. The selected subset should cover all partitions; doing so, all possible scenarios are covered and cost spent on test execution
2.2 TEST CASE PRIORITIZATION
12
are reduced [Pre05]. An instantiation of this technique is described as Test Set
Minimization [WHLM98]. According to Wong et al., a test set minimization procedure finds a minimal subset in terms of the number of test cases that preserves
the coverage with respect to a certain coverage criterion of the original test set. An
implemented example of this technique is the tool named ATACMIN [HL92] which
generates test sets minimized. It uses an enumeration algorithm to find optimal
solution and heuristics when the exact solutions are not obtained in reasonable
time.
The techniques presented above are related to improving the testing process by improving test case generation, detecting faults, guaranteeing effectiveness, and to reduce
the cost of the execution effort. Test Case Prioritization can also be an alternative to
address this latter issue, and it is the subject of the next section.
2.2
TEST CASE PRIORITIZATION
In the literature, the problem of ordering test cases is commonly known by the term test
case prioritization (TCP). It considers that a test set can be prioritized following some
criteria to achieve a goal. Unlike test set minimization and equivalence partitioning, this
technique does not suggest the reduction of elements in a test suite, but just adjust them
in some sequence that will bring gains related to the goal to be reached. As we have
mentioned before, reducing the size of test sets may not be possible in some cases; it may
happen due to quality requirements, coverage criteria to be kept or testing targets to be
met, thus, an alternative is using a Test Case Prioritization technique to the purpose
being pursued.
The test case prioritization problem is strongly tied to regression testing as seen in
various works [RUCH01, JG08, KP02, QNXZ07, Sri04, ST02, ZNXQ07]. As discussed
in the Section 2.1.3, regression testing is a process used to validate modified software
and detect whether new faults have been introduced into previously tested code; it can
become a very expensive and laborious task. Nevertheless, Elbaum et al. [EMR02] also
show that test case prioritization can also be employed in the initial testing of software,
and sometimes, it is used with a different terminology [AW95]. In our approach, the
prioritization technique can be applied in any phase of the software testing execution;
2.2 TEST CASE PRIORITIZATION
13
once the tests are created they can be submitted to our technique which will generate an
ordered sequence of tests.
Elbaum et al. formally defined the test case prioritization problem as follows [EMR00]:
Definition 1. Let S be a test suite, P the set of permutations of S and f a function from
P to the real numbers. The test case prioritization problem consists in finding p ∈ P
such that
∀p0 ∈ P. (p 6= p0 ) ⇒ (f (p) ≤ f (p0 ))
where P represents the set of all possible prioritizations of S, and f is a function that,
applied to any such sequence denoted by p , assigns a real value to that ordering allowing
the comparison of two prioritizations. Then, the definition refers to a search inside P
for the best prioritization p. However, the main problem of test case prioritization is the
generation of P . We have to produce all permutations of test cases, which in practice
is not feasible for a large number of tests. For instance, assuming that it takes 1µsec
to generate a single permutation, the permutation of up to 12 tests takes no more than
few minutes. However, for 17 tests, it takes 10 years. For more than 25 elements, the
time required is far greater than the age of the earth [Sed77]. Despite this assumption
being very old, over 30 years ago, and the computability of today being much faster,
the complexity of the problem still persists, just allowing the computation of a few more
elements.
This computational complexity can be seen as a variation of the Traveling Salesman
Problem (TSP), where the test cases are the cities, and to go from one test case to another
on the prioritization we have an associated weight, which would be the cost of traveling
from a city to another. In our case, the main difference is that the cost to travel from city
A to city B is different from the cost of traveling from B to A. Moreover, we do not need to
go back to the starting city (or the initial test case). But these differences do not change
the computational complexity of the problem which is classified as NP-hard [GJ79].
Because of the complexity of the problem, all related works focus on the usage of
heuristics, and we present them in the following chapter. Nevertheless, in our strategy,
we still explore permutation for small test suits and some heuristics in other cases.
CHAPTER 3
RELATED WORK
In this chapter we describe some test case prioritization techniques that are related to
our work. As the methodologies can be applied to different kinds of testing, we divided the prioritization techniques according to the level of code accessability: white-box
techniques, that use source code information, and black-box techniques, with no such
information at all. Each technique also has a specific purpose and applied to restricted
testing environments, as ours.
3.1
WHITE-BOX TEST CASE PRIORITIZATION TECHNIQUES
Despite our work is not a white-box technique most work on test case prioritization take
into consideration the source code of applications to generate their orderings, including
the paper that formally defined test case prioritization [EMR00]. Most of them are
applied to reduce the impact of regression testing, which can be very difficult to execute as
discussed in section 2.1. As our technique does not involve the white-box context, we will
detail one referenced work about regression white-box testing prioritization, concentrating
on definitions, and relevant concepts.
3.1.1
Prioritizing Test Cases For Regression Testing
The initiative reported in [RUCH01] is one of the most referenced works in test case
prioritization area. It has formally defined the problem and has discussed it in the
regression testing context. The main motivation is the high cost of testing, especially
in regression testing. The work considers other techniques to regression testing as test
selection and test suite minimization techniques. The motivation for not using them are
difficulties like the loss on the fault detection capability, which happens because they deal
with a reduced set of test cases, which means coverage cannot always be maintained.
14
3.1 WHITE-BOX TEST CASE PRIORITIZATION TECHNIQUES
15
Then, test case prioritization is proposed as a new approach that does not lose information, but arrange test cases in some way such that the faults are discovered earlier.
Furthermore, nine techniques are described, where actually the first three of them are
pseudo-techniques, used only for comparison, and the other six represent heuristics that
use test coverage information, produced by prior executions of test cases, to order them
for subsequent execution. The techniques are:
1. No prioritization: the application of no technique
2. Random prioritization: randomly order the test cases in a test suite.
3. Optimal prioritization: This is not a practical technique when the size of the
suite is very large (as discussed earlier in Chapter 2) because it might not be possible
to generate all permutations for suites with large number of tests, but this technique
is used to measure the success of the other practical heuristics. To avoid generating
all permutations, they have developed a greedy ”optimal” prioritization algorithm,
even knowing that it may not always produce the best ordering.
4. Total statement coverage prioritization: prioritize test cases in terms of the
total number of statements they cover by counting the number of statements covered
by each test case and then sorting the test cases in descending order of that number.
5. Additional statement coverage prioritization: Initially selects the test case
which covers more statements (the same way as the total statement coverage prioritization) but the following test cases are chosen according to the number of
statements covered by them that were not covered yet on the test suite, i.e., the
ones that add more new statements to the coverage.
6. Total branch coverage prioritization: the same as total statement coverage prioritization, except that it uses test coverage measured in terms of program branches
rather than statements.
7. Additional branch coverage prioritization: the same as additional statement
coverage prioritization, except that it uses test coverage measured in terms of program branches rather than statements.
3.1 WHITE-BOX TEST CASE PRIORITIZATION TECHNIQUES
16
8. Total fault-exposing-potential (FEP) prioritization: the order is based on
the ability of a test case to expose a fault. It is obtained from the probability that a
fault in a statement will cause a failure for a given test case. This probability must
be an approximation of the fault-exposing-potential, and it is acquired through the
adoption of an approach that uses mutation analysis to produce a combined estimate of propagation-and-infection that does not incorporate independent execution
probabilities.
9. Additional fault-exposing-potential (FEP) prioritization: Analogous to the
extensions made to total statement and branch coverage prioritization to yield additional statement and branch coverage prioritization, total FEP prioritization is
extended to create additional fault-exposing potential (FEP) prioritization. After
selecting a test case t, other test cases that exercise statements exercised by t have
their FEP values lowered, then, another test is selected based on the best FEP
value and the process repeats until all test cases have been ordered.
Furthermore, a series of experiments are described using a measure to compare effectiveness of the various test case prioritization techniques. The APFD measure (average
of the percentage of faults detected) plays the role of the function f discussed before. It
measures how quickly the prioritized suite detects the faults, with the range from 0 to
100. Higher APFD numbers mean faster (better) fault detection rates.
To perform the tests with the techniques, eight C programs were used as seen in Table 3.1, with the information about lines of code, number of versions, number of mutants,
the test pool size and the average test suite size.
The first seven are programs from Siemens Corporate Research and are commonly
known as the Siemens programs; the last one is a program developed for the European
Space Agency and is referred to as space. As can be seen, the space program is the larger
and more complex than the other subjects. These programs were used as experiments in
other works related to this subject [EMR02, JG08, KP02, JH03].
Some intuition acquired from experiments performed with the techniques described
shows, in general, that test case prioritization can substantially improve the rate of fault
detection in test suites. The six heuristics proposed obtained such improvements but in
the case of the program schedule, no heuristic outperformed the untreated or randomly
3.1 WHITE-BOX TEST CASE PRIORITIZATION TECHNIQUES
17
Experiment Results
Table 3.1 Subjects used in the experiments [RUCH01].
prioritized test suites. In almost every case, including the space program, additional
FEP prioritization outperformed prioritization techniques based on coverage. Considering
the overall results on the Siemens programs, branch-coverage-based techniques almost
always performed as well as or better than their corresponding statement-coverage-based
techniques. Considering differences between total and additional branch and statementcoverage-based techniques, there was no clear winner overall.
The authors conclude the paper explaining that, in this study, test case prioritization
techniques can improve the rate of fault detection of test suites, as shown in the empirical
studies, and that a benchmark was left with the APFD measure to compare test case
prioritization techniques and other measures. For future work, the gap between the
optimal test case prioritization and the heuristics used, even with FEP-based techniques,
was not bridged, then, other heuristics could be used to reduce this gap. Finally, the test
case prioritization problem has many other facets that can be faced, with other objectives
and goals, and not all them were addressed in this paper.
3.1.2
Other White-Box Initiatives
A series of works have been done to address problems related to white-box test design. An
extension to the work previously described can be found in [EMR02], where 16 techniques
are proposed (including 4 already presented in [RUCH01]); the other 12 are not related
to statements or branches, but to functions. These techniques use coverage information, probabilities on fault existence or exposure, or a combination of them, to prioritize
3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES
18
the tests. Their experiments show that function-level had similar results to statementlevel prioritization techniques. Another extension was carried out later by Jeffrey and
Gupta, proposing techniques based on statement coverage that takes into consideration
how statements affect each other [JG08]. Their conclude that their techniques have more
potential than the total statement and branch coverage prioritization techniques. Jones
and Harrold presented a technique that prioritizes test cases using the modified condition/decision coverage (MC/DC) criterion, which is a form of exhaustive testing that uses
branch coverage [JH03]. Their case study shows encouraging results to reduce the cost of
regression testing for users of the MC/DC testing criterion. Srivastava and Thiagarajan
propose a test case prioritization technique based on basic block coverage for large systems [ST02]. The work suggests that it is possible to effectively prioritize tests in large
scale software development environments by using binary matching to determine changes
at a fine granularity.
3.2
BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES
In this section we describe some works about test case prioritization techniques proposed
for tests with no information retrieved from the source code, but from requirements or
historical execution data.
3.2.1
Test Case Prioritization for Black Box Testing
The approaches discussed before gave us a good understanding about the prioritization
problem, but their experiments are related to source code, more commonly known as
white box testing. Regarding the approach presented in [QNXZ07], we can see a closer
relation to our problem, because it deals with black box testing as we do, but it still tries
to solve regression testing problems. Despite of this, that work does not use code data
to build the test prioritization; it uses runtime and historical information.
The purpose of this work is to test software in the situation where the source code
is not available. This approach reuses test suites and assigns each test case a selection
probability according to its historical performance, and then selects ordered test cases to
run until testing resources exhaust. Algorithms were developed to order the tests. The
main idea of these algorithms is to group all reused test cases by the fault types they
3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES
19
have revealed, and then dynamically adjust their priorities according to test results of
related test cases.
The technique uses a matrix R that relates each test case to all others, so once a test
case revealed regression faults, related test cases can be adjusted to higher priority to
acquire better rates of fault detection. Given a test suite T , T 0 a subset of T and R the
relation matrix, the general process is described as follows:
1. Select T 0 from T and prioritize T 0 using available test history.
2. Build the test case relation matrix R based on available information.
3. Draw a test case from T 0 , and run it.
4. Reorder the remaining test cases using run-time information and the test case relation matrix R.
5. Repeat step 3 and 4 until testing resource is exhausted.
The approach presents three algorithms: one builds the relation matrix and the other
two Escalate and Deescalate test cases. The first is described based on the matrix,
where each cell is a boolean value which is true if two related test cases reveal the same
faulty type, and false otherwise. For simplicity, in this paper example, a test case can
reveal only one faulty type. So, the matrix is traversed and filled following this rule.
Based upon the relation matrix, the regression testing strategy selects one test case to
run, and then escalate or de-escalate the priorities of its related test cases that have not
been used. The Escalate algorithm gives higher priorities to test cases related to the
previous one tested and which has failed. The Deescalate algorithm does exactly the
opposite, because it gives lower priorities to test cases related to others that stayed at
the last ordinal orders when one test case has ran successfully.
Further, some examples and experiments are shown in [QNXZ07], making use of this
approach. A new metric M0 for black box testing is exposed, given that APFD is hard
to use in a black box environment, because to find the fault numbers that each test case
reveals is said not to be an easy task. Let ti be the ith test case in the ordered test suite,
let timei be the execution time of ti , and let fi be a Boolean value that indicates if ti
has failed. M0 for a test suite T is given by the first equation seen in Figure 3.1, and the
second M1 is used when execution time is not relevant, as seen in Figure 3.2:
3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES
M0 =
1
2
Pn
20
Pn
timej − timei ) × fi ]
Pn j=1
Pn
× 100%
i timei ×
i fi
i=1 [(2
Figure 3.1 The calculation of M0 .
M1 =
1
2
Pn
− 2i + 1) × fi ]
P
× 100%
n × ni fi
i=1 [(2n
Figure 3.2 The calculation of M1 .
The experiments were made using Microsoft Word and PowerPoint 2003 with the
service packet (SP2) and all bug-fixes. The metric used was M1 and the results show that
the strategies using Escalate and Deescalate algorithms are on average 10% better than
no adjusting strategy. Then, the authors conclude that the dynamic adjusting methods
are of benefit to improve the test suite’s fault detection rate for black-box testing.
Finally, the authors explain that despite their relation matrix is built based on faulty
types, it can be done using other kinds of information like test objectives. For future
work, the authors mention that, as the algorithms are nonnumeric, they intend to research numerical algorithms to adjust selection probabilities of test cases and perform
experiments to compare the approaches.
3.2.2
The Automatic Generation of Load Test Suites and the Assessment of the
Resulting Software
The main purpose of the approach reported in [AW95] is to generate ordered test cases
sequences for telecommunications systems. Three automatic test case generation algorithms are introduced, and the authors claim that it can be applied to any software that
can be modeled by a Markov Chain [Nor98] using collected data and estimates.
The approach is based on a telecommunication profile that includes information about
the number of different types of calls, the average service demands and the probability
distribution for each type of call, and the average external arrival rate and probability
distribution for each call type. Using this information, a model using Markov Chains is
constructed, but this model could become very large given a high number of calls. For
instance, to test a system with five different call types, each holding at most 500 calls, it
would be necessary 268.318.178.226 Markov states requiring over 13.000.000.000 hours of
3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES
21
testing assuming three minutes holding time. So, the algorithms deal with this problem
of reducing the amount of states, choosing a representative state that corresponds to the
most likely to be executed. This is done using a constant that represents the minimum
probability that a state has to have to be used in a test case.
Three algorithms are described to solve the problem. The first generates all states
that have steady-state probability greater than . The second algorithm generates in
order all state transitions such that the product of the state probability and the call
arrival rate is greater than . It uses the first algorithm to generate all states of interest
and then filters them based on the restriction cited. The third algorithm extends the
second one by considering sub-paths rather than individual edges.
Empirical studies were performed using five industrial software telecommunications
projects ranging in size from 50,000 to more than one million lines of code. Further, it
was defined a reliability measure to evaluate the approach. This measure could be used
to track the progress of the load testing algorithms, and to guide test planning.
The results with the projects show the detection of serious program faults that they
predict would not have been detected until field release. They also report a greatly
facilitated regression testing and fault localization capability due to the use of these algorithms. They plan to use the notions to domains other than telecommunications systems.
This paper is used as reference in [EMR02] as a work where test case prioritization is not
used for regression testing but at initial testing.
3.2.3
Requirements-Based Test Case Prioritization
As our problem deals with tests acquired from requirements, the work reported in [Sri04]
is one of the most related to ours. Besides requirements based testing, it also uses a
black-box approach at the system level. Another difference from other approaches is that
its main goal is to identify the severe faults earlier and to minimize the cost of test case
prioritization. An enhancement on test case prioritization is proposed by incorporating
additional knowledge from requirements engineering research.
The approach proposes a new strategy called Prioritization of Requirements for Testing (PORT). It uses 3 prioritization factors (PFs) to order test cases: (1) customerassigned priority on requirements, (2) requirement complexity and (3) requirements
22
3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES
volatility. These factors are assigned values. The first, customer-assigned priority on
requirements (CP), is a value from 1 to 10 assigned by customers based on how important the requirement is. The second, requirement complexity (RC), is also a value from
1 to 10, but it is assigned by the developer based on the perceived implementation difficulties of the requirement. The last, requirements volatility (RV), is the number of times
a requirement has changed. The higher the factor value of a requirement, the higher the
priority of test cases related to that requirement.
Another task is the assignment of weights to the PFs such that the assigned total
weight (1.0) is divided amongst all. The assignment is done by the development team, but
must be according to the customer needs. Given these values, a weighted prioritization
(WP) is calculated for all requirements, as seen in Figure 3.3.
WP =
n
X
(P F value × P F weight)
P F =1
Figure 3.3 The weight prioritization equation [Sri04].
Test cases are then ordered such that the test cases for requirements with high WP are
executed before others. An example is shown in Table 3.2, where the better prioritization
is R1, R4, R3 and R2.
Factors
R1
R2
R3
R4
Weights
Customer Priority 7
4
10
8
0.30
Req. Complexity
10
5
4
9
0.40
Req. Volatility
10
8
5
5
0.30
WP
9.1 5.6 6.1 7.5 1.00
Table 3.2 An example using PORT [Sri04].
Further, a case study is detailed on an industrial project that comprises 152 KLOC.
The preliminary results show that requirements complexity and volatility impact fault
density. For future work, an analysis on the industrial data to determine the effectiveness
of customer priority should be made.
3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES
3.2.4
23
A Test Execution Sequence Study for Performance Optimization
The research reported in [RBJ07] has the objective to find out an optimization in the
execution of mobile phone tests inside the BTC program. As said before in Chapter 1,
the tests are in natural language and the testers should read the test cases, perform the
actions described and verify if the state of the phone corresponds to the description of
the expected results in the test case.
As the testers have a high demand on daily test suites execution, the time spent on the
execution of these tests becomes a critical factor to the teams. So any effort to increase
their efficiency could help these teams to provide a better service. Another aspect is that,
usually, testers do not follow a predefined order, but their own feelings based on practical
experience on how well they could execute a test suite. These decisions end up being
random and personal.
In this context, a test case structure can be described as follows:
ˆ Pre-condition (Setup) - some setups needed in order to start the test case execution.
This information is treated differently from the order data because they do not
change during the test execution, and they should be executed together in the
beginning of the whole ordered test sequence. Generally, this setup information is
related to the act of setting bits or basic configurations on the phone. They are not
considered in the reuse strategy.
ˆ Input Data - data needed for the execution of the test that can change during test
execution to produce outputs. They are considered in the reuse.
ˆ Output Data - data resulted from the execution of the test. Elements that should
be in the state of the phone in the end of the test. They are considered in the reuse.
So, the main idea is to make an ordered sequence based on the reuse of this information, minimizing the time spent to calibrate a mobile phone for a test execution. As a
test case can be represented as a set of inputs, some procedure and a set of outputs, the
objective is to generate a sequence where each test case is linked to another combining
inputs and outputs.
The methodology is proposed not to be mechanized but to be followed by the testers
like a guide. It is divided into a sequence of steps. The first step refers to a survey on
3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES
24
configurations, input and output data from the test cases. As said before, the configurations (setups) are done before the entire test sequence is executed. The input and
output data discovered are tagged with a unique ID and organized in a table. It serves
to identify them later. All test cases are redefined following a standardization, which
presents a test case as a unique ID, a set of inputs and a set of outputs using the tagged
IDs from the previous step. To use the methodology, a root test case is created, defined
as INIT, which represents a fictitious test case for the sequence to have an initial point
to be ordered from. It has no inputs, and usually its outputs are the ones that require
the minimal effort to be executed (for example, “0 contacts in the phonebook”).
Given that all these steps are performed, the ordering itself can be initialized. It is
done using a tree data structure, where INIT is the root node. To define the children of
the nodes, one verifies the test cases that have intersections between their inputs with
the outputs of the current node. So, the children nodes of the root are the test cases
that have their input elements inside the set of output elements of the root node, and so
on. There are some rules to be followed. For instance, a node cannot be repeated on its
sub-tree and whenever it is not possible to insert new nodes, that sequence is finished.
For example, in Table 3.3, we have some test cases with their input and output sets.
The INIT is the fist test case in the table. The intersections among the test cases are
highlighted in bold. We can see in Figure 3.4 that TC 097, TC 255 and TC 256 are the
children nodes from the root node (their inputs intersect the output of INIT).
Following this rule the whole tree is constructed. Each branch of the tree corresponds
to a test case sequence. Once the tree is constructed, then the ordered sequence is chosen
taking the deepest branch of the tree. However, there are cases in which not all test cases
are in the deepest branch. A new tree with the remaining test cases is built and the
process is redone until there are no test cases left. When there are two or more branches
with the same depth, to make the algorithm deterministic, the leftmost path is chosen.
An example of tree using the elements of Table 3.3 is displayed in Figure 3.4.
A case study was driven, comparing this approach to ordered sequences created by
testers from their feeling and personal experience (what they were using up to this research), and the approach has shown better results in 83% of the performances and the
average gain was over 5.56% reducing the execution time of the sequences in around
4 minutes. Authors claim that the saved time could be used to execute other one or
3.2 BLACK-BOX TEST CASE PRIORITIZATION TECHNIQUES
Table 3.3 Test case representation and their relationship [RBJ07].
Figure 3.4 An example of an ordered tree using the approach [RBJ07].
25
3.3 DISCUSSIONS
26
even two test cases. Currently, this methodology is being used for some teams inside the
BTC. Although this approach has brought some gains in efficiency terms, it does not
always bring the optimal ordered sequence. A generated sequence can occasionally be
the optimal one, but it is not certain.
In next section, we have a discussion about the contributions, criticisms and opinions
over the related work and their relevance to our problem.
3.3
DISCUSSIONS
Each work has its importance in the test case prioritization area, but with respect to our
problem, we have some more specific considerations about them.
The approach of [RUCH01] is a remarkable contribution to the test case prioritization
area because it has a formal definition of the problem and contributed to establish a
research topic under the term ”test prioritization”. Test case prioritization schedules test
cases in order to increase their ability to meet some performance goal. Although this
approach give us a background of the problem from a standard perspective, regarding
our problem, these contributions are not applicable because the approach is designed for
a white-box problem and takes in consideration only regression testing.
On the other hand, the work reported in [QNXZ07] is in the context of black-box
testing, taking into consideration the faulty types of each test case and use algorithms to
determine the ordering sequence, but the focus is on regression testing problems.
The approach in [AW95] presents a black-box example of test case generation and
prioritization that works at the initial testing phase. This approach requires the system
to be modeled as Markov Chains. It also requires an operational profile of the problem
context. The example presented uses an operational profile from a telecommunications
domain. We still need to further investigate whether Markov Chains are a suitable
mathematical model for our application domain.
The approaches reported in [Sri04] and in [RBJ07] present closely related work. The
first describes a black-box non-regression testing approach based on requirements. Its
goal is to discover the severe faults earlier on the testing process. For that, it uses
some factors and weights that should be given by customers and developers to retrieve a
value for each requirement. Based on these values the tests are ordered according to the
3.3 DISCUSSIONS
27
requirements they are associated with. However, no prioritization technique is scheduled
to test cases from the same requirement.
The second work was the main inspiration for our research, and the first attempt to
optimize test case execution through prioritization inside BTC program. This research
provides some good results, as shown earlier, and its methodology is already adopted in
some test execution teams. However, there are also some drawbacks.
As said before, following those steps does not always guarantee the generation of
the best execution sequence. In order to recover the best sequence we have to analyze
all possible permutations. On the other hand, a feasible generation of all permutations
depends on the size of the test suite. When we have a small set of test cases, it could be
feasible. Our experiments showed that the perfect sequence can be generated from suites
of up to 15 test cases in the worst case.
In cases where the perfect sequence is infeasible, we can also experiment with different
heuristics. However, heuristics cannot guarantee the best alternatives. Using heuristics
we are susceptible to make arbitrary choices that lead us to a solution that may not always
be a good one. For instance, one of the choices reported in [RBJ07] is the definition of
the root node that restricts the possibilities of starting with nodes that have minimal
preparation efforts. However, there might be a case in which we start from test cases
with medium or high preparation efforts and still get a minimal cost at the end. Another
choice is the selection of the leftmost branch when there are two sequences with the same
depth. These decisions can interfere in the quality of the sequences generated.
Moreover, this approach lacks a formal definition of the measure function f , which
makes it difficult to compare the results with different heuristics. In a numeric form we
can use mathematical concepts and return a value representing the results in a more
concrete and clear way.
Finally, except by [RBJ07], all approaches focus on ordering the test cases according to the chance of discovering faults, while only we and [RBJ07] are using test case
prioritization to optimise the test cases execution considering the inputs and outputs
provided.
CHAPTER 4
PERMUTATION TECHNIQUE FOR BLACK-BOX TEST
CASE PRIORITIZATION
Test Case Prioritization can be used to achieve many goals, as shown in the previous
chapter, and most of the techniques developed come from well defined heuristics. The
main cause to use heuristics is the difficulty to find the best prioritization, given it is a
problem with factorial order of complexity. However, contrary to other initiatives that do
not consider an optimal solution, we noticed that an exhaustive strategy that brings the
best results is feasible for small test suites. Whenever this methodology is not feasible,
heuristics must be taken in consideration even knowing the optimal result may not be
achieved. We discuss some proposed heuristics further in Chapter 6. A similar initiative
have been made but regarding Test Set Minimization discussed earlier. This problem is
also a NP-complete, so the same restrictions regarding computational cost are applied.
The tool ATACMIN [HL92] uses an exponential algorithm to find best solutions. It also
implements some heuristics to be used in the cases where these solutions are not found
in a feasible time.
We developed this work inserted in the mobile phone testing environment at BTC
program. We deal with functional tests where the source code of the system under test is
not available, the tests are specified in English and are executed manually. By observing
the daily work activities at the BTC, we noticed that several test suites contain a small
number of test cases. A feasibility study was taken to verify if an exhaustive technique
could be applied. We run a permutation algorithm (without any optimizations), to check
the possible limited amount of tests we could run with this technique. We had in mind
that such an algorithm could be left to run an entire day, as the testers received the test
suite in a period of one day in advance of their execution.
These results are listed in Table 4.1. We notice that a test suite with 15 test cases
can run (in the worst case) in less than 24 hours. Clearly, to generate all permutations
28
29
PERMUTATION TECHNIQUE FOR BLACK-BOX TEST CASE PRIORITIZATION
of 16 tests or more takes more than one day to prioritize. This was the main motivation
to develop an approach which exhaustively looks for the best sequence, as this number
fits well with the general test suite size. Larger test suites could still be prioritized with
our optimized permutation algorithm (to be explained later).
# of Elements
T ime # of Sequences
1 < 1 ms
1
2 < 1 ms
2
3 < 1 ms
6
4 < 1 ms
24
5 < 1 ms
120
6 < 1 ms
720
7 < 1 ms
5040
8 < 1 ms
40320
9
16 ms
362880
10
78 ms
3628800
11
969 ms
39916800
12
11 s
479001600
13
2 min
6227020800
14
34 min
87178291200
15
8h
1307674368000
Table 4.1 Feasibility study results.
Let us recall the Definition 1(Chapter 2):
Let S be a test suite, P the set of permutations of S and f a function from P to the
real numbers. The test case prioritization problem consists in finding p ∈ P such that
∀p0 ∈ P. (p 6= p0 ) ⇒ (f (p) ≤ f (p0 ))
Unlike all related work, we are not concerned with the definition of appropriate heuristics for test sequence generation. Our main concern is the definition of an appropriate f
which captures the correct notion of an efficient sequence. In our case, f is related to the
4.1 THE PERMUTATION TECHNIQUE
30
level of reuse of phone’s data and configurations. Our function f computes the time the
tester spends preparing the phone between the execution of two tests.
Before detailing the permutation strategy, we define some concepts related to a test
case in mobile phone environment:
Identifying Information ID, suite and description.
Setups A set of configurations that does not change during test execution of a suite.
For instance, the set up of specific configuration bits on the mobile phone does not
change during the execution of the entire suite. These configurations should be
executed before the execution of the whole test sequence.
Phones Each phone can be a PUT (Phone Under Test) or an auxiliary phone.
Input/Output State Input state is the required state of the phone before the execution
of a test. Output state is the state of the phone after the execution of a test. The
input state is data required to be stored on the phone before the execution of a
particular test. The output state is the state of the phone when the test is finished.
Each Input/Output State is composed of a set of input/output data. For instance,
”the phone contains 3 GIF files” and ”there is 1 contact in the phonebook” are
examples of input and output states.
4.1
THE PERMUTATION TECHNIQUE
We develop a technique based on permutation generation of all possible sequences from
a test case set [LIS08]. This decision is justified by the fact that many test suites we
deal with have a limited size of elements, which allows current computers to generate all
possibilities and return the best results.
Our function f reflects the level of reuse of data (input and output) of mobile phones.
Each data has an associated cost of generation which is the time spent to set the input
state on the phone. For example, a JPEG file has an associated time to be added and an
associated time to be deleted on the phone.
The cost computed by f is simply the total time spent preparing the input states
from one test to the next. The output state is used to verify possible reuses from one test
to the next in the sequence. The output data from a test that match with the input data
4.1 THE PERMUTATION TECHNIQUE
31
from the next test in the sequence means that there is no need to generate such input
data at this test, consequently, no effort is spent and no addition to the sequence cost.
Figure 4.1 shows a simplified example of how the calculation works. Each box represents
a test case, where we only show its input and output states. On the first element, we
cannot reuse any data as there is no previous test. So, we add to the total cost of the
sequence the time to generate the input state of the test A. In this case, we add the time
a tester takes to insert 1 Email. After the test A is executed (the steps for executing the
test are omitted from the Figure 4.1), we can reuse the outputs of A in the inputs of B.
Note that the execution of test A produced a second Email. So, before executing the test
B, the tester does not need to add the two Emails required, only the GIF file. The time
spent on adding a GIF is added to the total cost. At last, from B to C, we can reuse the
two Emails and one of the two GIFs needed in C, adding to the total cost only the time
spent on adding one more GIF.
Figure 4.1 An example using the reuse function.
The time to generate each kind of input data can come either by interviewing the
testers or by measuring the testers’ activities during their general day-work, or (as we did
in our case), by carrying out a small experiment (see Chapter 5). Notice that such measurements are needed to be taken only once for each kind of data (like saving a GIF file
or deleting an e-mail), as these data can be reused in other actions of prioritization. Another consideration regards our total non-interest in the time spent during the execution
of the test steps. This is understandable as no reuse can be done in such tasks because
the input state has already been prepared. Therefore, we presume that the time spent
32
4.1 THE PERMUTATION TECHNIQUE
in the execution of these procedures does not affect our calculations. Although we know
this value can change according to other variables as tester’s experience, motivation, level
of difficulty, among others [AB08b]. This variation is still immaterial with respect to the
reuse of data from one test to the next.
There is still one issue missing in this calculation: what about tests with multiple
phones (say, several PUTs and several auxiliary phones)? How do we reuse their inputoutput information? This scenario makes the calculation a little bit more complex. This
happens because we have to decide which phone will reuse the output state of another
phone from the previous test. We must link each phone from a test A to each phone from
a test B in the best way, which is where more reuses take place. Before calculating the
final cost, we have to combine, say, each PUT’s output state with another PUT’s input
state to check which combination produces the lowest cost. Figure 4.2 shows possible
combinations where in 4.2(a) the output state of the phone X is reused by the phone Y0
and, similarly, the output state of phone Y is reused by the phone X0 , while in 4.2(b) we
have the opposite situation (assuming that all phones are, say, PUTs).
To perform this verification, we apply an algorithm similar to the permutation one.
For instance, the phones of test A in Figure 4.2 are combined exhaustively with all phones
of the same kind in test B. By traversing all possible combinations we can verify which
pairs make more reuses. As tests usually require one or two phones (three at most), the
generation of these pairs does not impact the total sequence generation time. We do not
promote reuse from a PUT to an auxiliary phone (or vice-versa) as, in general, auxiliary
phones cannot be used as PUTs.
(a)
(b)
Figure 4.2 Possible phone combinations.
Another aspect related to the calculation of the cost is the decision to accumulate the
state of the phones from one test to the next (a stateful approach). Keeping the state
4.1 THE PERMUTATION TECHNIQUE
33
means that a phone that is used through a sequence of tests can accumulate data from
the reuses of all test cases executed. An example can be seen in Figure 4.3, where the
state of test A is updated with an Email and a GIF file. Note that the GIF file is not
used by the test B, but it is still in the phone state. So, it can be reused by the test C
without any configuration cost added to the sequence. The GIF file is only reused by test
C, as a result of the accumulated state of the phone after executing tests A and B.
Figure 4.3 An example keeping the phone state.
Alternatively, we can calculate the cost without taking into account the cumulative
phone state by looking strictly at the possible reuses between two test cases. In this
case, the cost of running two particular tests is immutable regardless the sequence (the
context) they belong to. This happens because they do not suffer any interference from
the execution of any other tests executed previously. This variation (called stateless)
allows us to calculate the costs of reuse between any two tests in advance before generating
the permutations.
Both stateless and stateful approaches have advantages and disadvantages. The former has a better performance as the costs between tests do not change and their costs
are calculated only once before the permutation starts. Therefore, they can be stored in
advance and retrieved when needed. This cannot be applied to the stateful algorithm as
the state of the phones changes along the sequence and affects all the following tests. So,
4.1 THE PERMUTATION TECHNIQUE
34
each possible sequence requires the cost to be computed on-the-fly. However, the stateful
approach undoubtedly produces a more accurate calculation concerning the execution
time (see Section 5.1.2.1).
4.1.1
Optimizations
As a permutation algorithm is very costly in a computational sense, some optimizations
were implemented to allow the algorithm to finish its processing in less time. They are
the pruning mechanism, equivalence classes, transitive closure and user-defined clusters.
The first, prunes sequences which will never be elected as optimal. The other two try
to reduce the number of test cases to be given as input to the permutation algorithm
without affecting the original optimal sequence of the suite. By reducing the size of the
suite, we also reduce the time to generate the permutations.
Pruning Mechanism: The goal is to avoid spending time with sequences that will
never be considered optimal. Initially, a greedy algorithm returns some test sequence. Its
cost is used as initial value to the best cost found so far, which is used to prune sequences
incapable of being optimal. The time to run the greedy algorithm is insignificant compared to the whole permutation algorithm, thus, it doesn’t affect the total computing
time.
While the sequences are being built, the permutation algorithm calculates their partial
costs as well. Then, if these partial costs outperform the best one found so far, the current
sequence is abandoned together with any other sequences which have the same prefix.
Hence, the total computation time is decreased.
An example is shown is Figure 4.4. In the first two rows, the sequences are generated
entirely, i.e., up to the last element, and are compared with the best cost found. In the
third row, the cost up to the second element already exceeds the best cost. Thus, the
sequence being built is halted together with all sequences starting with Test1 followed by
Test5. The same happens in the fourth row. The fifth row is calculated completely and
a new best cost is found, so the best cost variable is updated to 58.
The algorithm that generates the permutations with a pruning mechanism for optimization is shown in Algorithm 4.1. It is recursive and requires 6 parameters: the list
of test cases being prioritized (tests[]), an auxiliary list used to represent the sequences
4.1 THE PERMUTATION TECHNIQUE
35
Figure 4.4 An example using pruning mechanism.
being generated (aux[]), an auxiliary list of boolean values used for controlling the tests
already permuted (marked[]), an integer variable to control the size of the sequences
being generated (depth), the integer variable to keep the cost of the current sequence being generated (totalCost). There are two global variables: bestCost maintains the best
sequence cost found so far and optimals is a list of sequences that possess the best cost
at the end of the algorithm. The three lists used (tests[], aux[] and marked[]) have the
same length, which is the amount of tests being prioritized. The first time the algorithm
is called these parameters have the following values: tests[] contains the test cases being
prioritized; aux[] has all test cases initialized with null values; marked[] has all boolean
values set to false; depth and totalCost begin with 0 and bestCost is initialized with a
greedy function, which has the first element randomized, that returns the cost of the
sequence found. This is used to quickly prune the sequences that outperforms the cost
of this greedy sequence.
At line 1 the algorithm iterates n times, where n is the size of the test suite. It is
used together with the list of boolean values to guarantee that each test is used in each
sequence generated. Line 2 checks if the value has been already traversed, if it is not,
the test is used, otherwise, the next iteration happens. Line 3 marks the test from the
current index as used and at line 4 the test is added in the auxiliary list that represents
the current list been generated. The depth parameter is used to maintain how many tests
have being added to the sequence been constructed, hence, it is an index where the next
test should be allocated. Line 5 calculates the cost to add the test aux[depth] to the
4.1 THE PERMUTATION TECHNIQUE
36
Algorithm 4.1 permute()
Require: tests[], aux[], marked[], depth, totalCost
Ensure: tests.length = aux.length = marked.length
1:
2:
for i = 0 to tests.length-1 do
if !marked[i] then
3:
marked[i] ⇐ true
4:
aux[depth] ⇐ tests[i]
5:
cost ⇐ calculateCost(aux, depth)
6:
if ((cost + totalCost) < bestCost) AN D (depth < tests.length − 1) then
7:
8:
9:
permute(tests, aux, marked, depth + 1, cost + totalCost, bestCost)
else if (depth = tests.length − 1) AN D ((cost + totalCost) < bestCost) then
bestCost ⇐ cost + totalCost
10:
optimals.clear()
11:
optimals.addCopy(aux[])
12:
13:
else if (depth = tests.length − 1) AN D ((cost + totalCost) = bestCost) then
optimals.addCopy(aux[])
14:
end if
15:
aux[depth] ⇐ null
16:
marked[i] ⇐ f alse
17:
18:
end if
end for
current sequence state(totalCost).
From line 6 to 14 there are some comparisons to decide the evolution of the current
sequence. Line 6 verifies if there are more tests to be added (depth < test.length) and if
the cost of the current sequence (cost + totalCost) does not exceed the best cost found
so far. If the sequence does exceed the best cost found so far, the sequence generation
finishes, so any sequence with the prefix equal to the sequence generated up to this
moment is not even generated. Otherwise, the sequence generation continues in Line
7, performing a recursive call updating the depth to depth + 1 and the totalCost to
cost + totalCost. Line 8 tests the case where the sequence reaches the complete size and
its cost is better to the best cost found. The bestCost variable is updated, the result set
4.1 THE PERMUTATION TECHNIQUE
37
with all best permutations is cleared and a copy of the sequence with the best result,
which is at the aux[] variable, is added to it (lines 9 to 11). When the sequence cost is
equal to the best cost, instead of clearing the result set, a copy is added to the it (lines
12 and 13). Finally, lines 15 and 16 reset the test used at the position depth and the
boolean variable to allow the use of such test in furthers sequences.
Equivalence Class of Similar Test Cases: This optimization is used to reduce the
number of elements to run in the prioritization. By doing so, we reduce the computational
effort. The strategy tries to find equivalent tests: those that have equal input and output
states, and, their input state are equal to their output state individually. Figure 4.5
illustrates an equivalence class with three test cases. All of them have as input state just
one phone with 2 IM Accounts, and it is identical to their output state. Hence, their
are considered from the same equivalence class. In this particular scenario, executed
sequentially, these tests do not produce new outputs as they are equal to their inputs
(the other tests are neutral elements). Therefore, they can be represented by only one
equivalent test in the permutation algorithm, thus, decreasing the number of elements.
When the algorithm over the reduced suite finishes, the equivalent tests can be executed
in any order around and close to the representant.
Figure 4.5 Three equivalent test cases.
Figure 4.6 shows an example where 4.6(a) shows the complete suite, 4.6(b) shows
that tests 8, 6 and 4 are equivalent. One of them represents the others, in this case the
test 4, and the others are removed from the test set. The permutation is run only with
3 elements displayed in 4.6(c).
Transitive Closure: Another strategy that tries to reduce the load on the permutation algorithm by reducing the number of elements to be prioritized. A transitive closure
38
4.1 THE PERMUTATION TECHNIQUE
(a) Initial tests
(b) Equivalent tests
(c) reduced test set
Figure 4.6 Reducing test set with equivalence class.
can be defined as follows.
Definition 2. Let a graph G = (V, E), where V is the set of vertices and E is the set of
edges. The transitive closure of G is a graph G+ = (V, E+) such that
∀v, w ∈ V. (v, w) ∈ E+ ⇐⇒ (v ⇒ w 6= ∅)
if from an element (vertex) v we can reach an element w (denoted by (v ⇒ w 6= ∅)), then
there is a relation (edge) between v and w in G+. The problem to find these closures is
strongly used in reachability analysis, and once solved it results in a data structure that
answers efficiently range questions like “can I get to x from y?”, and disjoints subsets
according to the closure relation.
In our approach, we use transitive closure to discover disjoints subsets of test cases,
dividing the entire test suite in smaller independent groups. Each group does not affect
the others in terms of data reuse. If there is a reuse between two tests, they should
belong to the same group. Consequently, they are related to all elements already inserted
(because of its transitivity). Otherwise, they should be put in different groups. For
instance, given five elements a, b, c, d and e. If there are the following relations: from a
to b and from b to c, and from e to d, two sets are created: the first with a, b, c and the
second with d and e.
As the disjoints sets have no possible reuse between them, there is no need to run the
permutation algorithm with groups containing elements from different sets. Hence, we
break the whole test set in subsets, execute the permutation in each subset separately,
and return the optimal sequences from each subset. Notice that these sequences can be
4.1 THE PERMUTATION TECHNIQUE
39
executed in any order, as each group does not affect the others. Figure 4.7 illustrates the
execution of the transitive closure strategy. Tests 1, 4, 7 and 5, 6, 8 belong to different
closures. They are prioritized by the permutation algorithm separately and the best
ordering among each group is returned. They can now be executed sequentially: 1,4 and
7 from the first set can be followed by 6, 8 and 5 from the second set (or vice-versa: 6,8
and 5 comes before 1, 4 and 7).
Figure 4.7 The transitive Closure mechanism.
The more groups are found inside a test set, the faster the prioritization algorithm
is executed, as the exponential complexity is split in minor problems. This allows us
either to reduce the algorithm execution time, or to run it with larger sets of elements,
minimizing the limitations presented in Table 4.1.
The main difference of the transitive closure and equivalence class of similar test cases
optimizations is that the elements from a same group created by the former do not need
to be similar, they are grouped according to reuse relation among their elements. In
the latter technique the elements of a same group have a distinct characteristic, which is
the equality of inputs and outputs among themselves, that allow them to be represented
by only one member. Another aspect is that all elements in the group of the transitive
closure technique must be prioritized, while only one in the equivalence class needs to be
prioritized.
User-defined Clusters: Another way to group test cases to improve the efficiency
of the prioritization algorithms is to create clusters of them. In our context, a cluster
is an aggregation of test cases defined by the tester that must be executed together
along a sequence and can be prioritized with other tests being considered as a single one.
We can see an example of a cluster in Figure 4.8 where the tests ID135, ID148 and
4.1 THE PERMUTATION TECHNIQUE
40
ID270 must appear together at the prioritized sequences. They become very useful when
there are factors besides the data reuse that can have a major influence at the test suite
being prioritized. Then, the tester may choose some tests that must be together along a
sequence due those factors.
Figure 4.8 Example of cluster.
For example, imagine that a test engineer knows beforehand the set of test cases.
He knows that a determined subset has a more complicated execution than the others
and could prefer to execute these test cases together. He could group these tests in a
cluster. Then, these tests would be always prioritized as a unique test case. However,
it is important to emphasize that the clusters are different of the groups of a Transitive
Closure. While the test cases that are part of the clusters are defined by the user, the
groups of the Transitive Closure are formed by a pre-processing performed with the test
cases before the prioritization.
When a cluster is created, the first step is to prioritize the test cases inside the cluster.
To do this, we use the Greedy Algorithm described at Chapter 6. It is fast and has an
acceptable result to our approach. After this, the cluster will be a representative test
case where its input state is the input state from first test case in the order generated
by the Greedy and its output state is the output state accumulated from all test cases −
the phone’s output state after the execution of the last test case. After this step, it can
be prioritized together with other test cases, as seen in Figure 4.9.
Nevertheless, there is no guarantee of finding an optimal solution with the use of this
feature due to the use of a greed heuristic with the tests in the cluster.
Figure 4.9 Cluster as a representative test case.
CHAPTER 5
EMPIRICAL STUDIES
Experimentation is one of the most used tools to test ideas and validate approaches.
Generally, it takes advantage of solid statistical tools to support the inferences over
evidences in experiment results, aggregating more confidence over assumptions. To assure
our technique reduces the test execution time we took advantage of such mechanisms and
we executed some studies validating our approach.
Some experiments have been driven to validate the permutation technique presented
in Chapter 4. A prototype software was implemented to mechanize the permutation
(details about this tool are explained in Chapter 7). All empirical studies were performed
in collaboration with the Motorola BTC program. They provided us with infrastructure
and human resources to execute our experiments.
We worked together with a specific test team inside the BTC. The choice of this team
happened because of their innovative approach for improving their testing process. They
were already using heuristics to generate test sequences manually [RBJ07], which has
already brought some gains to their performance. This team allowed us to compare their
current manual technique with our proposal. Therefore, two studies were carried out: the
first involves data collection over the input data generation time (period to set each input
data of phones) and a comparative case study using its results. We are going to refer to
it as ES1-CS (Case Study). The second is a controlled experiment used to validate our
approach in comparison to the manual one with more accuracy by using more statistical
tools and by measuring the total execution time, it is referred to as ES2-CE (Controlled
Experiment).
Another difference between the two studies, is that ES1-CS does the calculations only
considering the time spent at the preparation of the phones (Initial State test phase), and
which we call configuration time. The total execution time is the sum of the configuration
time and the procedures execution time. As the techniques aim to discover reuses to be
made only during the Initial State phase of test execution, then we assume that the
42
5.1 EMPIRICAL STUDY 1 - CASE STUDY (ES1-CS)
43
major influence in the saved time of a sequence is given by this stage. ES2-CE not only
does a more accurate comparison between the techniques, but also shows how reliable
the assumption done in ES1-CS is, by verifying the influences of the configuration time
in the total execution time.
5.1
EMPIRICAL STUDY 1 - CASE STUDY (ES1-CS)
Although this study is simpler, it was very important because it gave us precise data
to run our technique and allowed a preliminary comparison with the manual one. It
retrieved the times spent generating the input data needed on the mobile phones during
test cases execution. This information are the inputs to our technique, as we calculate
the best sequences according to the maximum reuse and the input data generation times.
For example, a test that needs one video file as input state can be more costly than other
test case that needs three e-mail messages because a single video may take more time
than the generation of three e-mails. For that reason, the consistence of these times is
important during prioritization. The acquisition of this information may not need an
experiment every time some prioritization takes place. It could come from the testers’
expertise, or historical executions data. It is important to realize that even coming from
other sources this information should reflect as close as possible the reality, otherwise the
prioritization result may not be precise as well.
Moreover, we used these times to calculate the cost of the manual technique as well.
As we have the sequence generated by this technique together with the input and output
states needed for each test, we can discover the configured and reused data, and the cost
associated with such sequence. Therefore, with the results from both approaches we can
compare their performances.
5.1.1
Experiment Planning
This experiment justifies itself by the need of reliable data to be used in prioritization
techniques which are compared later. It was executed in two phases: a data collection
and the test case prioritization together with comparisons based on the collected information. The data collection consisted of the generation of different kinds of input data
repeated times to give us an average of effort spent in each configuration. Such informa-
5.1 EMPIRICAL STUDY 1 - CASE STUDY (ES1-CS)
44
tion was used to prioritize test cases and to compare theoretically the two techniques,
the automatic permutation and the manual prioritization.
5.1.1.1
Goals The following goals were identified in the first study:
G1: Identify the configuration times used in a particular test suite.
G2: Identify which technique produces a sequence of tests that can be executed in less
time and effort considering solely the data generation time.
5.1.1.2
Participants For the data collection, two testers are used, each one with
difference levels of experience. Both are from the team mentioned earlier: one is an
expert tester while the other is a trainee with restricted knowledge over testing mobile
phones. The choice of these two different levels is to try to dilute the experience factor
influence in the experiment results.
5.1.1.3
Experimental Material The objects used in the experiment are a test suite,
a data survey of the test suite, the mobile phones used to execute the data generations
and the tools involved in the tasks.
Test Suite. The test suites are files which contain the steps to be executed in
the mobile phone and the expected behavior. They could be split in two parts, the
initial condition, where all information about input state needed to the test execution
is informed, and the test procedure, which has all steps that the tester should perform
along the test.
The test suite used for this experiment tests a specific component of the mobile phones.
It was chosen for two reasons. First, it has 12 test cases. This is what fits perfectly in
the permutation size restrictions, and second, its data survey (see below) was available
as the team already had used this suite with the manual technique. We only spent some
time validating it and correcting inconsistencies.
Data survey. Before we execute the experiment we had to know which input data
we should use. They can be, for example, videos, e-mails, sound files, image files. Both
techniques, the manual and the permutation need a data survey over the test cases to
be prioritized. As the objects of our study are test cases written in English, such task
5.1 EMPIRICAL STUDY 1 - CASE STUDY (ES1-CS)
45
is not mechanizable. Hence, a person reads the test cases and identifies each test case
input state and output state. However, such task can be automated in some cases (one
example is described later in Chapter 8). But, as explained before, the specific test suite
chosen already had its data survey done by testers. We only checked its conformance and
corrected it when necessary. The data survey showed a total of 25 input data generations
needed in this experiment (see Table 5.2).
The mobile phones. The phones used by both testers had the same hardware
and software configuration, thus avoiding the discrepancies in the results related to the
equipment used according to the two participants.
Tools. To perform this experiment two tools were involved in the process. The time
was gathered using the tool ManualTest [AB08a]. It is used in the input data generation
to collect the time spent on it. The testers indicate to the tool which activity is being
performed while tool records the time spent on it. At the end of the experiment, the tool
generates logs with the times from each activity.
To prioritize the test suites, the tool explained in Chapter 7 was developed. It prioritizes a set of tests according to the possible reuses, using one of the techniques implemented. We used the permutation approach, which traverses all possible sequences and
returns the best ones, because we wanted to validate if the automatic prioritization based
on reuse is better than manual prioritization, which is mostly based on reuse and testers’
experience.
5.1.1.4
Tasks For each tester in involved in the data collection, the following tasks
were performed:
1. Listen to the explanation about activities to be performed.
2. Familiarize themselves with the ManualTest tool.
3. Receive the list containing the input data to be prepared.
4. Check if all the equipment needed to perform the activities is available.
5. Start the execution of the activities using the ManualTest.
6. When all tasks are finished, inform the tool the end of the experiment.
46
5.1 EMPIRICAL STUDY 1 - CASE STUDY (ES1-CS)
Data List 1
List 2
1
Insert 1 DRM SD jpg activated
IM login (Power On)
2
Insert 1 Video
Alert for IM
3
Alert for IM
Insert 1 DRM SD jpg activated
4
IM login (Power On)
Insert 1 Video
Table 5.1 An example of 2 lists with random orders of execution containing 4 input data
generation.
7. Recover the logs containing the time taken to generate the input data.
5.1.1.5
Experiment Design For each tester we created lists generated randomly
with different orders of executions of the input data generation, where each list has all
elements discovered in the survey (see an example in Table 5.1 with 4 elements). After
the execution, the average of each generation is calculated. After this step, we have
the average time for each generation. Finally, we take a major average regarding the
two testers’ averages. This metric is the main average time spent in each configuration.
These data will be used in the calculation of the time spent with input data generation
on the prioritization technique.
5.1.2
Results and Analysis
The expert tester was able to complete 5 sequences of execution, while the beginner tester
only finished 3 sequences. They had only 2 hours to execute the maximum number of
lists as possible. Their averages were calculated to generate the final average for each
data generation. These results are illustrated in Table 5.2.
The times presented in the column Average were used to perform a case study regarding the test suite involved. This case study will be explained next.
5.1.2.1
Case Study Given the average results of the input data preparation we could
run our permutation technique and compare the cost of the sequences generated with
the one created using the manual technique. The process to calculate the costs of the
sequences is the same described in Chapter 4. For both techniques they were calculated
47
5.1 EMPIRICAL STUDY 1 - CASE STUDY (ES1-CS)
Data Description
Expert
Beginner Averages
1
Insert 1 DRM SD jpg activated
00:01:03
00:00:50
00:00:57
2
Delete 1 DRM SD jpg activated
00:00:35
00:00:30
00:00:33
3
Insert 1 IM contact
00:00:59
00:00:51
00:00:55
4
Delete 1 IM contact
00:00:32
00:00:33
00:00:33
5
Automatic Lock (On)
00:00:25
00:00:24
00:00:25
6
Insert 1 Music file
00:00:17
00:00:27
00:00:22
7
Delete 1 Music file
00:00:26
00:00:18
00:00:22
8
Insert 1 Phonebook contact
00:00:36
00:00:22
00:00:29
9
Delete 1 Phonebook contact
00:00:17
00:00:13
00:00:15
10
Alert for IM
00:00:21
00:00:17
00:00:19
11
IM login (Power On)
00:00:15
00:00:14
00:00:14
12
Insert 1 SIM contact
00:00:21
00:00:26
00:00:24
13
Delete 1 SIM contact
00:00:16
00:00:11
00:00:14
14
Insert 1 GIF
00:00:30
00:00:18
00:00:24
15
Delete 1 GIF
00:00:22
00:00:24
00:00:23
16
IM login (Auto)
00:00:17
00:00:17
00:00:17
17
Alert (On)
00:00:19
00:00:22
00:00:21
18
Insert 1 Video
00:00:30
00:00:26
00:00:28
19
Delete 1 Video
00:00:19
00:00:25
00:00:22
20
Insert 1 JPG
00:00:16
00:00:12
00:00:14
21
Delete 1 JPG
00:00:18
00:00:18
00:00:18
22
Insert 1 DRM FWL jpg
00:01:20
00:00:33
00:00:56
23
Delete 1 DRM FWL jpg
00:00:21
00:00:26
00:00:23
24
Insert 1 DRM CD jpg
00:00:57
00:00:39
00:00:48
25
Delete 1 DRM CD jpg
00:00:33
00:00:32
00:00:32
Table 5.2 Execution times of configuration data collected in ES1-CS (in seconds).
48
5.1 EMPIRICAL STUDY 1 - CASE STUDY (ES1-CS)
Generation Type
Generation Time # best results
Configuration Time
Manual(Stateless)
Around 2 weeks
1
641
Manual(Stateful)
Around 2 weeks
1
617
Automatic(Stateless)
53 sec
96
455
Automatic(Stateful)
7 min 21 sec
45360
431
Table 5.3 Comparison among approaches.
taking in consideration the stateless and stateful approaches.
The comparison between the prioritization techniques is shown in Table 5.3. In the
first two rows we see that the manual methodology takes around two weeks to generate
the sequences. This happens because the testers devote a small period of time per day
to work on the sequence generation. So, the whole process can drag on for two weeks.
The manual approach is based on the testers’ experience, which always returns a single
sequence, and which is not guaranteed to be the best one.
The last two rows show the outcome of our tool. Not surprisingly, its generation
time is far better than the manual generation. The automatic prioritization generates
sequences very quickly compared to the manual. However, this time may be larger when
the information are not yet inserted in the prioritization tool. In those cases the generation time must be added by the time a tester takes to register such information. This
time depends on the test suite size, but for an usual test suite it may take no longer than
a few hours. Therefore, in such cases, the automatic technique still shows to be a more
reasonable solution.
Notice that depending on the amount of reuses, the number of best sequences can
be very large (several sequences might have the same lowest cost). The fewer the reuses
among the tests, the larger the set of best sequences. This happens because the number
of possibilities is huge. For example, for 12 tests, there are 479,001,600 sequences, so
there is a high probability of existing more than one best sequence. In the case of the
stateful permutation, this number is even larger. This happens because a particular test
which does not bring any reuse placed at a specific position in the sequence, may well be
placed in several subsequent positions with the same cost (provided that the subsequent
tests do not delete configurations). This increases the number of possible best sequences
in comparison to the stateless permutation.
5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE)
49
The automatic technique on both stateless and stateful approaches improved the
configuration execution time in around 30% in comparison to the manual technique (i.e.,
the time spent to execute the configurations of the sequences generated by the tool is
30% less than the time spent to execute the configurations of the sequence generated
manually). This is not surprising as the tool verifies all possible sequences and therefore
guarantees the best results. Also, a sequence created manually has high probability of
not being the best one because some group of reuses can be difficult to visualize by
humans, especially if we think of the amount of possible permutations. Hence, providing
an automatic generator is doubly advantageous: it produces sequences in less time and
the generated sequences can be executed faster.
However, as said previously, this study is based on the precondition that the configuration time is the major responsible for the reduction on test suite execution time, as the
procedures execution time must be constant regardless the amount of reuse performed.
To strengthen this assumption we realize we needed to perform another experiment. We
ask testers to execute the test sequences generated by the two techniques, but this time
including both data configuration and test procedures. The results achieved until this
moment were mainly theoretical, as the testers didn’t execute the tests. They executed
only the input data generation. The details about this experiment is described next.
5.2
EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE)
The case study showed promising results on the comparison between our technique and
the testers’ technique. Therefore, we ran a more elaborated empirical study which could
provide more confidence regarding our premises. From this point of view we elaborated
a controlled experiment, which is described in [LAIS09], to verify if the total time spent
in a test suite execution is influenced mainly by the configuration time instead of the
procedures time.
As this is a formal experiment, it is described with a higher level of detail based on a
variety of works available in the literature, which provided us guidelines on how design,
analyze and report experiments in software engineering [Pfl94, Pfl95a, Pfl95d, Pfl95b,
Pfl95c, JP05, JCP08].
5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE)
5.2.1
50
Experimental Planning
We designed and ran a controlled experiment to evaluate the benefits provided by the
manual and the automatic test prioritization techniques. The plan and design are described in what follows.
5.2.1.1
Goals The research objective of this experiment is to:
Compare test case prioritization techniques and analyse their impact on the total configuration, procedure and execution times from the point of view of test
managers in the context of manual execution of black-box tests of mobile phone
applications.
To achieve this objective, we define the following goal for this study (following the
Goal/Question/Metric method [BCR94]):
G: Identify which investigated technique provides test sequences with smaller execution
times.
The assessment of this goal is performed by answering the following research questions:
Q: When using the automatic technique instead of the manual one, what is the impact
on the total
a. configuration time?
b. procedure time?
c. execution time?
Our research hypothesis is that the automatic prioritization technique provides a
larger reduction of the total configuration time when compared to the manual one, but
both techniques have equal impact on the total procedure time.
This is understandable as data reuse only affects the configuration time. For example,
if a test reuses a GIF file from a previous test, the time saved for not inserting that GIF
affects the configuration time. By definition, the execution time may be affected to some
extent as well. However, the procedure time should not be affected as the test procedure
5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE)
51
simply assumes that the input is already available, i.e. it assumes the existence of a GIF
file in the phone.
M: To answer questions Q.a to Q.c, we use the following metrics:
– ConfDiff: the difference between the mean total configuration time of the
automatic and the manual techniques
– ProcDiff: the difference between the mean total procedure time of the automatic and the manual techniques
– ExecDiff: the difference between the mean total execution time of the automatic and the manual techniques
5.2.1.2
Participants A group of 8 testers was allocated to the experiment. All
testers belonged to the BTC program, where 4 of them were experienced testers, and the
remaining 4 were beginners. The allocation of the testers was provided by their manager,
according to their availability.
5.2.1.3
Experimental Material The objects used in the experiment are 2 test
suites, the mobile phones used to execute the tests, the tools involved in the tasks and
documents left to the testers as part of the experiment procedure.
The Test Suites. For this experiment 2 test suites are used: test suite I and test
suite II. The former contains 6 test cases, and the latter contains 8 test cases. These
suites where chosen according to the following criteria:
ˆ Their sizes are less than 16, so that the permutation technique could run in feasible
time.
ˆ The input/output data survey is available. The data survey is the description of the
inputs and outputs of each test in a machine readable format. Recall that the tests
are written in plain English. So, this information has to be extracted manually.
ˆ The number of tests to be executed should not be larger than 15 in order to allow
the testers to carry out the experiment within 1 working day.
5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE)
52
See Appendix A.1 for a detailed description of the test suites I and II.
The mobile phones. Phones used in this experiment can be of two types: a Phone
Under Test (PUT) or an auxiliary phone. All testers used the same PUTs throughout
the experiment. And they all have the same hardware and software. On the other hand,
auxiliary phones can be of any kind as they are not under test and they are only needed
for basic functions.
Tools. The same tools used in the first study were used here. To prioritize the test
suites under the automatic technique, we used the tool described in Chapter 7 and we
used ManualTEST [AB08a] to accurately record the time taken by the testers to execute
the configuration and the procedure activities.
Documents. Two documents were given to the testers during the experiment: the
reuse guideline and the participant instructions. The reuse guideline helps the testers
to perceive the data reuse that they should make along the sequence. It contains a
table where the rows are the tests in the order of execution, and the columns have the
information about the input and output states of each test. Data reuse is marked through
these states. The participant instructions provide detailed explanations about each step
of the experiment together with a few recommendations. The testers read the instructions
just before the execution of the tests.
5.2.1.4
Tasks The tasks to be performed by the participants during the experiment
are:
1. Listen to the explanation about the experiment.
2. Receive the training on the ManualTEST tool.
3. Use ManualTEST in order to get familiarised with it.
4. Read the participant instructions document to fully understand the steps to be
followed during the experiment.
5. For each test sequence, the testers should perform the following steps using the
ManualTEST:
(a) Setup the phones: the tester executes the setup activity.
5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE)
53
(b) Execute the test sequences: the tester starts to execute the tests using the
reuse guideline as reference. For each test, the following activities should be
performed:
i. Look for the inputs required in the reuse guideline.
ii. Execute the input activity.
iii. Execute the procedure activity.
6. When the execution of both test sequences are finished, the logs generated by
ManualTEST are collected.
5.2.1.5
Hypotheses and variables In this section, we present the statistical hy-
potheses that we want to test and the variables of our experiment. We define the following null statistical hypotheses H0i and their alternative hypotheses H1i , where i is the
index of their related research questions (Q.i). For instance, the null hypothesis H0a says
that there is no statistical difference on the total configuration time by using either the
manual or the automatic technique:
H0a : µConf T imeA = µConf T imeM
H1a : µConf T imeA 6= µConf T imeM
H0b : µP rocT imeA = µP rocT imeM
H1b : µP rocT imeA 6= µP rocT imeM
H0c : µExecT imeA = µExecT imeM
H1c : µExecT imeA 6= µExecT imeM
We also identify the dependent and independent variables of our experiment, as presented in Table 5.4. In what follows, we describe each of the independent variables.
5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE)
Name
Description
Levels
or
Range
Tester
Participant that per- 1..8
formed the test execution.
Test suite
Set of tests executed I, II
by the tester.
Technique
Test case prioritiza-
M, A
tion technique.
ConfTimet
Total
configuration [0, ∞]
time of a test sequence
when
using
the test prioritization
technique t.
ProcTimet
Total procedure time
[0, ∞]
of a test sequence
when using the test
prioritization
tech-
nique t.
ExecTimet
Total execution time
[0, ∞]
of a test sequence
when using the test
prioritization
tech-
nique t.
Table 5.4 Dependent and independent variables of the experiment.
54
5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE)
55
Tester : this variable represents the employees that execute the tests and their experience to perform tests of mobile phones. There are 2 groups: the experienced testers and
the beginners, where each group has four participants.
Test suite: to allow the feasibility of the experiment, only 2 test suites were chosen:
the test suite I with 6 tests, and the test suite II with 8 tests.
Technique: this variable represents the prioritization techniques involved in the experiment, which are the treatments to be applied to the problem. The automatic permutation
technique is called A, and the manual technique is called M.
Other factors related to the environment (like computers, tools, etc.) are controlled
to minimize their impact on the test execution time.
5.2.2
Experiment Design
In this experiment, we need to control the effect of the testers’ experience and the effect
of the size and complexity of the test suites. The effect of these two sources of variation
can be high on the configuration and procedure times. Also, there may exist a learning
effect that should be controlled when people repeat the same tasks.
We selected the Latin square design to run the experiment [BHHH78, Kue99], since it
is useful when the experimenter desires to block two different sources of variations and to
reduce the learning effect. Besides that, Latin square designs allow experiments with a
relatively small number of runs in comparison to a complete randomized blocking design.
This is one of the main motivations to the usage of this specific design in our experiment,
as we have a limited number of participants available.
In this design, the number of rows, columns, and treatments must be equal to allow
the construction of a square matrix. In our case, a 2x2 matrix is used (see Figure 5.1).
Once the rows (testers) and columns (test suites) are fixed, the treatments (test case prioritization techniques) are assigned to them at random and respecting the property where
each treatment exists once per row and once per column. This property is related to the
reduction of the learning effect, as each tester applies different prioritization techniques
for different test suites.
The participants are randomly selected in groups of two persons with the same experience level. As there are 4 experienced testers and 4 beginners, we have 2x2 latin
5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE)
Suite I
Suite II
Tester X
M
A
Tester Y
A
M
56
Figure 5.1 An example of a Latin square used in our experiment.
squares replicated four times. During the experiment, each tester receives the two suites
to be executed, each one prioritized according to a different technique. In each latin
square, the test suites and the techniques alternate between the two testers. Figure 5.1
shows an example. Given a replicate with testers X and Y, X executes the suite I (ordered by the manual technique) and the suite II (ordered by the automatic technique).
Y executes the suite I (ordered by the automatic technique) and the suite II (ordered
by the manual technique). Note that we need to randomize only the first allocation of
technique/tester/suite in a 2x2 latin square, since the remaining allocations follow from
it.
5.2.2.1
Procedure This section details the procedure used to run our experiment.
Each tester executes her tasks in her own working environment. This makes the experiment close to the testers’ reality. The executions are scheduled according to the testers’
availability. We allocate two testers per day, both belonging to the same Latin square.
With this constraint, we ensure that each replicate is executed in similar environment
conditions. Also, the effect of different conditions occurred in different days are considered
in the statistical analysis through the effect of each replicate.
We inform the participants that they should execute the tests as they normally do and
that their execution times will remain private, not being used for personal evaluation.
This information is important not to intimidate them to execute the tests either too fast
or too slow in comparison to what they usually do.
The testers should verify if all needed resources are available on their desks (phones,
cables, softwares, etc) before the execution of the test suite. It is recommended for them
to avoid distractions and unnecessary interruptions during the test execution. When the
test execution begins, the tester should indicate in the ManualTEST which test is running
and the type of the current activity (configuration or procedure). For any interruption
during the test execution, the tester should pause the ManualTEST as well. When the
5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE)
57
activity restarts, the tool should be resumed.
When the test execution ends, we recover from the tester’s machine the logs containing
the times related to the configuration and the procedure activities. These logs are used
to perform the analysis described in the next section.
5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE)
5.2.3
58
Execution
This section presents the preparation procedures for our experiment and shows deviations
occurred during the execution of the experiment.
5.2.3.1
Preparations To build the four 2x2 Latin squares replicates, we fixed the
columns with the test suites I and II. Each tester was classified by their manager as
expert or beginner. We allocated the testers to the rows in such a way that a Latin square
replicate contains testers with the same level of experience. Finally, we randomized the
allocation of prioritization techniques in each replicate.
5.2.3.2
Deviations Two deviations occurred during the execution of our experiment.
First, one of the tester experts had to leave on holidays before executing her tasks. Hence,
she was replaced by another tester with the same experience level.
The second deviation was a problem related to the build installed on the phones
in one of the executions. Two testers executed the tests with phones whose build was
different from the previous executions. This build caused a different behaviour of the
phones during the tests of the suite I. Consequently, we discarded the execution for this
suite. Another execution for this particular component was rescheduled to the following
week, potentially reducing the learning effect.
5.2.4
Analysis
The data collected during the experiment is presented on Table 5.5. Each pair of times
presented are the configuration time/ procedure time observed for combinations of test
suite (I or II), tester (1 to 8) and prioritization technique (A or M). We also present the
average time per tester, per test suite, per replicate and the grand average.
The statistical analysis performed in this study is presented in the next sections. In
all of them, we validated the assumptions of the Analysis of Variance (ANOVA): errors
are normally and independently distributed with mean zero and homogeneous variances
among treatments. We also validated an additional assumption of the statistical model
(see Equation 5.1) used in our latin square design using the Tukey’s Test for Nonadditivity
5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE)
59
Table 5.5 Results from the Latin squares with respect to the total configuration time and the
total procedure time.
[SC89]: the effect of the rows, columns and treatments terms are additive (no significant
effect of testers, test suites and prioritization techniques interactions [BHHH78]).
yijkl = µ + Repi + Testerj [Repi ] + TestSuitek
+Techniquel + Errorijkl
(5.1)
The variables which appear in Equation 5.1 are:
ˆ µ: The overall mean.
ˆ Repi : Represents the ith Latin square replicate of the experiment.
ˆ Testerj [Repi ]: A nested variable that represents the jth row (tester)) in the ith
replicate. The identification of the tester in the statistical model requires not only
the identification of the row of the latin square (1 or 2), but also requires the
determination of the latin square replicate (1, 2, 3 or 4).
ˆ TestSuitek : Represents the kth test suite used in the experiment.
ˆ Techniquel : Represents the lth technique under comparison (automatic and man-
ual).
ˆ Errorijkl : The error term.
5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE)
60
ˆ yijkl : The observation (configuration or procedure time) for the lth treatment that
is applied in the jth row (tester) and kth column (test suite) of the ith Latin square
replicate.
5.2.4.1
Total configuration time The results of the statistical analysis with respect
to the total configuration time is presented in Figure 5.2. The column Prob > F indicates
the probability of being wrong when rejecting the null hypothesis saying that a given
factor have no significant effect on the response variable (time). As we can see, the
ANOVA detected a significant effect of the investigated factor Technique with p−value
= 0.0069 at α = 0.05. Therefore, the null hypothesis H0a is rejected.
Figure 5.2 Results from the ANOVA with respect to the total configuration time.
In order to identify which technique provided the larger configuration time reduction,
Figure 5.2 also shows a comparative analysis of the total configuration time when using
the automatic (AUTO) and the manual (MAN) techniques. The comparative analysis presents the mean time (in seconds) using each technique and statistical information
about their difference. For instance, the ConfDiff metric representing the mean difference
between the automatic technique and the manual one was −216.125 seconds (approximately 3.6 minutes), with upper confidence level of -84.862 seconds and lower confidence
level -347.388 seconds at α = 0.05.
5.2.4.2
Total procedure time Figure 5.3 shows the results of the ANOVA with
respect to the total procedure time. In this analysis, the ANOVA did not detect a
significant effect for factor Technique (p−value = 0.2150) at α = 0.05. Hence, we cannot
5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE)
61
reject the null hypothesis H0b . This result can be better visualized by observing the
confidence interval for the mean difference between the automatic technique and the
manual one. Although the estimated mean for this difference (ProcDiff) was −218.25
seconds, its confidence interval includes the value 0 (zero).
Figure 5.3 Results from the ANOVA with respect to the total procedure time.
5.2.4.3
Total execution time Figure 5.4 shows the results of the ANOVA with
respect to the total execution time. As the ANOVA detected a significant effect for
the factor Technique (p−value = 0.0465) with α = 0.05, we reject the null hypothesis
H0c . Therefore, there is a difference in using the automatic and the manual technique.
The mean difference between the automatic technique and the manual one was −433.125
seconds (approximately 7.2 minutes). This metric defined earlier as ExecDiff indicates
that the automatic technique provided sequences executed in less time than the manual
one.
5.2.5
Interpretation
This section discusses the data shown in the analysis and addresses the threats to validity.
5.2.5.1
Evaluation of results and implications The analysis of the total config-
uration time shows that the techniques significantly affect the time taken by the configuration activity. This result rejects the null hypothesis H0a and supports the alternative
hypothesis H1a . The configuration time is strongly affected by the data reuse performed
5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE)
62
Figure 5.4 Results from the ANOVA with respect to the total execution time.
over the test suite execution. In average, the use of the automatic prioritization technique
in our experiment reduced the total configuration time in 25%.
Regarding the total procedure time, the null hypothesis H0b cannot be rejected and
therefore, we conclude that the techniques do not affect the total procedure time. This
was expected as the time spent with the procedure activity is not affected by the data
reuse present along a sequence. The procedure activity is executed under the assumption
that the configuration activity (and all data reuse) are done.
Finally, the most encouraging result came from the total execution time. The total
configuration time is relevant enough to affect the total execution time. We were able to
reject the null hypothesis H0c with a p-value of 0.0464 at α = 0.05, indicating that the
prioritization techniques affect the total execution time. This is understandable as the
data reuse happens during the configuration activity, which is part of the execution activity. In average, the automatic technique reduced in our experiment the total execution
time in 13.5%.
These results ensure that sequences generated by the tool using the automatic technique are executed in less time than the others created by testers using the manual
methodology. It is understandable as some scenarios caught by the software can become
difficult to be visualized by human beings. Another important aspect shown in this experiment regards the reliability on the data reuse concept, as the data reuse experienced
through the total configuration time are representative enough over the total execution
time.
5.2 EMPIRICAL STUDY 2 - CONTROLLED EXPERIMENT (ES2-CE)
5.2.5.2
63
Threats to Validity In this section, we present some of the potential threats
to the validity of our experiment.
Internal validity: The results we gathered from this experiment could have been
affected by the following factors: 1) Defects in the prioritization and the measuring tools.
This threat has been controlled by testing the code of the tools and validating their
outputs before the start of the experiment. 2) As we run each Latin square in different
days, the testers’ motivation may differ depending on their workload that day. In order
to minimize this threat, we motivated the testers by showing them the importance of
this study to their work process. 3) Network problems during the test executions. Some
testers faced a network problem while executing their tests. The server of the mobile
phone provider was taking too long to reply to the phone’s requests. To control this
problem, whenever a connection to the server was too slow, the testers stop the time
counting during server’s downtime.
Construct validity: We use data reuse to reduce the time spent on test execution. As
time is our metric for prioritization effectiveness, its measurement must be well controlled.
In order to avoid a misuse of the ManualTEST tool, two countermeasures have been
taken. First, before the start of the experiment, we taught the testers how to use the
tool. The participants had to try it a few times until they felt confident. Second, we
watched the testers during the experiment in order to avoid an inappropriate use of the
tool. We also needed a monitor to check if the testers record the time taken on each
activity correctly. The presence of a monitor could affect the behaviour of the testers
during the sessions, despite of the fact that they were previously informed to ignore the
presence of the monitor.
External validity: The generalisation of our results could be threatened by the following factors. 1) Object representativeness. We chose our test suites mostly based on
size. The prioritization techniques might have a different effect on test suites of different
sizes from the suites I and II. Moreover the suites I and II might not be general enough
to apply the techniques to all kinds of test suites. For instance, test suites with no data
reuse do not benefit from the automatic technique. 2) Few testers were available. Therefore, we chose them mostly by availability instead of performing a probabilistic sampling.
3) Testing process representativeness. The testing process used in the company may affect somehow the benefits of each test prioritization technique. We control this threat
5.3 CONCLUDING REMARKS
64
by running the experiment using the BTC testing process, which uses most of the good
practices in software testing.
Conclusion validity: 2 x 2 Latin squares have few degrees of freedom for the error term,
limiting the use and the precision of statistical analyses (ANOVA, etc.). We controlled
this threat by replicating four times the Latin square in our experiment, which increases
the degrees of freedom of the error term.
5.3
CONCLUDING REMARKS
Both empirical studies show relevant results to the assessment of the permutation technique as the best alternative compared to the manual technique. The first study provides
a reliable input information to our technique through the time collected from the data
generation. It also presents a preliminary result in the case study showing that the sequences generated by the permutation technique are executed in 30% less time than the
manual technique regarding the configuration time. The second experiment is more formal and complete than the first. It considers not only the configuration time, but the
total execution time, which is the sum of the configuration time and the procedures time.
Two assessments are detailed in this experiment: the influence of the configuration
time over the total execution time, and another comparison between the manual and permutation techniques. Both results are encouraging as the time saved in the configuration
phase showed to be relevant over the total execution time with statistical significance
and the sequences from the permutation technique have been executed in 13.5% less time
than the manual technique.
As we already showed significant evidences from the improvement obtained by the
use of the permutation technique rather than the manual technique, we studied and
developed some alternatives to those cases where the size of the test suite is large. In
the next chapter we detail some heuristics developed to allow an automatic test case
prioritization regarding the data reuse concept to such cases.
CHAPTER 6
PRIORITIZATION HEURISTICS
As seen in Chapter 3, all initiatives toward test case prioritization propose different
kinds of heuristics to a variety of goals, few of them addressing black-box testing. As
already mentioned, our objective is to reduce test suite execution time. To solve this
problem we used a metric based on data reuse. Chapter 5 provides reliable evidences
on the relevance of this metric supported by the use of experiments, and, it shows the
good results achieved with a permutation solution compared to a manual prioritization
heuristic. However, permutation is infeasible for large suites.
To solve this problem we decided to implement some heuristics to help testers finding
good sequences of tests to be run reusing some data along the suite executed, therefore
reducing its execution time. The use of heuristics are needed when there is no practical
solution to find the optimal result, although their use does not always guarantee the
discovery of the optimal solution. As the number of elements increase, also the state space
to be checked becomes exponentially larger, hence, the determination if the solution found
through an heuristic is the best in those cases is intractable. However, a good heuristic
with a reliable heuristic function, generally, brings results as close as possible to the
optimal, apart from some specific advantages depending on the heuristic, like less use of
memory or faster processing.
As we already have a cost function, detailed in Chapter 4, we used it together with
some artificial intelligence concepts and produced three heuristics. The first uses a greedy
algorithm, as it takes the locally next optimal solution regarding the reuse between a set
of tests. The others use concepts of global optimization problems, they are a simulated
annealing approach and a genetic algorithm solution. These three approaches are detailed
next.
65
6.1 GREEDY
6.1
66
GREEDY
The term greedy comes from the fact that the decision over the construction of the
solution is made based on the best next option: the part-made solution is incremented
with the next local optimal state and the process repeats until a complete solution is
built. Such algorithms resembles depth-first search in that it follows a single path all the
way to the goal, but will backtrack when a dead end is hit. Generally, the use of such
strategies comes with some benefits, minimal search cost, often perform quite well and
solutions are found quickly. Nevertheless, such approaches do not always find the optimal
solutions because their just takes immediate best choices instead of analyzing long term
options that could lead to better results.
According to this idea, we developed an approach that builds a sequence of test
iteratively finding the next optimal test to be added until the whole list is fulfilled; in
other words, we search the next test that reuses more data given the sequence built so far
and attach it to the sequence; the process repeats as long as there are tests not bound.
Such approach runs very quickly compared to the permutation algorithm, and the results
produced are good enough as reuse is obtained. However, sometimes the best solution
may not be found due the fact that the best test to be chosen reusing data earlier could
give rise to even more reuse later on.
Moreover, depending on the choice over the first element of the sequence, the result
can vary drastically. To solve this problem we decided to run this algorithm n times,
where n is the size of the set of tests, because each element is allocated as the first
element each time. Finally, the sequence with the best cost is returned. As the greedy
algorithm executes very fast, this approach becomes feasible allowing us to return the
best greedy sequence.
The major steps are detailed in Algorithm 6.1. The main idea is to create n greedy
sequences where each test is allocated once as the first element. At line 1 we assign the
higher integer to a variable that represents the best cost found so far. From line 2 to 8 we
have an iteration over the n greed sequences. Line 3 includes a method call that retrieves
the greed sequence where the ith test is allocated as the first element. Lines 4 to 7 check
if cost of the ith greed sequence is better than the best result found so far. When it is
true, the best cost and the best sequences list are updated. Lines 8 and 9 check if the cost
6.2 SIMULATED ANNEALING
67
Algorithm 6.1 greedy()
Require: tests[]
1:
bestCost ⇐ MAX-INTEGER
2:
for i = 0 to tests.length − 1 do
3:
h[] ⇐ findGreedySequence(i, tests)
4:
if cost(h) < bestCost then
5:
bestCost ⇐ cost(h)
6:
bestSequences.clear()
7:
bestSequences ⇐ h[]
8:
9:
10:
else if cost(h) = bestCost then
bestSequences ⇐ h[]
end if
11:
end for
12:
return bestSequences
of the sequence found is equal to best found and add such sequence to the best sequences
list in such cases. Line 12 yields the greed sequences with the lower cost. The method
f indGreedySequence(i, tests), called in line 3, is described in Algorithm 6.2. From line
1 to 3, we have an assignment of the ith element as the head of the sequence inside the
state variable, which maintains the sequence and calculates the cost of adding a new test
based on the state of phones added so far. Line 3 separates the head element from the
rest of the tests that will be compared. From line 4 to 7, the greed sequence is being
built with the remaining elements. The method called in line 5 is responsible for the local
optimal task present in greed algorithms. It returns the next test that can be added to
the state with the lower cost. Then, this test is added to the state in line 6 and removed
from the remaining set of tests. Line 9 returns the greed sequence found.
6.2
SIMULATED ANNEALING
Some problems have the property that the state description itself contains all the information needed to discover a solution. For these kinds of problems one practical solution
is to use iterative improvement algorithms. The general idea is to start with a complete
configuration and iteratively perform modifications to improve its quality. The best way
6.2 SIMULATED ANNEALING
68
Algorithm 6.2 findGreedySequence(headIndex, tests)
Require: headIndex, tests[]
1:
head ← tests[headIndex]
2:
state.add(head)
3:
tests[] ← tests[] − head
4:
for i = 2 to tests.length do
5:
aux ⇐ findBestReuse(state, tests[])
6:
state.add(aux)
7:
tests[] ← tests[] − aux
8:
end for
9:
return state.sequence;
to understand such algorithms is to screen all the states laid out on the surface of a
landscape. The height of any point corresponds to the evaluation function of the state
at that point. The main objective is to search the landscape for the highest peaks, which
corresponds to the optimal solutions.
Two known algorithms from this class are the Hill-Climbing and Simulated Annealing [RN03]. The Hill-Climbing starts from an initial configuration, then generates its
successors and updates the current configuration to the best successor. The process repeats until no successor can outperform its father. The problem with this algorithm is
that it easily finds peaks, but we have no guarantees that it is the highest or a high enough
peak. The peak found comes from the closest height point of the initial configuration,
hence, the choice of this initial configuration is crucial. Some approaches can be used
to improve this algorithm as the Random-restart Hill Climbing, where when a peak is
reached, the algorithm restarts from another random location.
The Simulated Annealing can be viewed as an optimized Hill-Climbing. This strategy
instead of picking the best successor, it picks a random one. If its cost is better than the
father’s, the current sequence is updated, otherwise, there is an associated probability
to take even a worst successor. This move is justified to avoid points that would lead
to false peaks (local maxima, not an optimal solution). The term Simulated Annealing
comes from the industrial process of annealing, which means the gradual cooling of a
liquid until it freezes. The iterative characteristic of the algorithm uses this idea. A
6.2 SIMULATED ANNEALING
69
Algorithm 6.3 simulatedAnnealing()
Require: current[], a initial list of tests
T , a Temperature,
coolingRate, the rate the temperature is decreased
1:
while T > 0 do
2:
next[] ⇐ chooseSuccessor(current)
3:
4E ⇐ cost(current[]) - cost(next[])
4:
if 4E > 0 then
5:
6:
7:
current[] ⇐ next[]
else
current[] ⇐ next only with probability e4E/T
8:
end if
9:
T ⇐ T × (1 − coolingRate)
10:
end while
11:
return current
parameter T is used to represent a temperature, a second parameter, coolingRate, is
used to decrease the value of T until it reaches zero and the algorithm halts. At higher
values of T , the algorithm tends to accept bad moves. Based on this idea, we developed a
test prioritization algorithm using concepts of simulated annealing shown in Algorithm 6.3
It requires an initial configuration where the tests should be arranged, the temperature, and the rate this temperature should be cooled. Thus, from line 1 to 10 we have
a while loop that iterates until the temperature decreases for an approximation to zero.
At line 2, the successors of the current node are generated. Basically, each one is generated exchanging the position from one test with its front neighbor (e.g., for a sequence
{T 1, T 2, T 3, T 4} three successors are generated, {T 2, T 1, T 3, T 4}, {T 1, T 3, T 2, T 4} and
{T 1, T 2, T 4, T 3}) as illustrated in Figure 6.1.
When all successors are created, one is randomly selected to be compared with its
father (the current sequence). Line 3 calculates the cost variation between the current
sequence and its chosen successor. From line 4 to 8 is verified if delta is greater than
0, which means that the successor is more economical, and so the successor becomes
the current at line 5; otherwise, there is a chance that even a worst successor takes
6.2 SIMULATED ANNEALING
70
Figure 6.1 An example of successors from a state in our simulated annealing algorithm.
the place of the father given a probability shown at line 7. At the end of the while
loop the temperature is cooled according to the cooling rate variable. Finally, when the
temperature reaches an approximation to zero, the current sequence discovered so far is
returned, as stated in line 11.
Despite the choice on the successors being very simple, this algorithm can be very
effective due its iterative nature. Another aspect is the choice for the ”bad” moves early
to avoid local maxima points, and find optimal solutions. One of the great advantages
of such algorithms is the memoryless concept, as it only keeps the current state, a very
little portion of memory is required to run it.
Another version of this algorithm was developed where the temperature parameter is
replaced by the time in seconds the user wants to execute the technique. The need for
the cooling rate parameter is discarded because the temperature is cooled through the
time running out. Note that in the experiments performed with the heuristics, which is
explained later in Section 6.4, we used the first approach using two parameters. Although
6.3 GENETIC ALGORITHM
71
this alternative version has not been used in this project, it was developed for use in future
experiments.
6.3
GENETIC ALGORITHM
Apart from the two heuristics implemented in the previous sections, another heuristic,
based on Genetic Algorithm, has been implemented in [SMP09, Sal09], where the authors
also carry out a comparison among the three heuristics, as reported in the next section.
Here we briefly summarise the implementation of Genetic Algorithm of [SMP09, Sal09]
to create the context for the comparison presented in the next section.
A Genetic Algorithm [MC96] is an iterative procedure that maintains a population
of individuals; these individuals are candidate solutions to the problem being solved, in
our case a sequence of test cases. Each iteration of the algorithm is called a generation.
During each generation, the individuals of the current population are rated for their
effectiveness as solutions. This step is called assessment. Based on these ratings, a
new population of candidate solutions is formed using specific genetic operators. Each
individual is represented by a codification and is called a chromosome. Every chromosome
consists of genes which have specific values. A gene is represented by test case in some
position at a sequence.
Basically, the algorithm consists of a number of iterations performing assessments
and improvements in the sequence been generated. A simplified description is shown in
Algorithm 6.4.
Four parameters are required: the size of the population being maintained through the
algorithm execution, the number of iterations the algorithm executes, and the crossing
and mutation rates, which represent the percentage of crossings and mutations that may
be performed on each individual.
Initially, a population is generated randomly in line 1, and from line 2 up to 7, a
series of repetitions is made representing the evolution of this initial population. Each
iteration consists of an assessment, where the cost of each individual is calculated, and
then a subset is selected representing the best ones in line 4. In line 5 some crossings
are performed with the individuals selected generating new ones. The crossing consists
of taking two individuals and exchanging some of their genes. To avoid a trend of quick
6.4 CASE STUDIES
72
Algorithm 6.4 geneticAlgorithm()
Require: tests[], the set of tests
popSize, the size of the population that will be generated
iterations, number of iterations
cRate, crossing rate
mRate, mutation rate
1:
generatePopulation(popSize, tests[])
2:
for i = 1 to iterations do
3:
assessment()
4:
selection()
5:
mutation(mRate)
6:
crossing(mRate)
7:
end for
8:
return bestResult
convergence to local optima instead of global optima, mutations are performed in line 6.
The mutation introduces a random change in the chromosome. A gene of this chromosome
is randomly selected to be exchanged with other gene of the chromosome.
The Genetic Algorithm processes these steps until all iterations previously determined
have been run. The best chromosome found so far is then returned as the best result.
With these three heuristics implemented we performed a series of case studies to assess
their effectiveness. These studies are described next.
6.4
CASE STUDIES
With the use of the tool detailed in Chapter 7, some experimentation has been conducted
in [SMP09] regarding the heuristics proposed. As we already showed the effectiveness of
the permutation technique, which is automated using this tool, compared to the manual
methodology applied by testers in the previous experiments described in the last chapter,
here we summarise complementary results to the ones obtained in [LAIS09], which are
relevant to those cases where neither the permutation nor the manual methodology can be
applied; these are the cases when a high number of tests are being prioritized. For these
cases the three previous heuristics described were implemented in the tool and some
6.4 CASE STUDIES
73
empirical studies were performed to assess their efficiency in producing good enough
prioritizations. The text in the rest of this section is a summary of what is presented
in [SMP09].
To perform these studies three test suites were chosen. They are used to test specific
components from mobile phones and each one has a set of data configurations needed to
execute the procedures from the tests. We are referring to them as test suite 007 with
17 test cases, test suite 016 with 11 test cases and test suite 158 having 72 test cases,
totalizing 100 tests. An initial analysis effort was done to retrieve the input/output
information of each test, and to register the associated data and test cases in the tool.
With all the material inserted in the tool we planned the experiment.
The test suites have been arranged in five different groups to be executed with the
different techniques. They are:
ˆ Group A: Test suite 007 (17 test cases) + test suite 016 (11 test cases) + test suite
158 (72 test cases) = 100 test cases
ˆ Group B: Test suite 007 = 17 test cases
ˆ Group C: Test suite 016 = 11 test cases
ˆ Group D: Test suite 158 = 72 test cases
ˆ Group E: Test suite 007 + Test suite 016 = 28 test cases
Some of the techniques require parametrization. The Simulated Annealing technique
needs two parameters, the temperature and the cooling rate. We have used the parameters
suggested by Manikas and Cain [MC96]: Temperature = 1000 and Cooling Rate = 0.9.
Regarding the genetic algorithm technique, four parameters are needed. To discover the
best parametrization according to each group, a previous experiment was conducted to
discover the best ones.
6.4.1
Genetic Algorithm Parametrization
The experiments involved the execution of the algorithm with the following range of
parameters (after each parameter we explain why the range was chosen):
74
6.4 CASE STUDIES
Group of
Number of
Population Crossing Mutation Number of
test cases
test cases
Group A
100
60
0.5
0.2
30
Group B
17
50
0.5
0.2
10
Group C
11
50
0.5
0.1
10
Group D
72
80
0.5
0.2
20
Group E
28
50
0.5
0.1
10
Size
Rate
Rate
Iterations
Table 6.1 Best parameters for the Genetic Algorithm found in the experiments performed
a Population Size : between 50 and 100 chromosomes and increasing of 10;
ˆ A small population leads the algorithm in the direction of a local minimum.
Moreover, a large population increases the chance to find a better solution;
however, this can cause the algorithm to run very slowly;
b Crossing Rate: between 0.5 and 0.8 and increasing of 0.1;
ˆ A low crossing rate means little reuse of existent information. However, a high
rate of crossing can cause a premature convergence: in other words, a fast
homogenization of the population.
c Mutation Rate: 0.1 or 0.2;
ˆ A high value of mutation rate can transform the Genetic Algorithm in a ran-
dom search.
d Number of iterations: between 10 and 50 and increasing of 10;
ˆ Many iterations make the algorithm very slow.
Due to randomness of the algorithm, every parameter combination was executed 3
times. The experiments were carried out on a computer with the following characteristics:
Core 2 Duo 2.2 GHz processor; 2 GB Memory RAM; Windows XP operational system.
The best results for each group are presented in Table 6.1.
As we can see, the best parameters to Crossing Rate and Mutation Rate are more or
less the same. To small sets of test cases the population size around 50 and the number
6.4 CASE STUDIES
75
of iterations of 10 is a good choice. However, for larger sets, such as 72, it was necessary
to increase the population size and the number of iterations.
6.4.2
Heuristic Techniques Comparison
Considering that the Genetic Algorithm parameters are defined, it is possible to run an
experiment with the three techniques: Greedy (no parametrization needed), Simulated
Annealing (Temperature = 100 and cooling rate = 0.9 as explained earlier) and Genetic
Algorithm (with the parameters achieved from the previous runs). For each group of
test the prioritization techniques were executed 10 times, and the best and the average
results of each technique according to the groups were catalogued. The results are shown
in Table 6.2
Table 6.2 shows that the results found for suites up to 72 test cases were the same.
However, for a set with 100 test cases the results found by the Genetic Algorithm were
the best. Although the time spent with the Genetic Algorithm is longer, as it is in
the same order of magnitude of Greedy and Simulated Annealing, we can conclude that
with our parameters to this problem, for the larger sets of test cases the Genetic Algorithm finds better results, but Simulated Annealing and Greedy can be used with less or
none parametrization and find results identical or very close to the Genetic Algorithm.
Therefore, all three heuristics can be rated with similar effectiveness.
In the next chapter, we detail the tool implemented to support an automatic test
case prioritization that was used to perform all experiments described along this and the
previous chapters, and that can be used to help testers to execute their test suites more
quickly.
76
6.4 CASE STUDIES
Group A (100 Tests)
Genetic Algorithms
Simulated Annealing
Greedy
Cost (s)
Time (ms)
Cost (s)
Time (ms)
Cost (s)
Time (ms)
AVERAGE
16642
5232,8
16650
139
16758
2201,7
BEST
16580
3844
16650
62
16758
1031
Group B (17 Tests)
Genetic Algorithms
Simulated Annealing
Greedy
Cost (s)
Time (ms)
Cost (s)
Time (ms)
Cost (s)
Time (ms)
AVERAGE
303
184,5
303
24,8
303
37,5
BEST
303
125
303
15
303
31
Group C (11 Tests)
Genetic Algorithms
Simulated Annealing
Greedy
Cost (s)
Time (ms)
Cost (s)
Time (ms)
Cost (s)
Time (ms)
AVERAGE
288
79,7
288
26,4
288
23,6
BEST
288
63
288
15
288
16
Group D (72 Tests)
AVERAGE
BEST
Genetic Algorithms
Simulated Annealing
Greedy
Cost (s)
Time (ms)
Cost (s)
Time (ms)
Cost (s)
Time (ms)
16368,2
3884,4
16424
54,7
16491
403,2
16326
2500
16424
31
16488
359
Group E (28 Tests)
Genetic Algorithms
Simulated Annealing
Greedy
Cost (s)
Time (ms)
Cost (s)
Time (ms)
Cost (s)
Time (ms)
AVERAGE
327
220,2
327
26,6
327
59,3
BEST
327
203
327
15
327
47
Table 6.2 Results of experiments executed with different sets of test cases.
CHAPTER 7
MECHANIZATION
To allow the use of all test case prioritization techniques based on data reuse presented so
far, we developed a tool that has all algorithms implemented in it. Such mechanization
was very important due the fact that the prioritization at the BTC is manually executed.
The requirement analysis was performed by doing meetings with the stakeholders of the
project to elucidate some concepts related to the mobile phone domain and to understand some aspects involved in the manual prioritization done by testers. Through these
interviews we discovered what information we needed to execute automated test case
prioritizations and, moreover, it permitted us to research and develop our techniques.
Among its features the tool allows us to store input/output data, setups, resources,
and test cases. Moreover, the user can select a set of the catalogued tests together with a
technique to perform the prioritization. The best results are displayed with the associated
cost, i.e., the time to perform the configurations along the sequence, and the time taken
to generate the sequences. Its GUI (Graphical User Interface) facilitates the choice of
sequences among those with the same cost. User-defined clusters are also supported as
explained earlier in Section 4.1.1.
Next we detail the tool from a structural point of view and the technologies involved.
Further, the functionalities are described and illustrated by some snapshots.
7.1
SOFTWARE STRUCTURE
This section details the architecture of the Prioritization Tool, and overviews some of the
project’s decisions.
7.1.1
Architecture
Figure 7.1 illustrates the architecture of the tool. It was built using Java [GJSB05],
a strongly typed language that provides the development of portable applications and
77
7.1 SOFTWARE STRUCTURE
78
has a large number of libraries, named APIs (Application Programming Interface). Our
application was developed in top of two Java-based platforms: the Eclipse Rich Client
Platform (RCP) [ML05], for controlling the software life-cycle, and the Prevayler library
to manage the persistence of the information exercised in the tool. They are described
next:
ˆ Eclipse Rich Client Platform (RCP) [ML05]: A dynamic plug-in based frame-
work for developing client applications. It provides a simple, powerful and flexible
way to create desktop systems using the Java language and can be understandable
as a middleware that lets available a series of facilities such as a flexible UI (User
Interface) paradigm, scalable UIs, extensible applications, help support, contextsensitive help, portability, error handling, and a simple install and update mechanism. The Eclipse Standard Widget Toolkit (SWT) provides a graphical user
interface (GUI) toolkit for Java that allows efficient and portable access to the native UI facilities of the OS. The JFace component works on the top of SWT to
provide helper classes for handling common UI-related tasks, and the Workbench
component organizes the UI in graphical components as views, perspectives, action
contributions and editors. Moreover, we used an API that provides advanced and
custom widgets (GUI components) from the Nebula Project1 , which is an project
inside the Eclipse organization that aims to provide more complex widgets.
ˆ Prevayler: Prevayler is an open source persistence library for Java [GJ05]. It
is an implementation of the Prevalent System architecture pattern, in which data
is kept hot in memory with changes journaled for system recovery. Its usage is
very simple: the programmer does not need to worry about a database server to
run, (the prevayler API persists all the data in the disk); also no configuration is
needed, the developer can persist the business objects from her application directly
without making them extend any base class or implement any interface in order
to be persistent. The only restriction is to encapsulate all modifications of our
business objects into instances of the Transaction interface, much like the Command
pattern [GHJV94]. Whenever a transaction is initiated, the object is persisted
in a journal so that data is not lost if our system crashes. Moreover, the state
1
See more at http://www.eclipse.org/nebula/
7.1 SOFTWARE STRUCTURE
79
of the entire business object model can be saved anytime the developer wishes.
The execution is very fast as all data run in memory, so the only disk access is
streaming transactions to the journal. Such API is extremely recommended for
simple applications that do not require too much database control. The choice on
this library was very important due to its simplicity, permitting us to focus in the
development of the techniques and not on storage issues.
Figure 7.1 The Architecture of the Prioritization Tool.
In addition to the technologies described so far we developed our tool following a
basic RCP Architecture with two modules. The core module possesses the business
logic environment with all business objects and their relationship. It also has the code
implementation from all prioritization techniques already explained and the submodule
responsible for the persistence transactions of our data. The UI module has all graphical
7.1 SOFTWARE STRUCTURE
80
implementation for the presentation of the tool, providing screens for data registering
and prioritization executions. These modules communicate through a Facade [GHJV94]
as illustrated in Figure 7.1. They are described next.
7.1.1.1
Core Module This module is divided in four packages: Entities, Prioritiza-
tion, Persistence and Util. Each one is described below:
ˆ Entities: This submodule has all business objects needed to perform the prioriti-
zations. Some of them were discovered through the interviews with testers, while
others come from architectural decisions. Figure 7.2 presents a simplified class diagram from UML [OMG03] that reflects this package. The main class is T estCase:
it has a set of Setups, which are those configurations used in the execution of the
whole sequence. They can be of two types: Flex Bit or General Configuration. A
Flex Bit is a binary variable that has to be set on the phones to enable or disable
certain features (like Instant Messaging, E-mail, Browser), while a General Configuration may be some other initialization needed to use those features in the phones
from a specific test suite. A F lexBit itself in our tool has no value. Only when it
is associated with a T estCase (a T estF lexBit) a value is assigned to it. Each test
may be associated to a set of Resources, e.g. a bluetooth headset, and each one
penalizes the sequence cost with the weight of the resource, that represents a time
spent by the initialization of such item.
A test case may also be associated to a set of phones that may be either a PUT
or an Auxiliary Phone. Each phone contains two DataGroups, one representing
the input state and another containing the output state. When executing stateful
prioritization, the state accumulated through the sequence is represented by the
class State, which has the phones with the aggregated data. At last, a T estGroup
represents a set of test cases with no conflicting Flex Bits. When there are Flex Bit
conflicts among tests from the same suite, they should be allocated in different test
groups and prioritized separately because a bit change along the test execution is
very costly. That is why they are executed beforehand in the sequence. A cluster is
an specialization from an test case because it behaves like one at the prioritization
as explained in Section 4.1.1.
7.1 SOFTWARE STRUCTURE
81
ˆ Prioritization: All prioritization techniques are implemented in this submodule:
Stateless Permutation, Stateful Permutation, Greedy, Simulated Annealing and Genetic Algorithm. They all implement the same interface P rioritization. Any other
technique to be developed can be added to the tool by implementing this interface.
ˆ Persistence: The persistence submodule provides the transactions needed to per-
sist the information about our business objects and a class that simulates a repository of all our objects. This repository maintains lists of the data we want to
persist and provides methods to manipulate these lists. For example to create a
transaction to persist a test case, we created the AddT estCase class that follows
a interface provided by the P revayler API. It simply has a method that calls the
repository to add a test case. Each transaction is atomic and its implementation is
very simple (we use the same process used to the AddT estCase class).
ˆ Util:
This submodule provides some useful classes to help with phone and test
cases calculations and comparisons, algorithms of permutations used in both Stateless and Stateful approaches and some classes to help with debugging.
Figure 7.2 Class diagram of the entities submodule.
7.1 SOFTWARE STRUCTURE
7.1.1.2
82
UI Module This module is responsible for the presentation and the usability
of the tool. It is composed by the Eclipse Workbench UI objects used to create the GUI
with components like the Action Bar filled with the Menu Actions. There are five Menu
Actions for registering information (Data, Setup, Resource, Test Case and Clusters), one
for prioritization and another to close the tool. Each screen is composed of views arranged
in a defined Perspective, and each view is formed by widgets components. Most of these
widgets are from the SWT library, but some screens use advanced widgets provided by
the Nebula project.
For example, the AddData action, when selected, opens the window displayed in
Figure 7.3. It uses two views the AddDataView (left) and the ShowDataView (right).
They are composed of various kinds of widgets: labels, input fields, buttons and tables.
The AddDataView allows users to insert data that has an identification, a description
and the parent feature it belongs to. Feature is a general classification for an amount of
tests, data and setups. For instance, Instant Messaging is an example of a feature. A
data may belong to a feature that can be used by others. Moreover, the associated costs
(time) to insert and delete a data from a phone should be registered. When the user
hits the Add Data button it is persisted and its information will be shown at the table
displayed in the ShowDataView. In this view they can be also removed from the tool, by
selecting them and pressing the Remove Data button.
Figure 7.3 The screen for data insertion.
7.2 MAIN FEATURES
83
Remind that at this time the registered data has no input/output association, it is
treated as an AbstractData. When these data are associated to a test they become an
input or an output data. With these technologies we created a set of forms that allow the
user to record information about the test suites and to perform test case prioritization
executions with the techniques presented in chapters 4 and 6 .
7.2
MAIN FEATURES
To illustrate the use of the tool we present two main screens: the Add Test Case and the
Prioritization screens. The former has functionalities to record the test cases from test
suites with the required information to prioritize them, the latter uses these registered
test cases to execute a selected prioritization technique with them and show the results.
7.2.1
Add Test Case screen
Figure 7.4 shows the Add Test Case screen. It provides fields to be filled with the test case
information divided by blocks in the left view and controls to remove previous registered
test cases or visualize them in the right view.
Regarding the left view, the first block has fields to be filled with general information
about the test case: its identification, a description, the suite identification and the
quantity of phones needed. The next block is responsible for the setup information. The
user may choose some setups previous registered in the Add Setup screen and associate
them with the test case. They may be a General Configuration or a Flex Bit. If the
chosen setup is a Flex Bit, the user should inform its value: ON or OFF. When they are
registered in the Add Setup screen no value is informed and they are recorded as a FlexBit
class, but when they are associated with the test case a TestFlexBit class is instantiated,
which is a Flex Bit associated with an ON/OFF value. The setups are added to the test
case when the Add Setup button is pressed. On this screen, setups can be removed by
selecting them in the setup table and pressing the Remove Setup button.
At the third block the user should inform the input/output state of the test case.
First, the user may choose the phone (according to the number of phones informed in the
first block) and mark the checkbox that informs if it is a PUT or not. Next, a previously
registered data must be selected and its data type INPUT or OUTPUT must be informed
84
7.2 MAIN FEATURES
Figure 7.4 The screen for test case insertion.
together with the amount of data needed (data value). When the Add Data button is
pressed, an InputData or OutputData class is instantiated and is added to one of the
data tables according to the data type chosen.
Under the Data block there is the Resource block where the user may add or remove
previous registered resources (For example: Bluetooth Headsets, SIM Cards) - this is not
visible in Figure 7.4. Finally, the user should press the Add Test Case button to save the
test, which is shown in the table displayed in the right view. The right view allows the
removal of registered test cases and the presentation of their information.
7.2.2
Prioritization screen
This screen uses the test cases registered in the screen presented in the previous section
to prioritize them according a selected technique. Figure 7.5 illustrates a prioritization
7.2 MAIN FEATURES
85
execution with seven test cases using the stateful permutation technique.
Figure 7.5 A prioritization execution with the stateful permutation.
The user should mark the test cases and clusters to be prioritized at the table located
in the top left corner of the screen. Next, the prioritization technique should be selected.
As some test case prioritization techniques require parameters, if one of them is selected,
its associated parameters should be informed. In particular, the Simulated Annealing
technique has two ways of parametrization: by temperature and cooling rate, or by time
(in seconds). Once the technique is selected, the user should press the Prioritize button.
The tool executes the prioritization technique selected with the test cases and clusters
marked.
The results are shown in the table on the center of the screen. They are organized
7.2 MAIN FEATURES
86
according to the Setup Groups (group of test separated by conflicting Flex Bits) and
Groups (sets of the Transitive Closure discovered by the permutation algorithm). In this
case, there is one setup group and two groups of the transitive closure (group 1, with the
test cases 253, 258, 262; and group 2, with the test cases 132, 360, 251, 355). In the first
group, three sequences have tied with the best results, while in the second, eight were
returned as the best sequences. The user should choose one sequence for the first group
and another from the second. They can be executed in any order as they are independent
from the data reuse point of view.
The estimated configuration time (in seconds) and the time (in milliseconds) taken in
the prioritization are presented. At the bottom of the screen there is a view where the
user can build the sequence she wants to use based on the multiple tied results. The tester
chooses the group and the tool presents a sequence of list boxes each one representing
one position in the sequence. Each box is filled with the available tests for that specific
position. When a test A is chosen for a position X the other positions are re-filled with the
available tests that are present in sequences where test A are at position X. In Figure 7.5
for the group 2, there are only two possibilities at the third position where the first is test
251 and the second is test 132, which are the tests 360 and 355. Therefore, when the list
of best results becomes very large such facility comes in handy as the user can visualize
the possible tests to be executed given that some of them are already allocated in some
positions.
CHAPTER 8
CONCLUDING REMARKS
This work presented a new methodology to test case prioritization reusing data needed
along a sequence of test cases aiming to reduce time and effort spent on executing them
to improve the productivity of the testing process. Some techniques were proposed and
they were applied in a mobile phone testing environment, where the test cases follow
the black-box design, are described textually and executed manually by human testers.
We implemented an exploratory approach based on permutation to generate sequences
of test cases using this concept. The yielded sequences have the optimal reduced cost
regarding the data reuse, as the approach verifies all possible sequences, except for some
optimizations described below. Nevertheless, the permutation algorithm can be used
with a limited amount of test cases due to its computational complexity as it belongs to
the set of NP-Complete problems. This algorithm has improvements that contribute to
increase the size of tests tolerated during a prioritization, they are:
ˆ the pruning mechanism, which avoids generating a sequence whose prefix already
outperformed the best sequence already found;
ˆ considering equivalence classes of tests that can reduce the amount of tests in the
prioritization, as they all have the same input and output states, and their input
state are equal to the output state, not contributing to generate new data along a
sequence;
ˆ grouping test cases using the transitive closure optimization to prioritize them sepa-
rately as one group cannot affect reuse in other ones. The prioritization is applied in
reduced sets of test cases making some otherwise infeasible prioritization possible;
ˆ allowing user-defined clusters that are prioritized first and then prioritized together
with other test cases and clusters, therefore reducing the amount of test cases being
prioritized. However, this feature invalidates the guarantee of the optimal solution;
87
CONCLUDING REMARKS
88
To allow the application of the data reuse method over test suites with a larger
number of test cases, we also propose three heuristics using classical artificial intelligence
concepts. A greed algorithm that takes the best test case option of reuse alternating all
test cases in the head of the sequence. A strategy using simulated annealing that uses
iterations controlled by the decreasing of a temperature to generate a sequence. Finally,
another that uses a genetic algorithm that also tries to find an ordered list of tests based
on iterations through the evolution of the sequences. However, the use of such techniques
does not guarantee the best result. Nevertheless, these techniques are a viable alternative
when it is infeasible to adopt permutations or the time factor becomes critical during the
prioritization process; they have produced good results in reasonable periods of time to
small and large test suites.
To assess the efficiency of the techniques we performed a series of experiments. First,
we performed some comparative studies regarding the permutation technique and a manual heuristic-based technique used by testers [RBJ07]. The first experiment was a case
study which has shown that the permutation technique generated sequences 30% more
economical in relation to the time spent in configuration time compared to the manual
technique. As these results were only a theoretical estimation, we performed a formal
controlled experiment where the testers executed sequences generated by both techniques.
This experiment served to assess two assumptions: to assure that the time saved concerning configuration activities was representative in the total execution time and to
perform another comparison between the two techniques. We assessed with statistical
significance that the time saved in configuration was representative enough to reduce the
total execution time. The permutation technique has shown to be 13.5% more effective
than the manual technique. As the manual technique has brought gains over of 5.56%,
we assume that the use of our permutation technique in suites that were not prioritized
yet could give gains of almost 20% in the test suite executions. This time saved could be
used to execute more test cases, which would help testers to meet or increase their daily
productivity marks.
Regarding the heuristics, we performed experiments aiming at two purposes: to find
good parametrization to a genetic algorithm and to compare the results among the techniques. Five groups of test cases with different sizes were used: Group A (100 test cases),
Group B (17 test cases), Group C (11 test cases), Group D (72 test cases) and Group E
8.1 LIMITATIONS
89
(28 test cases). With the parametrization results discovered to the genetic algorithm according to the amount of tests showed in Table 6.1, the comparative study was performed
and showed that for a small amount of tests the techniques brought equal results, and
for larger test suites (Group A and D), despite the fact that the genetic algorithm has
brought better results, the greedy and simulated annealing approaches returned results
in the same magnitude of cost in less time, which leads us to suppose that these three
techniques have very similar efficiency. However, a more accurate empirical validation is
required.
In addition, we developed a stand-alone tool that mechanizes all prioritization techniques proposed and that helped us along all empirical studies. The tool provides mechanisms to register data, setup, and resources that are used to register test cases which in
turn are used to perform the prioritization executions. Moreover, when a prioritization
takes place, the results are displayed in a table together with information about the time
spent in the prioritization execution and the estimated cost on configuration time for the
test suite selected. Test cases are displayed in groups from the transitive closure optimization, and tests in equivalence classes are highlighted. Finally, the tool also includes
a mechanism to help users to choose a unique sequence based on the results, which is
very helpful when too many sequences are returned.
8.1
LIMITATIONS
The limitations of our work are described below:
ˆ The only approach that guarantees the discovery of optimal sequences of test cases,
the permutation technique, can only be applied to test suites with small size (at
most 15 test cases if no optimization is applied). We provided several optimizations
in the algorithm that help the technique to take a higher number of test cases.
However, there is the possibility that no optimization is applicable for some set
of test cases. For example, a test suite that does not have only one group from
the transitive closures nor equivalent test cases, and such that no pruning can be
performed, will obligatory traverse all possible sequences. We suppose that such
cases are unlikely to happen in the application domain we are working, however, we
are aware that in such cases only 15 test cases are a feasible number to be prioritized
8.1 LIMITATIONS
90
in less than one day.
ˆ Another aspect from our data reuse techniques is the overhead in the data survey
activity that must be performed to feed our tool allowing a mechanized prioritization. It is a labored task as the testers should read the test cases, identify the
input/ouput state of each test case and register them in the tool. Such task could
be automated.
ˆ After having all data gathered, another issue is how to assign the cost associated
with each data. We performed experiments to retrieve the costs to configure each
data used in our prioritization executions. Nevertheless, it might be infeasible
to perform experiments every time that new data appears. Another problem is
that this cost may vary from tester to tester depending on their experience and
productivity. As a prioritization should be executed for each tester specifically, they
should register the time for each data used according to their historical execution
or an estimate based on the tester experience. If new data is used, we propose
to perform small experiments or to execute the test cases without prioritization to
acquire knowledge over the data and further register the historical data from these
executions, and only then perform the prioritization.
ˆ The fewer the possibilities of reuse among a set of test cases, the larger the number
of best sequences found by using the permutation technique. For example, it may
be very difficult to choose one of 45 thousand best sequences returned by the tool
as the result of a prioritization. To minimise this issue we developed in the tool an
interface that helps the user to build the preferred sequence based on the results
returned, as shown in Chapter 7.
ˆ Regarding the heuristics proposed the best sequence is not guaranteed to be found.
Moreover, a more complex experiment could be performed with more execution and
comparisons with the permutation technique when it is possible.
ˆ The empirical studies presented in this work are inserted in a mobile phone test-
ing environment where test cases are executed manually by testers. Therefore, the
results from these studies might not be immediately generalized to other application domains. Further experiments should be executed to confirm these techniques
8.2 FUTURE WORK
91
effectiveness in other environments.
8.2
FUTURE WORK
The following activities are suggested topics for future work:
ˆ The input/ouput state information from each test could be detailed in the test itself
during its creation or generation process. Such information could be standardized
allowing text processing by our tool. In this way, test cases could be automatically
registered, what would save the time spent by entering these information in the
tool.
ˆ Others heuristics could be implemented and compared to those presented in Chap-
ter 6, creating a benchmark of prioritization techniques based on data reuse.
ˆ Exploring the use of the transitive closure optimization together with the heuristics.
ˆ Performing a more complex experiment with different test suites and more iterations
using the heuristics proposed and trying to find good parametrization for them.
The purpose of such experiment would be to provide a more accurate comparison
among the heuristics, and also use the permutation technique when possible in the
comparison.
ˆ Incorporating other aspects (test complexity, test size, tester experience) in our
calculations that may affect the time spent in test case execution improving our f
function.
In summary, we proposed test case prioritization techniques based on data reuse, reducing the time spent to configure them to improve the productivity of testing teams. An
approach using permutation that finds the best results for small test suites was detailed
together with other three heuristics that can be used to larger test suites. We executed
some empirical studies that enforced our expectations that data reuse is relevant to reduce time of a test suite execution and that our techniques are better than the manual
technique currently used by testers.
APPENDIX A
APPENDIX
A.1
TEST SUITES USED IN THE CONTROLLED EXPERIMENT
Tables A.1 and A.2 show the test suites used in our controlled experiment. The tests are
described in terms of their inputs and outputs (test procedures are omitted). The results
of the manual and the automatic techniques are shown in Table A.3.
Test
Inputs
Outputs
1
1 Phonebook contact in phone 2
1 Phonebook contact in phone 2
0 phone02 IM contact in phone 1
1 phone02 IM contact in phone 1
1 phone01 IM contact in phone 2
1 phone01 IM contact in phone 2
1 SIM contact in phone 2
1 SIM contact in phone 2
1 IM login (Auto) in phone 2
1 IM login (Auto) in phone 2
1 IM login (Power On) in phone 1
1 IM login (Power On) in phone 1
1 Music file in phone 1
1 Music file in phone 1
1 Alert for IM in phone 1
1 Alert for IM in phone 1
1 Automatic Lock in phone 1
1 Automatic Lock in phone 1
1 phone02 IM contact in phone 1
1 phone02 IM contact in phone 1
1 phone01 IM contact in phone 2
1 phone01 IM contact in phone 2
2
1 Alert phone02 in phone 1
3
4
5
6
1 phone02 IM contact in phone 1
0 phone02 IM contact in phone 1
1 phone01 IM contact in phone 2
1 phone01 IM contact in phone 2
1 phone02 IM contact in phone 1
1 phone02 IM contact in phone 1
1 phone01 IM contact in phone 2
1 phone01 IM contact in phone 2
1 phone02 IM contact in phone 1
1 phone02 IM contact in phone 1
1 phone01 IM contact in phone 2
1 phone01 IM contact in phone 2
1 Video in phone 1
1 Video in phone 1
1 phone02 IM contact in phone 1
1 phone02 IM contact in phone 1
1 phone01 IM contact in phone 2
1 phone01 IM contact in phone 2
1 Phonebook contact in phone 1
1 Phonebook contact in phone 1
Table A.1 The test suite I.
92
Test
Inputs
Outputs
1
0 SMS Message in phone 1
0 SMS Message in phone 1
0 Info Service Msg. in phone 1
3 Info Service Msg. in phone 1
1 Service field value in phone 1
1 Service field value in phone 1
1 Active channel 0 in phone 1
1 Active channel 0 in phone 1
1 Auto cleanup none in phone 1
1 Auto cleanup none in phone 1
0 Screen saver in phone 1
0 Screen saver in phone 1
1 Ringtone to info in phone 1
1 Ringtone to info in phone 1
2
1 Phonebook contact in phone 1
1 Self contact in phone 1
1 SMS Message in phone 1
1 SMS Message in phone 2
3
1 Word completion in phone 1
1 Word completion in phone 1
1 Itap mode in phone 1
1 Itap mode in phone 1
4
5
1 Itap mode in phone 1
1 Non-DRM HR Pic. in phone 1
1 Non-DRM HR Pic. in phone 1
1 Phonebook contact in phone 1
1 Phonebook contact in phone 1
1 SMS Message in phone 1
1 SMS Message in phone 2
6
1 Phonebook contact in phone 1
1 Phonebook contact in phone 1
1 Self contact in phone 1
1 Self contact in phone 1
1 Call to self in phone 1
1 Call to self in phone 1
1 Msg. App. locked in phone 1
0 Msg. App. locked in phone 1
2 SMS Message in phone 1
7
1 Dialed Call to self in phone 1
1 Dialed Call to self in phone 1
8
2 SMS Message in phone 1
3 SMS Message in phone 1
1 Auto cleanup set in phone 1
1 Auto cleanup set in phone 1
1 SMS Message in phone 1
Table A.2 The test suite II.
Suite Manual
Automatic
I
3-4-5-2-6-1
1-2-6-5-4-3
II
6-8-5-1-7-4-3-2
1-2-6-8-5-7-4-3
Table A.3 Test suites sorted according to each prioritization technique.
BIBLIOGRAPHY
[AB08a]
Eduardo Aranha and Paulo Borba. Manualtest: Improving collection of
manual test execution data in empirical studies. In Proceedings of the 5th
Empirical Software Engineering Latin-american Workshop (ESELAW), Salvador, Brazil, 2008.
[AB08b]
Eduardo Aranha and Paulo Borba. Using process simulation to assess the
test design effort reduction of a model-based testing approach. In ICSP,
pages 282–293. Springer, 2008.
[AR02]
Carina Andersson and Per Runeson. Verification and validation in industry–
A qualitative survey on the state of practice. In International Symposium on
Empirical Software (ISESE), pages 37–47. IEEE Computer Society, 2002.
[AW95]
Alberto Avritzer and Elaine J. Weyuker. The automatic generation of load
test suites and the assessment of the resulting software. IEEE Trans. Softw.
Eng., 21(9):705–716, 1995.
[BCR94]
V.R. Basili, G. Caldiera, and H.D. Rombach. The Goal Question Metric
Approach. Encyclopedia of Software Engineering, 1:528–532, 1994.
[Bei95]
Boris Beizer. Black-box testing: techniques for functional testing of software
and systems. John Wiley & Sons, Inc., New York, NY, USA, 1995.
[Ber07]
Antonia Bertolino. Software testing research: Achievements, challenges,
dreams. In Lionel C. Briand and Alexander L. Wolf, editors, International
Conference on Software Engineering, ISCE 2007, Workshop on the Future
of Software Engineering, FOSE, pages 85–103, 2007.
94
BIBLIOGRAPHY
95
[BHHH78] George E. P. Box, William G. Hunter, Stuart J. Hunter, and William G.
Hunter. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. Wiley-Interscience, June 1978.
[Dij72]
Edsger W. Dijkstra. The humble programmer. Commun. ACM, 15(10):859–
866, October 1972.
[EMR00]
Sebastian Elbaum, Alexey G. Malishevsky, and Gregg Rothermel. Prioritizing test cases for regression testing. In In Proceedings of the International
Symposium on Software Testing and Analysis, volume 25, pages 102–112.
ACM Press, 2000.
[EMR02]
Sebastian G. Elbaum, Alexey G. Malishevsky, and Gregg Rothermel. Test
case prioritization: A family of empirical studies. Software Engineering,
28(2):159–182, 2002.
[GHJV94] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design
Patterns: Elements of Reusable Object-Oriented Software (Addison-Wesley
Professional Computing Series). Addison-Wesley Professional, illustrated
edition edition, January 1994.
[GJ79]
M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to
the Theory of NP-Completeness. W. H. Freeman, 1979.
[GJ05]
Irum Godil and Hans-Arno Jacobsen. Horizontal decomposition of prevayler.
In CASCON ’05: Proceedings of the 2005 conference of the Centre for Advanced Studies on Collaborative research, pages 83–100. IBM Press, 2005.
[GJSB05]
James Gosling, Bill Joy, Guy Steele, and Gilad Bracha. The Java Language
Specification, Third Edition. Addison-Wesley Longman, Amsterdam, 3 edition, June 2005.
[Ham04]
Paul Hamill. Unit test frameworks. O’Reilly, 2004.
[HL92]
J. R. Horgan and S. A. London. ATAC: A data flow coverage testing tool
for C. In Proceedings of Symposium on Assessment of Quality Sofiware Development Tools, pages 2–10, New Orleans, LA, USA, 1992. Springer-Verlag.
BIBLIOGRAPHY
[Hof99]
96
Douglas Hoffman. Cost benefits analysis of test automation. In Software
Testing Analysis & Review Conference (STAR East), Orlando, FL, USA,
1999.
[IEE90]
IEEE standard glossary of software engineering terminology. Technical report, 1990.
[JCP08]
A. Jedlitschka, M. Ciolkowski, and D Pfahl. Reporting experiments in software engineering. In F. Shull, J. Singer, and D.I.K. Sjøberg, editors, Guide to
Advanced Empirical Software Engineering, chapter 8. Springer-Verlag, London, 2008.
[JG08]
Dennis Jeffrey and Neelam Gupta. Experiments with test case prioritization
using relevant slices. J. Syst. Softw., 81(2):196–221, 2008.
[JH03]
James A. Jones and Mary Jean Harrold. Test-suite reduction and prioritization for modified condition/decision coverage. IEEE Transactions on
Software Engineering, 29:92–101, 2003.
[Jor02]
Paul C. Jorgensen. Software Testing: A Craftsman’s Approach, Second Edition. CRC, 02.
[JP05]
A. Jedlitschka and D. Pfahl. Reporting guidelines for controlled experiments
in software engineering. Empirical Software Engineering, International Symposium on, 0:10 pp., 2005.
[KFN99]
Cem Kaner, Jack L. Falk, and Hung Quoc Nguyen. Testing Computer Software, Second Edition. John Wiley & Sons, Inc., New York, NY, USA, 1999.
[KP02]
Jung-Min Kim and Adam Porter. A history-based test prioritization technique for regression testing in resource constrained environments. In ICSE
’02: Proceedings of the 24th International Conference on Software Engineering, pages 119–129, New York, NY, USA, 2002. ACM.
[Kue99]
Robert Kuehl. Design of Experiments: Statistical Principles of Research
Design and Analysis. Duxbury Press, 2nd edition, 1999.
BIBLIOGRAPHY
[LAIS09]
97
Lucas Lima, Eduardo Aranha, Juliano Iyoda, and Augusto Sampaio. Test
case prioritization based on data reuse - an empirical study. In Proceedings of
the Third International Symposium on Empirical Software Engineering and
Measurement, Lake Buena Vista, FL, USA, 2009. IEEE Computer Society.
To appear.
[LIS08]
Lucas Lima, Juliano Iyoda, and Augusto Sampaio. A permutation technique
for test case prioritization in a black-box environment. In 2nd Brazilian
Workshop on Systematic and Automated Software Testing, Campinas-SP,
Brazil, October 2008.
[MC96]
T. W. Manikas and J. T. Cain. Genetic Algorithms vs. Simulated Annealing:
A Comparison of Approaches for Solving the Circuit Partitioning Problem.
Technical report, 1996.
[ML05]
Jeff McAffer and Jean-Michel Lemieux. Eclipse Rich Client Platform: Designing, Coding, and Packaging Java(TM) Applications. Addison-Wesley
Professional, 2005.
[MSBT04] Glenford J. Myers, Corey Sandler, Tom Badgett, and Todd M. Thomas. The
Art of Software Testing, Second Edition. Wiley, June 2004.
[NCT+ 07] S. Nogueira, E. Cartaxo, D. Torres, E. Aranha, and R. Marques. Model
Based Test Generation: An Industrial Experience. In Proceedings of the I
Brazilian Workshop on Systematic and Automated Software Testing, 2007.
[Nor98]
J. R. Norris. Markov Chains. University of Cambridge, 1998.
[OMG03]
OMG. Unified modeling language. Specification v1.5, Object Management
Group, March 2003. http://www.omg.org/cgi-bin/doc?formal/03-03-01.
[OTPS98]
Akira K. Onoma, Wei-Tek Tsai, Mustafa Poonawala, and Hiroshi Suganuma.
Regression testing in an industrial environment. Commun. ACM, 41(5):81–
86, 1998.
98
BIBLIOGRAPHY
[Pfl94]
Shari Lawrence Pfleeger. Design and analysis in software engineering: the
language of case studies and formal experiments. SIGSOFT Softw. Eng.
Notes, 19(4):16–20, 1994.
[Pfl95a]
Shari Lawrence Pfleeger. Design and analysis in software engineering: how
to set up an experiment. SIGSOFT Softw. Eng. Notes, 20(1):22–26, 1995.
[Pfl95b]
Shari Lawrence Pfleeger. Experimental design and analysis in software engineering: choosing an experimental design. SIGSOFT Softw. Eng. Notes,
20(3):13–15, 1995.
[Pfl95c]
Shari Lawrence Pfleeger. Experimental design and analysis in software engineering, part 5: analyzing the data. SIGSOFT Softw. Eng. Notes, 20(5):14–
17, 1995.
[Pfl95d]
Shari Lawrence Pfleeger.
Experimental design and analysis in software
engineering: types of experimental design. SIGSOFT Softw. Eng. Notes,
20(2):14–16, 1995.
[Pre05]
Roger Pressman.
Software Engineering:
A Practitioner’s Approach.
McGraw-Hill, sixth edition, 2005.
[QNXZ07] Bo Qu, Changhai Nie, Baowen Xu, and Xiaofang Zhang. Test case prioritization for black box testing. In COMPSAC ’07: Proceedings of the 31st
Annual International Computer Software and Applications Conference - Vol.
1- (COMPSAC 2007), pages 465–474, Washington, DC, USA, 2007. IEEE
Computer Society.
[RBJ07]
H. Rodrigues, R. Baden, and W. Júnior. A Test Execution Sequence Study
for Performance Optimization. Technical report, The Federal University of
Pernambuco, 2007.
[RN03]
Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach.
Prentice-Hall, Englewood Cliffs, NJ, 2nd edition edition, 2003.
BIBLIOGRAPHY
99
[RUCH01] Gregg Rothermel, Roland H. Untch, Chengyun Chu, and Mary Jean Harrold. Prioritizing test cases for regression testing. Software Engineering,
27(10):929–948, 2001.
[RW06]
Rudolf Ramler and Klaus Wolfmaier. Economic perspectives in test automation: balancing automated and manual testing with opportunity cost. In
Proceedings of the International Workshop on Automation of Software Test
(AST), pages 85–91, New York, NY, USA, 2006. ACM.
[Sal09]
Ricardo Sales. Implementing Heuristics for Test Cases Prioritization. Technical report, Universidade Federal do Rio Grande do Norte, Natal, 2009.
Monograph.
[SC89]
George W. Snedecor and William G. Cochran. Statistical Methods. Iowa
State University Press, 1989.
[Sed77]
Robert Sedgewick. Permutation generation methods. ACM Comput. Surv.,
9(2):137–164, 1977.
[SMP09]
Ricardo Sales, Taı́za Montenegro, and Thaysa Paiva. Implementing Heuristics for Test Cases Prioritization and Integration with Target. Technical
report, The Federal University of Pernambuco, 2009.
[Sri04]
H. Srikanth. Requirements Based Test Case Prioritization. In Student Research Forum in 12th ACM SIGSOFT Intl Symposium. ACM SIGSOFT, on
the Foundations of Software Engg, 2004.
[ST02]
Amitabh Srivastava and Jay Thiagarajan. Effectively prioritizing tests in
development environment. SIGSOFT Softw. Eng. Notes, 27(4):97–106, 2002.
[WHLM98] W. Eric Wong, Joseph R. Horgan, Saul London, and Aditya P. Mathur.
Effect of test set minimization on fault detection effectiveness. Softw. Pract.
Exper., 28(4):347–369, 1998.
[ZNXQ07] Xiaofang Zhang, Changhai Nie, Baowen Xu, and Bo Qu. Test case prioritization based on varying testing requirement priorities and test case costs.
BIBLIOGRAPHY
100
In QSIC ’07: Proceedings of the Seventh International Conference on Quality Software, pages 15–24, Washington, DC, USA, 2007. IEEE Computer
Society.
This volume has been typeset in LATEXwith the UFPEThesis class (www.cin.ufpe.br/∼paguso/
ufpethesis).
Download

Lucas Albertins de Lima “TEST CASE PRIORITIZATION BASED ON