Projeto de Experimento Controlado com
Quadrado Latino by Example
Elder Cirilo
Pontifícia Universidade Católica do Rio de Janeiro
Laboratório de Engenharia de Software
Conteúdo baseado na apresentação de Eduardo Aranha, 1º Encontro de
Engenharia de Software Experimental 2011 - Natal
Objetivos
• Apresentar e motivar o uso de experimentos controlados
baseados em engenharia de software baseados na técnica
do quadrado latino.
• Ilustrar projetos de experimentos que utilizaram o quadrado
latino como técnica de controle de fatores
• Ilustrar o processo de análise estatística em experimentos
controlados baseados no quadrado latino.
Agenda
1. Exemplo ilustrativo
2. Experimentos controlados
3. Projetos estatísticos de experimento
–
Plano completamente aleatorizado
–
Plano aleatorizado em blocos completos
–
Plano de quadrado latinos
4. Exemplos de experimentos baseados no quadrado latino
5. Análise estatística em projetos baseados no quadrado
latino, utilizando a ferramenta R.
Exemplo Ilustrativo
• Como avaliar diferentes tecnologias de
Model-based Testing (MBT)?
A
B
C
Contexto do Experimento
• Escolha a melhor tecnologia de MBT: A, B ou C
– Eficiência
– Efetividade
– Facilidade de Uso
– Facilidade de Manutenção
• Industria de Software
– Profissionais variado
– Projetos diferentes
– Recursos limitados
Requer resultados cientificamente embasados
Projeto de Experimento Ilustrativo
• Avaliar MBT no uso de três projetos em andamento
– Escopo: projeto de testes – Visando minimizar os custos.
Requisitos
Desenvolvedor
Abordagem
A
B
C
Possíveis Resultados
Projeto 1
10
15
20
25
30
40
Projeto 2
10
15
20
25
30
40
Projeto 3
10
15
20
25
30
40
h
h
h
• Tecnologia de MBT foi realmente a responsável pelos
resultados?
• Se rodar o estudo em novos projetos, o resultado será o
mesmo?
Como obter resultados mais relevantes?
• Controlar os elementos existentes no ambiente de execução
do experimento.
– Elminar, reduzir ou diluir o efeito desses elementos
• Podemos controlar:
– ambiente de desenvolvimento
– experiência dos participantes
– complexidade do projeto
Projeto 1 – Controle do Ambiente
• Controle
– Um único projeto
– Um único desenvolvedor
• Uso de todas as técnicas
• Ordem no uso das técnicas
A B C
1o
2o
3o
Considerações Sobre a Proposta
• Fixar projeto e desenvolver elimina alguns efeitos
indesejáveis
– Complexidade do projeto
– Expertise e experiência dos diferentes desenvolvedores
– ...
• Existem algum efeito de aprendizado por parte do
desenvolvedor?
• Treinamentos podem amenizar esse tipo de efeito, mas
existem alguma outra ameaça associada ao projeto do
experimento?
Projeto 2 – Controlando Aprendizagem
• Controle
– Mesmo perfil de desenvolvedor
– Interesse pessoal determina
técnica
– Treinamento na abordagem
escolhida
A
B
C
Considerações Sobre a Proposta
• Efeito dos requisitos continua eliminado
• Efeito do desenvolvedor voltou a existir
– Porém, foi reduzido ao se fixar perfil do participante
• Escolha de quem usa a técnica pode estar tendenciosa?
Projeto 3 – Evitando Viés
• Controle
– Mesmo perfil de desenvolvedor
– Sorteio determina técnica
– Treinamento na técnica
sorteada
B
A
C
Considerações Sobre a Proposta
• Baixa probabilidade de ocorrer resultado tendencioso para o
sorteio.
• É suficiente ter apenas uma observação para cada
abordagem?
Projeto 4 – Aumentando Observações
• Número maior de desenvolvedores com mesmo perfil
• Sorteio da técnica a ser utilizada acompanhado de
treinamento
B
A
C
B
A
C
Experimento Controlado
• Procedimento que mudam de forma proposital as variáveis
de um processo/sistema
• Observam mudanças na saída e identificar as causas
x1
xn
x2
…
Entradas
Variáveis controladas
Processo/Sistema
…
z1
z2
Saída
Coletam evidências
contra hipótese
formulada
Variáveis não controladas
(e possivelmente desconhecidas)
zn
Princípios Fundamentais dos Experimentos
Controlados
• Controle Local
– Eliminar, reduzir, diluir ou isolar o efeito de fatores de ruído
– Fixar certos níveis para variáveis não investigadas
• Ex.: experiência dos participantes, complexidade do projeto
• Replicação
– Aplicação de um tratamento em mais de uma unidade
experimental
– Dilui efeito da variabilidade existente entre pessoas, projetos e
artefatos similares
– Diminui chances de se obter resultados ao acaso
• Aleatorização
– Com réplicas, dilui o efeito de diferenças de motivação,
experiência
– Elimina possível viés do pesquisador e reduz efeito de
aprendizado
Análise de Causa-Efeito
• Possível apenas quando utilizado os princípios de réplica
com aleatorização
• Estudos observaicionais ou quase-experimentos
– Ausência de aleatorização e/ou controle local
Como Analisar os Dados
• Como avaliar situações onde análise visual não mostra
claramente se houve ou não melhora significativa?
• Será que com mais observações as conclusões mudariam?
Análise Estatísticas
Plano Aleatorizado em Blocos Completos
• Aplicado quando:
– Existem um fator não investigado com influência significante na
variável de saída
– Não é possível ou interessante fixar um único nível para esse
fator
• Bloco
– Grupo homogêno de unidades experimentais
– Aleatorização feita dentro dos blocos
Exemplo de Blocos em ES
• Nível de experiência em desenvolvimento
Blocos
(6 desenvolvedores)
18 participantes
…
Baixo
…
Médio
…
Alto
Até que Ponto Poderemos Generalizar os
Resultados?
• Resultado do experimento limitado a um único tipo de
projeto
– Simples ou complexos
• Tamanho e complexidade de projetos na prática
Quadrados Latinos
• Aplicado quando:
– Existem dois fatores de ruído com influência significante na
variável de saída
• Bloco
– Combinação de níveis dos dois fatores de ruído (linha, coluna)
• Número de participantes cresce significamente neste
cenário.
Cruzamento Entre Experiência e Tamanho
Tamanho do Projeto
Nível de
Experiência
B
A
C
C
B
A
A
C
B
Réplicas mudando-se
Desenvolvedores e/ou
projetos
Limitações do Quadrado Latino
• Requer mesma quantidade de tratamento, linha e colunas
• Alguns quadrados precisam de mais de 2 réplicas
– Ex: quadrado de tamanho 2
Quadrado Latino by Example
Exemplo 1
Cirilo, E. et al.
Empirical Evaluation
• Our main goal is to investigate whether different techniques
for product line implementation influence the correct
comprehension of the configuration knowledge.
• Similar to related efforts two dimensions were evaluated in
the empirical evaluation:
– Correctness
– Time
Research Questions
• We distinguish the following research questions.
– RQ1: Does the availability of domain-specific models increase the
correct comprehension of the configuration knowledge?
– RQ2: Does the availability of domain-specific models reduce the time
that is needed to correctly comprehend the configuration knowledge?
– RQ3: Does the individual differences among the expertise of product
line engineers impact on the correct comprehension of the
configuration knowledge?
– RQ4: Which types of configuration knowledge comprehension task
benefit most from the use of domain-specific and from other codeoriented techniques?
Hypotheses
• Associated to the first two research questions are two null
hypotheses
– H10: The correct comprehension of the configuration knowledge does
not depend on the different specification techniques.
– H20: The time to correctly comprehend the configuration knowledge
does not depend on the different specification techniques.
• The alternative hypotheses are the following:
– H11: The correct comprehension of the configuration knowledge
depends on the different specification techniques.
– H21: The time to correctly comprehend the configuration knowledge
depends on the different specification techniques.
Empirical Evaluation
• Correct Answers and Time Analysis
– The correspondence between participant’s number of correct
answers and tools/product lines
– The influence of each approach in the time that each
participant spend answering the questionnaire.
• Expertise Analysis
– The influence of participant’s expertise in the number of correct
answers
First Evaluation
• The study involved six post-graduate answering three
questionnaires, one for each product line following the Latin
Square Design
Which
abstraction(s)/code
asset(s) is(are) related
to the feature X?
How many
abstraction(s)/code
asset(s) is(are)
mapped to the feature
Y?
Participants
E-Shop
OLIS
Buyer
P1 and P4
G+
PV
C
P2 and P5
C
G+
PV
P3 and P6
PV
C
G+
Product Lines vs. Correct Answers
Buyer - highest number of correct answers
Lowest number of feature and no diversity of frameworks
! "#$%&"'( )'*( &&"+,'- . $/ "&$'
#! "
C
+"
*"
PV
)"
G+
("
C
'"
$"
C
PV
&"
%"
G+
PV
G+
C
PV
G+
C
G+
PV
#"
PV
G+
C
!"
, - . / 012- 34"#"
, - ./ 012- 34"$"
, - ./ 012- 34"%"
, - ./ 012- 34"&"
567892"
=>?@."
: ; <7"
, - ./ 012- 34"' "
, - . / 012- 34"( "
Product Lines vs. Correct Answers
OLIS - intermediate number of correct answers
Well modularized features
! "#$%&"'( )'*( &&"+,'- . $/ "&$'
#! "
C
+"
*"
PV
)"
G+
("
C
'"
$"
C
PV
&"
%"
G+
PV
G+
C
PV
G+
C
G+
PV
#"
PV
G+
C
!"
, - . / 012- 34"#"
, - ./ 012- 34"$"
, - ./ 012- 34"%"
, - ./ 012- 34"&"
567892"
=>?@."
: ; <7"
, - ./ 012- 34"' "
, - . / 012- 34"( "
Product Lines vs. Correct Answers
E-Shop - lowest number of correct answers
Features no-well modularized
! "#$%&"'( )'*( &&"+,'- . $/ "&$'
#! "
C
+"
*"
PV
)"
G+
("
C
'"
$"
C
PV
&"
%"
G+
PV
G+
C
PV
G+
C
G+
PV
#"
PV
G+
C
!"
, - . / 012- 34"#"
, - ./ 012- 34"$"
, - ./ 012- 34"%"
, - ./ 012- 34"&"
567892"
=>?@."
: ; <7"
, - ./ 012- 34"' "
, - . / 012- 34"( "
Techniques vs. Correct Answers
CIDE - lowest number of correct answers
in the E-Shop product line
! "#$%&"'( )'*( &&"+,'- . $/ "&$'
#! "
C
+"
*"
PV
)"
G+
("
C
'"
$"
C
PV
&"
%"
G+
PV
G+
C
PV
G+
C
G+
PV
#"
PV
G+
C
!"
, - . / 012- 34"#"
, - ./ 012- 34"$"
, - ./ 012- 34"%"
, - ./ 012- 34"&"
567892"
=>?@."
: ; <7"
, - ./ 012- 34"' "
, - . / 012- 34"( "
Techniques vs. Correct Answers
pure::variants – better number of hits for
the E-Shop product line than CIDE
! "#$%&"'( )'*( &&"+,'- . $/ "&$'
#! "
C
+"
*"
PV
)"
G+
("
C
'"
$"
C
PV
&"
%"
G+
PV
G+
C
PV
G+
C
G+
PV
#"
PV
G+
C
!"
, - . / 012- 34"#"
, - ./ 012- 34"$"
, - ./ 012- 34"%"
, - ./ 012- 34"&"
567892"
=>?@."
: ; <7"
, - ./ 012- 34"' "
, - . / 012- 34"( "
Techniques vs. Correct Answers
CIDE – better number of hits for
the OLIS product line than pure::variants
! "#$%&"'( )'*( &&"+,'- . $/ "&$'
#! "
C
+"
*"
PV
)"
G+
("
C
'"
$"
C
PV
&"
%"
G+
PV
G+
C
PV
G+
C
G+
PV
#"
PV
G+
C
!"
, - . / 012- 34"#"
, - ./ 012- 34"$"
, - ./ 012- 34"%"
, - ./ 012- 34"&"
567892"
=>?@."
: ; <7"
, - ./ 012- 34"' "
, - . / 012- 34"( "
Techniques vs. Correct Answers
! "#$%&"'( )'*( &&"+,'- . $/ "&$'
#! "
C
+"
*"
PV
)"
&"
%"
$"
✓
G+
G+
("
'"
±
C
✗
PV
G+
✓
± PV
C
C
± PV
G+
C
G+
PV
#"
PV
G+
C
!"
, - . / 012- 34"#"
, - ./ 012- 34"$"
, - ./ 012- 34"%"
, - ./ 012- 34"&"
567892"
=>?@."
: ; <7"
, - ./ 012- 34"' "
, - . / 012- 34"( "
Techniques vs. Time
• Time demanded to answer the questionnaire.
G+
PV
C
e-Shop
1:35:47
1:43:29
1:33:45
OLIS
1:27:51
1:45:42
1:31:09
Buyer
0:43:05
1:17:42
1:14:42
• Average time to correct answer a question.
G+
PV
C
0:02:57
0:04:39
0:03:10
Participant’s Expertise
P1
P2
P3
P4
P5
P6
Spring
4
4
4
2
1
3
Struts
2
2
1
4
2
2
Spring MVC
4
4
4
5
3
5
Hibernate
1
1
1
1
4
1
iBatis
2
2
2
3
1
3
Spring-DM
1
1
1
1
1
1
Jadex
2
1
1
3
1
1
1.5
1.25
1
2.25
2
1
OLIS
3
3
2.75
3.5
1.75
3.25
Buyer
2
1
1
3
1
1
e-Shop
Expertise Results
! "#$%&"'( )'*( &&"+,'- . $/ "&$'#. 0'123"&4$"'
$!(""
%#"
PV
' +"
&#"
*"
'"
)"
%&
(#"
"
C
PV
%! "
C
C
G+
C
%"
#"
' #"
"
$&
&"
$"
%"
!&
#"
$"
G+
$#"
C
PV
G+
G+
PV
$! "
C
PV
G+
PV
G+
#"
!!""
!"
- ./
,)-*+,
. / 01
2-*01"$"
34"$"
- ./
,)-*+,
. / 01
2-*01"%
34"%""
- ./
,) -*+,
. / 01
2-*01"'
34"&""
34567/ "
567892"
:8 ;9:5"
<7"
- ./2-*01"(
,) -*+,
./ 01
34"' ""
- ./2-*01"#"
,) -*+,
./ 01
34"#"
; <=>+"
"
=>?@."?@AB71*C
ABCD94E"
- ./2-*01"2"
,) -*+,
./ 01
34"( "
Expertise Analysis
! "#$%&"'( )'*( &&"+,'- . $/ "&$'#. 0'123"&4$"'
$!(""
%#"
PV
' +"
&#"
*"
'"
)"
%&
(#"
"
C
PV
%! "
C
C
G+
C
%"
#"
' #"
"
$&
&"
$"
%"
!&
#"
$"
G+
$#"
C
PV
G+
G+
PV
$! "
C
PV
G+
PV
G+
#"
!!""
!"
- ./
,)-*+,
. / 01
2-*01"$"
34"$"
- ./
,)-*+,
. / 01
2-*01"%
34"%""
- ./
,) -*+,
. / 01
2-*01"'
34"&""
34567/ "
567892"
:8 ;9:5"
<7"
- ./2-*01"(
,) -*+,
./ 01
34"' ""
- ./2-*01"#"
,) -*+,
./ 01
34"#"
; <=>+"
"
=>?@."?@AB71*C
ABCD94E"
- ./2-*01"2"
,) -*+,
./ 01
34"( "
Statistical Results – Answers
• Kruskal-Wallis
chi-squared
Kruskal-Wallis
05/11/2015
17.6812
df
14
Nome do Autor © LES/PUC-Rio
p-value
0.2217
43
Statistical Results - Time
• ANOVA
ANOVA
DF
Sum Sq
Mean Sq
F Value
Pr(>F)
Tool
2
45259
22629.4
2.6310
0.1324
05/11/2015
Nome do Autor © LES/PUC-Rio
44
Second Evaluation
• Our study involved fifteen post-graduate answering three
questionnaires, one for each product line following the Latin
Square Design
• Questions were devised into four different comprehensibility
tasks:
– Identifying all files in which source code of a feature occurs
– Identifying all features that occurs in a certain file
– Identifying all framework-concept instances that are
implementing a certain feature
– Investigating dependencies between framework-concept
instances
Statistical Results – Answers
GenArch+
~26.28%
~34.53%
• ANOVA
Tools
DF
Sum Sq
F Value
Pr(>F)
2
24.808
18.3421
1.075e-05
Statistical Results – Answers
GenArch+
~26.28%
~34.53%
• Tukey
diff
lwr
upr
p adj
G,-C
1.526667
0.7805027
2.2728306
0.0000777
P,-C
-0.48000
-1.2261640
0.2661640
0.2641892
P,-G
-2.00666
-2.7528306
-1.2605027
0.0000013
Statistical Results – Time
GenArch+
~19.41%
~62.65%
• Kruskal-Wallis
chi-squared
Kruskal-Wallis
4.0495
df
2
p-value
0.1320
Individual Task Performance - Anwsers
Average of Correct Anwsers per Task
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
T1
T2
T3
G+
PV
T4
C
50%
87%
Individual Task Performance - Time
Average of Time per Task
1600
1400
1200
1000
800
600
400
200
0
T1
T2
T3
G+
PV
T4
C
5x
Quadrado Latino by Example
Exemplo 2
Ribeiro, M. et al.
Empirical Evaluation
• To better understand the use of Emergent Interfaces in
preprocessor-based software product lines in maintenance
activity.
• Hypotheses
– H1: With and without Emergent Interfaces, developers spend
on average the same time to complete a maintenance task
involving feature dependences.
– H2: With and without Emergent Interfaces, developers commit
on average the same number of errors when performing a
maintenance task involving feature dependencies.
Design
• The study involved 24 under/post-graduate one for each
product line following the Latin Square Design
Participants
Bestlap
Mobile Media
P1 … n
EI
Wout IE
P2 … n
Wout IE
EI
Collecting the metrics
• Eclipse plug-in that consists of two buttons
– Play/Pause
– Finish
Maintenance Tasks
• Implementation of a New requirement
– In the Best Lap, the game score be not only positive, but also
negative.
– In the Mobile Media, subjects should replace the actual web
images server for another one that is able to provide more and
different image formats.
• Fixing of Unused-variable
Experiment Execution
• To make subjects ware of preprocessor, VSoC, Feature
Dependencies, Emergent Interfaces, and Emergo, a one
hour training was provided before running the experiment.
• One toy example, Jcacl, was used to explain these concepts
and exemplify the task New requirement and Unusedvariable
• They performed the experiment with 10 MSc/PhD students
at Federal University of Pernambuco, Brazil (Round 2) and
replicated the experiment with 14 undergraduate students
at Federal University of Alagoas, Brazil (Round 3).
Data Interpretation
• New requirement task – Round 2
Data Interpretation
• New requirement task – Round 2
Data Interpretation
• New requirement task – Round 3
Data Interpretation
• New requirement task – Round 3
Data Interpretation
• Unused Variable – Time Penalty
Data Interpretation
• Unused Variable – Round 2
Data Interpretation
• Unused Variable – Round 2
Data Interpretation
• Unused Variable – Round 3
Data Interpretation
• Unused Variable – Round 3
Conclusions
• Question 1: Do Emergent Interfaces reduce effort during
maintenance tasks involving feature code dependencies in
preprocessor-based systems?
– We conclude that Emergent Interfaces reduce the time spent to
accomplish the New- requirement task. Without them, subjects
are 3 and 3.1 times slower.
– When considering the Unused-variable task, the time difference
with and with- out Emergent Interfaces is smaller when
compared to the New-requirement task. On average, subjects
are 1.5 and 1.68 times slower without Emergent Inter- faces.
Conclusions
• Question 2: Do Emergent Interfaces reduce the number of
errors during maintenance tasks involving feature code
dependencies in preprocessor-based systems?
– The results show that, with Emergent Interfaces, subjects
might be aware of feature dependencies. Hence, the probability
of changing the impacted features increases, leading them to
press the Finish button not rashly.
– Without Emergent Interfaces, subjects committed 84% and
81% of the errors.
– without Emergent Interfaces tend to write more feature
expressions wrongly when compared to with Emergent
Interfaces: 75% and 78%
Análise estatística em projetos baseados
no quadrado latino
Ferramenta R
Ferramenta R
• Ferramenta para análise estatística gratuita
• Baseada na Linguagem R
– Utilização é realizada através de comando em um console
– Comandos realizados sobre dados ou resultados de função
– Dados podem ser dispostos em um vetor, matriz ou data
frame.
• http://cran.r-project.org/
Ferramenta R
Repesentando Quadrado Latino
Replica, Estudante, EstudoDeCaso, Tecnica, Resposta
1, 1, bY, G, 6.60
2, 4, bY, G, 8.00
1, 1, oL, C, 1.80
2, 4, oL, C, 2.10
1, 1, eS, P, 4.10
2, 4, eS, P, 4.00
1, 2, bY, P, 4.50
2, 5, bY, P, 4.60
1, 2, oL, G, 6.10
2, 5, oL, G, 6.15
1, 2, eS, C, 4.65
2, 5, eS, C, 5.15
1, 3, bY, C, 8.00
2, 6, bY, C, 6.00
1, 3, oL, P, 5.10
2, 6, oL, P, 4.90
1, 3, eS, G, 6.90
2, 6, eS, G, 7.20
…
Comandos
• Carregando os dados
data.ql = read.table(file=”dados-resposta.txt",header = T)
attach(data.ql)
• Definindo elementos do quadrado latino
Replica <- factor(Replica.)
Estudante <- factor(Estudante.)
EstudoDeCaso <- factor(EstudoDeCaso.)
Tecnica <- factor(Tecnica.)
• Plotando os resultados
plot(Resposta~Tecnica,col="gray",
xlab="SPL Tool",ylab="Answers")
Comando – Teste de Variança
anova.ql = aov(Resposta~Replica+Estudante:Replica
+EstudoDeCaso+Tecnica)
summary(anova.ql)
kw <- kruskal.test(Resposta~Estudante+EstudoDeCaso
+Tecnica,data.ql)
• Verificar se amostra possui distribuição normal e mesma
variança.
– Distribuição Normal: Shapiro-Wik
shapiro.test(Resposta)
• Se p-value > 0.05 = OK, a amostra é normal
– Mesma viriança: Levene
levene.teste(Resposta)
• Se p-value > 0.05 = OK, a amostra possui mesma variança
Comando – Comparações Multíplas
• ANOVA
– Método Tukey
fmTukey=TukeyHSD(anova.ql,"Tecnica")
fmTukey
• Kruskal
– Método Nemenyi-Damico-Wolfe-Dunn
oneway_test(Dificuldade ~ Tecnica, data = data.ql)
Projeto de Experimento Controlado com
Quadrado Latino by Example
Elder Cirilo
Pontifícia Universidade Católica do Rio de Janeiro
Laboratório de Engenharia de Software
Conteúdo baseado na apresentação de Eduardo Aranha, 1º Encontro de
Engenharia de Software Experimental 2011 - Natal