The microarray data
analysis
Ana Deckmann
Carla Judice
Jorge Lepikson
Jorge Mondego
Leandra Scarpari
Marcelo Falsarella Carazzolle
Michelle Servais
Tais Herig
Summary
- Statistics
background
- Introduction to microarray
- Pre-processing microarray data
- Statistics analysis
- D-maps
Statistics background
Error model
- measurement = truth + error
- error = bias + variance
Bias describe a systematic
tendency of the measurement.
Ex: dyes Cy3 and Cy5 don´t
have the same efficient
Normalization
Variance is often normally
distributed, ex :
instrumentation imperfection
and biological variation
Experimental replicate
(techniques and biological)
and statistics
Introduction to microarray
-Three different microarray technologies :
- Spotted cDNA microarrays (500 to 2500 bp)
- Spotted oligonucleotide microarrays (30 to 70 bp)
- Affymetrix chips (25 bp)
- Can be used to :
- Differential gene expression studies, gene co-regulation
studies, gene function identification studies. time-course
studies, dose-response studies, clinical diagnosis, …
Two color architecture
Codelink architecture (one color)
Probes: 30-meros, 90% até 550 bases
downstream extremidade 3’
Targets: 10ug cRNA biotinilado
Scanning
excitation
red
laser
green
laser
emission
overlay images

higher frequency,
more energy

lower frequency,
less energy
1
A
B
C
D
E
F
2
3
4
Ludwig scanner
1 2 3 4 5 6 7 8 9 10 11
a
b
c
d
e
f
g
h
i
j
k
Ludwig flags : (0) Int <= Back
(1) Irregular spots
G
(3) Spot ok
(4) Saturated
H
Scarpari, Leandra – 2006 – Tese Doutorado
Codelink scanner
Codelink flags :
(L) near background
(C) contaminated
(S) saturated
(M) masked
(G) good
1
2
3
4
LGE scanner
A
Defined intensity :
B
-Int Cy3 = Area Cy3 * (median(Int Cy3)median(Bkgd(Cy3))
C
-Int Cy5 = Area Cy5 * (median(Int Cy5)median(Bkgd(Cy5))
D
E
LGE defined flags :
(0) – Spot ok
F
G
H
(1) – Spot Saturado
(2) – Int/Back <= 1.05
(3) – Area <= 110 or 50 (9x9 or 11x11)
Cy3= 3329280; Cy5= 2251624
r=0.67 (fold=-1.49)
(Target median - Bkgd median) * Area = integrated intensity
pixels in
*
pixels out
=
pixels in > pixels out
Cy3= 222824; Cy5= 15488
r=0.069 fold=-14.5 flag=0
Cy3= 481536; Cy5= 676000
Cy3= 293664; Cy5= 485368
r=fold=1.40 flag=0
r=1.65 flag=0
Cy3= 6400; Cy5= -3584 NA (sinal:ruído<=1) flag=2
Cy3= 8767720; Cy5= 1349296
r=0.15 fold=-6.7 flag=1
Pre-processing microarray data
-Bioconductor repository (http://www.bioconductor.org/)
-Log intensities
R=G
Most genes have low gene expression
levels. What happens here?
Log2R=Log2G
M vs A plot
up-regulated genes
down-regulated genes
Transformed data {(M,A)i}:
M = log2(R) - log2(G) (minus)
A = ½·[log2(R) + log2(G)] (add)
non-differentially
expressed genes
are now along the
horizontal line:
M=0

log2R - log2G = 0

R=G
Density plot
log2R = red channel signal
log2G = green channel signal
Print-tip box plot
1
16
Normalization within slides
Expectation: Most genes are non-differentially expressed, i.e. most
of the data points should be around M=0.
Median normalization : which sets the median of log intensity ratios to
zero
Median value = 0
Lowess normalization : global lowess normalization
Print-tip normalization : print-tip group lowess normalization
Scaled print-tip : scaled print-tip group lowess normalization
X*ij=(Xij-median(GRIDj))/sd(GRIDj)
Normalization across slides
-QUANTILE
QQPlot
Mean between
8 slides
-LOWESS (applied in one color microarray)
Transformed data {(M,A)i}:
M = log2(Int1) - log2(Int2) ; A= ½·[log2(Int1) + log2(Int2)]
Statistics analysis
- T statistics test
The T statistics down-weight the importance of the average if the
deviation is large and vice versa;
T = mean(x) / SE(x)
where SE(x)=std.dev(x)/N (standard error of the mean)
The blue gene has
the lower T-value
than red gene.
Top table and volcanoplot
p.value
1.01E-07
3.94E-06
0.000734
7.25E-05
1.38E-09
6.82E-05
F.change
-1.5
-1.3234
-1.93895
1.960643
2.317313
2.34858
GENE
interleukin-18 binding protein
Matrix metalloproteinase 3
leukocyte integrin alpha chain
azurocidin 1 preproprotein
Macrophage-stimulating protein
alpha1-antichymotrypsin
Fold change =
ratio; if ratio >=1
or
-1/ratio; if ratio < 1
Cluster data analysis
Objetivo do Programa
●
Automatizar a análise dos dados
●
Diferentes formatos
 GeneTAC (LGE)
 ScanArray (Ludwig)
 CodeLink
 NimbleGen (Futuro)
Características do Programa
●
Possibilita a criação de diferentes projetos
●
Estruturado por etapas
●
Português e Inglês
●
Linguagens: cgi, R (análise estatística)
●
Banco de dados: MySql
Estrutura do Programa
Definição de um Projeto
LGE e Ludwig
Configuração da Lâmina
CodeLink
Submissão dos Arquivos da Lâmina
Seleção de Dados
Normalização
Análises Estatísticas
Estrutura do Programa: Definição do Projeto
●
Criar / Selecionar um projeto
●
Definir o padrão
 Número de Placas funcionais
Estrutura do Programa: Definição do Projeto
Estrutura do Programa: Arquivos da Lâmina
●
Submissão dos arquivos
●
Definição dos grupos
●
Definição dos canais
Estrutura do Programa: Arquivos da Lâmina
Estrutura do Programa: Seleção dos Dados
●
Exclusão de spots indesejados
 Diferentes formas de exibir os dados
 Diferentes filtros
 Imagens
Estrutura do Programa: Seleção dos Dados
Estrutura do Programa: Normalização
●
Métodos diferentes
●
Opções
●
Visualização
Estrutura do Programa: Normalização
Estrutura do Programa: Análises estatísticas
●
Fold Change
●
Pvalue
Estrutura do Programa: Análises estatísticas
Gráficos: Lâmina
Grid
(Fonte: Leandra Scarpari)
Gráficos: M vs A plot
M = log2(R/G)
A = ½ log2(RG)
(Fonte: Leandra Scarpari)
Gráficos: M vs A plot
(Fonte: Ana Deckmann)
Gráficos: Density
(Fonte: Leandra Scarpari)
Gráficos: VolcanoPlot
Fold Change: Escala de comparação entre as razões
(Quanto maior o módulo, mais diferencialmente expresso)
Pvalue: Reprodução dos dados
(Quanto menor, mais estão se reproduzindo os dados)
(Fonte: Leandra Scarpari, Ana Deckmann)
Gráficos: Clustering
Busca de padrões
(Fonte: Ana Deckmann)
Fim
Box plot
Comparison of normalization methods for Codelink Bioarray data
Differences between pair of arrays
in the technical replicates :
(1) Array 1 vs array 4
(2) Array4 vs array 5
BMC Bioinfomatics 2005, 6:309
- Within slide normalization
Print-tip normalization
Before
No norm
Print tip
After
Scaled print tip
Nucleic Acids Research, 2002, vol 30, No 4