Processamento de Cadeias de
Caracteres
Ivan G. Costa Filho
[email protected]
Centro de Informática
Universidade Federal de Pernambuco
Biologia In Silico - Centro de Informática - UFPE
Tópicos
• Cadeias de Caracteres Biológicas
• Problemas Básicos
– alinhamento par/múltiplo
– busca de motifs
– modelagem de famílias de proteínas
• Métodos
– Algoritmos dinâmicos
– cadeias escondidas de Markov
– métodos probabilísticos
Biologia In Silico - Centro de Informática - UFPE
Disciplina
• Aulas – Marco/Abril
– introdução de conceitos/métodos básicos
– Aulas práticas
• Seminários - Abril/Maio
– apresentação de tópicos da disciplina
• Individual - pós
• duplas – graduação
• Projeto Maio a Junho
– analise de dados reais (de artigos discutidos) em
grupo
Biologia In Silico - Centro de Informática - UFPE
Avaliação
• 40% - apresentação dos seminários
– avaliação pelos companheiros de
classe e presença
• 20% - listas de exercícios
• 40% - projeto em grupo
– nota individual - cada grupo é
responsável por descrever a
participação
Biologia In Silico - Centro de Informática - UFPE
Bibliografia
• R Durbin, Sean R Eddy, A Krogh, Biological
Sequence Analysis : Probabilistic Models of Proteins
and Nucleic Acids, Cambridge University Press.
• An Introduction to Bioinformatics Algorithms, Neil
Jones e Pavel Pevzner, MIT Press, 2004
• Ver pagina para literatura especifica de cada aula …
– www.cin.ufpe.br/~igcf
Biologia In Silico - Centro de Informática - UFPE
Biologia Molecular
Biologia In Silico - Centro de Informática - UFPE
Entender a vida a nível celular
• Como a informação genética é herdada
• Como a informação genética influencia
processos celulares
• Como genes trabalham juntos para
realizar uma função celular
Biologia In Silico - Centro de Informática - UFPE
Informação Genética - DNA
• DNA (ácido
desoxirribonucleico)
– Cadeia de nucleotídeos
– 4 tipos: A;C;G;T
– forma fita dupla a partir
da complementaridade.
• A=TeC=G
Biologia In Silico - Centro de Informática - UFPE
Dogma Central - Transcrição
• Transcrição
– DNA para RNA
• RNA (acido ribonucléico)
–
–
–
–
Biologia In Silico - Centro de Informática - UFPE
fita simples.
4 tipos: A;C;G;U
Moléculas instáveis
Transporte de informação
do núcleo ao citoplasma
Dogma Central - Transcrição
• Transcrição – copia seqüência de bases do
DNA para o RNA (com U ao invéss de T).
Biologia In Silico - Centro de Informática - UFPE
Dogma Central - Tradução
• Tradução
– RNA -> Proteínas
– realizada pelo ribossomo
– Código genético
• Proteínas
– cadeia de aminoácidos
– 20 tipos diferentes
– adquire uma estrutura tridimensional
– entidades funcionais da célula
Biologia In Silico - Centro de Informática - UFPE
Tradução - Código Genético
• Combinações de códons (3 bases) codificam
um dos 20 aminoácidos.
Biologia In Silico - Centro de Informática - UFPE
Dogma Central
• Dogma: fluxo de informação
DNA  mRNA  Proteína
• Gene: segmento de DNA
codificando uma proteína.
• Transcrito: segmento de RNA
transcrito de uma gene.
• Um gene corresponde a uma
proteína e uma função
celular.
Biologia In Silico - Centro de Informática - UFPE
Controle da Expressão
Gênica
• Como se da o controle da expressão
gênica?
• Certas proteínas, fatores de transcrição,
se ligam ao DNA e são responsáveis
por iniciar a transcrição.
Biologia In Silico - Centro de Informática - UFPE
Controle da Regulação
Gênica
Biologia In Silico - Centro de Informática - UFPE
Bioinformatics
• Manage molecular biological data
– Store in databases, organise, formalise, describe...
• Compare molecular biological data
• Find patterns in molecular biological data
– phylogenies
– correlations (sequence / structure / expression / function /
disease)
Goals:
• characterise biological patterns & processes
• predict biological properties
– low level data ⇒ high level properties
(eg., sequence ⇒ function)
Biologia In Silico - Centro de Informática - UFPE
Bioinformatics: neighbour disciplines
• Computational biology
– Broader concept: includes computational ecology,
physiology, neurology etc...
• -omics:
– Genomics
– Transcriptomics
– Proteomics
• Systems biology
– Putting it all together...
– Building models, identify control & regulation
Biologia In Silico - Centro de Informática - UFPE
Molecular biology data...
• DNA sequences
>alpha-D
ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCAC
CCAGACTGTGGAGCCGAGGCCCTGGAGAGGTGCGGGCTGAGCTTGGGGAAACCATGGGCA
AGGGGGGCGACTGGGTGGGAGCCCTACAGGGCTGCTGGGGGTTGTTCGGCTGGGGGTCAG
CACTGACCATCCCGCTCCCGCAGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCC
CCCACTTCGACTTGCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGG
CCGCCTTGGGCAACGCTGTCAAGAGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCA
GCGACCTGCATGCCTACAACCTGCGTGTCGACCCTGTCAACTTCAAGGCAGGCGGGGGAC
GGGGGTCAGGGGCCGGGGAGTTGGGGGCCAGGGACCTGGTTGGGGATCCGGGGCCATGCC
GGCGGTACTGAGCCCTGTTTTGCCTTGCAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTG
GCCACACACCTGGGCAACGACTACACCCCGGAGGCACATGCTGCCTTCGACAAGTTCCTG
TCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGATAA
>alpha-A
ATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGC
CAGGCCGGTGACTTGGGTGGTGAAGCCCTGGAGAGGTATGTGGTCATCCGTCATTACCCC
ATCTCTTGTCTGTCTGTGACTCCATCCCATCTGCCCCCATACTCTCCCCATCCATAACTG
TCCCTGTTCTATGTGGCCCTGGCTCTGTCTCATCTGTCCCCAACTGTCCCTGATTGCCTC
TGTCCCCCAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACC
TGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTG
AGGCTGCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACG
CCCAAAAGCTCCGTGTGGACCCCGTCAACTTCAAAGTGAGCATCTGGGAAGGGGTGACCA
GTCTGGCTCCCCTCCTGCACACACCTCTGGCTACCCCCTCACCTCACCCCCTTGCTCACC
ATCTCCTTTTGCCTTTCAGCTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTT
CCCCTCTCTCCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGG
CACCGTCCTTACTGCCAAGTACCGTTAA
Biologia In Silico - Centro de Informática - UFPE
Molecular biology data...
• Amino acid sequences
• Protein structure:
– X-ray crystallography
– NMR
Biologia In Silico - Centro de Informática - UFPE
Cell biology & proteomics
data...
• Subcellular localization
Biologia In Silico - Centro de Informática - UFPE
Prediction Methods
• Homology / Alignment
• Simple pattern (“word”) recognition
• Statistical methods
– Weight matrices: calculate amino acid probabilities
– Other examples: Regression, variance analysis, clustering
• Machine learning
– Like statistical methods, but parameters are estimated by
iterative training rather than direct calculation
– Examples: Neural Networks (NN), Hidden Markov Models
(HMM), Support Vector Machines (SVM)
• Combinations
Biologia In Silico - Centro de Informática - UFPE
Similarity between
sequences
If two sequences look similar, the explanation may be:
• Homology
(common descent)
• Convergent evolution
(common function → common selective pressure)
• Chance!
Biologia In Silico - Centro de Informática - UFPE
Sequences are related
• Darwin: all organisms are related through descent with modification
• => Sequences are related through descent with modification
• => Similar molecules have similar functions in different organisms
Phylogenetic tree based on
ribosomal RNA:
three domains of life
Biologia In Silico - Centro de Informática - UFPE
Sequences are related II
Phylogenetic tree of
globin-type proteins
found in humans
Biologia In Silico - Centro de Informática - UFPE
Why compare sequences?
Protein 1: binds oxygen
Sequence similarity
Protein 2: binds oxygen ?
Biologia In Silico - Centro de Informática - UFPE
•
Determination of
evolutionary
relationships
•
Prediction of protein
function and structure
(database searches).
Biological Databases
• Vast biological and sequence data is freely available through
online databases
• Use computational algorithms to efficiently store large amounts
of biological data
Examples
• NCBI GeneBank
http://ncbi.nih.gov
Huge collection of databases, the most prominent being the nucleotide sequence database
• Protein Data Bank
http://www.pdb.org
Database of protein tertiary structures
• SWISSPROT
•
http://www.expasy.org/sprot/
Database of annotated protein sequences
• PROSITE
http://kr.expasy.org/prosite
Database of protein active site motifs
Biologia In Silico - Centro de Informática - UFPE
Alinhamento de Sequencias
Biologia In Silico - Centro de Informática - UFPE
BLAST
• A computational tool that allows us to
compare query sequences with entries in
current biological databases.
• A great tool for predicting functions of a
unknown sequence based on alignment
similarities to known genes.
Biologia In Silico - Centro de Informática - UFPE
BLAST
Biologia In Silico - Centro de Informática - UFPE
Some Early Roles of
Bioinformatics
• Sequence comparison
• Searches in sequence databases
Biologia In Silico - Centro de Informática - UFPE
Biological Sequence Comparison
• Needleman- Wunsch,
1970
– Dynamic programming
algorithm to align
sequences
Biologia In Silico - Centro de Informática - UFPE
Busca de Sinais de
Localização
Biologia In Silico - Centro de Informática - UFPE
Protein sorting in eukaryotes
• Proteins belong in different organelles of the cell – and some even have
their function outside the cell
• Günter Blobel was in 1999 awarded The Nobel Prize in Physiology or
Medicine for the discovery that "proteins have intrinsic signals that govern
their transport and localization in the cell"
Biologia In Silico - Centro de Informática - UFPE
Protein sorting: secretory pathway / ER
Secretory proteins have a signal peptide
Initially, they are transported across the ER membrane
Biologia In Silico - Centro de Informática - UFPE
Signal peptides
A signal peptide is an Nterminal part of the amino
acid chain, containing a
hydrophobic region.
Signal peptides differ
between proteins, and can be
hard to recognize.
Biologia In Silico - Centro de Informática - UFPE
Simple pattern (“word”) recognition
Example:
PROSITE entry PS00014, ER_TARGET:
Endoplasmic reticulum targeting sequence (”KDEL-signal”).
Pattern: [KRHQSA]-[DENQ]-E-L
NB: only yes/no answers!
Biologia In Silico - Centro de Informática - UFPE
Statistical Methods
• Estimate probabilities for nucleotides / amino acids
• Information content in sequences; logos; Position- Weight
Matrices.
• Quantitative answers.
Biologia In Silico - Centro de Informática - UFPE
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Busca de Motifs
Biologia In Silico - Centro de Informática - UFPE
Random Sample
atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataagg
taca
tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggt
ccga
gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaag
gaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccactt
atag
gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagc
gcaa
cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgt
tcat
aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacacta
tgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagg
gaag
Biologia In Silico - Centro de Informática - UFPE
Implanting Motif
AAAAAAAGGGGGGG
atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGG
GGGa
tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggt
ccga
gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaag
gaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGctt
atag
gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagc
gcaa
cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgt
tcat
aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacacta
tgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagg
gaag
Biologia In Silico - Centro de Informática - UFPE
Where is the Implanted Motif?
atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaagggg
ggga
tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggt
ccga
gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaag
gaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggctt
atag
gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagc
gcaa
cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgt
tcat
aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacacta
tgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagg
gaag
Biologia In Silico - Centro de Informática - UFPE
Implanting Motif AAAAAAGGGGGGG
with Four Mutations
atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGc
GGGa
tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggt
ccga
gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaag
gaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGctt
atag
gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagc
gcaa
cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgt
tcat
aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacacta
tgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagg
gaag
Biologia In Silico - Centro de Informática - UFPE
Where is the Motif???
atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggc
ggga
tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggt
ccga
gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaag
gaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggctt
atag
gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagc
gcaa
cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgt
tcat
aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacacta
tgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagg
gaag
Biologia In Silico - Centro de Informática - UFPE
Why Finding (15,4) Motif is
Difficult?
atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGc
GGGa
tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggt
ccga
gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaag
gaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGctt
atag
gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagc
gcaa
cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgt
tcat
aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacacta
AgAAgAAAGGttGGG
tgta
..|..|||.|..|||
cAAtAAAAcGGcGGG
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagg
gaag
Biologia In Silico - Centro de Informática - UFPE
Próxima Aula
•
•
Ler capitulo 1 do Durbin
Introdução a algoritmos dinâmicos
(10/08)
Biologia In Silico - Centro de Informática - UFPE
Agradecimentos
•
Alguns slides extraidos de
–
–
Biological Sequence Analysis course,
CBS, Universidade Tecnica da
Dinamarca
Neil Jones, University of California at
San Diego
Biologia In Silico - Centro de Informática - UFPE
Download

Alexander Schliep (Max PIanck Molecular Genetics, Berlin