Processamento de Cadeias de Caracteres Ivan G. Costa Filho [email protected] Centro de Informática Universidade Federal de Pernambuco Biologia In Silico - Centro de Informática - UFPE Tópicos • Cadeias de Caracteres Biológicas • Problemas Básicos – alinhamento par/múltiplo – busca de motifs – modelagem de famílias de proteínas • Métodos – Algoritmos dinâmicos – cadeias escondidas de Markov – métodos probabilísticos Biologia In Silico - Centro de Informática - UFPE Disciplina • Aulas – Marco/Abril – introdução de conceitos/métodos básicos – Aulas práticas • Seminários - Abril/Maio – apresentação de tópicos da disciplina • Individual - pós • duplas – graduação • Projeto Maio a Junho – analise de dados reais (de artigos discutidos) em grupo Biologia In Silico - Centro de Informática - UFPE Avaliação • 40% - apresentação dos seminários – avaliação pelos companheiros de classe e presença • 20% - listas de exercícios • 40% - projeto em grupo – nota individual - cada grupo é responsável por descrever a participação Biologia In Silico - Centro de Informática - UFPE Bibliografia • R Durbin, Sean R Eddy, A Krogh, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press. • An Introduction to Bioinformatics Algorithms, Neil Jones e Pavel Pevzner, MIT Press, 2004 • Ver pagina para literatura especifica de cada aula … – www.cin.ufpe.br/~igcf Biologia In Silico - Centro de Informática - UFPE Biologia Molecular Biologia In Silico - Centro de Informática - UFPE Entender a vida a nível celular • Como a informação genética é herdada • Como a informação genética influencia processos celulares • Como genes trabalham juntos para realizar uma função celular Biologia In Silico - Centro de Informática - UFPE Informação Genética - DNA • DNA (ácido desoxirribonucleico) – Cadeia de nucleotídeos – 4 tipos: A;C;G;T – forma fita dupla a partir da complementaridade. • A=TeC=G Biologia In Silico - Centro de Informática - UFPE Dogma Central - Transcrição • Transcrição – DNA para RNA • RNA (acido ribonucléico) – – – – Biologia In Silico - Centro de Informática - UFPE fita simples. 4 tipos: A;C;G;U Moléculas instáveis Transporte de informação do núcleo ao citoplasma Dogma Central - Transcrição • Transcrição – copia seqüência de bases do DNA para o RNA (com U ao invéss de T). Biologia In Silico - Centro de Informática - UFPE Dogma Central - Tradução • Tradução – RNA -> Proteínas – realizada pelo ribossomo – Código genético • Proteínas – cadeia de aminoácidos – 20 tipos diferentes – adquire uma estrutura tridimensional – entidades funcionais da célula Biologia In Silico - Centro de Informática - UFPE Tradução - Código Genético • Combinações de códons (3 bases) codificam um dos 20 aminoácidos. Biologia In Silico - Centro de Informática - UFPE Dogma Central • Dogma: fluxo de informação DNA mRNA Proteína • Gene: segmento de DNA codificando uma proteína. • Transcrito: segmento de RNA transcrito de uma gene. • Um gene corresponde a uma proteína e uma função celular. Biologia In Silico - Centro de Informática - UFPE Controle da Expressão Gênica • Como se da o controle da expressão gênica? • Certas proteínas, fatores de transcrição, se ligam ao DNA e são responsáveis por iniciar a transcrição. Biologia In Silico - Centro de Informática - UFPE Controle da Regulação Gênica Biologia In Silico - Centro de Informática - UFPE Bioinformatics • Manage molecular biological data – Store in databases, organise, formalise, describe... • Compare molecular biological data • Find patterns in molecular biological data – phylogenies – correlations (sequence / structure / expression / function / disease) Goals: • characterise biological patterns & processes • predict biological properties – low level data ⇒ high level properties (eg., sequence ⇒ function) Biologia In Silico - Centro de Informática - UFPE Bioinformatics: neighbour disciplines • Computational biology – Broader concept: includes computational ecology, physiology, neurology etc... • -omics: – Genomics – Transcriptomics – Proteomics • Systems biology – Putting it all together... – Building models, identify control & regulation Biologia In Silico - Centro de Informática - UFPE Molecular biology data... • DNA sequences >alpha-D ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCAC CCAGACTGTGGAGCCGAGGCCCTGGAGAGGTGCGGGCTGAGCTTGGGGAAACCATGGGCA AGGGGGGCGACTGGGTGGGAGCCCTACAGGGCTGCTGGGGGTTGTTCGGCTGGGGGTCAG CACTGACCATCCCGCTCCCGCAGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCC CCCACTTCGACTTGCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGG CCGCCTTGGGCAACGCTGTCAAGAGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCA GCGACCTGCATGCCTACAACCTGCGTGTCGACCCTGTCAACTTCAAGGCAGGCGGGGGAC GGGGGTCAGGGGCCGGGGAGTTGGGGGCCAGGGACCTGGTTGGGGATCCGGGGCCATGCC GGCGGTACTGAGCCCTGTTTTGCCTTGCAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTG GCCACACACCTGGGCAACGACTACACCCCGGAGGCACATGCTGCCTTCGACAAGTTCCTG TCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGATAA >alpha-A ATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGC CAGGCCGGTGACTTGGGTGGTGAAGCCCTGGAGAGGTATGTGGTCATCCGTCATTACCCC ATCTCTTGTCTGTCTGTGACTCCATCCCATCTGCCCCCATACTCTCCCCATCCATAACTG TCCCTGTTCTATGTGGCCCTGGCTCTGTCTCATCTGTCCCCAACTGTCCCTGATTGCCTC TGTCCCCCAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACC TGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTG AGGCTGCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACG CCCAAAAGCTCCGTGTGGACCCCGTCAACTTCAAAGTGAGCATCTGGGAAGGGGTGACCA GTCTGGCTCCCCTCCTGCACACACCTCTGGCTACCCCCTCACCTCACCCCCTTGCTCACC ATCTCCTTTTGCCTTTCAGCTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTT CCCCTCTCTCCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGG CACCGTCCTTACTGCCAAGTACCGTTAA Biologia In Silico - Centro de Informática - UFPE Molecular biology data... • Amino acid sequences • Protein structure: – X-ray crystallography – NMR Biologia In Silico - Centro de Informática - UFPE Cell biology & proteomics data... • Subcellular localization Biologia In Silico - Centro de Informática - UFPE Prediction Methods • Homology / Alignment • Simple pattern (“word”) recognition • Statistical methods – Weight matrices: calculate amino acid probabilities – Other examples: Regression, variance analysis, clustering • Machine learning – Like statistical methods, but parameters are estimated by iterative training rather than direct calculation – Examples: Neural Networks (NN), Hidden Markov Models (HMM), Support Vector Machines (SVM) • Combinations Biologia In Silico - Centro de Informática - UFPE Similarity between sequences If two sequences look similar, the explanation may be: • Homology (common descent) • Convergent evolution (common function → common selective pressure) • Chance! Biologia In Silico - Centro de Informática - UFPE Sequences are related • Darwin: all organisms are related through descent with modification • => Sequences are related through descent with modification • => Similar molecules have similar functions in different organisms Phylogenetic tree based on ribosomal RNA: three domains of life Biologia In Silico - Centro de Informática - UFPE Sequences are related II Phylogenetic tree of globin-type proteins found in humans Biologia In Silico - Centro de Informática - UFPE Why compare sequences? Protein 1: binds oxygen Sequence similarity Protein 2: binds oxygen ? Biologia In Silico - Centro de Informática - UFPE • Determination of evolutionary relationships • Prediction of protein function and structure (database searches). Biological Databases • Vast biological and sequence data is freely available through online databases • Use computational algorithms to efficiently store large amounts of biological data Examples • NCBI GeneBank http://ncbi.nih.gov Huge collection of databases, the most prominent being the nucleotide sequence database • Protein Data Bank http://www.pdb.org Database of protein tertiary structures • SWISSPROT • http://www.expasy.org/sprot/ Database of annotated protein sequences • PROSITE http://kr.expasy.org/prosite Database of protein active site motifs Biologia In Silico - Centro de Informática - UFPE Alinhamento de Sequencias Biologia In Silico - Centro de Informática - UFPE BLAST • A computational tool that allows us to compare query sequences with entries in current biological databases. • A great tool for predicting functions of a unknown sequence based on alignment similarities to known genes. Biologia In Silico - Centro de Informática - UFPE BLAST Biologia In Silico - Centro de Informática - UFPE Some Early Roles of Bioinformatics • Sequence comparison • Searches in sequence databases Biologia In Silico - Centro de Informática - UFPE Biological Sequence Comparison • Needleman- Wunsch, 1970 – Dynamic programming algorithm to align sequences Biologia In Silico - Centro de Informática - UFPE Busca de Sinais de Localização Biologia In Silico - Centro de Informática - UFPE Protein sorting in eukaryotes • Proteins belong in different organelles of the cell – and some even have their function outside the cell • Günter Blobel was in 1999 awarded The Nobel Prize in Physiology or Medicine for the discovery that "proteins have intrinsic signals that govern their transport and localization in the cell" Biologia In Silico - Centro de Informática - UFPE Protein sorting: secretory pathway / ER Secretory proteins have a signal peptide Initially, they are transported across the ER membrane Biologia In Silico - Centro de Informática - UFPE Signal peptides A signal peptide is an Nterminal part of the amino acid chain, containing a hydrophobic region. Signal peptides differ between proteins, and can be hard to recognize. Biologia In Silico - Centro de Informática - UFPE Simple pattern (“word”) recognition Example: PROSITE entry PS00014, ER_TARGET: Endoplasmic reticulum targeting sequence (”KDEL-signal”). Pattern: [KRHQSA]-[DENQ]-E-L NB: only yes/no answers! Biologia In Silico - Centro de Informática - UFPE Statistical Methods • Estimate probabilities for nucleotides / amino acids • Information content in sequences; logos; Position- Weight Matrices. • Quantitative answers. Biologia In Silico - Centro de Informática - UFPE ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Busca de Motifs Biologia In Silico - Centro de Informática - UFPE Random Sample atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataagg taca tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggt ccga gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaag gaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccactt atag gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagc gcaa cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgt tcat aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacacta tgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagg gaag Biologia In Silico - Centro de Informática - UFPE Implanting Motif AAAAAAAGGGGGGG atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGG GGGa tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggt ccga gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaag gaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGctt atag gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagc gcaa cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgt tcat aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacacta tgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagg gaag Biologia In Silico - Centro de Informática - UFPE Where is the Implanted Motif? atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaagggg ggga tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggt ccga gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaag gaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggctt atag gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagc gcaa cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgt tcat aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacacta tgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagg gaag Biologia In Silico - Centro de Informática - UFPE Implanting Motif AAAAAAGGGGGGG with Four Mutations atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGc GGGa tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggt ccga gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaag gaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGctt atag gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagc gcaa cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgt tcat aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacacta tgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagg gaag Biologia In Silico - Centro de Informática - UFPE Where is the Motif??? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggc ggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggt ccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaag gaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggctt atag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagc gcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgt tcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacacta tgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagg gaag Biologia In Silico - Centro de Informática - UFPE Why Finding (15,4) Motif is Difficult? atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGc GGGa tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggt ccga gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaag gaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGctt atag gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagc gcaa cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgt tcat aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacacta AgAAgAAAGGttGGG tgta ..|..|||.|..||| cAAtAAAAcGGcGGG ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagg gaag Biologia In Silico - Centro de Informática - UFPE Próxima Aula • • Ler capitulo 1 do Durbin Introdução a algoritmos dinâmicos (10/08) Biologia In Silico - Centro de Informática - UFPE Agradecimentos • Alguns slides extraidos de – – Biological Sequence Analysis course, CBS, Universidade Tecnica da Dinamarca Neil Jones, University of California at San Diego Biologia In Silico - Centro de Informática - UFPE