AUTOMATIC PHONETIC ANNOTATION OF AN ORTHOGRAPHICALLY TRANSCRIBED SPEECH CORPUS Rui Amaral, Pedro Carvalho, Diamantino Caseiro, Isabel Trancoso, Luís Oliveira IST, Instituto Superior Técnico INESC, Instituto de Engenharia de Sistemas e Computadores Summary • Motivation • System Architecture – Module 1: Grapheme-to-phone converter (G2P) – Module 2: Alternative transcriptions generator (ATG) – Module 3: Acoustic signal processor – Module 4: Phonetic decoder and aligner • Training and Test Corpora • Results – Transcription and alignment (Development phase) – Test corpus annotation (Evaluation phase) • Conclusions and Future Work Motivation • Time consuming, repetitive task ( over 60 x real time) • Large corpora processing • No expert intervention – Non-existence of widely adopted standard procedures – Error prone – Inconsistency's among human annotators System Architecture Acoustic signal processor Orthographically transcribed Grapheme-to-Phone Converter Alternative Transcriptions Generator speech corpus Lexicon Rules Phonetic Decoder/Aligner Phonetically annotated speech corpus - Module 1 - Grapheme-to-Phone Converter Modules of the Portuguese TTS system (DIXI) • Text normalisation – Special symbols, numerals, abbreviations and acronyms • Broad Phonetic Transcription – Careful pronunciation of the word pronunciation – Set of 200 rules – Small exceptions dictionary (364 entries) – SAMPA phonetic alphabet - Module 2 - Alternative Transcriptions Generator Transformation of phone sequences into lattices • Based on optional rules: – Which account for: » Sandhi » Vowel reduction – Specified using finite-state-grammars and simple transduction operators A (B C) D Examples: Type Text Broad P.T. Alternative P.T. sandhi with vowel quality change de uma [d@ um6] [djum6] mesmo assim [m"eZmu 6s"i~] [m"eZmw6s"I~] sandhi with vowel reduction de uma [d@ um6] [dum6] mesmo assim [m"eZmu 6s"i~] [m"eZm6s"i~] semana [s@m"6n6] [sm"6n6] oito ["ojtu] ["ojt] restaurante [R@Stawr"6~t] [R@StOr"6~t] viagens [vj"aZ6~j~S] [vj"aZe~S] vowel reduction Alternative pronunciations Example (rules application): Phrase “vou para a praia.” Canonical P.T. [v"o p6r6 6 pr"aj6] Narrow P. T. (most freq.) [v"o pr"a pr"ai6] = sandhi + vowel reduction Rules: DEF_RULE 6a, ( (6 NULL) (sil NULL) (6 a) ) DEF_RULE pra, ( p ("6 NULL) r 6 ) Lattice p "6 r 6 sil ... r a 6 sil p r ... - Module 3 - Acoustic Signal Processor Extraction of acoustical signal characteristics • Sampling: 16 kHz, 16 bits • Parameterisation: MFCC (Mel - Frequency Cepstral Coefficients) – Decoding: 14 coefficients, energy, 1st and 2nd order differences, 25 ms Hamming windows, updated every 10 ms. – Alignment: 14 coefficients, energy, 1st and 2nd order differences, 16 ms Hamming windows, updated every 5 ms. - Module 4 - Phonetic Decoder and Aligner Selection of the phonetic transcription which is closest to the utterance • Viterbi algorithm • 2 x 60 HMM models – Architecture » left-to-right » 3-state » 3-mixture NOTE: modules 3 and 4 use Hidden Markov Model Toolkit (Entropic Research Labs) Training and Test Corpora • Subset of the EUROM 1 multilingual corpus – European Portuguese – Collected in an anechoic room, 16 kHz, 16 bits. – 5 male + 5 female speakers (few talkers) – Prompt texts » Passages: • Paragraphs of 5 related sentences • Free translations of the English version of EUROM 1 • Adapted from books and newspaper text » Filler sentences: • 50 sentences grouped in blocks of 5 sentences each • Built to increase the numbers of different diphones in the corpus – Manually annotated. Training and Test Corpora (cont.) Speaker Passages Phrases 1 O0 - O4 O5 - O9 P0 - P4 F5 - F9 Training Corpus 2 O0 - O4 O5 - O9 P0 - 04 F0 - F4 Test Corpus 1 3 P5 - P9 Q0 - Q4 Q5 - Q9 F5 - F9 Test Corpus 2 4 P0 - P4 P5 - P9 Q0 - Q4 F5 - F9 5 O5 - O9 P0 - P4 P5 - P9 F0 - F4 6 P5 - P9 Q0 - Q4 Q5 - Q9 F5 - F9 O0-O9, P0-P9: English translations 7 O0 - O4 O5 - O9 P0 - P4 F0 - F4 Q0-Q9, R0-R9: Books and newspaper text. 8 Q0 - Q4 Q5 - Q9 R0 - R4 F0 - F4 9 R5 - R9 O0 - O4 O5 - O9 F5 - F9 10 Q5 - Q9 R0 - R4 R5 - R9 F5 - F9 Passages: Filler sentences: F0-F9 Transcription and alignment results • Transcription: – Precision = ((correct - inserted)/Total) x 100% • Alignment: – % of cases in which the absolute error is < 10 ms – average absolute error including 90 % of cases Transcription Alignment Models Precision < 10ms Percentile 90% HMM (transcription) 52,8 % 66,9 % 20 ms HMM (alignment) 43 % 78,9 % 18 ms Annotation strategies and Results Transcription Alignment Strategy 1 HMM alignment HMM alignment Strategy 2 HMM recognition HMM recognition Strategy 3 HMM recognition HMM alignment Transcription Alignment Models Precision < 10ms Percentile 90% Strategy 1 85,3 % 77,4 % 20 ms Strategy 2 85,8 % 44 % 29 ms Strategy 3 85,8 % 78 % 19 ms NOTE: Alignment evaluated only in places where the decoded sequence matched the manual sequence Annotation results - Transcription Precision Rules Test 1 Test 2 Canonical 74 % 76,9 % Sandhi 77,1 % 79,4 % Vowel reduction and alternative pronunciation 85,1 % 84,5 % • Comments – Better precision achieved for canonical transcriptions of Test 2 – Highest global precision achieved in Test 1 – Successive application of the rules leads to a better precision Annotation results - Alignment Alignment Rules Test 1 Test 2 < 10 ms 90 % < 10 ms 90 % Canonical 74,68 % 24 ms 75,18 % 25 ms Sandhi 75,04 % 23 ms 75,41 % 24 ms Vowel reduction and alternative pronunciations 78,76 % 19 ms 77,27 % 22 ms • Comments – Better alignment obtained with the best decoder – Some problematic transitions: vowels, nasals vowels and liquids. Conclusions • Better annotations results with: – Alternative Transcriptions (comparatively to canonical). – Use of different models for alignment and recognition • About 84 % precision in transcription and 22 ms of maximum alignment error for 90 % of the cases Future Works • Automatic rule inference – 1st Phase: comparison and selection of rules – 2nd Phase: validation or phonetic-linguistic interpretation • Annotation of other speech corpora to build better acoustic models • Assignment of probabilistic information to the alternative pronunciations generated by rule TOPIC ANNOTATION IN BROADCAST NEWS Rui Amaral, Isabel Trancoso IST, Instituto Superior Técnico INESC, Instituto de Engenharia de Sistemas e Computadores Preliminary work • System Architecture – Two-stage unsupervised clustering algorithm » nearest-neighbour search method » Kullback-Leibler distance measure – Topic language models » smoothed unigrams statistics – Topic Decoder » based on Hidden Markov Models (HMM) NOTE: topic models created with CMU Cambridge Statistical Language Modelling Toolkit System Architecture TRAINING PHASE Process 1: C1 Topic HMM NEWSPAPER TEXT CORPUS (TOPIC LABELED ) T1 Ci Topic Model Generation Ti Ck Clustering TM1 TMi TMk Tk Selection & Filtering NEWSPAPER TEXT CORPUS (TOPIC UNLABELED ) Process 2: Texts Topic Segmentation and Labelling DECODING PHASE Topic annotated texts Training and Test Corpora • Subset of the BD_PUBLICO newspaper text corpus – 20000 stories – 6 month period (September 95 - February 96) – topic annotated – size between 100 and 2000 word – normalised text