AUTOMATIC PHONETIC ANNOTATION
OF AN ORTHOGRAPHICALLY TRANSCRIBED
SPEECH CORPUS
Rui Amaral, Pedro Carvalho, Diamantino Caseiro, Isabel Trancoso, Luís Oliveira
IST, Instituto Superior Técnico
INESC, Instituto de Engenharia de Sistemas e Computadores
Summary
• Motivation
• System Architecture
– Module 1: Grapheme-to-phone converter (G2P)
– Module 2: Alternative transcriptions generator (ATG)
– Module 3: Acoustic signal processor
– Module 4: Phonetic decoder and aligner
• Training and Test Corpora
• Results
– Transcription and alignment (Development phase)
– Test corpus annotation (Evaluation phase)
• Conclusions and Future Work
Motivation
• Time consuming, repetitive task ( over 60 x real time)
• Large corpora processing
• No expert intervention
– Non-existence of widely adopted standard procedures
– Error prone
– Inconsistency's among human annotators
System Architecture
Acoustic signal
processor
Orthographically transcribed
Grapheme-to-Phone
Converter
Alternative
Transcriptions
Generator
speech corpus
Lexicon
Rules
Phonetic
Decoder/Aligner
Phonetically annotated
speech corpus
- Module 1 -
Grapheme-to-Phone Converter
Modules of the Portuguese TTS system (DIXI)
• Text normalisation
– Special symbols, numerals, abbreviations and acronyms
• Broad Phonetic Transcription
– Careful pronunciation of the word pronunciation
– Set of 200 rules
– Small exceptions dictionary (364 entries)
– SAMPA phonetic alphabet
- Module 2 -
Alternative Transcriptions Generator
Transformation of phone sequences into lattices
• Based on optional rules:
– Which account for:
» Sandhi
» Vowel reduction
– Specified using finite-state-grammars and simple transduction operators
A (B  C) D
Examples:
Type
Text
Broad P.T.
Alternative P.T.
sandhi with vowel
quality change
de uma
[d@ um6]
[djum6]
mesmo assim
[m"eZmu 6s"i~]
[m"eZmw6s"I~]
sandhi with
vowel reduction
de uma
[d@ um6]
[dum6]
mesmo assim
[m"eZmu 6s"i~]
[m"eZm6s"i~]
semana
[s@m"6n6]
[sm"6n6]
oito
["ojtu]
["ojt]
restaurante
[R@Stawr"6~t]
[R@StOr"6~t]
viagens
[vj"aZ6~j~S]
[vj"aZe~S]
vowel reduction
Alternative
pronunciations
Example (rules application):
Phrase
“vou para a praia.”
Canonical P.T.
[v"o p6r6 6 pr"aj6]
Narrow P. T. (most freq.)
[v"o pr"a pr"ai6]
=
sandhi + vowel reduction
Rules:
DEF_RULE 6a, ( (6  NULL) (sil  NULL) (6  a) )
DEF_RULE pra, ( p ("6  NULL) r 6 )
Lattice
p
"6
r
6
sil
...
r
a
6
sil
p
r
...
- Module 3 -
Acoustic Signal Processor
Extraction of acoustical signal characteristics
• Sampling: 16 kHz, 16 bits
• Parameterisation: MFCC (Mel - Frequency Cepstral Coefficients)
– Decoding: 14 coefficients, energy, 1st and 2nd order differences, 25 ms
Hamming windows, updated every 10 ms.
– Alignment: 14 coefficients, energy, 1st and 2nd order differences, 16 ms
Hamming windows, updated every 5 ms.
- Module 4 -
Phonetic Decoder and Aligner
Selection of the phonetic transcription which
is closest to the utterance
• Viterbi algorithm
• 2 x 60 HMM models
– Architecture
» left-to-right
» 3-state
» 3-mixture
NOTE: modules 3 and 4 use Hidden Markov Model Toolkit (Entropic Research Labs)
Training and Test Corpora
• Subset of the EUROM 1 multilingual corpus
– European Portuguese
– Collected in an anechoic room, 16 kHz, 16 bits.
– 5 male + 5 female speakers (few talkers)
– Prompt texts
» Passages:
• Paragraphs of 5 related sentences
• Free translations of the English version of EUROM 1
• Adapted from books and newspaper text
» Filler sentences:
• 50 sentences grouped in blocks of 5 sentences each
• Built to increase the numbers of different diphones in the corpus
– Manually annotated.
Training and Test Corpora (cont.)
Speaker
Passages
Phrases
1
O0 - O4
O5 - O9
P0 - P4
F5 - F9
Training Corpus
2
O0 - O4
O5 - O9
P0 - 04
F0 - F4
Test Corpus 1
3
P5 - P9
Q0 - Q4
Q5 - Q9
F5 - F9
Test Corpus 2
4
P0 - P4
P5 - P9
Q0 - Q4
F5 - F9
5
O5 - O9
P0 - P4
P5 - P9
F0 - F4
6
P5 - P9
Q0 - Q4
Q5 - Q9
F5 - F9
O0-O9, P0-P9: English translations
7
O0 - O4
O5 - O9
P0 - P4
F0 - F4
Q0-Q9, R0-R9: Books and
newspaper text.
8
Q0 - Q4
Q5 - Q9
R0 - R4
F0 - F4
9
R5 - R9
O0 - O4
O5 - O9
F5 - F9
10
Q5 - Q9
R0 - R4
R5 - R9
F5 - F9
Passages:
Filler sentences:
F0-F9
Transcription and alignment
results
• Transcription:
– Precision = ((correct - inserted)/Total) x 100%
• Alignment:
– % of cases in which the absolute error is < 10 ms
– average absolute error including 90 % of cases
Transcription
Alignment
Models
Precision
< 10ms Percentile 90%
HMM (transcription)
52,8 %
66,9 %
20 ms
HMM (alignment)
43 %
78,9 %
18 ms
Annotation strategies and Results
Transcription
Alignment
Strategy 1
HMM alignment
HMM alignment
Strategy 2
HMM recognition
HMM recognition
Strategy 3
HMM recognition
HMM alignment
Transcription
Alignment
Models
Precision
< 10ms
Percentile 90%
Strategy 1
85,3 %
77,4 %
20 ms
Strategy 2
85,8 %
44 %
29 ms
Strategy 3
85,8 %
78 %
19 ms
NOTE: Alignment evaluated only in places where the decoded sequence matched the manual sequence
Annotation results
- Transcription Precision
Rules
Test 1
Test 2
Canonical
74 %
76,9 %
Sandhi
77,1 %
79,4 %
Vowel reduction and
alternative pronunciation
85,1 %
84,5 %
• Comments
– Better precision achieved for canonical transcriptions of Test 2
– Highest global precision achieved in Test 1
– Successive application of the rules leads to a better precision
Annotation results
- Alignment Alignment
Rules
Test 1
Test 2
< 10 ms
90 %
< 10 ms
90 %
Canonical
74,68 %
24 ms
75,18 %
25 ms
Sandhi
75,04 %
23 ms
75,41 %
24 ms
Vowel reduction and
alternative pronunciations
78,76 %
19 ms
77,27 %
22 ms
• Comments
– Better alignment obtained with the best decoder
– Some problematic transitions: vowels, nasals vowels and liquids.
Conclusions
• Better annotations results with:
– Alternative Transcriptions (comparatively to canonical).
– Use of different models for alignment and recognition
• About 84 % precision in transcription and 22 ms of
maximum alignment error for 90 % of the cases
Future Works
• Automatic rule inference
– 1st Phase: comparison and selection of rules
– 2nd Phase: validation or phonetic-linguistic interpretation
• Annotation of other speech corpora to build better
acoustic models
• Assignment of probabilistic information to the
alternative pronunciations generated by rule
TOPIC ANNOTATION IN BROADCAST NEWS
Rui Amaral, Isabel Trancoso
IST, Instituto Superior Técnico
INESC, Instituto de Engenharia de Sistemas e Computadores
Preliminary work
• System Architecture
– Two-stage unsupervised clustering algorithm
» nearest-neighbour search method
» Kullback-Leibler distance measure
– Topic language models
» smoothed unigrams statistics
– Topic Decoder
» based on Hidden Markov Models (HMM)
NOTE: topic models created with CMU Cambridge Statistical Language Modelling Toolkit
System Architecture
TRAINING PHASE
Process 1:
C1
Topic HMM
NEWSPAPER TEXT CORPUS
(TOPIC LABELED )
T1
Ci
Topic Model
Generation
Ti
Ck
Clustering
TM1
TMi
TMk
Tk
Selection &
Filtering
NEWSPAPER TEXT CORPUS
(TOPIC UNLABELED )
Process 2:
Texts
Topic Segmentation
and Labelling
DECODING PHASE
Topic annotated texts
Training and Test Corpora
• Subset of the BD_PUBLICO newspaper text corpus
– 20000 stories
– 6 month period (September 95 - February 96)
– topic annotated
– size between 100 and 2000 word
– normalised text