UNIVERSIDADE TÉCNICA DE LISBOA
INSTITUTO SUPERIOR TÉCNICO
Posterior Regularization Framework:
Learning Tractable Models with Intractable Constraints
João de Almeida Varelas Graça
(MSc in Information Systems and Computer Engineering)
Submitted in Partial Fulfillment of the Requirements for
the Degree of Doctor of Philosophy
Computer Science and Engineering
Advisor:
Professor Maria Luísa Torres Ribeiro Marques da Silva Coheur
Co-Advisor: Professor Fernando Carlos das Neves Pereira
Co-Advisor: Professor Ben Taskar
Committee
President:
Jury:
President of the IST scientific committee
Professor Maria Luísa Torres Ribeiro Marques da Silva Coheur
Professor Fernando Carlos das Neves Pereira
Professor Ben Taskar
Professor Mário Alexandre Teles de Figueiredo
Professor Adrià de Gispert
Professor Vasco Miguel Gomes Nunes Manquinho
July 16, 2009
2
Abstract
Unsupervised Learning of probabilistic structured models presents a fundamental tradeoff between richness of captured constraints and correlations versus efficiency and tractability
of inference. In this thesis, we propose a new learning framework called Posterior Regularization (PR) that incorporates side-information into unsupervised estimation in the form of constraints on the model’s posteriors. The underlying model remains unchanged, but the learning
method changes. During learning, our method is similar to the EM algorithm, but we solve a
problem similar to Maximum Entropy inside the E-Step to enforce the constraints. We apply
the PR framework to two different large scale tasks: Statistical Word Alignments and Unsupervised Part of Speech Induction. In the former, we incorporate two constraints: bijectivity
and symmetry. Training using these constraints produces a significant boost in performance as
measured by both precision and recall against manually annotated alignments for six language
pairs. In the latter we enforce sparsity on the word tag distribution which is overestimated using the default training method. Experiments on six languages achieve dramatic improvements
over state-of-the-art results.
4
Resumo
A aprendizagem não supervisionada de modelos probabilísticos estruturados apresenta
um compromisso fundamental entre a expressividade das restrições e correlações capturadas,
e a eficiência e exequibilidade de inferência. Nesta tese apresentamos uma nova framework
de aprendizagem intitulada framework de Regularização à Posteriori, que incorpora informação adicional durante a aprendizagem não supervisionada na forma de restrições sobre a
distribuição à posteriori do modelo. O modelo probabilístico mantém-se inalterado, mas o algoritmo de aprendizagem não. Durante a aprendizagem o nosso método é semelhante ao algoritmo EM, mas resolvemos um problema semelhante ao problema de máxima entropia dentro
do E-Step, de forma a satisfazer as restrições. Aplicamos a nova framework a duas tarefas de
grande escala: O problema de Alinhamento Palavra a Palavra Estatístico e a Indução de Etiquetas Sintácticas. Na primeiro tarefa incorporamos duas restrições: bijectividade e simetria.
Os modelos treinados com estas restrições produzem um aumento significativo nos resultados
avaliados em termos de precisão e cobertura em seis pares de línguas distintos. Na segunda
tarefa impomos dispersão na distribuição de etiquetas dada uma palavra, que é sobrestimado
usando o método de treino padrão. Experiências em seis línguas diferentes apresentam resultados significativamente melhores do que o estado da arte.
6
Keywords & Palavras Chave
Keywords
• Posterior Regularization Framework
• Unsupervised Learning
• Prior Knowledge
• Statistical Word Alignments
• Part of Speech Induction
• Natural Language Processing
Palavras Chave
• Framework de Regularização à Posteriori
• Aprendizagem Não Supervisionada
• Conhecimento à Priori
• Alinhamentos Palavra a Palavra Estatístico
• Indução de Etiquetas Sintácticas
• Processamento de Língua Natural
8
Acknowledgments
I would like to express my gratitude to everyone that helped me during the development
of this dissertation, provided me with their support, and endured my constant stress and bad
temper. Without them this work would not have been possible. My thanks extend to the
INESC-ID Spoken Language Systems lab team, where I developed a part of my thesis, and to
the University of Pennsylvania where I was an invited student for most of the time during the
execution of this work. My thanks to my colleagues Partha Talukdar, Alex Kulesza and Jennifer
Gillenwater who I had the pleasure to share the work room and discuss a lot of ideas. My
special thanks to Kuzman Ganchev whith whom I had the pleasure to work closely during this
work, for all his patience and friendship. I would like to thank my Supervisors, Luísa Coheur,
Fernando Pereira and Ben Taskar for all their guidance over these years, constant advices and
corrections, and never ending patience towards my doubts and requests. To all my friend that
have always been there for me, even when things seemed to be going wrong, thank you for
your words of comfort and motivation. And last, but certainly not least, I thank my family, for
their unconditional support, not only throughout this project, but also for my entire life.
Lisboa, April 18, 2011
João Graça
10
Contents
1
2
Introduction
1
1.1
Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Background
5
2.1
Modeling - Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1.2
HMM Parameters
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2
Learning - Expectation Maximization Algorithm . . . . . . . . . . . . . . . . . . .
13
2.3
Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.4
Statistical Word Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.4.1
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.4.2
Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.4.3
Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.4.4
Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.4.5
Baseline Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.4.6
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
Part of Speech Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.5.1
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.5.2
Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.5.3
Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.5.4
Baseline problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
2.5.5
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.5
3
Posterior Regularization Framework
29
3.1
Posterior Regularization Objective . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.1.1
30
Posterior Regularization Objective . . . . . . . . . . . . . . . . . . . . . . .
i
3.1.2
Computing The PR Objective . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.1.3
Projection Before Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.2
Posterior Regularization Optimization Algorithm . . . . . . . . . . . . . . . . . .
33
3.3
Statistical Word Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.3.1
Bijectivity Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.3.2
Symmetry Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
PoS Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.4.1
Parameter Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.4.2
Posterior Sparsity Constraint . . . . . . . . . . . . . . . . . . . . . . . . . .
39
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.4
3.5
4
Statistical Word Alignment
45
4.1
Word Alignment Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.1.1
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.1.2
Dual Projection Optimization . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4.1.3
Projection Before Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.1.4
Overall Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
4.1.5
Rare vs. Common Words . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.1.6
Directional Alignments Combination . . . . . . . . . . . . . . . . . . . . .
50
4.1.7
Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.2.1
Syntax Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.2.2
Phrase-based machine translation . . . . . . . . . . . . . . . . . . . . . . .
54
4.2
5
Part of Speech Induction
57
5.1
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
5.2
Overall Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.3
Maximum Entropy Observation Model Results . . . . . . . . . . . . . . . . . . . .
58
5.4
Tag Sparsity Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
5.5
Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.6
Using The Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
ii
6
Conclusion and Future Work
67
6.1
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
6.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
A Derivations
71
A.1 Convergence Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
A.2 Modified E-step Dual Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
A.3 Modified Penalty E-step Dual Derivation . . . . . . . . . . . . . . . . . . . . . . .
72
B Bibliography
75
iii
iv
List of Figures
2.1
HMM running example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2
HMM Trellis representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.3
Manual annotate word alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.4
Example of a predicted word alignment . . . . . . . . . . . . . . . . . . . . . . . .
20
2.5
Precision Recall curve example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.6
State posterior distributions for different models for an English and French sentence pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.7
PoS mapping example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
2.8
Differences of sparsity per word on the true tag distribution versus the HMM
and HMM-ME model trained using EM . . . . . . . . . . . . . . . . . . . . . . . .
27
3.1
PR objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.2
Modified EM algorithm for optimizing the PR objective . . . . . . . . . . . . . . .
34
3.3
State posterior distributions for different models an English and French sentence
pair. EM training vs PR with bijectivity constraint . . . . . . . . . . . . . . . . . .
35
State posterior distributions for different models an English and French sentence
pair. EM training vs PR with symmetry constraint . . . . . . . . . . . . . . . . . .
37
3.5
Sparsity measure illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.6
Differences of sparsity per word on the true tag distribution versus the HMM
and HMM-ME model trained using EM and PR with sparsity constraint . . . . .
42
Effects of the convergence criteria for optimization of the PR objective for Word
Alignments with symmetry constraints . . . . . . . . . . . . . . . . . . . . . . . .
46
Effects of slack values for optimization of the PR objective for Word Alignments
with symmetry constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
Precision Recall comparing the use of constraints during training of just at decoding time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
4.4
Precision Recall comparison for different training methods of word alignments .
49
4.5
Word alignment methods overall comparison . . . . . . . . . . . . . . . . . . . . .
49
4.6
Learning curves for different method for word alignments . . . . . . . . . . . . .
50
3.4
4.1
4.2
4.3
v
4.7
Word Alignments method comparison broken into rare and common words . . .
51
4.8
Word alignment methods comparison after directional symmetrization heuristics
51
4.9
Word Alignments error analysis for different methods . . . . . . . . . . . . . . . .
53
4.10 Edge conservation for cross lingual grammar induction . . . . . . . . . . . . . . .
54
4.11 Machine Translation phrase extraction from word alignments example . . . . . .
55
5.1
PoS methods comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.2
PoS Maximum entropy features effect comparison . . . . . . . . . . . . . . . . . .
59
5.3
PoS effect of reducing the number of model parameters . . . . . . . . . . . . . . .
60
5.4
PoS number of model parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
5.5
PoS Maximum Entropy regularizer strength comparison . . . . . . . . . . . . . .
61
5.6
PoS Average absolute �1 /�∞ difference for each word type between each training
method and the supervised initialization. . . . . . . . . . . . . . . . . . . . . . . .
62
5.7
PoS L1LMax vs Accuracy for different methods
. . . . . . . . . . . . . . . . . . .
62
5.8
PoS Token distribution per cluster . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
5.9
PoS comparison of different number of word types constrained to be sparse . . .
63
5.10 PoS different sparsity strength comparison . . . . . . . . . . . . . . . . . . . . . .
64
5.11 PoS error analysis for English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.12 PoS error analysis for Portuguese . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.13 PoS Supervised learning error by using induced clusters as features . . . . . . . .
65
5.14 PoS Supervised learning error reduction by using induced clusters as features . .
66
vi
List of Tables
2.1
General notation used in this document . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2
HMM notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.3
HMM probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.4
HMM multinomial parametrization . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.5
Word Alignment corpus statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.6
PoS corpora statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
4.1
Machine translation results for different word alignments . . . . . . . . . . . . . .
56
vii
viii
List of Algorithms
1
EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2
Computing KL(Qx̄ � pθ (z̄ | x̄)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3
PR optimization via modified EM . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
34
x
1
Introduction
Unsupervised learning of linguistic structures is a complex task. The process by which
linguistic structures are generated is not always clear and even when it is, it is normally too
complex to be formally expressed. Nevertheless, unsupervised learning has been applied to
a wide range of natural language processing tasks, such as: Part of Speech Induction (PoSI)
[Schütze, 1995; Merialdo, 1994; Clark, 2003], Dependency Grammar Induction [Klein and Manning, 2004; Smith and Eisner, 2006], Constituency Grammar Induction [Klein and Manning,
2004], Statistical Word Alignments (SWA) [Brown et al., 1993] and Anaphora Resolution [Charniak and Elsner, 2009], just to name a few.
Different motivations have pushed research in this area. From both a linguistic and cognitive point of view, unsupervised learning is useful as a tool to study language acquisition. From
a more pragmatic perspective, unsupervised learning is required since annotated corpora is a
scarce resource for several reasons. First, the construction of annotated corpus is costly and
time consuming, resulting in an actual shortage of annotated corpora for several tasks. Second,
even when these corpora exists, they are usually not generic, and when the context or domain
changes their usefulness decreases dramatically. Third, when creating a corpus, several assumptions are embedded in the annotation process that will propagate to the results of any
tool trained in it. These assumptions are not necessarily the most useful for a specific task. For
instance, the decision of which Part of Speech (PoS) tags to use when annotating a corpus may
seem useful at the time, but hinder the corpus utility for other tasks. This situation occurs for
instance in the Penn TreeBank [Marcus et al., 1993] that contains a special PoS tag for the word
“To” that is hard to justify, since it is generally a preposition. Although a model trained in a
supervised manner can learn this easily from a few examples, the same is not true for unsupervised learning since learning is based on co-occurrences of words or combinations of words
in the text. As another example, in the Portuguese Treebank [Afonso et al., 2002] the text was
segmented in a way, in which various words were merged into a single unit with the same tag
(compound nouns, adverbial expressions). Although this might have eliminated many ambiguities during the annotation process, it is inadequate to several multilingual applications, such
as grammar transfer, when the other corpus does not follow the same segmentation guidelines.
Also, the segmentation used is by no means trivial and easy to replicate, making these corpus
unfitted to train a system with the goal of annotating a corpus with a different segmentation.
Furthermore, even when annotated corpora is available the annotations induced by an unsupervised system can be useful as extra data. For instance, induced PoS tags have been used
to boost the performance of supervised models in several tasks, such as PoS tagging either to
account for rare words or for domain adaptation [Huang and Yates, 2009; Biemann et al., 2007],
name entity recognition, word sense disambiguation [Biemann et al., 2007], estimate the fertility of words in word alignment models [Brown et al., 1993], or to replace actual words in
unsupervised grammar induction [Headden III et al., 2008].
Unsupervised learning can be split into three different tasks: 1) Modeling - desig a model
2
CHAPTER 1. INTRODUCTION
that explains the linguistic phenomena we are trying to capture; 2) Learning - given a model
and a corpus, learn the parameters that maximize some objective; 3) Decoding - given the
model and parameters, pick the best hidden structure from the posterior distribution (the distribution over all possible hidden structures conditioned on the observed data).
A common approach in modeling is to chose a family of probabilistic parametric models
where some variables are hidden and represent the linguistic structure that is trying to be modeled. This choice entails a fundamental trade-off between model complexity and tractability of
inference. On the one hand, we desire a model capable of faithfully reproducing the various
steps of the generative process underlying the linguistic phenomena. On the other hand, the
model has to be computationally tractable to be of any use. This second requirement implies
that we frequently end up with a model that follows an oversimplified generative story. For
instance in machine translation, early models [Brown et al., 1993] assumed that the translation
of a particular sentence breaks down as the translation of each individual word in a linear order
from left to right.
Training consists in finding the parameters that maximize some objective, normally the
likelihood of the data. A common solution to this optimization problem is to use the Expectation Maximization (EM) algorithm [Dempster et al., 1977]. The quality of the parameters
corresponds to the quality of the hidden structure chosen by the model during decoding, given
the parametrization. When using maximum likelihood estimation, there is no way to specify any bias or prior knowledge about the desired hidden structures. The lack of guidance
together with the oversimplification of the model causes the learned parameters to be suboptimal which may lead to undesired results. Furthermore, even simpler models normally
have too many parameters. The lack of guidance also leads the model to overfit and try to
explain undesired correlations in the data.
To mitigate this problem one can go back to the modeling phase and design a new model
that introduces constraints, disallowing certain configurations of the hidden variables, or a
model that follows a more complex generative story. For instance, Brown et al. [1993] start
by defining two simple and tractable models IBM M1 and IBM M2 for word based machine
translation, which follow an oversimplified generative story on how alignments are produced.
These models produce reasonable results but have systematic errors. Then, they proceed by
defining a series of more complex models, IBM M3-M6, that model refined versions of the generative story, and produce better alignments. However, these improvements come at a price:
the more complex models are intractable and require heuristics and approximate inference
prone to search errors. Note that often the simplest models have the same expressive power
than more complex ones, and can represent the same set of hidden structures (this is the case for
the different IBM models). Nevertheless, their simplifying assumptions increase the difficulty
of the learning algorithm to find the correct parameters without any guidance.
A different option to solve the lack of guidance is to keep the model and change the training procedure. A common approach is to move the learning procedure from a maximum likelihood estimation to a maximum a posterior estimation, where a prior distribution over the
model parameters (that encodes preferences about different parameter settings) is added, thus
guiding the learning procedure. Another approach is to move from the maximum a posterior
estimation to a complete Bayesian paradigm of learning. In this paradigm, instead of picking the best parameter setting, we directly pick the best hidden state structure by integrating
over all possible parameter distributions. This approach has shown positive results for several
tasks, such as Grammar Induction [Cohen and Smith, 2009; Headden III et al., 2009], and PoSI
1.1. THESIS CONTRIBUTIONS
3
[Johnson, 2007; Goldwater and Griffiths, 2007]. Despite its recent success, this is not an ideal
solution, since we are normally interested in expressing preferences over the hidden structures
and not over the parameters. Moreover, these preferences are normally hard to express as preferences on the parameters, since the posterior distributions depend on the parameters through
complex interactions.
In this thesis we propose a different approach to guide the model during learning. We
present a new learning framework called Posterior Regularization (PR) that incorporates prior
knowledge directly over the hidden variables during learning. The PR framework defines a
rich language for specifying prior knowledge about the desired hidden structure. The prior
knowledge is expressed as inequalities on the expected value under the posterior distribution
of some user-defined constraint features (not necessarily the same features used by the model).
Constraining the posterior distribution allows a more direct way to achieve the desired behaviour than adding priors to the parameters as we show on the experimental sections. Furthermore, constraining the expected value of the features instead of adding them to the model
allows us to express features that would otherwise make the model intractable. With PR, the
underlying model remains unchanged, but during learning it is forced to find parameters that
satisfy the desired constraints. We also present a learning algorithm that resembles the EM
algorithm to find the parameters that maximize the PR objective.
We show the benefits of the new learning framework on two different case studies: SWA
and PoSI.
For the SWA task we use the Hidden Markov Model (HMM) for word alignments [Vogel
et al., 1996] and achieve significant improvements in performance using the new training procedure. We use PR to enforce two different pieces of prior knowledge as constraints: (i) bijectivity: “one word in one language should not translate to many words in the other language”; and
(ii) symmetry: “the translation of words should be the same when translated from language A
to language B and from language B to language A”. We show for six different language pairs
and different testing scenarios that training using PR with both of these constraints always
outperforms training using EM, normally by a large margin. Moreover, by using this training
procedure the simple HMM model generally outperforms the more complex IBM M4 model.
Also, the alignments produced by the new training procedure are better suited for transferring
syntactic dependency parse annotations, and also to extract minimal translation units leading
to better end to end machine translation systems, specially on small data scenarios.
In the task of PoSI we use PR to constrain the ambiguity of word-tag associations via a
sparsity-inducing norm on the posteriors. We use a first-order HMM as our model to compare
the different training conditions: EM training without modifications to encourage sparsity, a
Bayesian approach with a sparsity-inducing prior as used by Johnson [2007] using variational
Bayes EM, and PR with the posterior sparsity-inducing norm constraint. Our experiments
on six languages (Bulgarian, Danish, English, Portuguese, Spanish, Turkish) shows dramatic
improvements over state-of-the-art results: 11% average increase in accuracy. Furthermore,
using the learned clusters in a semi-supervised setting cuts the amount of needed labeled data
to achieve a given error level by over a half.
1.1
Thesis Contributions
The main contributions of this thesis are:
4
CHAPTER 1. INTRODUCTION
• a new learning framework, that exports a rich language to express prior knowledge about
the hidden variables via posterior regularization;
• an efficient and simple algorithm for parameter estimation for the proposed learning
framework;
• a state of the art system for SWA with a detailed analyses of the results;
• a state of the art system for fully unsupervised PoSI with a detailed analyses of the results.
This thesis also contributes with two important resources for the community:
• manually annotated word alignments among six European languages freely available at
https://www.l2f.inesc-id.pt/wiki/index.php/Word_Alignments;
• The PR toolkit, an extensive software package that contains the code to reproduce all the
experiments in this thesis as well as several other applications. The toolkit is available
under GPL at http://code.google.com/p/pr-toolkit/.
1.2
Document Outline
This document is organized as follows: in the next chapter we lay down the background for
the remainder of the document. We then describe the probabilistic model (HMM), the baseline
learning algorithm (EM), and different ways of finding the best hidden structure given a model
and a set of parameters. We proceed by introducing the two case studies: SWA and PoSI.
Chapter 3 introduces the PR Framework. It describes the PR objective and how to calculate it efficiently. Then we present a learning algorithm for this objective similar to the EM
algorithm, and the application of PR to the two case studies. We conclude the chapter with a
description of work related with PR.
Chapter 4 presents a detailed evaluation and analysis of the results for the SWA task and
Chapter 5 presents a detailed evaluation and analysis of the results for the PoSI task. Finally,
Chapter 6 concludes and discusses directions for future work.
2
Background
In this chapter we introduce the mathematical framework used throughout this document,
and the baseline learning framework to which we shall compare our results. We describe two
case studies and their problems when the baseline learning framework is used.
The problem setting is the following: let X = x̄1 . . . x̄D be a training set of independent
and identically-distributed random variables. In this work x̄d (for notation simplicity we will
drop the superscript d when considering an isolated example) corresponds to a sentence (or
a pair of sentences from parallel corpora, but for simplicity we will use the single sentence
case to introduce the general notation) in natural language and decomposes as a sequence
x̄ = x1 . . . xN , of observations of length N . Each xn is a discrete random variable, in our case
a word, taking a value v from a finite vocabulary V. Each x̄ has an unknown hidden structure
z̄ that we want to predict. In this work the structures are sequences z̄ = z1 . . . zN of the same
length N as the observations. Each hidden state zn is a discrete random variable and can take
a value h from a discrete vocabulary H.
Notation
X
D
x̄ = x1 . . . xN
N
xi
V
|V|
vi
z̄ = z1 . . . zN
zi
H
|H|
hi
training set
number of training examples
observation sequence
size of the observation sequence
observation at position i in the sequence
observation values set
number of distinct observation types
particular observation, i ∈ |V|
hidden state sequence
hidden state at position i in the sequence
hidden states value set
number of distinct hidden value types
particular hidden value, i ∈ |H|
Table 2.1: General notation used in this document
This work focuses on the problem of unsupervised learning of linguistic structure. We can
break down this problem into 3 stages:
• Modeling - This step consists of choosing a model that explains the generative process
by which language and the corresponding linguistic structure arise. We will work with
families of parametric generative probabilistic models, pθ (x̄, z̄), where θ are the model
6
CHAPTER 2. BACKGROUND
parameters, and each setting of θ defines a particular model. For both case studies we
will use an HMM model that we describe in Section 2.1.
• Learning - Given a model family and a training corpus X , the goal of the learning step
is to find the parameters that maximize some objective function, normally the likelihood
of the observed data. This objective is maximized using the Expectation Maximization
(EM) algorithm, the most common approach for training models with hidden variables.
In Section 2.2 we describe the EM algorithm.
• Decoding - Given a model and a setting of parameters θ, find the “best” structure for a
particular observation sequence x̄. Decoding will be addressed in Section 2.3.
To make the problem setting more concrete we present a brief overview of the two particular applications used in this work: SWA and PoSI.
In SWA we are given a training set X , where each training example x̄ is a pair of sentences:
¯ = xs1 . . . xsM of length
a target sentence x̄t = xt1 . . . xtN of length N and a source sentence xs
M . Each xtn is a word vt from a target language vocabulary Vt and each xsm is a word vs from
a source vocabulary Vs. The hidden structure z̄ represents a mapping between target words
and source words, indicating that the two words are a translation of each other. Section 2.4
describes this application in detail, the baseline model used and its major problems.
In PoSI we are given a training set X where each training example is a sentence x̄ =
x1 . . . xN of length N . Each xn is a word v from a vocabulary V. We are given a set of clusters H, and the task is to assign each word in x̄ to a given cluster hv. Section 2.5 describes the
task in detail, the baseline model and its problems.
2.1
2.1.1
Modeling - Hidden Markov Models
Overview
The Hidden Markov Model (HMM) is one of the most common sequence probabilistic
models, and has been applied to a wide variety of tasks. More generally, an HMM is a particular
instance of a chain directed probabilistic graphical model, or a Bayesian network. In a Bayesian
network, every random variable is represented as a node in a graph, and the edges in the graph
are directed and represent probabilistic dependencies between the random variables. For an
HMM, the random variables are divided into two sets, the observed variables, in our case
x, and the hidden variables, in our case z. In the HMM terminology, the observed variables
are called observations, and the hidden variables are called states (an excellent description
of the HMM model is given in Rabiner [1989] ). To ground the definitions and simplify the
explanation we will use a small example displayed graphically in Figure 2.1 (the notation is
summarized in Table 2.2).
A first order HMM model has the following independence assumptions over the join distribution pθ (x̄, z̄):
• Independence of previous states - the probability of being in a given state hl at position
zi only depends on the state hm of the previous position zi−1 , that is pθ (zi = hl | zi−1 =
hm , zi−2 . . . z1 ) = pθ (zi = hl | zi−1 = hm ), defining a first order Markov chain. The order
2.1. MODELING - HIDDEN MARKOV MODELS
7
HMM Example
x̄
N =4
i, j
V = {we, know, the, way}
p, q
xi = vq , xj = vp
H = {a, b, c}
l, m
zi = h l , zj = hm
observed sentence “we know the way”
observation length
positions in the sentence: i, j ∈ {1 . . . N }
observation vocabulary
indexes into the vocabulary p, q ∈ |V|
observation at position xi (xj ) has value vq (vp )
hidden values vocabulary
indexes into the hidden values vocabulary
state at position zi (zj ) has hidden value hl (hm )
Table 2.2: HMM notation for the running example.
a
a
b
c
we
know
the
way
Figure 2.1: HMM structure, for the simple running example.
of the Markov chain depends on the number of previous positions taken into account.
The remainder of the exposition can be easily extend to higher order HMM, giving the
model more expressiveness, but making inference harder.
• Homogeneous transition - the probability of making a transition from state hl to state
hm is independent of the particular position in the sentence, for all i, j ∈ N, pθ (zi = hl |
zi−1 = hm ) = pθ (zj = hl | zj−1 = hm ), so pθ (zi = hl | zi−1 = hm ) = pθ (hl | hm ).
• Observation independence - the probability of observing vq at position i is fully determined by the state at that position, that is pθ (xi = vq | zi = hl ), and this probability is
independent of the particular position, that is pθ (xi = vq | zi = hl ) = pθ (vq | hl ).
These conditional independence assumptions are crucial to allow efficient inference as will be
described. There is also a particular starting probability, the probability of starting at state
hl , which is modeled independently. The three probability distributions that define the HMM
model are summarized in Table 2.3. For each one of them we will use a short notation to
simplify the exposition. The parametrization of these distributions will be discussed in more
detail in Subsection 2.1.2. The joint distribution can be expressed as:
pθ (x̄, z̄) = πz1 bz1 (x1 )
N
�
i=2
azi−1 ,zi bzi (xi ),
(2.1)
8
CHAPTER 2. BACKGROUND
HMM distributions
Name
probability distribution
initial probability
pθ (z1 = hl )
transition probability pθ (zi = hl | zi−1 = hm )
observation probability pθ (xi = vq | zi = hl )
short notation
πl
am,l
bl (xi )
Table 2.3: HMM probability distributions.
a
a
a
a
b
b
b
b
c
c
c
c
Figure 2.2: Trellis representation of the HMM in Figure 2.1, for the observation sequence “we
know the way”, where each hidden variable can take the values a, b, c.
which for the example from Figure 2.1 is:
pθ (x̄, z̄) = πa ba (”we”)aa,a ba (”know”)aa,b bb (”the”)ab,c bc (”way”).
(2.2)
Besides the probability of a given observation sequence and hidden state sequence, we will be
interested in computing the total probability of an observed sequence, that is, “the likelihood”
of the sequence, which corresponds to summing the probability of all possible hidden state
sequences.
�
Likelihood : pθ (x̄) =
pθ (x̄, z̄).
(2.3)
z̄
The number of possible hidden state sequences is exponential in the length of the sentence
(|H|N ), which makes summing over all of them hard. Yet, we must be able to compute this sum
(sum over z̄) to compute the above likelihood formula. For the particular example in Figure 2.1
we explicitly represent all possible paths in Figure 2.2. Each column represents a position, and
each row represents a possible state. By following the arrows from position 1 to position 4 one
can transverse all possible 34 = 81 paths. However, there is a well know dynamic programming
algorithm, the Forward Backward algorithm (FB), that allows the computation to be performed
in linear time, by making use of the independence assumptions.
The FB algorithm relies on the independence of previous states assumption, which is illustrated in the trellis view by only having arrows between consecutive states. The FB algorithm
defines two auxiliary probabilities, the forward probability and the backward probability.
Forward Probability :
αi (hl ) = pθ (zi = hl , x1 . . . xi )
(2.4)
2.1. MODELING - HIDDEN MARKOV MODELS
9
The forward probability represents the probability that in position i we are in state zi = hl
and that we have observed x̄ up to that position. The forward probability is defined by the
following recurrence:
α1 (hl ) = πl bl (x1 )
�
αi (hl ) = 
am,l αi−1 (hm ) bl (xi )
(2.5)
(2.6)
hm ∈H
Considering our example, the probability of being in state hl at position 1 and having observed
the word “we” is just the initial probability of starting at state hl times the probability of observing the word “we” in state hl . When we advance to position 2, the probability of being in
state hl and observing the word “know” is the probability of observing the word “know” given
that we are in state hl times all possible ways we could have reached that state at position 2. We
could have reached hl at position 2 from any state at position 1, so it corresponds to the sum
over all possible states for position 1, times the probability of transitioning from each of these
to hl at position 2. At position 4, the last position of the sequence, the probability of being in
state hi and observed the last word “way” accounts for all possible paths up to the last position.
So, the likelihood of a given sequence can be computed by summing the forward probability
at the last position for every possible state.
�
�
pθ (x̄) =
pθ (x̄, z̄) =
αN (h).
(2.7)
z̄
h∈H
Although the forward probability is enough to calculate the likelihood of a given sequence,
we will also need the backward probability to calculate other quantities of interest. The backward probability, represents the probability of observing x̄ from position i + 1 up to N , given
that at position i we are at state zi = hl :
Backward Probability :
βi (hl ) = pθ (xi+1 ...xN |zi = hl ).
(2.8)
The backward recurrence is given by:
βN (hl ) = 1
�
βi (hl ) =
al,m bm (xi+1 )βi+1 (hm ).
(2.9)
(2.10)
hm ∈H
The backward probability is similar to the forward probability, but operates in the inverse direction. At position N there are no more observations, and the backward probability is set to
1. At position i the probability of having observed the future and being in state hl , is given
by the sum for all possible states of the probability of having transitioned from position i in
state hl to position i + 1 with state hm and observed xi+1 at time i + 1 and the future given by
βi+1 (zi+1 = hm ). With the FB probability one can compute the likelihood of a given sentence
using any position in the sentence.
�
�
pθ (x̄) =
pθ (x̄, z̄) = ∀i
αi (h)βi (h).
(2.11)
z̄
h∈H
Note that for time N, βN (h) = 1 and we get back to equation 2.7. Although redundant, this
10
CHAPTER 2. BACKGROUND
fact is useful when implementing an HMM as a sanity check that the computations are being
performed correctly, since one can compare the likelihood at each position that should be the
same. The FB algorithm may fail for long sequences since the nested multiplication of numbers
smaller than 1 may easily become smaller than the machine precision. To avoid that problem,
Rabiner [1989] presents a scaled version of the FB algorithm that avoids this problem.
In order to complete the three stages of unsupervised learning described in the beginning
of this chapter (modeling, learning, decoding), we will need to calculate a few other probabilities. These are: a) the sequence posterior distribution, that is, the probability of a particular
hidden state sequence given that we have observed a particular sentence; b) the state posterior
distribution, corresponding to the probability of being in a given state in a certain position
given the observed sentence; c) the transition posterior distribution, which is the probability
of making a particular transition, from position i to i + 1 given the observed sentence. In Section 2.2 it will become clear why these quantities are necessary. All these quantities are easily
computed after running the FB algorithm:
Sequence Posterior :
State Posterior :
Transition Posterior :
pθ (z̄ | x̄) =
pθ (x̄, z̄)
;
pθ (x̄)
γi (hl ) = pθ (zi = hl | x̄) =
(2.12)
αi (hl )βi (hl )
;
pθ (x̄)
ξi (hl , hm ) = pθ (zi = hl , zi+1 = hm | x̄) =
(2.13)
αi (hl )al,m bm (xi+1 )βi+1 (hm )
.
pθ (x̄)
(2.14)
Since we are working in an unsupervised setting, we never get to observe the hidden state
sequence. Instead, given a training set X = {x̄1 . . . x̄D }, we will need to collect sufficient statistics, or expected counts that represent the expected number of times that each hidden variable
is expected to be used with the current parameters setting. This sufficient statistics will then be
used during learning as fake observations of the hidden variables. The sufficient statistics are:
a) initial counts, which we denote as ic(hl ) and counts the expected number of times we pick
state hl as the initial state; b) transition counts, denoted as tc(hl , hm ) and counts the expected
number of times a transition from state hl to state hm was performed; c) state counts denoted
as sc(vq , hm ), that counts the number of times in state hm and seeing observation vq . Using the
posteriors described above, these quantities can be computed by the following formulas:
Initial Counts :
ic(hl ) =
D
�
(2.15)
γ1 (hl );
d=1
Transition Counts :
tc(hl , hm ) =
State Counts :
sc(vq , hm ) =
D N
−1
�
�
ξi (hl , hm );
d=1 i=1
D
N
�
�
γi (hm ).
(2.16)
(2.17)
d=1 i=1,xi =vq
These formulas correspond to summing for all instances in the training corpus D, the posterior
probability (the cumulative probability over all possible hidden states sequences) of observing
a given event.
2.1. MODELING - HIDDEN MARKOV MODELS
11
short notation probability distribution |parameters|
constraint
�
πj
pθ (z1 = hj )
|H| - 1
π = 1;
� h∈H j
2
al,m
pθ (zi = hl | zi−1 = hm )
(|H| − 1)
a = 1;
� hl ∈H m,l
|H|
bq (l)
pθ (xi = vq | zi = hl )
(|V| − 1)
vq ∈V bq (l) = 1.
Table 2.4: Multinomial parametrization of the HMM distributions.
2.1.2
HMM Parameters
So far we have not committed to any form for the probability distributions, πl , am,l and
bl (xi ). In both applications addressed in this thesis, both the observations and the hidden
variables are discrete. The most common approach is to model each of these probability distributions as multinomial distributions, summarized in Table 2.4. Multinomial distributions are
attractive for several reasons: first of all they are easy to implement; secondly the estimation
of the maximum likelihood parameters has a simple closed form. Using the sufficient statistics
from the previous section, the parameter estimates are:
π̂l = �
âl,m = �
b̂l (vq = o) = �
ic(hl )
hm ∈H ic(hm )
tc(hl , hm )
hm ∈H tc(hl , hm )
sc(vq , hl )
vp ∈V sc(vp , hl )
(2.18)
(2.19)
(2.20)
Computing the new parameters only involves normalizing the expected counts to turn them
into proper probabilities. A major drawback of using these distributions is that each observation is treated atomically and it does not allow the use of features, such as morphology and
orthography of words. Also, given that each word is treated atomically there is no way to
reduce the number of parameters in the model, while at the same time model all observations.
In this thesis, in particular in the PoSI application, we will replace the multinomial observation probability by a maximum entropy distribution,
exp(f (vq , hl ) · θ)
pθ (vq | hl ) = �
,
vp exp(f (vp , hl ) · θ)
(2.21)
where f (vq , hl ) is a feature function defined over observation and hidden state values, and θ
are the parameters. If we only use an identity feature per each observation and hidden state
pair then both models can represent the same distributions. However, the advantage of this
parametrization is that it enables us to define different features for each observation and to
share some of those features between different observations, for instance word suffixes and
orthography features (capitalization, numeric values). Moreover, since we can share features
between observations, we can reduce the total number of parameters by discarding word identity features and relying on the shared features. We will denote the HMM model with maximum entropy observation distribution by HMM+ME to distinguish it from the HMM with
multinomial distribution.
12
CHAPTER 2. BACKGROUND
The drawback of using the HMM+ME model is that there is no closed form solution to
estimate the maximum likelihood parameters. Instead we need to solve the following unconstrained optimization problem:
��
θ̂ = arg min
pθ (x̄, z̄)
(2.22)
θ
= arg min
θ
= arg min
θ
x̄∈X
z̄
� �
vq ∈V hl ∈H
� �
vq ∈V hl ∈H
(2.23)
sc(vq , hl ) · log(pθ (vq |hl ))
sc(vq , hl )(θ · f (vq , hl ) − log(
�
vp ∈V
exp(θ · f (vp , hl )))),
(2.24)
where we are finding the parameters that maximize the joint likelihood of both x̄ and z̄ for all
the corpus. Note that pθ (vq | hl ) defines a probability distribution over word values and hidden
state values and does not depend on the particular context where they occur. So the sum over
all the corpus amounts to iterating over all pairs of word values and hidden state values and
counting how many times they appear in the corpus. Furthermore, since we are focusing on the
unsupervised learning setting we do not have the values for the hidden variables, so instead
we will use the expected counts described earlier. This can be seen as if we had observed a
given pair word value and hidden state value sc(vq , hl ) times.
When training a maximum entropy model it is usual to add a prior on the parameters to
avoid overfitting. This is more important in the maximum entropy model than in the multinomial model, since the parameters on the maximum entropy model are unconstrained, as
opposed to the multinomial parameters that have to be positive and sum to one (since they
correspond to a probability distribution). We add a Gaussian prior to the parameters which
corresponds to adding a squared Euclidean norm penalty on the parameters multiplied by
the inverse of the Gaussian variance, σ1 ||θ||22 that discourages the parameters from having big
values. The estimation procedure becomes:
θ̂ = arg min
θ
∂
and its gradient is ∂θ
=
�
vq ∈V
�
��
x̄∈X
z̄
pθ (x̄, z̄) +
1
||θ||22
σ
(2.25)
�
�
�
sc(v
,
h
)
f
(v
,
h
)
−
p
(x
=
v
|z
=
h
)
·
f
(v
,
h
)
q l
q l
p
p l +
l
hl ∈H
vp ∈V θ
To solve this optimization problem we use a gradient based method with direction, given
by the Limited Memory Quasi Newton method and we pick the step size using the strong
Wolfe’s rule [Nocedal and Wright, 1999].
2
σ θ.
Using this parametrization we still have a regular HMM, which is different from the Maximum Entropy Hidden Markov (MEMM) model, described in McCallum et al. [2000], since the
HMM+ME is modeling the probability of a word given a hidden state, which is the opposite of
what is done in MEMMs.
Note, that we still have
� a proper probability distribution of observation values given hidden state values, that is vq ∈V pθ (vq | hl ) = 1, we just have a different set of parameters θ. This
is the biggest difference between this model and a Markov random field (MRF). An MRF is an
undirected graphical model and the joint probability of an observation x̄ and a state sequence
z̄ is given by:
�N
ψ(hi , hi−1 )ψ(hi , vi )
pθ (x̄, z̄) = i=1
,
(2.26)
Z
2.2. LEARNING - EXPECTATION MAXIMIZATION ALGORITHM
13
where Z is the normalization constant and ψ(hi , hi−1 ) and ψ(hi , vi ) are non-negative (transition
and observation) potentials, and as opposed to the HMM, the observation and transition potentials need not to be normalized, since the MRF only needs to be normalized globally. Even
using the same features, the two models differ in the degrees of freedom, since the HMM+ME
needs to be locally normalized.
2.2
Learning - Expectation Maximization Algorithm
Given the parametric model pθ (x̄, z̄) and a training corpus X of D sentences x̄1 . . . x̄D ,
training seeks model parameters θ that minimize the negative log-likelihood of the corpus:
�
� log pθ (x̄)] = E[−
� log
Negative Log Likelihood : L(θ) = E[−
pθ (x̄, z̄)],
(2.27)
z̄
� (x̄)] =
where E[f
corpus.
1
D
�D
i=1 f (x̄
i)
denotes the empirical average of a function f over the training
Because of the hidden variables z̄, the likelihood term contains a sum over all possible hidden structures inside of a logarithm, which makes this quantity hard to compute. The most
common minimization algorithm to fit the model parameters in the presence of hidden variables is the Expectation Maximization (EM) algorithm [Dempster et al., 1977]. The EM procedure can be thought of intuitively in the following way: If we observe the hidden variables’
values for all sentences in the corpus, then we could easily compute the maximum likelihood
value of the parameters as described in Subsection 2.1.2. On the other hand, if we had the model
parameters we could compute the posterior distributions and collect the sufficient statistics.
The EM procedure starts with an initial guess for the parameters θ0 at time t = 0. The algorithm iterates for T iterations until it converges to a local minima of the negative log likelihood,
and each iteration is divided into two steps. In the first step - “E Step” (Expectation) - given the
current parameter values θt and the observed variables it computes the posteriors for the hidden variables pθ (z̄ | x̄). In the case of the HMM this requires only to run the FB algorithm. The
second step - “M step” (Maximization) - uses pθ (z̄ | x̄) to “softly fill in” the values of the hidden variables z̄, and collects the sufficient statistics, initial counts (Eq: 2.15), transition counts
(Eq: 2.16) and state counts (Eq: 2.17) and uses those counts to estimate maximum likelihood
parameters θt+1 as described in Subsection 2.1.2. The EM algorithm is guaranteed to converge
to a local minimum of L(θ) under mild conditions [Neal and Hinton, 1998]. Note that we are
not committing to the best assignment of the hidden variables, but summing the occurrences of
each parameter weighed by the posterior probability of all possible assignments. This modular
split into two intuitive and straightforward steps accounts for the vast popularity of EM (see
Algorithm 1).
More formally, EM minimizes L(θ) via block-coordinate descent on an upper bound F (q, θ)
14
1
2
3
4
5
CHAPTER 2. BACKGROUND
for t = 1..T do
for each training sentence x̄ do
E-Step: q t+1 (z̄ | x̄) = pθt (z̄|x̄) ;
end
�
�
�
�
M-Step: θt+1 = arg max E
q t+1 (z̄ | x̄) log pθ (x̄, z̄)
θ
6
z̄
end
Algorithm 1: EM algorithm.
using an auxiliary distribution over the latent variables q(z̄ | x̄) [Neal and Hinton, 1998]:
�
�
�
� − log
L(θ) = E
pθ (x̄, z̄)
(2.28)
�
z̄
�
�
�
�
p
(x̄,
z̄)
p
(x̄,
z̄)
θ
θ
� − log
� −
= E
q(z̄ | x̄) ∗
≤E
q(z̄ | x̄) log
q(z̄
|
x̄)
q(z̄
| x̄)
z̄
z̄
�
�
�
q(z̄ | x̄)
�
= E
q(z̄ | x̄) log
= F (q, θ),
pθ (x̄, z̄)
z̄
�
(2.29)
(2.30)
where we have multiplied and divided the pθ (x̄, z̄) by the same quantity q(z̄ | x̄), and the lower
bound comes from applying Jensen Inequality (Equation 2.29). F (q, θ) is normally referred to
as the energy function, which comes from the physics field and refers to the energy of a given
system that we want to minimize.
�
�
�
q(z̄
|
x̄)
�
EM Upper Bound : L(θ) ≤ F (q, θ) = E
q(z̄ | x̄) log
.
(2.31)
pθ (x̄, z̄)
z̄
The alternating E and M steps at iteration t + 1 can be seen as minimizing the energy function
first with respect to q(z̄ | x̄) and then with respect to θ:
E:
q t+1 (z̄ | x̄) = arg min F (q, θt ) = arg min KL(q(z̄ | x̄) || pθt (z̄ | x̄)) = pθt (z̄ | x̄); (2.32)
q(z̄|x̄)
M:
q(z̄|x̄)
�
θt+1 = arg min F (q t+1 , θ) = arg max E
θ
θ
�
�
z̄
�
q t+1 (z̄ | x̄) log pθ (x̄, z̄) ;
(2.33)
q(·)
where KL(q||p) = Eq [log p(·)
] is the Kullback-Leibler divergence. The KL term in the E-Step
results from dropping all terms from the energy function that are constant for a set θ, in this
case the likelihood of the observation sequence pθ (x̄):
2.3. DECODING
�
z̄
q(z̄ | x̄) log
15
�
q(z̄ | x̄) �
=
q(z̄ | x̄) log q(z̄ | x̄) −
q(z̄ | x̄) log pθ (x̄, z̄)
pθ (x̄, z̄)
z̄
z̄
�
�
=
q(z̄ | x̄) log q(z̄ | x̄) −
q(z̄ | x̄) log pθ (x̄)pθ (z̄ | x̄)
z̄
=
�
z̄
(2.35)
z̄
q(z̄ | x̄) log
q(z̄ | x̄)
− log pθ (x̄)
pθ (z̄ | x̄)
= KL(q(z̄ | x̄)||pθ (z̄ | x̄)) − log pθ (x̄).
2.3
(2.34)
(2.36)
(2.37)
Decoding
Given the learned parameters and an observation x̄ we want to find the best hidden state
sequence z̄∗ . There are several ways to define what we mean by the best z̄∗ , for instance, the
best assignment to each hidden variable zi and the best assignment to the sequence z̄ as a
whole. The first way, normally called Posterior decoding consists in picking the highest state
posterior for each position in the sequence:
z̄∗ = arg max γi (zi ).
z1 ...zN
(2.38)
This method does not guarantee that the sequence z is a valid sequence of the model. For
instance there might be a transition probability between two of the best node posteriors with
probability zero.
The second approach called Viterbi decoding consists in picking the best global hidden
state sequence z̄.
z̄∗ = arg max pθ (x̄, z̄).
(2.39)
z̄
This method is very similar to the forward procedure of the FB algorithm, making use of the
same trellis structure to efficiently represent and use all the exponential number of sequences.
In fact the only difference from the forward algorithm is in the recursion 2.6 where instead of
summing over all possible hidden state assignments, we take their maximum.
2.4
Statistical Word Alignments
The seminal work of Brown et al. [1993] introduced a series of probabilistic models (IBM
M1-5) for statistical machine translation (MT) and the concept of “word-by-word” alignment,
the correspondence mapping between words in source and target languages. Although no
longer competitive as end-to-end translation models, IBM models, as well as the HMM of Vogel
et al. [1996], are still widely used for word alignment. Word alignments are used primarily
for extracting minimal translation units for machine translation, such as phrases in phrasebased translation systems [Koehn et al., 2003] and rules in syntax-based machine translation
[Galley et al., 2004; Chiang et al., 2005], as well as for MT systems combination [Matusov et al.,
2006]. Furthermore, their importance has grown far beyond machine translation: for instance,
for transferring annotations, between languages by projecting PoS taggers, NP chunkers and
16
CHAPTER 2. BACKGROUND
parsers [Yarowsky and Ngai, 2001; Rogati et al., 2003; Hwa et al., 2005; Ganchev et al., 2009],
discovery of paraphrases [Bannard and Callison-burch, 2005; Callison-Burch, 2007, 2008], and
joint unsupervised PoS and grammar induction across languages [Snyder and Barzilay, 2008;
Snyder et al., 2009].
Word alignment for a parallel sentence pair represents the correspondence between words
in a source language and their translations in a target language. Due to the inherent ambiguity
in word alignments when performing manual annotations. it is common to distinguish two
kinds of alignments [Och and Ney, 2003]: sure alignments, for unambiguous translations, and
possible alignments, for ambiguous translations. These annotations are required for evaluating
the alignments produced by different models.
0
0
1
2
3
4
5
6
7
8
9
1
2
3
•
•
•
•
•
•
4
5
6
7
8
9
•
•
•
•
•
•
•
•
•
i
disagree
with
the
argument
advanced
by
the
minister
.
je ne partage pas le avis de le ministre .
Figure 2.3: Human-annotated alignment between the English sentence i disagree with the argument advanced by the minister. and the French sentence je ne partage pas le avis de le ministre. Sure
alignments are squares with borders; possible alignments are squares without borders.
Figure 2.3 shows a gold alignment example between an English and a French sentence.
The following notation will be used during this thesis: Sure alignments are represented in the
figure as squares with borders, possible alignments are represented as squares without borders.
Rows contain the source words and columns contain the target words. This figure shows the
difficulty of the alignment task, where 3 different blocks of possible alignments exist with no
direct one to one correspondence (e.g. ne partage pas, disagree with).
There are many reasons why a simple word-to-word (1 to 1) correspondence is not possible for every sentence pair: for instance, auxiliary verbs used in one language but not the
other (e.g. English He walked and French Il est allé), articles required in one language but optional in the other (e.g. English Cars use gas and Portuguese Os carros usam gasolina), cases
where the content is expressed using multiple words in one language and a single word in
the other language (e.g. agglutination such as English weapons of mass destruction and German
Massenvernichtungswaffen), and expressions translated indirectly.
Going back to the problem setting of the beginning of the chapter, given a parallel corpus
2.4. STATISTICAL WORD ALIGNMENTS
17
X = {x̄1 . . . x̄D } each x̄ consists of a target sentence x̄t = xt1 . . . xtN of length N and a source
¯ = xs1 . . . xsM of length M , the goal of statistical machine translation (SMT) is given
sentence xs
¯ to find the most likely translation x̄t in the target language moda source language sentence xs,
¯ It’s not feasible to model pθ (x̄t | xs)
¯ directly so different decompositions
eled by pθ (x̄t | xs).
have been proposed: e.g. word based [Brown et al., 1993], phrase based [Koehn et al., 2003],
syntax based [Galley et al., 2004].
The first models proposed were word based translation models [Brown et al., 1993]. These
models translate a source sentence into a target sentence by translating each individual source
word at a time. In this thesis we will work with the simplest IBM model, IBM M1 and with the
HMM model [Vogel et al., 1996]. These models have the following generative story:
¯ of length M , choose the target sentence length N with proba1. Given a source sentence xs
bility p(N |M );
2. For each target sentence position n, choose a position on the source sentence m = zn
¯
based on the previous history, with probability p(zn = m | zn−1 = l, xs);
3. For each target position n, choose a target word identity based on the source word at
¯
position m with probability p(xtn = v | xsm , xs).
Since the hidden state values correspond to the positions in the source sentence we represent
their values hl , just by the integer representing the position l to simplify the notation. To account for words that appear in the target sentence and that have no correspondence in the
source sentence, an extra source word, denoted “null word”, is added to the source sentence.
Step two of the generative story introduces the hidden variable z̄ = z1 . . . zN of the same length
as the target sentence, where for each position n has as value the position of the aligned word
in the source sentence zn = m, where m ∈ 1 . . . M . Since we are not dealing with machine
translation, but with word alignment we always get to observe both the target sentence and
the source sentence, so step 1 of the generative story is ignored and we can ignore the conditioning on the source sentence during the alignment process. With this generative story the
alignment process is a sequence model as described in Section 2.1. However, this generative
story is a gross oversimplification of the true alignment process. For instance, it only allows
(1-n) alignments, meaning that each source word can be aligned to several target words, by
picking the same m at step 2, but each target word can only be aligned to a source word. This
is far from real, for instance Figure 2.3 has 3 examples of (n-n) alignments. Moreover, these
models are not symmetric, the alignments produced depend on which language is picked to be
the source.
2.4.1
Models
In this thesis we focus on the hidden Markov model (HMM) for word alignment proposed
by Vogel et al. [1996]. Following the description of subsection 2.1 the probability of an align¯ can be expressed as:
ment z̄ and target sentence x̄t given a source sentence xs
¯ = pθ (x̄, z̄) = pθ (z1 )
pθ (x̄t, z̄ | xs)
N
�
i=1
pθ (zi | zi−1 )pθ (xti | zi ).
(2.40)
18
CHAPTER 2. BACKGROUND
Since both sentences are observed and we are only trying to model the possible alignments h,
¯ always has the same value (is a constant) and does not influence the pθ (x̄t, z̄ |
the variable xs
¯ So we can simplify the expression and drop the dependence on xs.
¯
xs).
There are several important standard details on the parametrization of the model: The distortion probability pθ (zi = m | zi−1 = l) depends only on the distance (m − l) between the
source positions the states represent. Only distances in the range ±5 are modeled explicitly,
with larger distances assigned equal probabilities. This simplification is done to reduce the
number of parameters of the model, and because we are interested in adding the bias to the
model that consecutive words in one language tend to align to consecutive words in the other
language, but we are not really interested in the absolute positions of the words. To incorporate null-links, we add an observation probability given null: pθ (xi | zm = hnull ). Null links
also need to maintain position information so that we can still measure the distance between
aligned words. To implement this, we create position-specific null hidden states for each source
position, and set pθ (l|nullm ) = 0, this is we can only move to the null word on the same position
as the previous word, and pθ (nulll |nullm ) = 0 null states can only transition to the same null
state or non null states.
In this work we will also use the simpler model IBM M1 as a way to initialize the parameters of the HMM model. IBM M1 assumes that the translation of words is independent of their
particular position in the sentence. The joint probability for an alignment and a target sentence
given a source sentence for IBM M1 is given by:
¯ = pθ (x̄, z̄) =
pθ (x̄t, z̄ | xs)
N
�
i=1
pθ (xti | zi ),
(2.41)
For a full description of the different IBM models we refer the reader to the original IBM
paper [Brown et al., 1993]. For a good tutorial on Word Alignments and EM applied to word
alignments we refer the reader to Kevin Knight excellent tutorial, “A Statistical MT Tutorial
Workbook” available from the authors website1 .
2.4.2
Decoding
Besides the decoding options described in Section 2.3, another possibility that often works
better for word alignments is to use Minimum Bayes-Risk (MBR) decoding [Kumar and Byrne,
2002; Liang et al., 2006; Graça et al., 2007]. Using this decoding we include an alignment link
n − m if its state posterior probability is above some threshold. This allows the accumulation of probability from several low-scoring alignments that agree on one alignment link. The
threshold is tuned in some small amount of labeled data, in our case the development set to
minimize some loss. Kumar and Byrne [2002] study different loss functions that incorporate
linguistic knowledge, and show significant improvements. Note that this could potentially result in an alignment having zero probability under the model, since many-to-many alignments
can be produced. MBR decoding has several advantages. First, independently of the particular choice of the loss function, by picking a specific threshold one can trade-off precision and
recall of the predicted word alignments. In fact, in this work when comparing different alignment models we do not commit to any loss function but instead compare precision vs recall
1
http://www.isi.edu/ knight/#pubs
2.4. STATISTICAL WORD ALIGNMENTS
19
curves, by generating alignments for different thresholds (0..1). Second, with this method one
can ignore the null word probabilities which are normally poorly estimated.
2.4.3
Corpora
We use six different language pairs during the development of this thesis. The corpora
are the Hansard corpus [Och and Ney, 2000] of English/French Canadian Parliamentary proceedings (En-Fr), and from the Europarl corpus [Koehn, 2002] we use the following language
pairs English/Portuguese (En-Pt), Portuguese/Spanish (Pt-Es), Portuguese/French (Pt-Fr) and
Spanish/French (Es-Fr). In order to evaluate the quality of generated word alignments one
needs manually annotated word alignments. For the Hansard corpus we use the annotations
defined by Och and Ney [2000] and for the English/Spanish we use the annotations from EPPS
[Lambert et al., 2005] (En-Es). During this thesis we also developed annotations for six different
language pairs of the Europarl corpus (all combinations between Portuguese, English, French
and Spanish) [Graça et al., 2008], which are freely available2 . For all these languages we used
the first 100 sentences of the Europarl common test set defined in Koehn et al. [2003], which
are taken from Q4/2000 portion of the data (2000-10 to 2000- 12). The common test set can be
download from Europarl archives 3 .
Different annotation guidelines and annotation processes can lead to different annotations.
This can make a substantial difference in the ability to evaluate word alignments. For instance
the models described previously are directional, thus unable to produce phrase alignment (nm alignments). As we will see different guidelines tend to generate more or less phrase alignments. Another important difference is the use of sure vs possible alignments. This makes
a significant difference, as the word alignment evaluation metrics described in the following
subsection only penalize the absence of sure alignments.
Table 2.5 shows the different challenges that each corpus presents. For example, En-Es has
longer sentences, which raise the ambiguity level of the alignment task, and smaller percentage of one-to-one alignments. Coping with non-bijective alignments is difficult because of the
added uncertainty about word fertility. The range of the percentage of sure alignments varies
between 79% for Es-Fr to 21% for En-Fr. Although a direct correspondence between words is
often not possible, the great majority of sure alignments are in fact one-to-one (86% - 98%). This
characteristic will be explored in this work.
2.4.4
Evaluation Metrics
There are several ways to evaluate the quality of word alignments. The extrinsic metric that
really matters is the quality of the end system for which the aligments were produced. However we still need intrinsic measures to optimize some parameters (for instance the threshold
when using posterior decoding) and compare different models. These intrinsic measures are
based on the manually built sets of sure aligned points S and possible aligned points P . These
sets are compared against the set of alignments A produced by a word alignment model.
Figure 2.4 shows an example of a predicted word alignment together with the human annotation. The set S corresponds to the squares with borders, the set P corresponds to all squares
2
3
https://www.l2f.inesc-id.pt/resources/translation/
http://www.statmt.org/europarl/archives.html
20
CHAPTER 2. BACKGROUND
Corpus Sentence Pairs Ave Length Max Length % Sure
En/Fr
447
16/17
30/30
21%
En/Es
400
29/31
90/99
67%
En/Pt
60
11/11
20/20
54%
Pt/Es
60
11/11
20/20
69%
Pt/Fr
60
11/12
20/20
77%
Es/Fr
60
11/12
20/20
79%
% 1-1
98%
86%
91%
92%
88%
87%
Table 2.5: Word Alignment test corpora statistics: The last column shows the percentage of 1
to 1 alignments (for S links only).
0
1
2
3
4
5
0
1
♣
♣
③
♣
♣
♣
♣
2
♣
♣
③ ♣
♣
♣
♣
cependant ,
3
4
♣
♣
5
♣
♣
♣
♣
6
♣
♣
7
♣
♣
but
then
③
♣
♣
♣
♣
♣ mr.
♣
③
③
③
♣
♣ baldwin
♣
♣
♣
♣
③
♣ said
♣
♣
♣
♣
♣
③:
m. baldwin avait ensuite dèclarè :
Figure 2.4: Example of a predicted word alignment. Red circles mean predicted points
and the set A corresponds to red circles.
We start by defining precision and recall with some modification to handle sure and possible
alignments:
|A ∩ P |
;
|A|
|A ∩ S|
Recall =
.
|S|
P recision =
(2.42)
(2.43)
A first thing to note is that recall only penalizes the absence of sure alignments. In this particular
example this alignment has a precision of 62.5% and a recall of 75%. Some common metrics
exist that combine precision and recall in different ways. The most common one is alignment
error rate (AER) [Och and Ney, 2003]
AER(S, P, A) = 1 −
|A ∩ S| + |A ∩ P |
.
|A| + |S|
(2.44)
In this particular example, the AER is 33%. Fraser and Marcu [2007b] criticize AER for having
low correlation with the usual Machine Translation metric (BLEU score Papineni et al. [2002]),
and suggest using a weighted F-measure:
F (α) =
|A ∩ S|
.
α|S| + (1 − α)|A|
(2.45)
2.4. STATISTICAL WORD ALIGNMENTS
21
where 0.1 ≤ α ≤ 0.4, which they show to correlate better with BLEU [Papineni et al., 2002].
One disadvantage of this metric is that the appropriate α is corpus-specific.
Other intrinsic metrics, such as the kind of AER calculated by Ayan and Dorr [2006] based
on phrases extracted from the alignments, also has poor correlation with translation performance. Note that all these metrics were evaluated against the BLEU metric used for machine
translation, and aim at achieving high correlation with its score. However, no studies were realized about the correlation of these metrics for other applications, such as transfer of annotations
between languages.
During this thesis we will show results mostly using precision and recall, and will use
the above metrics only when strictly necessary. One way to compare different models is for
instance to look at the precision recall curve. This is done by generating alignments using
posterior decoding for thresholds varying from 0-1 and plotting the different precision and
recall values.
Precision/Recall curves
1
0.9
recall
0.8
0.7
0.6
0.5
Example 1
Example 2
Example 3
0.4
0.4
[h]
0.5
0.6
0.7
0.8
0.9
1
precision
Figure 2.5: An example of a Precision Recall curve for 3 different models
Figure 2.5 shows an example of such a curve for different models. These curves contain
more information than using posterior decoding with a given threshold, no matter what metric
we are using (this would be just a single point in the curve). From this curve we can see that
Example 3 performs significantly worst than Example 1 and 2, and that Example 1 is better
than 2 when high precision levels are required. Another way to use these curves is to measure
the area under each curve, which gives us a single number to compare models. We use the
area under the graph to measure the performance of models during training. We stop training
when the area under the graph starts decreasing.
2.4.5
Baseline Problems
On the positive side, the HMM model is simple, with complexity of inference O(N × M 2 ).
However there are several problems with the model that arise from its directionality.
• Non-bijective: Multiple target words can be linked to a single source word, which is
rarely desirable. For instance, the model produces non-bijective links 22% of the time for
En-Fr instead of 2% that the gold corpus has.
22
CHAPTER 2. BACKGROUND
0
0 ♣
1 ♣
2 ♣
3 ♣
4 ♣
5 ①
6 ♣
1
2
3
4
5
6
7
8
♣
♣
♣
♣
①
♣
♣
♣
♣
♣
①
♣
♣
♣
♣
✈
♣
♣
♣
♣
♣
♣
③
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
③
♣
♣
♣
♣
♣
✇
♣
♣
♣
♣
♣
③
♣
♣
♣
et que à
♣
♣
♣
♣
♣
♣
9
♣
0
0 ♣
♣ 1 ♣
♣ 2 ♣
♣ 3 ♣
♣ 4 ♣
♣ 5 ♣
③6 ♣
le lieu de provoquer la rupture ,
1
2
3
4
5
6
7
8
9
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
③
♣
♣
♣
♣
♣
♣
③
�
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
✉
③
③
♣
♣
o
♣ cau
♣
a
♣ inte
♣ sch
③
♣
♣
et que à
♣
③ ♣
♣
♣
♣
le lieu de provoquer la rupture ,
Figure 2.6: State posterior distributions for different models for an English and French sentence
pair. Left: EN→FR model. Right: FR→ EN model.
• Asymmetric: By switching the (arbitrary) choice of which language is source and which
is target, the HMM produces very different results. For example, intersecting the sets of
alignments produced by the two possible choices for source preserves less than half of
their union for both En-Fr and En-Pt.
Figure 2.6 shows an example of the state posterior distribution for the alignment between
an English and French sentence using the HMM model. Each entry in the matrix corresponds
to the probability of having the corresponding two words being a translation of each other.
Note that the state posterior distribution is a proper distribution so for a given source word the
sum over all target words has to sum to 1. The left figure shows the alignment in the English
to French direction where the rows are source words and columns are target words, while the
right figure shows the state posteriors on the opposite direction. The first observation is that the
posteriors are concentrated around particular source words (rare words occurring less than 5
times in the corpus) in both directions, instead of being spread across different words. This is a
well known problem when training using EM called the “garbage collector effect” [Brown et al.,
1993]. A rare word in the source language aligns to many words in the target language that we
would ideally like to see unaligned, or aligned to other words in the sentence. The reason
why this happens is because the generative model has to distribute translation probability for
each source word among different candidate target words. If one translation is much more
common than another, but the rare translation is used in the sentence, the model might have
a very low translation probability for the correct alignment. On the other hand, since the rare
source word occurs only in a few sentences it needs to spread its probability mass over fewer
competing translations. In this case, choosing to align the rare word to all of these words leads
to higher likelihood than correctly aligning them or aligning them to the special null word,
since it increases the likelihood of this sentence without lowering the likelihood of many other
sentences.
2.4.6
♣
Related Work
Together with IBM M1 and IBM M2, Brown et al. [1993] developed a series of more complex
models IBM M3 to IBM M6. These models differ from the previous ones in two particular
ways. First the length of the target sentence is defined based on the identity of the words in
inst
2.5. PART OF SPEECH INDUCTION
23
the source sentence. For each source word there is an associated probability of generating a
particular number of target words. These provides a better model than to just pick the sentence
size based on the source sentence length. Furthermore, it allows to directly model how many
words should be generate by each source word, which better explains the translation process.
Nevertheless, these models are still 1 to n models. Secondly, there is a extra reordering step on
the target words at the end of the translation process. Moreover, these models are allowed to
generate spurious words, that correspond to the “null word” in the simpler models, but in this
case we are allowed to have more than one “null word”, which has been shown to improve
the quality of the resulting alignments [Moore, 2004]. IBM M3 and IBM M4 are deficient which
means that these models assigns probability mass to alignments that do not form a proper
target sentence. IBM M5 and IBM M6 solve this problem, however Brown et al. [1993] shows
that the added complexity of removing deficiency does not provide a significant improvements,
so IBM M4 is the default choice. These models do not decompose in a tractable way. For
instance for a target sentence of length 9 there are 362.880 possible reordering options (9!).
Which means that in practice besides being very slow to train, one needs to resort to several
heuristic approximations when summing over all possible alignments.
Since these models were developed several researches have work on improving them and
on creating new models. The HMM model due to its simplicity has been the target of many of
those improvements. Toutanova et al. [2002] proposes several extensions to the HMM model:
condition the translation probabilities with part of speech information; condition the transition
probabilities on the source word part of speech tags; try to capture fertility by modeling how
many words should be generated for each source word (this is performed by modeling if the
HMM should generate more words from the previous word or move to another word using
a probability over a boolean variable “stay”); better modeling the “null” alignments by making it a mixture model between the normal probability of translating a word to “null” and the
same probability conditioned on the following source word. These improvements lead to an
increase in performance when compared against the HMM model but the experiments were
only carried for small corpus, so there is no guarantee that they will hold for bigger corpus.
Deng and Byrne [2006] extend the HMM model to a Semi-HMM, where the state duration is
explicitly model. They use the state duration to model the fertility of each word. Again the
results are competitive with IBM M4 but not clearly better. Lopez and Resnik [2005] improve
the HMM by adding a distortion model based on the tree distance between words which offered a slightly gain over the surface distortion model. More recently, Fraser and Marcu [2007a]
proposes a new generative model capable of modeling non consecutive n-m alignments which
slightly outperforms IBM M4.
2.5
Part of Speech Induction
In this section we present the PoSI task. PoS tags are pre-requisite for many text applications. The task of PoS tagging where one is given a labeled training set of words and respective
tags is a well studied task with several methods achieving high prediction quality. On the other
hand the task of PoSI where one does not have access to a labeled corpus is a much harder task
with a huge space for improvement. PoSI has been used to refer to two very different tasks.
In one case, which we will refer to as weakly supervised, we are given a raw text and a dictionary of which tags each word can have, and the goal is to disambiguate, for a particular
word occurrence what tag should it have. This was the path started by the influential work
24
CHAPTER 2. BACKGROUND
of Merialdo [1994] and there has been a large amount of recent follow-up work. On the other
case, which we call fully unsupervised, we are given only the raw text along with sentence
boundaries and a predefined number of cluster we can use. In this setting, which is the one
we address in this work, the problem can be seen as a clustering problem. We want to cluster
words that behave grammatically in the same way on the same cluster. This line of work typically focus on distributional features, words with the same grammatical function tend to occur
in the same neighbourhood of different words, and morphological features, similar grammatical words tend to have common morphological features (suffixes, capitalization) or a mixture
or both [Schütze, 1995; Clark, 2003].
The fully unsupervised setting presents several extra difficulties: Firstly, one has to decide
on how many clusters to use. A problem is that due to the zipfian distribution true PoS tags, for
instance there are many more nouns and verbs than there are prepositions, statistical models
will prefer to use several clusters to explaining nouns in detriment of explaining classes with
few word occurrences as possessives. Another problem is related with the word tag ambiguity,
each word can have different PoS Tags according to the context, for instance the word run can
be either a verb or a noun. The first approaches proposed to solve the PoSI task dealt with
this problem by assuming that each word could only have one tag, in fact most approaches
modeled the tag of a word type independent of their position in the sentence using distributional clustering techniques [Schütze, 1995]. A more realistic option is to model the distribution
of tags for each word, for instance by using an HMM model whose hidden states values are
possible clusters and observations are the words, we follow this later approach. However, this
approach adds additional problems: training an HMM model to maximize the likelihood of
the data tends to assign a uniform distribution to the distribution of words given tags [Johnson, 2007], meaning that each word has some probability of being associated with all hidden
states, while in general each word can only be assigned to a small subset of the existing tags,
normally 1. In fact this problem was the main reason why people tend to use the dictionary
informed setting, bypassing this ambiguity problem. Several works fall between the two extremes of fully unsupervised and weakly unsupervised, where one has access to an incomplete
dictionary. Different approaches have applied this setting and present different ways to generalize the information contained in the given dictionary [Banko and Moore, 2004; Wang and
Schuurmans, 2005; Smith and Eisner, 2005; Ravi and Knight, 2009].
Formally, in the problem setting of fully unsupervised PoSI we are given a training set
X = x̄1 . . . x̄D of D training examples, where each example x̄ = x1 . . . xN is a sentence of N
words, whose values v are taken from a vocabulary V of possible word types. We are also given
the set of clusters H that we are allowed to use. The hidden structure z̄ = z1 . . . zN corresponds
to a sequence of cluster assignments for each individual word, such that zn = hl with hl ∈ H.
2.5.1
Models
The baseline model is a first order hidden Markov model (HMM) as described in Section
2.1. Where the observations x are the words in a sentence, and the hidden states z are cluster
assignments, from a pre-determined set of possible clusters H. For this application we explicit
model the final state distribution, since this provides a small consistent improvement. We
append each sentence, x̄ with a fake token representing the last word at position N + 1, and
augment the hidden state space with an extra value hv|H|+1 . The transition probabilities are
zero for going in or out of this new state, with the exception of the last sentence position, where
one can move into it.
2.5. PART OF SPEECH INDUCTION
En
Pt
Bg
Es
Dk
Tr
Sent.
49208
9359
14187
3512
5620
5512
25
Types
49206
29489
34928
16858
19400
18805
LUnk
49.28%
37.83%
39.26%
37.73%
26.30%
33.83%
Tokens
1173766
212545
214930
95028
65057
100238
Tags
17
22
12
47
30
25
Table 2.6: PoS corpora statistics. The third column shows the percentage of word types after
lower-casing and eliminating word types occurring only once.
2.5.2
Corpora
We explore POS induction for six different languages, using corpora with manual annotation only to evaluate the performance of various models. The characteristics of the corpora
are summarized in Table 2.6: the Wall Street Journal portion of the Penn treebank [Marcus
et al., 1993] using the 17 tags set defined by Smith and Eisner [2005] (En); the Bosque subset of
the Portuguese Floresta Sinta(c)tica Treebank [Afonso et al., 2002] (Pt); the Bulgarian BulTreeBank [Simov et al., 2002] (Bg) with the 12 coarse tags; the Spanish corpus from the Cast3LB
treebank [Civit and Martí, 2004] (Es); the Turkish METU-Sabanci treebank [Oflazer et al., 2003]
(Tr); and the Danish Dependency Treebank (DDT) [Kromann et al., 2003] (Dk).
2.5.3
Evaluation Metrics
To compare the performance of the different models one needs to evaluate the quality of
the induced clusters. To that end, we use two different kinds of evaluation. The most relevant
one is how useful are the induced clusters for different tasks that require PoS tags. In Section
5.6 we perform such evaluation. The task is semi-supervised PoS tagging where we have a
supervised tagger that only has access to a small amount of labeled data. We show that by
adding the clusters induced by the unsupervised model as additional features to the supervised
model, we boost the performance of the supervised model. The second evaluation approach
is to compare the induced clusters to the true tags of the corpus. To compare against the true
tags and compute accuracy we need to map the hidden states to the true PoS tags. The two
most common approaches in the literature are 1-Many mapping and 1-1 mapping [Haghighi
and Klein, 2006]. In the 1-Many mapping each hidden state is mapped to the tag with which
it co-occurs the most. This means that several hidden states can be mapped to the same tag,
and some tags might not be used at all. The 1-1 mapping greedily assigns each hidden state
to a single tag. In the case there are the same number of tags and hidden states this will give
a 1-1 correspondence. A major drawback of the latter mapping is that it fails to express all
the information of the hidden states. Typically, unsupervised models prefer to explain very
frequent tags with several hidden states, and combine some very rare tags. For example the Pt
corpus has 3 tags that occur only once in the corpus. Grouping these together but sub-dividing
nouns still provides a lot of information about the true tag assignments but this would not be
captured by the 1-1 mapping. Nevertheless, the 1-Many mapping also has drawbacks, since
it can only distinguish clusters based on their highest frequency tag. So having a cluster split
almost evenly between nouns and adjectives, or having a cluster with the same number of
nouns, but a mixture of words with different tags is judged the same. We also evaluate resulting
tags using mutual information (MI) between tags and hidden states. To compute this, imagine
26
CHAPTER 2. BACKGROUND
State 0
State1
State2
Sum
Noun
1463
100
987
2550
Verb
12
700
353
Adv
8
35
40
Sum
1483
835
1380
1-1:
59.6%
1-Many:
85.1%
State 0
State1
State2
Sum
Noun
2063
100
387
2550
1065
Verb
462
500
103
1065
83
Adv
8
35
40
83
Sum
2533
635
530
MI
0.29
1-1:
70.3%
1-Many:
79.8%
MI
0.14
Figure 2.7: PoS mapping example. Each entry in each matrix indicates the number of tokens
of a given hidden state (column) which belongs to a given true tag (row). Sums indicate the
number of tokens belonging to each category. The last row shows the accuracy according to
the different metrics.
randomly selecting a token in the corpus: the PoS tag of the token is one random variable, and
the hidden state is another random variable. MI is the mutual information between these two
random variables.
Figure 2.7 shows two examples of the clusters assign by an unsupervised model (columns)
and the corresponding true tags (rows). Each entry contains the number of tokens for a given
hidden state/tag pair. The last row contains the accuracy for each mapping. In the first case (left
example) we can see how the 1-1 maps state 2 to Adv although that cluster is modeling mostly
nouns and verbs, because it has already mapped nouns and verbs to other clusters. In the
second case we see how the 1-1 can be deceiving. It assign a higher score to that clustering than
to the first cluster, although the clusters are not well separated as in the first case. There is no
clear cluster for nouns, since a lot of verbs are assign to state 1 as well. But since this mapping
only uses the maximum for each tag and these are the same the performance is bigger. On the
other hand both 1-Many and MI prefer the first case. Although 1-Many mapping correlates
well with MI it still fails to capture all the information about the clusters as described before.
On the other hand its easier to analyse the results in term of 1-Many since the scale is more
familiar. We will use both of these metrics in the remainder of this thesis.
2.5.4
Baseline problems
Using the HMM model for this task has several disadvantages. First, it treats words atomically, without regard to orthography and morphology which is critical to generalization in
many languages [Clark, 2003]. Second, the model has a very large number of parameters: the
number of tags times the number of word types, which presents an extremely rich model space
capable of fitting irrelevant correlations in the data. We address these problems by switching to
the Maximum Entropy parametrization of the observation probability described in Subsection
2.1.2, and introduce several classes of features:
Word Identity (Words): Lower cased token, for words that occur at least 10 times;
2.5. PART OF SPEECH INDUCTION
10
10
Supervised
HMM
HMM+ME
8
4
2
Supervised
HMM
HMM+ME
8
L1L∞
6
L1L∞
27
6
4
2
0
0
0
1000 2000 3000 4000 5000 6000 7000 8000
rank of word by L1L∞
0
200 400 600 800 1000 1200 1400 1600 1800
rank of word by L1L∞
Figure 2.8: Differences of sparsity per word on the true tag distribution versus the HMM and
HMM-ME model trained using EM.
Suffixes (Suff.): suffixes of length 2-5 if they occur in at least 20 word forms;
Short Word (Size): word length ≤ 3;
Regular Expressions (Ortho.): for intial-capitalization, mix-cased, punctuation, contains-hyphen,
-underscore and -digit4 ;
Frequent (Counts): set iff the word is one of the top 10% most frequent words.
However, for both models HMM and HMM-ME the parameter that maximize the likelihood of the observed data allow a high level of ambiguity for the tag given word distribution:
common words tend to be associated with every tag with some probability, depending on the
context. By contrast, a natural property of POS categories across many languages and annotation standards is that each word only has a small number of allowed tags. To see this problem,
we show the sparsity differences between the true tags distributions and the clusters obtain
by the both HMM and HMM-ME models trained using the EM procedure. Figure 2.8 shows
the distribution of tag ambiguity across words for two corpora, where we see that both models
grossly overestimate the tag ambiguity of almost all words, compared with true data (Supervised).
2.5.5
Related Work
There is a long history of unsupervised part of speech induction using distributional features [Schütze, 1995], which is beyond the scope of this thesis. To the best of our knowledge,
Merialdo [1994] were the first to use a sequence model (a second order HMM) for POS induction. Merialdo [1994] use a dictionary containing the set of possible tags for each word in
order to provide some constraint to the model, effectively restricting somewhat its capacity.
Unfortunately the availability of a large dictionary is somewhat unrealistic. Much of the following work, tries to reduce the required dictionary size [Toutanova and Johnson, 2007; Smith
and Eisner, 2005; Goldwater and Griffiths, 2007; Haghighi and Klein, 2006]. Recently, Ravi and
Knight [2009] restrict the capacity of a first-order HMM tagger further by first inducing a list of
possible transitions and disallowing the model from using any others.
In a fully unsupervised setting, several works [Johnson, 2007; Gao and Johnson, 2008] attempt to restrict the tag ambiguity by using a Dirichlet prior to favor a sparse model and hence
4
Capitalization is only used for words that are sentence-initial less than 10% of the time.
28
CHAPTER 2. BACKGROUND
indirectly reduce tag ambiguity. Clark [2003] tries to capture morphological information in POS
induction using an HMM with observation probabilities also modeled as an HMM over characters. They find that modeling morphological information can produce better PoS clusters.
3
Posterior Regularization Framework
During the modeling stage a critical design decision concerns choosing between model
complexity and model expressiveness. Although one wishes to use a model that can explain
as closely as possible the particular application’s generative process, one has to restrict oneself
to models that are computationally tractable. This leads to models that are generally an oversimplification of the inherent process, as seen in the previous chapter. These models, due to
their simplifying assumptions, fail to learn the parameters settings that produce the desired
predictions.
One solution to this problem is to add more complexity to the model to better reflect the
underlying process. For example in the SWA application, this is the approach taken by IBM
models 4+ [Brown et al., 1993; Och and Ney, 2003], and more recently by the LEAF model
[Fraser and Marcu, 2007a]. Unfortunately, these changes make the models probabilistically
deficient and intractable, requiring approximations and heuristic learning and inference prone
to search errors.
Another solution to this problem is to move to a more Bayesian setting and express preferences about the parameters by adding prior distributions. This approach has shown positive
results for several tasks, namely grammar induction [Cohen and Smith, 2009; Headden III et al.,
2009]. However, preferences about the hidden variables is normally not easy to express directly
in terms of the parameters, since they depend on complex interactions of the parameters. In
fact for the PoSI task, the same intuition works well when express over the output variables
as will be shown in this work, but when expressed over the parameters fails to achieve good
results.
Instead, in this thesis we propose a new learning framework called Posterior Regularization (PR) [Graça et al., 2007] that incorporates prior knowledge directly on the output variables
during learning. The prior knowledge is expressed as inequalities on the expected value under
the posterior distribution of user-defined constraint features (not necessarily the same features
used by the model). Since in most applications we are interested in the hidden variables, constraining the posteriors allows a more direct way to achieve the desired behaviour. We show
evidence of this fact for the PoSI application. On the other hand, constraining the expected
value of the features instead of adding them to the model allows us to express features that
would otherwise make the model intractable. For example enforcing that each hidden state
value of an HMM model should be used at most once per sentence would break the Markov
property and make the model intractable. In contrast, we will show how to enforce that each
hidden state is used at most once in expectation by using PR. The underlying model remains
unchanged, but the learning method changes. During learning, our method is similar to the
EM algorithm with the addition of solving a problem similar to maximum entropy inside the
E-Step in order to satisfy the specified constraints.
In this thesis we focus the presentation of the PR framework on unsupervised learning of
30
CHAPTER 3. POSTERIOR REGULARIZATION FRAMEWORK
generative models, in particular on the HMM model. However, the PR framework is more general and has been applied to semi-supervised settings, with both generative and discriminative
models [Ganchev et al., 2008], and to different probabilistic models such as tree models for the
task of grammar induction [Ganchev et al., 2009]. For a broader discussion of PR and different
applications see [Ganchev et al., 2009].
The next section presents the Posterior Regularization framework, then Section 3.2 presents
a learning algorithm for the PR objective, very similar to the EM algorithm described in the previous chapter. In Section 3.3 we will use PR to train models for the SWA application using the
following constraints: (i) bijectivity: “one word should not translate to many words”; and (ii)
symmetry: “directional alignments should agree”. In Section 3.4 we will use PR as the training procedure for the PoSI application, and propose a constraint that solves the high word/tag
ambiguity problem. Finally, Section 3.5 presents work related with the PR framework.
3.1
Posterior Regularization Objective
Prior knowledge in PR is specified in terms of expectations under the posterior distribution
of user-defined constraint features φ(x̄, z̄). Under certain assumptions, such as linearity of the
constraints and features that decompose the same way as the underlying model (which is the
case for all constraints used in this work), then all this new learning procedure requires is the
ability to do inference in the underlying model, in the case of HMM to run the FB algorithm.
The key advantage of PR is that the base model remains unchanged, but, during learning, it
is driven to obey the constraints by setting appropriate parameters θ. Moreover, empirical
results show that enforcing constraints in expectation, results in predictions that also satisfy
the constraints.
More formally, prior information in PR is specified with sets Qx̄ of allowed distributions
over the hidden variables z̄, which satisfy inequality constraints on some user-defined feature
expectations, with violations bounded by � ≥ 0:
Qx̄ = {q(z̄ | x̄) : ∃ξ, Eq(z̄|x̄) [φ(x̄, z̄)] − bx̄ ≤ ξ; ||ξ||22 ≤ �}
(3.1)
Where, Qx̄ which we assume non-empty, and convex, denotes the set of valid distributions,
where some feature (φ(x̄, z̄)) expectations under the posterior distribution are bounded by bx̄
and � ≥ 0 is an allowable violation slack. Setting � = 0 enforces inequality constraints strictly,
requiring Eq(z̄|x̄) [φ(x̄, z̄)] ≤ bx̄ to hold for all q(z̄ | x̄) in Qx̄ .
Constrained Posterior Set :
3.1.1
Posterior Regularization Objective
We can see the PR objective as a trade-off between two distance measures - KL-divergences
- as pictured on Figure 3.1. On the left side we have the set of all distributions over the observations x̄, per example all distributions over all possible English sentences. In the picture, δ(x̄)
represents the Dirac delta function for sentences in the training corpus (it has the value one
if the sentence appears in the corpus and zero otherwise). On the left side we wish to find
the parameters θ that minimize the KL-divergence from our model, pθ (x̄) , to the empirical
3.1. POSTERIOR REGULARIZATION OBJECTIVE
p(x)
n
31
p(z|x)
Θ
δ(x )
K L(δ (x n
)||pθ (x))
pθ (x)
|p θ (
Qx |
K L(
θ
n
n ))
z|x
pθ (z | xn )
N
Negative Log-Likelihood
Qxn
N
Posterior Regularization
�
�
Figure 3.1: PR objective E[KL(δ(x̄)
� pθ (x̄))] + E[KL(Q
x̄ � pθ (z̄ | x̄))].
distribution δ(x̄):
KL(δ(x̄)||pθ (x̄)) =
�
δ(x̄) log
x̄
=
�
x̄
= −
δ(x̄)
pθ (x̄)
δ(x̄) log δ(x̄) −
�
x̄∈X
(3.2)
�
δ(x̄) log pθ (x̄)
(3.3)
x̄
log pθ (x̄) = L(θ).
(3.4)
The first term is always zero (note that δ(x̄) is either zero or one and both make the first term
zero), and the second term is only non-zero for sentences in the training corpus, which is the
same as the negative log likelihood of the observed data. The right side represents all possible
distributions over outputs z̄. Here, we are trying to find the parameters that minimize the
distance between the posterior distribution, pθ (z̄ | x̄), and the set Qx̄ of all distributions that
satisfy the desired constraints. Putting it into words, in PR, the negative log-likelihood of a
model is penalized with the KL-divergence between the desired distribution space Qx̄ and the
model posteriors, KL(Qx̄ � pθ (z̄ | x̄)) = min KL(q(z̄ | x̄) � pθ (z̄ | x̄)). The regularized
objective is:
q(z̄|x̄)∈Qx̄
Posterior Regularized Likelihood :
�
L(θ) + E[KL(Q
x̄ � pθ (z̄ | x̄))].
(3.5)
The objective trades-off likelihood and distance to the desired posterior subspace (modulo getting stuck in local maxima) and provides an effective method of controlling the posteriors.
Note that the second term is data dependent, that is if a set of parameters already satisfies the
constraints, then the PR objective is the same as minimizing the negative log likelihood. On the
other hand, there might be cases that the pθ (z̄ | x̄) distribution will never reach the set Qx̄ because the model cannot express the constraints, or because the required parameters to achieve
Qx̄ make the negative likelihood term very high. Even when the model has enough parameters to represent the constraints, the required parameter setting can be such that the negative
likelihood term becomes very high, obfuscating the effect of the regularization.
32
CHAPTER 3. POSTERIOR REGULARIZATION FRAMEWORK
3.1.2
Computing The PR Objective
Computing the PR objective involves solving the optimization problem:
KL(Qx̄ � pθ (z̄ | x̄)) =
Primal Projection
min
q(z̄|x̄)∈Qx̄
KL(q(z̄ | x̄) � pθ (z̄ | x̄)).
(3.6)
directly minimizing this objective is hard since there is an exponential number of hidden sequences z̄, for instance |N ||H| for the HMM model. However, the problem becomes easy to
solve in its dual formulation (see Appendix A.2 for derivation):
Dual Projection
arg max
λ≥0
−b�
x̄ λ − log Z(λ) − � ||λ||2 ,
(3.7)
�
where Z(λ) = z̄ pθ (z̄ | x̄) exp(−λ · φ(x̄, z̄)) is the normalization constant and the primal solution is q(z̄ | x̄) = pθ (z̄ | x̄) exp{−λ� φ(x̄, z̄)}/Z(λ). There is one dual variable per expectation
λi
constraint, and the dual gradient at λ �= 0 is ∇(λ) = Eq(z̄|x̄) [φ(x̄, z̄)] − � ||λ||
− bx̄ . Note that this
2
primal-dual relationship is very similar to the one between maximum likelihood and maximum
entropy. If bx̄ correspond to empirical expectations and pθ (z̄ | x̄) is uniform, then Equation 3.7
would be a log-likelihood. As with maximum entropy, gradient computation involves computing an expectation under q(z̄ | x̄), which can be performed efficiently if the features φ(x̄, z̄)
factor in the same way as the model pθ (x̄, z̄), and the constraints are linear. The posterior over
z̄ represented by a graphical model such as HMM can be written as a product of cliques C,
which in the HMM correspond to the observation probabilities and transition probabilities.
N
Factored Posterior :
1 �
pθ (z̄ | x̄) =
pθt (zi | zi−1 )pθo (xi | zi ).
pθ (x̄)
(3.8)
i=1
If the features φ(x̄, z̄) are factorized along the same cliques (which is the case for all our constraints):
N
�
Factored Features :
φ(x̄, z̄) =
φ(xi , zi )φ(zi , zi−1 ).
(3.9)
i=1
Then q(z̄ | x̄) has the same form as pθ (z̄ | x̄),
q(z̄ | x̄) =
=
1
�
pθ (z̄ | x̄) exp(−λ φ(x̄,z̄))
Zλ
N
1 �
�
�
pθt (zi | zi−1 ) exp(−λ φ(zi ,zi−1 )) pθo (xi | zi ) exp(−λ φ(xi ,zi ))
Zλ
(3.10)
(3.11)
i=1
where Zλ is the new normalization constant that ensures that q(z̄ | x̄) is a proper probability
distribution. Hence the projection step uses the same inference algorithm (the FB algorithm for
HMMs) to compute the gradient, only modifying the local factors (observation and transition
probabilities) using the current setting of λ. Note that the new factors are no longer proper
probability distributions.
We optimize the dual objective using a gradient based method shown in Algorithm 2. Here
η is an optimization precision, α is a step size. Here, β∇(λ) represents an ascent direction
chosen as follows: For inequality constraints, it is the projected gradient [Bertsekas, 1999]; for
3.2. POSTERIOR REGULARIZATION OPTIMIZATION ALGORITHM
1
2
3
4
5
6
7
8
33
λi ← 0;
while ||∇(λi )|| > η do
�
pθt (zi | zi−1 )� = pθt (zi | zi−1 ) exp(−λi φ(zi ,zi−1 )) ;
�
pθo (xi | zi )� = pθo (xi | zi ) exp(−λi φ(xi ,zi )) ;
q(z̄ | x̄) ← forwardBackward(pθo (xi | zi )� , pθt (zi | zi−1 )� );
λi+1 =← λi + αβ∇(λi );
i + +;
end
Algorithm 2: Computing KL(Qx̄ � pθ (z̄ | x̄)) =
min
q(z̄|x̄)∈Qx̄
KL(q(z̄ | x̄) � pθ (z̄ | x̄))
equality constraints with slack, we use conjugate gradient [Nocedal and Wright, 1999], noting
that when λ = 0, the objective is not differentiable. In practice this only happens at the start of
optimization and we use a sub-gradient for the first direction.
Computing the projection requires an algorithm for inference in the original model, and
uses that inference as a subroutine. For HMMs we need to make several calls to the FB algorithm in order to choose λ. Setting the optimization precision η more loosely, allows the
optimization to terminate more quickly but at a less accurate value, which corresponds to a
bigger distance between the pθ (z̄ | x̄) and the set Qx̄ . We found that aggressive optimization
significantly improves results and consequently we choose relatively small η so that tighter
values do not significantly improve performance.
3.1.3
Projection Before Decoding
The optimal parameters determined on the train set might not satisfy the constraints in the
test data. However, we can perform the projection to compute min KL(q(z̄ | x̄) � pθ (z̄ | x̄))
q(z̄|x̄)∈Qx̄
and use it at test time instead of pθ (z̄ | x̄). When performing Posterior decoding we can just
use q(z̄ | x̄) and pick the highest probability node marginals according to that distribution.
�
When using Viterbi decoding one uses the changed factors pθt (zi | zi−1 ) exp(−λ φ(zi ,zi−1 )) and
�
pθo (xi | zi ) exp(−λ φ(xi ,zi )) with the λ found during projection. We will evaluate the benefits of
projecting at decode time for each application/constraint individually on Chapters 4 and 5.
3.2
Posterior Regularization Optimization Algorithm
We can optimize the PR objective using a procedure very similar to the expectation maximization (EM) algorithm. Recall from Eq. 2.32 that in the E-step, q(z̄ | x̄) is set to the posterior
over hidden variables given the current θ. To converge to the PR objective, we must modify
the E-step so that q(z̄ | x̄) is a projection of the posteriors onto the constraint set Qx̄ for each
example x̄.
E� :
arg min
q(z̄|x̄),ξ
KL(q(z̄ | x̄) � pθt (z̄ | x̄))
s.t. Eq(z̄|x̄) [φ(x̄, z̄)] − bx̄ ≤ ξ; ||ξ||22 ≤ �.
(3.12)
34
CHAPTER 3. POSTERIOR REGULARIZATION FRAMEWORK
pθ (z̄ | x̄)
θ
min KL
E� − Step :
max F (q(z̄), θ)
M − Step :
max F (q(z̄), θ)
q(z̄)
q(z̄)∈Qx̄
θ
q(z̄)
Qx̄
�
Figure 3.2: Modified EM algorithm for optimizing the PR objective L(θ) − E[KL(Q
x̄ � pθ (z̄ |
x̄))].
The new posteriors q(z̄ | x̄) are used to compute sufficient statistics for this instance and hence
to update the model’s parameters in the M-step (Eq. 2.33), which remains unchanged. This
scheme is illustrated in Figure 3.2 and in Algorithm 3. The only implementation difference is
that we must now perform the KL projection before collecting sufficient statistics. Appendix
A.1 shows the proof that this procedure converges to a local minima of the PR objective.
1
2
3
for t = 1..T do
for each training sentence x̄ do
E’-Step: q t+1 (z̄) = arg min KL(q(z̄ | x̄)||pθt (z̄ | x̄))
q(z̄|x̄)∈Qx̄
4
5
6
end
�
�
� � q t+1 (z̄) log pθ (x̄, z̄)
M-Step: θt+1 = arg maxθ E
z̄
end
Algorithm 3: PR optimization via modified EM. E’-Step is computed using Algorithm 2.
3.3
Statistical Word Alignments
In this section we propose two constraints for SWAṄote that since in word alignments the
observation x̄ corresponds to a pair of sentences, the target sentence x̄t = xt1 . . . xtN , and the
¯ = xs1 . . . xsM , when defining features we will explicit write the feature
source sentence xs
function as a function of both sentences and the alignment to make the exposition clearer.
3.3.1
Bijectivity Constraints
We observed in Table 2.5 that most alignments are 1 to 1 and we would like to introduce
this prior information into the model. Unfortunately including such a constraint in the model,
which corresponds to use each hidden state value only once per sequence, directly breaks the
Markov property in a fairly fundamental way, since for each position we need to keep track of
all states that were used before. Introducing the constraints in expectation using the PR framework is easy and tractable. We encode them as the constraint E[φ(x̄, z̄)] ≤ 1 where we have one
3.3. STATISTICAL WORD ALIGNMENTS
35
0
1
♣
2
3
1
♣
2
5
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
♣
③
�
♣
♣
♣
6
♣
♣
♣
♣
♣
♣
♣
9
③
♣
♣
♣
♣
♣
♣
8
♣
♣
♣
♣
♣
♣
♣
7
♣
♣
♣
♣
♣
♣
③ ♣
6
♣
♣
♣
♣
♣
♣
♣
4
♣
♣
♣
♣
♣
♣
♣
3
♣
♣
♣
♣
③
♣
♣
0
♣
♣
♣
♣
✇
♣
♣
9
♣
♣
♣
♣
③
♣
♣
8
♣
♣
♣
♣
♣
♣
♣
7
♣
♣
♣
♣
③
♣
♣
6
♣
♣
♣
✈
♣
♣
♣
5
♣
♣
♣
♣
①
♣
♣
4
♣
♣
♣
♣
①
♣
♣
7
♣
♣
✉
③
③
♣
8
♣
o
♣ cau
♣
a
♣ inte
♣ sch
③
9
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
✈
♣
♣
♣
♣
♣
♣
③
♣
♣
♣
♣
♣
♣
①
♣
♣
♣
♣
♣
♣
♣
③
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
③
♣
♣
♣
♣
♣
♣
♣
♣
②
♣
♣
♣
♣
♣
♣
③
♣
♣
♣
♣
♣
♣
✇
♣
♣
♣
♣
♣
♣
♣
③
♣
♣
o
♣ cau
♣
a
♣ inte
♣ sch
③
0 ♣
1 ♣
2 ♣
3 ♣
4 ♣
5 ①
6 ♣
0 ♣
1 ♣
2 ♣
3 ♣
4 ♣
5 ♣
6 ♣
♣
♣
0 1 2
et que à
♣
③ ♣
♣
♣
♣
0 ♣
♣ 1 ♣
♣ 2 ♣
♣ 3 ♣
♣ 4 ♣
♣ 5 ♣
③6 ♣
♣
0 ♣
♣ 1 ♣
♣ 2 ♣
♣ 3 ♣
♣ 4 ♣
♣ 5 ♣
①6 ♣
3 4 5
6
7
8
9
le lieu de provoquer la rupture ,
♣
♣
0 1 2
et que à
♣
③ ♣
♣
♣
feature φ(x̄t, xsm , z̄) for each position m in the source sentence that counts how many times it
is aligned to a target word in the alignment z̄:
Bijective Features :
φ(x̄t, xsm , z̄) =
♣
3 4 5
6
7
8
9
le lieu de provoquer la rupture ,
Figure 3.3: State posterior distributions for different models an English and French sentence
pair. Left: EN→FR model. Right: FR→ EN model. Top: Regular HMM. Bottom: After applying the bijectivity constraint. Circle size indicates probability value. Circle color in the bottom
rows indicates differences in posterior from the top row. Green - higher probability, red - lower
probability.
N
�
♣
♣
1(zn = m).
(3.13)
n=1
For example for the sentence in Figure 3.3 in the top left alignment the state posteriors over
the source word schism clearly sum to more than 1. The effects of applying these constraints
to the posteriors are shown in the second row. Enforcing the one to (at most) one constraint
clearly solves the garbage collector effect. Moreover, when distributing the probability mass to
the other words, most of the probability mass goes into the right positions (as measured by gold
alignments). By enforcing the constraint at training and decoding time the percentage of 1-1
alignments increases from 78% on Hansards English to French to 97.3% (manual alignments
have 98.1%), and in the En-Pt corpus, the constraint improves from 84.7% to 95.8% (manual
alignments have 90.8%).
inst
inst
36
3.3.2
CHAPTER 3. POSTERIOR REGULARIZATION FRAMEWORK
Symmetry Constraints
The directional nature of the generative models used to recover word alignments conflicts with their interpretation as translations. In practice, we see that the choice of which
language is source versus target matters and changes the mistakes made by the model (see
Figure 3.4). The standard approach is to train two models independently and then combine
their predictions Och and Ney [2003] (more details in Subsection 4.1.6). However, we show
that it is much better to train two directional models concurrently, coupling their posterior
distributions over alignments to approximately agree. Let the directional models be defined
→
−
←
−̄
−
−
¯ (target-source) and ←
¯ z | x̄t) (source-target). Define z̄ to range
as: →
p θ (x̄t, z̄ | xs))
p θ (xs,
→
−
←
−̄
over the union of all possible directional alignments z̄ ∪ z . We define a mixture model
→
−
←
−̄
→
−
−
−
−
1→
1←
¯ = 2 p θ (x̄t, z̄ | xs)
¯ + 2 p θ (xs,
¯ z | x̄t) where ←
¯ z̄ | x̄t) = 0 and vicepθ (x̄t, z̄ | xs)
p θ (xs,
versa (i.e., the alignment of one directional model has probability zero according to the other
model). We then define the following feature for each target-source position pair n, m:
Symmetry Features :
→
−
→
−
+1 z̄ ∈ z̄ and z n = m
←
−̄
−
φnm (xtn , xsm , z̄) = −1 z̄ ∈ z and ←
zm =n .
0
otherwise
If the feature φnm (xtn , xsm , z̄) has an expected value of zero, then both models predict the
n, m link with equal probability. We therefore impose the constraint Eq [φnm (xtn , xsm , z̄)] = 0
(possibly with some small slack). Note that satisfying this implies satisfying the bijectivity constraint presented above, since both models are normalized according to a particular direction.
To compute expectations of these features under the model q(z̄ | x̄) we only need to be able to
compute them under each directional HMM. To see this, we have by the definition of q(z̄ | x̄)
and pθ (z̄ | x̄),
−
−̄
Z→
Z←
→
−
−
z̄
z
→
−
←
−̄
→
−
−
q (z̄) −
+←
q (z̄) ←
→
−
¯ +←
¯ x̄t) exp{−λ� φnm (xtn , xsm , z̄)}
p θ ( z̄ | x̄t, xs)
p θ ( z | xs,
¯
¯ x̄t)
p θ (x̄t|xs)
p θ (xs|
q(z̄ | x̄) =
=
,
2
Zλ
2Zλ
(3.14)
where we have defined:
1
→
−
q (z̄) =
→
−
→
−
¯ exp{−λ� φnm (xtn , xsm , z̄)}
p θ (x̄t, z̄ | xs)
→
Z−
q (z̄)
�
→
−
→
−
→
¯ exp{−λ� φnm (xtn , xsm , z̄)}
with Z−
=
p θ (x̄t, z̄ | xs)
q (z̄)
(3.15)
(3.16)
−
→
z̄
←
−
q (z̄) =
1
−
Z←
q (z̄)
−
with Z←
q (z̄) =
←
−̄
←
−
¯ z | x̄t) exp{−λ� φnm (xtn , xsm , z̄)}
p θ (xs,
�
←
−̄
z
←
−̄
←
−
¯ z | x̄t) exp{−λ� φnm (xtn , xsm , z̄)}
p θ (xs,
(3.17)
(3.18)
All these quantities can be computed separately in each model using forward-backward and
−
Z→
−
Z←
q (z̄)
q (z̄)
furthermore, Zλ = 12 ( −
+ ←
). The effect of this constraint is illustrated at the
→
−
¯
¯ x̄t)
p θ (x̄t|xs)
p θ (xs|
bottom of Figure 3.4. The projected link posteriors are equal for the two models, and in most
cases the probability mass was moved to the correct alignment links. The exception is the
word pair internal/le. In this case, the model chose to incorrectly have a high posterior for the
3.4. POS INDUCTION
37
0
1
♣
2
3
1
♣
2
5
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
♣
③
�
♣
♣
♣
6
♣
♣
♣
♣
♣
♣
♣
9
③
♣
♣
♣
♣
♣
♣
8
♣
♣
♣
♣
♣
♣
♣
7
♣
♣
♣
♣
♣
♣
③ ♣
6
♣
♣
♣
♣
♣
♣
♣
4
♣
♣
♣
♣
♣
♣
♣
3
♣
♣
♣
♣
③
♣
♣
0
♣
♣
♣
♣
✇
♣
♣
9
♣
♣
♣
♣
③
♣
♣
8
♣
♣
♣
♣
♣
♣
♣
7
♣
♣
♣
♣
③
♣
♣
6
♣
♣
♣
✈
♣
♣
♣
5
♣
♣
♣
♣
①
♣
♣
4
♣
♣
♣
♣
①
♣
♣
7
♣
♣
✉
③
③
♣
8
♣
o
♣ cau
♣
a
♣ inte
♣ sch
③
9
♣
♣
�
♣
♣
♣
♣
♣
�
♣
♣
♣
♣
♣
♣
①
♣
♣
♣
♣
♣
♣
♣
♣
③
♣
♣
♣
♣
♣
♣
③
♣
♣
♣
♣
♣
♣
✇
♣
♣
♣
♣
♣
♣
♣
③
♣
♣
♣
�
♣
♣
♣
♣
♣
�
♣
♣
♣
♣
♣
♣
①
♣
♣
♣
♣
♣
♣
♣
♣
③
♣
♣
♣
♣
♣
♣
③
♣
♣
♣
♣
♣
♣
✇
♣
♣
♣
♣
♣
♣
♣
③
♣
♣
o
♣ cau
♣
a
♣ inte
♣ sch
③
0 ♣
1 ♣
2 ♣
3 ♣
4 ♣
5 ①
6 ♣
0 ♣
1 ♣
2 ♣
3 ♣
4 ♣
5 ♣
6 ♣
♣
♣
et que à
♣
③ ♣
♣
♣
♣
0 ♣
♣ 1 ♣
♣ 2 ♣
♣ 3 ♣
♣ 4 ♣
♣ 5 ♣
③6 ♣
♣
0 ♣
♣ 1 ♣
♣ 2 ♣
♣ 3 ♣
♣ 4 ♣
♣ 5 ♣
③6 ♣
le lieu de provoquer la rupture ,
♣
♣
et que à
♣
③ ♣
♣
♣
♣
le lieu de provoquer la rupture ,
Figure 3.4: Posterior distributions for different models for an English and French sentence pair.
Left: EN→FR model. Right: FR→ EN model. Top: Regular HMM posteriors. Bottom: After
applying symmetry constraints. Circle size indicates probability value. Circle color in the
bottom rows indicates differences in posterior from the top row. Green - higher probability,
red - lower probability.
alignment link rather than generating internal from null in one direction and le from null in the
other.
Figure 3.4 shows the expected value of each feature as the difference between the posterior
probability for each directional models for the same word pair. The last row shows both directional posteriors after imposing the symmetry constraint. Note that the projected posteriors
are equal in the two models. Also, one can see that in most cases the probability mass was
moved into the right place with the exception of the word pair internal/le; this is because the
word internal does not appear on the French side, but the model still has to spread around the
probably mass to that word. In this case, the model decided to accumulate it into the word
le instead of moving it to the null word. By enforcing this constraint at training and decoding
time we can improve the symmetry of the directional alignments (as measured by intersection
over union) on the Hansards corpus from 48% to 89.9% and on the En-Pt corpus from 48% to
94.2%.
3.4
♣
♣
PoS Induction
As seen in the previous chapter, one of the biggest problems in the PoSI task is the high
ambiguity on the word tag distribution. Each word can be assigned to most tags, while we
know that each word should be assigned to a very small number of tags, in most cases only one.
In this section we use PR and propose a word-tag sparsity constraint that favours posteriors
inst
inst
38
CHAPTER 3. POSTERIOR REGULARIZATION FRAMEWORK
2
Prior Probability
No Prior
Sparse prior
y=x
y = exppsi(x)
y = x − 0.5
1.5
1
0.5
Model parameter
0
0
0.5
1
1.5
2
(a) Dirichlet distribution with hyper- (b) Digamma Function which acts like using
parameter α < 1. The product of this negative Laplacian smoothing of 0.5.
Dirichlet distribution will produce a distribution picked around the values 0 and 1,
encouraging sparsity.
distributions that assign to each word a few number of possible part of speech tags. Before
we describe our constraint the next subsection describes a different approach to solve the same
problem by moving into the Bayesian setting and using priors on the parameters to encourage
sparsity. This is an example on how imposing constraints over the parameters does not work
so well as imposing the constraints directly over the posteriors. Then in Subsection 3.4.2 we
describe our constraint and how it solves the high word tag ambiguity problem.
3.4.1
Parameter Sparsity
Recent advances in inference methods for Bayesian estimation have been applied to unsupervised PoS tagging [Gao and Johnson, 2008; Johnson, 2007; Goldwater and Griffiths, 2007].
In the Bayesian setting, preference for sparsity is expressed as a prior distribution p(θ, α) over
model parameters. Since the underlying model (HMM) is a product of multinomials, Dirichlet
distributions are a natural choice for priors since they are conjugate to multinomials (the product of a Dirichlet with a multinomial is a Dirichlet). This conjugacy simplifies the mathematical
and computational aspects of the model. The complete description of the model is:
θt ∼ Dir (αt )
pθt (zi |zt−1 ) ∼ Multi (θt )
θo ∼ Dir (αo )
pθo (xi |zi ) ∼ Multi (θo )
Here, αt controls sparsity over the state transition matrix and αo controls the sparsity of
the state observation probabilities. Johnson [2007] notes that αt does not influence the model
that much. In contrast, as αo approaches zero, it encourages the model to have highly skewed
pθo (xi |zi ) distributions, that is, each tag is encouraged to generate a few words with high probability, and the rest with very low probability (see Figure 3.5(a) for an example of a Dirichlet
distribution with αo ≤ 1). This is not exactly the constraint we would like to enforce: there
are some PoS tags that generate many different words with relatively high probability (for example, nouns and verbs), while each word is associated with a small number of tags. This
difference is one possible explanation for the relatively worse performance of this prior compared to our method.
3.4. POS INDUCTION
39
Johnson [2007] describes two approaches to learn the model parameters: a componentwise Gibbs sampling scheme (GS) and a Variational Bayes (VB) approximation using a mean
field. We will use the VB approach in our comparison since it only requires a small change
to the M-Step of the EM algorithm. It is out of the scope of this document to fully describe
VB approach, and we refer the readers to the original paper. The authors assume a mean field
assumption, and since the likelihood and the prior are conjugates (multinomial and Dirichlet),
there is an EM-like procedure that finds local-optimal model parameters. In the particular case
of the HMM this procedure only requires a minor change in the M-Step. In the EM algorithm
the M-Step consists in normalizing the expected counts collected during the E-Step. VB in this
setup changes the M-Step by passing each of the individual counts as well as the normal MStep normalizer through a squashing function (exponential of the Digamma function, see figure
3.5(b)). This as the effect of subtracting 0.5 to the expected counts, ensuring that the counts
don’t become negative, which as the effect of the Rich get Richer: words tag pairs with low
counts will be pushed to have zero probability. Note that this approach can not be applied to
the maximum entropy observation model since there, the model parameters having value zero
(sparsity), do not enforce posterior sparsity. In fact if all the parameters have probability zero
we get the uniform distribution, since we will have for each possible word-tag pair exp0·φ(x̄,z̄) .
3.4.2
Posterior Sparsity Constraint
To measure sparsity, we need to define an appropriate measure of PoS tag ambiguity that
we will attempt to control. For hard assignments this measure is more intuitive so we describe
it for hard assignments and then extend it to distributions over assignments. Fix a word-type,
such as “stock”. Intuitively, we would like all occurrences of “stock” to be tagged with a small
subset of all possible tags. For a hard assignment of labels to the entire corpus, we could
count how many different tags are used for occurrences of the word “stock.” If instead of a
single tagging of the corpus, we have a distribution q(z̄ | x̄) over assignments (resulting from
the E-step), we can generalize this measure as follows. Instead of asking was a particular tag
ever used for the word “stock”, we would ask what is the maximum probability with which a
particular tag was used for the word “stock”. We then add the maximums for each particular
occurrence of the word “stock”. This is the quantity that we wish to maximize. We call this
measure �1 /�∞ .
While in the case of the SWA application the constraints were defined for each sentence
individually, the constraint specified before is defined for the entire corpus. We wish to count
for the entire corpus how many different hidden state values (tags) were used per word type.
Seeing the training corpus X as a sequence of sentences x̄1 . . . x̄D appended together, where in
turn each sentence is a sequence of words x, we can reefer to a given word occurrence by its
type v ∈ V, and its occurrence number in the whole corpus. For instance, if the full corpus was
the following two sentences He went with the car to garage. The car was running out of oil., one can
refer to the second occurrence of the word car, in two ways. The first way is the one we have
been using so far, we refer to the second sentence in the corpus as x̄2 and then to the second
word in that sentences as x2 . A second way that we will use when defining this constraint is
to refer to the word type, vv = car, where v ∈ V, and to use the number of times it occurred in
the all of the corpus until now, in this case two i = 2.
More formally, we define a feature φ(v, h, i) that has the value 1 whenever the ith occurrence of the word v has part of speech tag h. For every word v ∈ V, we would like there to be
only a few PoS tags h such that there are occurrences i where v has nonzero probability. This
40
CHAPTER 3. POSTERIOR REGULARIZATION FRAMEWORK
can be achieved if it “costs” a lot to allow an occurrence of a word to take a tag, but once that
happens, it should be “free” for other occurrences of the word to receive that same tag. More
precisely, we would like the sum (�1 norm) over tags h and word types v of the maxima (�∞
norm) of the state posterior of taking tag h over all occurrences of v to be small. The constraint
can be described by the following formula:
�1 /�∞ :
arg min
cvh
��
v∈V h∈H
(3.19)
cvh s. t. Eq(z̄|x̄) [φ(v, h, i)] ≤ cvh
where the cvh that minimizes that expression is the maximum value of the state posterior
for all occurrences of a word type for a given tag.
Moreover, we will use a slightly modified version of the PR framework described in Section 3 so that instead of hard constraints on q(z̄ | x̄), it allows the constraints to be relaxed
at a cost specified by a penalty. This relaxation can allow combining multiple constraints
without having to explicitly ensure that the constraint set remains non-empty. Additionally,
it will be useful in dealing with the �1 /�∞ constraints we need. If those were incorporated as
hard constraints, the dual objective would become non-differentiable, making the optimization (somewhat) more complicated. Using soft constraints, the non-differentiable portion of
the dual objective turns into simplex constraints on the dual variables, allowing us to use an
efficient projected gradient method.
Formally, the E-step of our approach is expressed by the objective:
Primal Penalty Projection
min
q(z̄|x̄),cvh
KL(q(z̄ | x̄) � pθ (z̄ | x̄))+σ
�
vh
cvh s. t. Eq [φ(v, h, i)] ≤ cvh
(3.20)
where σ is the strength of the regularization. Note that setting σ = 0 we are back to normal EM
where q is the model posterior distribution. As σ → ∞, the constraints force each occurrence
of a word type to have the same posterior distribution, effectively reducing the model to a
0th-order Markov chain in the E step.
The dual of this objective has a very simple form, closely related to the original dual formulation 3.7. (see Appendix A.3 for derivation):
�
�
�
�
Dual Penalty Projection
max − log
pθ (z̄ | x̄) exp(−λ · φ(x̄, z̄))
s. t.
λvhi ≤ σ
λ≥0
z̄
i
(3.21)
where z̄ ranges over assignments to the hidden tag variables for all of the occurrences in the
training data, φ(x̄, z̄) is the vector of φ(v, h, i) feature values for assignment z̄, λ is the vector
of dual parameters λvhi , and the primal parameters are q(z̄ | x̄) ∝ pθ (z̄ | x̄) exp (−λ · φ(x̄, z̄)).
This can be computed by projected gradient, as described by Bertsekas [1999].
Figure 3.5 illustrates how the �1 /�∞ norm operates on a toy example. For simplicity suppose we are only regularizing one word and our model pθ (z̄ | x̄) is just a product distribution
over 15 instances of the word. The left panel in Figure 3.5 shows the posteriors under pθ (z̄ | x̄).
We would like to concentrate the posteriors on a small subset of rows. The center panel of the
figure shows the λ values determined by Equation 3.7, and the right panel shows the projected
distribution q(z̄ | x̄), which concentrates most of the posterior on the bottom row. Note that
3.5. RELATED WORK
41
pθ
DT
JJ
VB
NN
instance
qti ∝ pθ e−λti
λ
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
DT
JJ
VB
NN
instance
10
9
8
7
6
5
4
3
2
1
0
1
DT
0.8
JJ
0.6
VB
0.4
0.2
NN
instance
0
Figure 3.5: An illustration of �1 /�∞ norm. Left panel: initial posterior distributions (columns)
for 15 instances of a word. Middle panel: optimal regularization parameters λ, each row sums
to σ = 20 the regularization strength. Right panel: Projected posterior distributions q, which
concentrates the probability mass for all instances on the NN tag, reducing the �1 /�∞ norm
from just under 4 to a little over 1.
we are not requiring the posteriors to be sparse, which would be equivalent to preferring that
the distribution is peaked; rather, we want a word to concentrate its tag posterior on a few tags
across all instances of the word. Indeed, most of the instances (columns) become less peaked
than in the original posterior to allow posterior mass to be redistributed away from the outlier
tags. Since they are more numerous than the outliers, they moved less. This also justifies only
regularizing relatively frequent events in our model.
We will call the models training using PR and the �1 /�∞ constraint as HMM+Sp for the
HMM model and HMM+ME+Sp for the HMM+ME model. We refer to the HMM using the
priors to encourage sparsity has HMM+VB.
Figure 3.6 shows the distribution of tag ambiguity across words for two corpora. As we see
from Figure 3.6, when we train using the EM procedure (HMM), both models grossly overestimate the tag ambiguity of almost all words, compared with true data (Supervised), however
when we used both the �1 /�∞ constraint as well as the HMM+VB model we achieve tag ambiguity closer to the true. However, in Chapter 5 will show that both models achieve a better
�1 /�∞ distribution in very different ways. While the model trained with PR actually produces
clusters of higher quality the same does not happen when using the HMM+VB model.
3.5
Related Work
The idea of introducing constraints over a model to better guide the learning process has
been used before. In the context of word alignment, Deng and Byrne [2006] use a state-duration
HMM in order to model word-to-phrase translations. The fertility of each source word is
implicitly encoded in the durations of the HMM states. Without any restrictions, likelihood
prefers to always use longer phrases and the authors try to control this behavior by multiplying
every transition probability by a constant η > 1. This encourages more transitions and hence
shorter phrases. For the task of unsupervised dependency parsing, Smith and Eisner [2006] add
a constraint of the form “the average length of dependencies should be X”, to capture the locality of syntax ( at least half of the dependencies are between adjacent words), using a scheme
CHAPTER 3. POSTERIOR REGULARIZATION FRAMEWORK
10
Supervised
HMM
HMM+VB
HMM+ME
HMM+Sp
HMM+ME+Sp
L1L∞
8
6
4
2
10
Supervised
HMM
HMM-VB
HMM+ME
HMM+Sp
HMM+ME+Sp
8
L1L∞
42
6
4
2
0
0
0
1000 2000 3000 4000 5000 6000 7000 8000
rank of word by L1L∞
0
200 400 600 800 1000 1200 1400 1600 1800
rank of word by L1L∞
Figure 3.6: Our ambiguity measure (�1 /�∞ ) for each word type in two corpora measured for an
HMM initialized with supervised data, using EM training (HMM, HMM+ME), using sparsity
inducing parameter prior HMM+VB and using �1 /�∞ (HMM+Sp, HMM+ME+Sp). Left:En
corpus, Right:Pt corpus.
they call structural annealing. They modify
� the model’s distribution over trees pθ (z̄ | x̄) by a
(δ e∈z̄ length(e))
�
penalty term as: p θ (z̄ | x̄)θ ∝ pθ (z̄ | x̄)e
. Where length(e) is the surface length of
edge e. The factor δ changes from a high value to a lower one so that the preference for short
edges (hence a smaller sum) is stronger at the start of training.
These two approaches also have goal of controlling unsupervised learning, and the form of
the modified distributions is reminiscent of the form that the projected posteriors take. However, the approaches differ substantially from PR. Smith and Eisner [2006] makes a statement
of the form “scale the total length of edges”, which depending on the value of δ will prefer to
have more shorter/longer edges. Such statements are not data dependent; depending on the
value of δ, for instance if δ ≤ 0 even if the data are such that the model uses edges that are too
short on average we will still push them to be lower. By contrast the statements we can make
in PR are of the form “there should be more short edges than long edges.” Such a statement is
data dependent in the sense that if the model satisfies the constraints then we do not need to
change it; if it is far from satisfying it we might need to make very dramatic changes.
PR is closely related to the work of Mann and McCallum [2007, 2008], who concurrently
developed the idea of using penalties based on posterior expectations of features to guide semisupervised learning. They call their method generalized expectation (GE) constraints or alternatively expectation regularization. In the original GE framework, the posteriors of the model
on unlabeled data are regularized directly. They train a discriminative model, using conditional
likelihood on labeled data and “expectation regularization” penalty term on the unlabeled data:
2
�
arg max L(θ) − λE[||E
pθ (z̄|x̄) [φ(x̄, z̄) − bx̄ ||2 ].
θ
(3.22)
Notice that there is no intermediate distribution q(z̄ | x̄). For some kinds of constraints this
objective is difficult to optimize in θ and in order to improve efficiency Bellare et al. [2009] propose interpreting the PR framework as an approximation to the GE objective in Equation 3.22.
They compare the two frameworks on several datasets and find that performance is similar.
Liang et al. [2009] casts the problem of incorporating partial information about latent variables
into a Bayesian framework using “measurements,” and after several approximation steps, they
arrive at the objective we optimize.
The idea of jointly training two directional models has been explored by Liang et al. [2006],
3.5. RELATED WORK
43
although under very different formalization. They define a joint objective:
�
�
�
→
−
←
−̄
−
−
→
−
−
� log →
¯ + log ←
¯ | x̄t) + log
¯ ←
¯ x̄t) .
max E
p (x̄t | xs)
p (xs
p ( z̄ | x̄t, xs)
p ( z | xs,
θ1 ,θ2
θ
θ
θ
θ
z̄
→
−
←
−̄
−
−
¯ ←
¯ x̄t) ranges over all one-to-one
However, the product distribution →
p θ ( z̄ | x̄t, xs)
p θ ( z | xs,
alignments and computing it is #P-complete [Liang et
al.,
2006].
They approximate this distri� −
− (h ). It is not clear what
bution as a product of state posteriors: q(z̄ | x̄) = i,j →
γ n (hm )←
γ
m n
objective the approximate procedure actually optimizes. While the approximation leads to
good alignment error rate, the authors note that they were not able to observe improved MT
performance as a result.
44
CHAPTER 3. POSTERIOR REGULARIZATION FRAMEWORK
4
Statistical Word Alignment
In this chapter we present an in depth discussion of the quality of the word alignments
when training using PR with the two proposed constraints. We begin with an evaluation of
the word alignments measured against manual annotated corpora. We then proceed to evaluate the alignments on two different tasks: dependency grammar transfer and phrase-based
translation.
4.1
Word Alignment Evaluation
We begin with a comparison of word alignment quality evaluated against manually annotated alignments as measured by precision and recall. We use the six parallel corpora with gold
annotations described in the beginning of Section 2.4.3.
4.1.1
Experimental Setup
Following common practice for all experiments given the training set, we discarded all
training data sentence pairs where one of the sentences contained more than 40 words, since
this makes training faster without significantly hurting performance. We added the unlabeled
development and test data sets to the pool of training sentences, to avoid having unknown
words. We initialized IBM M1 translation table with uniform probabilities over word pairs
that occur together in the same sentence and trained IBM M1 for 5 iterations. All HMM alignment models were initialized with the translation table from IBM M1 and uniform distortion
probabilities. We run each training procedure until the area under the precision recall curve
measured on a development corpus stops increasing (see Figure 4.4 for an example of such a
curve). Using the precision recall curve gives a broader sense of the model’s performance than
using a single point (by tuning a threshold for a particular metric). In most cases this meant
4 iterations for normal EM training and 2 iterations using PR. We suspect that the constraints
make the space easier to search. For the models trained with PR we describe the criteria used
for the projection Algorithm 2 in detail in the next subsection.
The intent of this experimental section is to evaluate the gains from using constraints during learning, hence the main comparison is between HMM trained with normal EM against
trained with PR. We also report results for IBM M4, since it is often used as the default word
alignment model, and can be used as a reference. However, we would like to note that IBM
M4 is a more complex model, able to capture more structure albeit at the cost of intractable
inference. Since our approach is orthogonal to the base model used, the constraints described
here could be applied in principle to IBM M4 if exact inference was efficient, hopefully yielding
similar improvements. We used a standard implementation of IBM M4 [Och and Ney, 2003]
and since changing the existing code is not trivial, we could not use the same stopping criterion
46
CHAPTER 4. STATISTICAL WORD ALIGNMENT
Figure 4.1: HMM with symmetry constraints for different values of τ in the Hansards corpus
using 100k sentences. Left: Area under the precision recall curve. Right number of calls to the
FB algorithm during projection for the 4 iterations.
to avoid overfitting and we are not able to produce precision recall curves. We trained IBM
M4 using the default configuration of the MOSES training script.1 This performs 5 iterations of
IBM M1, 5 iterations of HMM and 5 iterations of IBM M4. The following experiments all use
MBR decoding since this decoding method always outperforms Viterbi decoding.2
The competing models through this section will be the HMM model trained with EM denoted as HMM the HMM model trained using PR and the bijectivity constraint, denoted as
B-HMM and using the symmetry constraints denoted as S-HMM and finally IBM M4.
4.1.2
Dual Projection Optimization
When training using PR there are several parameters related with the projection of the
posteriors pθ (z̄ | x̄) into the constraint set Qx̄ that need to be set ( see Algorithm 2). The
convergence criterion for the projection algorithm was the squared normalized l2 norm of the
||∇(λ )||2
i 2
gradient (gradient norm divided by number of constraints #φ(x̄,z̄)
) being smaller than η. For
bijective constraints we set η to 0.005 and used zero slack. For symmetry constraints η and
slack were set to 0.001. We chose η aggressively and lower values did not significantly increase
performance. Less aggressive settings cause degradation of performance.
Figure 4.1 shows the difference in performance as we vary η when training with PR using
the symmetry constraints. As η decreases the performance of the model gets much better, ranging from values close to using regular EM training for big values of η to an huge improvement
when η = 0.001. However, these improvements come at a computational cost. If we measure
optimization efficiency by the number of calls performed to the FB algorithm (required for gradient computations) we see that as we decrease η we exponential require more calls to the FB
algorithm. An interesting remark is that when we optimize to high precision besides getting
better results we also require less EM iterations (2 achieves the maximum performance), which
indicates that the constraints provide a lot of guidance to the model since satisfying them more
closely, gets the model to a local minimum much faster. There are however some numerical
problems with using small values of η, as some sentences failed to converge more often in
which case we use the original counts. The reasons for failing to converge are due to exceeding
1
2
http://www.statmt.org/moses/?n=FactoredTraining.HomePage
IBM M4 uses Viterbi decoding since Giza++ does not support MBR decoding.
4.1. WORD ALIGNMENT EVALUATION
47
Figure 4.2: HMM with symmetry constraints for different allowable slack values (�) using 100k
sentences of the Hansards corpus. Left: Area under the precision recall curve. Right number
of calls to the FB algorithm used during projection for the 4 iterations.
the maximum number of iterations either on projecting or looking for the best step size, or if
the objective exceeds the numeric range for numbers (since we are taking the exponential of
the parameters to calculate q(z̄ | x̄)).
We also evaluate the effect of the value for the slack variable � used to define the set Qx̄ of
acceptable constraints (Eq: 3.1). The slack is kind of complementary in an inverse way to η, as
by making η bigger, we are making the primal projection objective KL(Qx � pθ (z̄ | x̄)) bigger
as well. On the other hand by making � bigger we are making the set Q easier to attain and the
dual optimization problem easier since we are subtracting � times the norm of the parameter
from the gradient. The biggest difference between the two, although mostly a technicality, is
that, when close to the boundary of Q if we have no slack we pay the distance as measure by
KL, while in the second case when we reach inside of the L2 of radius � we are paying the
distance in terms of L2 norm.
Figure 4.2 shows the performance of HMM using PR with the symmetry constraints when
using the best convergence value η = 0.001 for different values of slack �. The figure on the
left shows the area under the curve, we note that the difference is not significant for different
slack values. On the right we show the number of required call to the Forward Backward
algorithm per EM iteration. We note that better optimization methods, such as L-BFGS, or
using a warm start for the parameters at each EM iteration (parameters from the previous
iteration), or training the models online, would potentially decrease the running time of our
method, but this is left as future work.
4.1.3
Projection Before Decoding
Here we analyse the difference of projecting the posteriors at decode time vs not projecting,
and the effect of training with PR against training with EM and projecting the posteriors at
decode time.
Figure 4.3 shows the precision vs recall curves of both models when training using standard EM versus PR with both constraints, and the results of additionally applying the constraints at decode time. A first observation is that training with PR significantly boosts the
performance of each model. Moreover using the projection at decode time always increases
performance. Comparing both constraints, it seems that bijective constraint is more useful at
training time. Note that using this constraint at decode time with regular training yields worse
48
CHAPTER 4. STATISTICAL WORD ALIGNMENT
M1 Hansards EN-FR S-HMM Constraint
1
0.9
0.8
0.8
recall
1
0.9
0.7
B-HMM BP
0.6 B-HMM NP
HMM BP
0.5
HMM NP
0.4
0.4
0.5
0.7
S-HMM SP
HMM SP
0.5 S-HMM NP
HMM NP
0.4
0.4
0.5
0.6
0.6
0.7
0.8
0.9
1
0.6
0.7
0.8
0.9
precision
precision
HMM Hansards EN-FR Bijective Constraint
HMM Hansards EN-FR Symmetric Constraint
1
0.95
0.9
0.85
0.8
0.75 B-HMM BP
0.7 B-HMM NP
HMM BP
0.65
HMM NP
0.6
0.6 0.65 0.7
recall
recall
recall
M1 Hansards EN-FR Bijective Constraint
0.75
0.8
0.85
precision
0.9
0.95
1
1
0.95
0.9
0.85
0.8
0.75 S-HMM SP
HMM SP
0.7
S-HMM NP
0.65
HMM NP
0.6
0.6 0.65 0.7
0.75
0.8
0.85
0.9
0.95
1
1
precision
Figure 4.3: Precision recall curves for IBM M1 and HMM using standard EM training (HMM)
versus PR with bijective constraints (B-HMM) and symmetry constraints (S-HMM) and different decoding types: decoding without any projection (NP), doing bijective projection before
decoding (BP), and doing symmetry projection before decoding (SP), using 100k sentences of
the Hansards corpus.
results than just training with the same constraint using PR. On the other hand, the symmetry
constraint is stronger at decode time. The rest of the results reported are all using projection at
decode time.
4.1.4
Overall Results
In this subsection we present results on alignment quality. Figure 4.4 shows precision recall
curves for the different models on the En-Fr corpus using English as the source language (left),
and on the En-Pt corpus using Portuguese as the source. Precision recall curves are obtained by
varying the posterior threshold from 0 to 1 and then plotting the different precision and recall
values obtained.
We observe several trends from Figure 4.4. First, both types of constraints improve over
the HMM in terms of both precision and recall (their precision recall curve is always above).
Second, S-HMM performs slightly better than B-HMM. IBM M4 is comparable with both constraints (after symmetrization). The results for all language pairs are in Figure 4.5. For ease
of comparison, we choose a decoding threshold for HMM models to achieve the recall of the
corresponding IBM M4 and report precision. Our methods always improve over the HMM by
10% to 15%, and improve over IBM M4 9 out of 12 times. Comparing the constraints with each
other we see that S-HMM performs better than B-HMM in 10 out of 12 cases. Since S-HMM
4.1. WORD ALIGNMENT EVALUATION
49
Precision/Recall curves
Precision/Recall curves
1
1
0.95
0.95
0.85
0.8
0.75
0.7
0.65
recall
0.9
0.85
recall
0.9
S-HMM
B-HMM
IBM M4 F
IBM M4 Int
IBM M4 GDF
HMM
0.6
0.6
0.65
0.8
0.75
0.7
0.65
0.7
0.75
0.8
0.85
0.9
0.95
S-HMM
IBM M4
IBM M4 Int
IBM M4 GDF
B-HMM
HMM
0.6
0.6
1
0.65
0.7
0.75
precision
0.8
0.85
0.9
0.95
1
precision
Figure 4.4: Precision recall curves for different models using 1000k sentences. Precision on the
horizontal axis. Left: Hansards EN-FR direction. Right: EN-PT PT-EN direction.
70.5
70.5
91.6
93.4
90.8
90.1
91.8
82.4
84.9
84.4
84.6
74.9
76.3
78.3
79.8
80.9
82.5
84.6
86.5
88.0
88.9
88.4
89.1
84.0
85.3
87.2
87.8
87.2
84.6
84.3
75.7
73.0
70
71.3
74.4
77.6
75
82.7
84.5
86.3
87.9
82.4
85.0
85.0
80
83.9
85.0
86.2
85
88.4
90
94.6
95
67.5
65
60
En-Pt
Pt-En
Pt-Fr
Fr-Pt
En-Es
Es-En
Es-Fr
Fr-Es
Pt-Es
Es-Pt
En-Fr
Fr-En
Languages
HMM
IBM M4
B-HMM
S-HMM
Figure 4.5: Word alignment precision when the threshold is chosen to achieve IBM M4 recall
with a difference of +/- 0.005. The average relative increase in precision (against the HMM
model) is 10% for IBM M4, 11% for B-HMM and 14% for S-HMM.
50
CHAPTER 4. STATISTICAL WORD ALIGNMENT
90
90
80
70
S-HMM
B-HMM
IBM M4
HMM
60
50
1000
10000
100000
precision
100
precision
100
80
70
S-HMM
B-HMM
IBM M4
HMM
60
1e+06
size
50
1000
10000
100000
1e+06
size
Figure 4.6: Word alignment precision as a function of training data size (number of sentence
pairs). Posterior decoding threshold chosen to achieve IBM M4 recall in the Hansards corpus.
Right: En-Fr direction. Left: Fr-En direction.
indirectly enforces bijectivity and models sequential correlations on both sides, this is perhaps
not surprising.
Figure 4.6 shows performance as a function of training data size. As before, we decode
to achieve the recall of IBM M4. For small training corpora adding the constraints provides
larger improvements (20-30)% but we still achieve significant gains even with a million parallel
sentences (15%). Greater improvements for small data sizes indicate that our approach can be
especially effective for resource-poor language pairs.
4.1.5
Rare vs. Common Words
One of the main benefits of using the constraints described is an alleviation of the garbage
collector effect [Brown et al., 1993]. Figure 4.7 breaks down the performance improvements
by common versus rare words. As before, we use posterior decoding, tuning the threshold to
match IBM M4 recall. For common words, this tuning maintains recall very close for all models
so we do not show this in the figure. In the top left panel of Figure 4.7, we see that precision
of common words follows the pattern we saw for the overall corpus: symmetry and bijective
outperform both IBM M4 and the baseline HMM, with symmetry slightly better than bijective.
The results for common words vary more slowly as we increase the quantity of training data
than they did for the full corpus. In the top right panel of Figure 4.7 we show the precision for
rare words. For the baseline HMM as well as for IBM M4, this is very low precisely because of
the garbage collector problem: rare words become erroneously aligned to untranslated words,
leading to low precision. In fact the constrained models achieve absolute precision improvements of up to 50% over the baseline. By removing these erroneous alignments the translation
table becomes more accurate allowing higher recall on the full corpus. In the bottom panel of
Figure 4.7, we observe a slightly diminished recall for rare words. This slight drop in recall is
due to moving the mass corresponding to rare words to null.
4.1.6
Directional Alignments Combination
As discussed earlier, the word alignment models are asymmetric, while most applications
require a single alignment for each sentence pair. Typically this is achieved by a symmetriza-
4.1. WORD ALIGNMENT EVALUATION
51
100
90
80
70
80
70
S-HMM
B-HMM
IBM M4
HMM
60
50
1000
10000
60
precision
precision
90
50
40
S-HMM
B-HMM
20 IBM M4
HMM
10
1000
30
100000
1e+06
10000
size
100000
1e+06
size
100
recall
90
80
70
HMM
IBM M4
B-HMM
S-HMM
60
50
1000
10000
100000
1e+06
size
1
1
0.95
0.95
0.9
0.9
0.85
0.85
recall
recall
Figure 4.7: Precision and recall as a function of training data size for Hansards corpus split
by common and rare words. Top Left: Common Precision, Top Right: Rare Precision. Bottom:
Rare Recall.
0.8
0.8
0.75
0.75
0.7
0.7
S-HMM
B-HMM
IBM M4 Int
0.65 IBM M4 GDF
HMM
0.6
0.6
0.65
0.7
0.75
0.8
precision
0.85
0.9
0.95
1
S-HMM
B-HMM
HMM
0.65
IBM M4 Int
IBM M4 GDF
0.6
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
precision
Figure 4.8: Precision recall curves for the different models after soft union symmetrization.
Precision is on the horizontal axis. Left EN-FR, Right PT-ES
52
CHAPTER 4. STATISTICAL WORD ALIGNMENT
tion heuristic that takes two directional alignments and produces a single alignment. For MT
the most commonly used heuristic is called grow diagonal final Axelrod et al. [2005]. This
starts with the intersection of the sets of aligned points and adds points around the diagonal
that are in the union of the two sets of aligned points. The alignment produced has high recall
relative to the intersection and only slightly lower recall than the union. In syntax transfer the
intersection heuristic is normally used, since one wants to have high precision links to transfer
knowledge between languages. One pitfall of these symmetrization heuristics is that they can
obfuscate the link between the original alignment and the ones used for a specific task, making
errors more difficult to analyze. Since they are heuristics tuned for a particular phrase-based
translation system, it is not clear when they will help and when they will hinder system performance. In this work we followed a more principled approach that uses the knowledge about
the posterior distributions of each directional model. We include a point in the final alignment
if the average of the posteriors under the two models for that point is above a threshold. This
heuristic is called soft union [DeNero and Klein, 2007]. Figure 4.8 shows the precision recall
curves after symmetrization for the En-Fr corpus. The posterior regularization trained models still performed better, but the differences get smaller after doing the symmetrization. This
should not be very surprising, since the soft union symmetrization can be viewed as an approximation of our symmetry constraint applied only at decode time. Applying the symmetrization
to the model with symmetry constraints does not affect performance.
4.1.7
Error Analysis
In this subsection we discuss some scenarios in which the constraints make the alignments
better, and some scenarios where they fail. We have already discussed the garbage collector
effect and how both models address it. Both of the constraints also bias the model to have at
most probability one in any row or column of the posterior matrix, encouraging 1 to 1 alignments. Obviously whenever alignments are systematically not 1 to 1 , this can lead to errors.
Figure 4.9, shows the state posterior distributions for an English/French sentence pair using
the same notation as in Figure 2.6. In the top panel of Figure 4.9, we see the left baseline model
(EN→FR), where the English word met is incorrectly being aligned to séance est ouverte. This
makes it impossible to recover the correct alignment house/séance. Both constraints correct this
problem. On the other hand, by enforcing a 1 to 1 mapping the correct alignment met / est
ouverte is lost. Going back to the first row (regular HMM) this alignment is correct in one direction and absent in the other (due to the n to 1 model restriction) but we can recover that
information using the symmetrization heuristics, since the point is present at least in one direction with high probability mass. This is not the case for the constraint-based models that
reduce the mass of that alignment in both directions. Going back to the right panel of Figure
4.8, we can see that for low values of precision the HMM model actually achieves better recall
than the constraint-based methods. There are two possible solutions to alleviate this type of
problem, both with their caveats. One solution is to model the fertility of each word in a way
similar to IBM M4, or more generally to model alignments of multiple words. This can lead to
significant computational burden, and is not guaranteed to improve results. A more complicated model may require approximations that destroy its performance gain, or require larger
corpora to estimate its parameters. Another option is to perform some linguistically motivated
pre-processing of the language pair to conjoin words. This of course has the disadvantage that
it needs to be specific to a language pair in order to include information such as “English simple
past is written using a single word, so join together French passé composé.” An additional problem
4.2. APPLICATIONS
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
4
5
53
0
②
♣
♣
♣
1
2
0
③
♣
♣
♣
♣
♣
1
2
3
4
4
♣
♣
♣
♣
③
♣
6
3
♣
♣
♣
③
♣
♣
5
2
♣
♣
③
♣
♣
♣
♣
③
1
♣
♣
♣
♣
♣
♣
③
♣
0
♣
③
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
6
♣
♣
♣
♣
♣
③
5
♣
♣
♣
♣
③
♣
4
♣
♣
♣
♣
①
♣
3
♣
♣
♣
♣
③
♣
5
6
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
③
♣
♣
③
③
♣
♣
♣
♣
③
♣
♣
♣
♣
♣
♣
♣
♣
③
♣
♣
♣
♣
③
la séance est ouverte à
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
♣
③
♣
♣
♣
♣
2 heures
7
0
1
0 ③ ♣
③
1 ♣
♣
2 ♣
♣
3 ♣
♣ 4 ♣
♣
①5 ♣
♣
7
0
1
♣ 0 ③
♣
♣ 1 ♣
③
♣ 2 ♣
♣
♣ 3 ♣
♣
♣ 4 ♣
♣
♣ 5 ♣
♣
7
0
1
♣ 0 ③
♣
♣ 1 ♣
③
♣ 2 ♣
♣
♣ 3 ♣
♣
♣ 4 ♣
♣
♣ 5 ♣
♣
.
la séance
♣
♣
♣
♣
2
♣
♣
♣
♣
3
4
2
♣
♣
♣
♣
♣
♣
3
4
5
6
5
♣
♣
♣
♣
♣
③
7
4
♣
♣
♣
♣
③
♣
♣
2
♣ p.m.
3
♣
♣
♣
③
♣
♣
♣
③
2
♣
♣
③
♣
♣
♣
③
♣
♣
♣
♣
♣
7
♣
♣
♣
♣
♣
♣
6
♣
♣
♣
♣
♣
③
5
♣
♣
♣
♣
③
♣
6
7
♣
♣
♣
♣
♣
♣
③
♣
♣
③
♣
2
♣ p.m.
♣
♣
♣
♣
♣
♣
③
♣
♣
♣
♣
③
est ouverte à
♣
♣
♣
♣
♣
♣
♣
♣
♣ the
♣ house
♣ met
♣
at
♣ the
♣ house
♣ met
♣
at
♣
2
♣ p.m.
♣ the
♣ house
♣ met
♣
at
2 heures .
Figure 4.9: State posterior distributions for different models for an English and French sentence
pair. Left: EN→FR direction. Right: FR→ EN direction. Top: EM trained HMM posteriors.
Middle: After applying bijective constraint. Bottom: After applying symmetry constraint.
Sure alignments are squares with borders; possible alignments are squares without borders.
Circle size indicates probability value. Circle color in the middle and bottom rows indicates
differences in posterior from the top row. Green - higher probability, red - lower probability.
with joining words to alleviate inter-language divergences is that it can increase data sparsity.
4.2
Applications
In this section we evaluate the alignments resulting from using the proposed constraints in
two different tasks: Syntax transfer where alignments are used to indicate which labels should
match in the different languages and statistical machine translation where alignments are used
to restrict the number of possible minimal translation units.
4.2.1
Syntax Transfer
Here we compare the different alignment models based on how well they can be used
for transfer of linguistic resources across languages. In particular, Ganchev et al. [2009] use a
word aligned corpus and a parser for a resource rich language (source language) in order to
create a parser for a resource poor language (target language). Consider the parse tree of the
source language as a set of dependency edges. For each such edge, this work checks whether
both end points are aligned to words in the target language. If both endpoints are aligned,
then the edge is transferred. Obviously, not all edges transferred in this way result in correct
parse edges. Parser errors in the source language can result in incorrect parses. Additionally,
54
CHAPTER 4. STATISTICAL WORD ALIGNMENT
80
60
78
76
55
74
72
50
70
68 S-HMM
B-HMM
66 HMM
3
45
3.2
3.4
3.6
3.8
4
4.2
4.4
S-HMM
B-HMM
HMM
12
14
16
18
20
22
Figure 4.10: Edge conservation for cross lingual grammar induction. Left: En→Bg subtitle corpus; Right: En→Es parliamentary proceedings. Vertical axis: percentage of transferred edges
that are correct. Horizontal axis: average number of predicted edges per sentence.
linguistic phenomena in the two languages might cause correct parses and alignments to result
in incorrect transferred edges. For example the verb “bike” in English might be translated to
French as “aller à vélo” where the semantic content is translated as the word “vélo” but we
cannot expect this French noun to behave similarly to the English verb. In order to address this
concern the resulting alignments are filtered by PoS tag.
In order to evaluate the alignments without dependence on a particular parser, we computed the fraction of correctly transferred edges as a function of the average number of edges
transferred. In general as we trade-off precision vs recall of alignments, we can increase the
accuracy of the transferred edges by transferring a smaller number of edges. Figure 4.10 shows
our results for transferring from English to Bulgarian (En→Bg) and from English to Spanish
(En→Es). En→Bg results are based on a corpus of movie subtitles, and are consequently shorter
sentences, while En→Es results are based on a corpus of parliamentary proceedings. We use
the “two-rules” system for Bulgarian, using sentences of any length that have at least one transferred edge. For Spanish we use the “three-rules” system. See Ganchev et al. [2009] for details
on the corpora and transfer rules. The alignments were produced using the softUnion heuristic. Figure 4.10 shows the percentage of transferred edges that are correct, as a function of the
average number of edges per sentence that were transferred for the two languages (for different values of the posterior threshold). The Bulgarian corpus has shorter sentences on average,
resulting in a smaller number of edges transferred per sentence and in a higher accuracy relative to Spanish. We see in Figure 4.10 that for both domains, the models trained using posterior
regularization perform better than the baseline model trained using EM.
4.2.2
Phrase-based machine translation
In this subsection we attempt to investigate whether the new alignments produce improvements in a end to end phrase based machine translation system. In particular we fix a state of
the art machine translation system, the open source Moses [Hoang et al., 2007] toolkit3 , and
measure its performance when we vary the supplied word alignments.
The exact relationship between word alignment quality and machine translation quality is
not straightforward. Despite recent work showing that better word alignments can result in
3
www.statmt.org/moses/
4.2. APPLICATIONS
Foreign Source Points
0 1
0 1 2
x
a
0-0
✇
✇
0
0
a
xy
ab
0-0 1-1
✇
b 1
xyz
a b c 0-0 1-1 2-2 1
•
✇
y
b
1-1
2
c 2
yz
bc
1-1 2-2
x( y z
x y
z
c
2-2
55
Foreign Source Points
x
a
0-0
xy
a
0-0
2
x
a
b
0-0
a
x
y
a
b
0-0
b
xyz
a b c 0-0 2-2
✇c
z
c
2-2
z
yz
c
2-2
z
bc
2-2
yz
bc
2-2
Figure 4.11: Results from the phrase extraction procedure for different word alignments. In
the right panel the word alignment model has missed the y-b alignment which will result in
different phrases being extracted.
better MT [Ganchev et al., 2008], there is evidence that the two are only indirectly connected
[Lopez and Resnik, 2006; Vilar et al., 2006]. Most statistical MT systems use word alignments to
extract phrase pairs which are then used along with corresponding probabilities in the translation pipeline. Lopez and Resnik [2006] identify three ways in which word alignments can
affect a phrase based translation system: which phrase pairs are extracted; the probabilities
associated to them; and the value of the lexical weight feature.
The phrase pairs extracted are closely tied to the alignments, but in perhaps an unintuitive
way. The phrase extraction code extracts all phrase pairs that include at least one aligned point
but do not contradict an alignment by including an aligned word in one language without its
translation in the other language. So on the one hand if there are too many incorrect alignment
points forming a cluster, the correct phrases cannot be extracted without adding the spurious
words, leading to missing words/phrases from the phrase table. On the other hand, unaligned
words act as wild cards that can be aligned to every neighboring word, thus increasing the size
of the phrase table (in the Hansards corpus the phrase table size increases by a factor of 5 when
using a decoding of threshold of 0.6 instead of 0.2). Another undesirable effect of unaligned
words is that whey will only appear (in the phrase table) in the context of the surrounding
words. In the case that the word was incorrectly misaligned and has a correct word translation,
it will not be available at decoding time on different contexts.
Figure 4.11 shows the phrases extracted from two word alignments for the same sentence
pair. The alignments differ only in the alignment point y-b. By incorrectly excluding this point,
we remove the phrase y-b, and hence would not be able to translate y as b except in the context
we have seen it. We will also increase the total number of phrases, since any unaligned points
can be included in adjacent phrases.
The spurious phrase pairs, will change both the phrase probability as well as the lexical
weight feature. Lopez and Resnik [2006] conclude that the factor with most impact was the
degradation of the translation probabilities due to noisy phrase pairs. In a different study Ayan
and Dorr [2006] note that the lexical weight feature of common phrase based MT systems do
not correctly take account for null aligned words. They show that accounting for these correctly
can improve performance.
For all experiments, the experimental setup is as follows: we lowercase the corpora, and
56
CHAPTER 4. STATISTICAL WORD ALIGNMENT
train language models from all available data. The reasoning behind this is that even if bilingual texts might be scarce in some domain, monolingual text should be relatively abundant.
For each competing alignment model we train Moses and use MERT optimization to tune its
parameters on a development set. Moses is trained starting from a symmetrized alignment
set, phrases up to 7 words are extracted, and the msd-bidirectional-fe distance based distortion
model is used. The competing alignment models are: GIZA++ implementation of IBM M4,
where the directional models are symmetrized using the grow diagonal and final heuristic (M4
GDF); our implementation of the baseline HMM model using the softUnion heuristic (HMM
SU); our implementation of HMM trained with PR using both constraints and the softUnion
heuristic (B-HMM SU, S-HMM SU). When using the softUnion heuristic we tested three values
of the threshold (0.2,0.4,0.6) which try to capture the choice between recall oriented alignments
and precision oriented alignments, and pick the best according to the performance in the development set. Table 4.1 summarizes the results for the different corpora and alignments.
M4 GDF
HMM SU
B-HMM SU
S-HMM SU
Fr → En
35.7
35.9
36.0
35.5
En → Fr
31.2
28.9
31.5
31.2
Es → En
32.4
32.3
32.6
31.9
En → Es
31.6
31.6
31.7
32.5
Pt → En
31.4
30.9
31.0
31.4
En → Pt
28.9
31.6
32.2
32.3
Table 4.1: BLEU scores for all language pairs. The best threshold was picked according to the
development set after the last MERT iteration. Bold represents the best value overall.
A first observation is that the constraint based models always outperform the HMM model
trained with EM and outperform M4 GDF in all but one experiment, with a difference ranging
from (0.2 to 0.9). However both constrains help in different cases. While the Bijectivity constrains, seem to help more for the Hansards corpus and for one direction of the EPPS corpus, in
the other cases the symmetry constrains outperform all other models. It is also interesting that
the baseline HMM with softUnion outperforms the more complex M4 baseline in half the cases.
Even if the constrained alignments consistently lead to better BLEU scores these differences are
small and not easy to justify.
There are several influencing factors, related with the high entropy on the typical MT cascade: there was a huge variance in the number of iterations MERT required to converge ranging
from just 2 iterations to 28 iterations. Also the best value on the development set did not always coincide with the best value on the test set. Moreover, against common knowledge in MT
community which indicates that, bigger phrase tables are better, we note that in 14 out of 18
times the threshold picked was 0.4 (middle size phrase tables) and the other 4 times 0.2 was
picked (smaller phrase tables). For instance for the symmetry constraints HMM model on the
En-Pt corpus the phrase table sizes ranges from 0.8 GM for 0.2, 1.7G for 0.4 and 2.7G for 0.6.
This is probably related with the fact that the number of aligned points with bigger thresholds
is small leading to a lot of erroneous phrases constituted mostly by unaligned points, which
will lead to a poor estimation of phrase probabilities, as noted also in Lopez and Resnik [2006].
5
Part of Speech Induction
This Chapter compares the different models used for the PoSI task: the HMM-ME model
and the HMM model using standard EM and PR with sparsity constraint and the HMM model
using a discounting Dirichlet prior.
5.1
Experimental Setup
We test our models on the six languages described in subsection 2.5.2. The competing models are the HMM with a multinomial observation model (HMM), the HMM with a maximum
entropy observation model trained using EM (HMM+ME) and trained using PR with the sparsity penalty (HMM+Sp, HMM+ME+Sp), and a multinomial HMM with a Dirichlet prior on the
parameters (HMM+VB) trained using variational Bayes.
Following standard practice for the HMM models with a multinomial observation model,
we lowercase the corpora and replace words with only one occurrence by a special unknown
token. This improves the multinomial HMM results by decreasing the number of parameters
and eliminating rare words, which are mostly nouns and normally end up together with other
nouns. Since the max-entropy observation models have access to features, these pre-processing
steps are not necessary and we do not perform them. We randomly initialized all models from
the same posteriors obtained by running the E-Step of the HMM model with a set of random
parameters. This means that the initialization was identical for all models (for a fixed random seed). For the models using the maximum entropy observation model, we regularize the
likelihood with a Gaussian prior with variance 10. For EM and Variational Bayes training we
trained the model for 200 iterations, since we found that the models converge around iteration
100. For HMM+VB model we use a transition and observation prior of 0.001 corresponding
to the best values reported in Johnson [2007]. For PR training, we initialize the models with
the parameter obtained by training with EM for 30 iterations (mostly for speed) and then we
run an additional 170 iterations of PR. In all experiments, unless explicitly mentioned, for PR
training, we regularize only words that occur at least 10 times, with σ = 32, following the setup
of Graça et al. [2009]. We will see that this setting is not optimal for every language, and model,
but we prefer to use it instead of doing supervised model selection, which does not make sense
in an unsupervised scenario. We obtain hard assignments using Posterior decoding since this
showed small but consistent improvements over Viterbi decoding. In these experiments, all
models have a number of hidden states equal to the number of tags in the corpus.
58
CHAPTER 5. PART OF SPEECH INDUCTION
Accuracy with 1-many mapping
Mutual Information
1.38
1.16
1.54
1.44
1.68
2.11
2.10
2.29
1.68
1.56
1.65
1
2.23
2.22
2.19
2.44
1.84
1.76
1.83
1.93
1.98
1.81
1.5
0.92
35
2
1.25
1.50
1.54
1.69
1.19
47.7
50.8
54.0
61.6
57.8
63.6
67.3
68.2
71.3
58.6
40
2.78
2.66
2.93
75.7
47.1
45
2.5
68.1
53.9
50
60.5
55
63.9
65.5
68.1
61.2
60
58.1
65
64.1
70
66.9
67.6
70.7
71.6
68.1
75
72.4
3
72.5
70.8
76.5
80
0.5
En
HMM
Pt
HMM+ME
Bg
HMM+Sp
Es
HMM+ME+Sp
Dk
Tr
HMM+VB
En
HMM
Pt
HMM+ME
Bg
HMM+Sp
Es
Dk
HMM+ME+Sp
Tr
HMM+VB
Figure 5.1: Average accuracy and mutual information for 10 different runs (random seeds
identical across models) for 200 iterations of training. Left: Accuracy after performing the 1Many mapping. Right: Mutual information between gold tags distribution and hidden state
distribution. The error bars extend one standard deviation.
5.2
Overall Results
Figure 5.1 compares performance of the models across the six languages using both the 1Many and MI metrics described in Subsection 2.5.3. Figure 5.1 shows that for both metrics, our
model (HMM+ME) always performs better than the baseline (HMM). Also, training with PR
using the sparsity constraint always improves over HMM. Strikingly, the improvements of the
two modifications are additive: HMM+ME+Sp always performs better than both HMM+ME
and also HMM+Sp. Finally HMM+VB performs better than HMM for English, and comparably
or worse for the other languages.
5.3
Maximum Entropy Observation Model Results
As shown in the previous section HMM+ME always outperforms the HMM model for all
corpora. In this section we break down the performance of the HMM+ME model in terms of
different feature sets, the number of parameters used, and the value for the Gaussian prior
over the parameters. Conventional wisdom for discriminative learning is that adding more
features almost always improves performance. While this is often true in a setting where there
is a large labeled corpus, the situation is very different for an unsupervised problem such as
ours. In the supervised setting, the model has guidance that it can use to discard some of the
features as irrelevant to the task. By contrast, there is no way for an unsupervised model to
learn that it should not use some of the features to explain the data. For example, the features
described in Section 2.5.1 (Words, Suff., Size, Ortho. and Counts) could be used for a named
entity recognizer, or a noun-phrase chunker. However, without labeled data we have to rely
that our selection of features is adequate to the task at hand. Moreover, increasing the number
of features will increase the capacity of the model, which can lead to adversarial results, such
as leading the model to explain irrelevant correlations of the data.
Figure 5.2 breaks down the performance of the model by different feature sets. A first
observation is that not all features sets are useful. In fact, only the word identity features with
cutoff and the suffixes features are useful across all languages, while the other features seem to
be hurting or at least not helping the model at all. The fact that the morphological features are
5.3. MAXIMUM ENTROPY OBSERVATION MODEL RESULTS
59
61.60
61.86
61.85
60.17
57.26
61.80
72.52
58.6
63.13
67.31
66.96
66.82
64.63
63.38
66.80
67.78
72.40
72.98
72.56
71.35
54.0
50
60.5
63.90
63.12
63.95
58.1
55
59.17
57.07
60
64.2
65
66.9
67.60
68.01
68.59
67.36
67.14
68.44
70
72.50
71.83
72.71
68.20
66.90
70.83
75
45
En
Pt
Bg
Es
Dk
Tr
Languages
HMM
All
No-Size
No-Count
No-Suff
No-Cutoff
No-Orto
Figure 5.2: Accuracy (using 1-Many mapping) of the HMM+ME model for the different corpora using different feature sets. All is using all features described in Subsection 2.5.1; NoSize: removes the feature for short words; No-Count removes the feature for frequent words;
No-orto: removes the regular expression features; No-Suff: removes the suffixes features; Nocutoff: removes the cutoff to include the word identity, meaning that all word identities are
included. All values correspond to an average over 10 different runs. The error bars extend
one standard deviation.
not useful is unexpected and seems to contradict prior work [Clark, 2003]. We leave a better
understating of the features interactions for future work.
One of the advantages of the HMM+ME model is that it allows us to reduce the number
of parameters of the model, hence giving the model less chance to overfit. Figure 5.4 shows
the accuracy of the model as we vary the number of word identities used as parameters. We
can observe that even when the cutoff is set to 1, the HMM+ME still improves over HMM
although they have parameters for the same word identities (recall that in the HMM words
that occurred only once were replaced by the token “unknown”), which shows that the other
features are actually helping the model. As we increase the word identity cutoff value the
performance of the model varies according to the language. For some languages the less words
we have the better the performance is (En , Pt), while for other languages the performance of
the model starts to degrade after a certain point (Es ,Dk). A possible explanation is that the
value of the cutoff is based on the absolute number of occurrences of a word type, which does
not scale with the corpus size. In fact, the corpus where higher cutoffs hurt performance are the
ones with smaller corpus size. Figure 5.4 shows the number of features used by each model as
we increase the cutoff. As expected, the number of features of the HMM+ME model with cutoff
of 1 is similar to the HMM but the number of features decreases drastically as we increase the
cutoff, specially for small corpus, relying more on the remaining shared features. This might
be particularly useful for domain adaptation where we want to rely as less as possible on the
particular word identity since these are bound to change or have a different meaning in the
new domain.
There is a direct correlation between the number of features used and the Gaussian prior
used on the HMM+ME model. Since as we increase the regularizer strength (inverse of the
Gaussian prior), the model pays more for giving high weights to features, it will prefer to
use common features such as suffixes to explain the data and avoid using the word identity
features. Figure 5.5 shows the accuracy of the HMM+ME as we change the regularizer value
60
CHAPTER 5. PART OF SPEECH INDUCTION
75
65.6
58.6
55
68.1
67.7
66.3
66.6
72.7
72.8
72.8
72.0
71.2
72.2
71.9
72.6
60.5
60
64.2
66.9
66.9
66.9
68.2
67.5
67.6
69.0
65
70.4
70
50
En
Pt
Es
Dk
Languages
HMM
1
5
10
15
20
Figure 5.3: Accuracy (using 1-Many mapping) of the HMM+ME model for the different corpora using different cutoffs on the word identity feature. All values correspond to an average
over 5 different runs. The error bars extend one standard deviation.
30000
25000
20000
15000
10000
5000
0
En
Pt
Es
Dk
Languages
HMM
1
5
10
15
20
Figure 5.4: Number of features used by the HMM+ME as we increase the word identity cutoff
value.
5.4. TAG SPARSITY RESULTS
61
67.2
67.7
67.0
57.4
60.5
66.9
71.3
73.2
72.2
70.3
64.2
54.0
50
60.1
60
66.9
61.1
67.1
69.9
67.5
68.6
70
72.3
72.8
72.0
80
40
32.5
38.9
30
20
En
Pt
Es
Dk
Languages
HMM
0.01
0.1
1
10
100
Figure 5.5: Accuracy (using 1-Many mapping) of the HMM+ME model for the different corpora using different strengths for the regularizer. All values correspond to an average over 5
different runs. The error bars extend one standard deviation.
using the default feature set. Setting the strength very high (0.01) always hurts the model. This
is natural since this will force the model to give small weights to most of the features relying
mostly on suffix features, which are not discriminative between some word classes. As we
decrease the regularizer strength, the model performance increases significantly. We note that
the model is not very sensitive to smaller regularization strengths (1,10,100).
5.4
Tag Sparsity Results
In this section we analyse both sparse inducing approaches (HMM+Sp, HMM+VB). Going
back to Figure 2.8 we see that both approaches are achieving an average �1 /�∞ close to the true
distribution. However, the results on Figure 5.1 are quite different. In fact while HMM+Sp always improves performance, HMM+VB more often hurts than it helps. Figure 2.8 only shows
the global distribution of �1 /�∞ for each method, but shows no information about the difference
on sparsity for individual words. This is shown in Figure 5.6, which displays the average sum
of absolute differences in sparsity between each model and the true tags. In all languages both
sparsity inducing models have a smaller average sparsity difference per word with HMM+Sp
having the slightly smaller difference. It is also interesting to see that the HMM+ME model
also improves over this measure, although it is not directly dealing with sparsity. So far these
measures did not illustrate the significant differences in performance between HMM+Sp and
HMM+VB. Figure 5.7 shows scatter plots of the performance of each method against the �1 /�∞
value they achieve. HMM+Sp always achieves a better performance according to 1-Many mapping, while HMM+VB achieves a better performance on 1 to 1 mapping. However, the good
behaviour of HMM+VB under the 1 to 1 mapping is not due to the quality of the induced
clusters, but a consequence of the defects of the 1 to 1 mapping described in Subsection 2.5.3.
Figure 5.8 shows the number of tokens assigned to each cluster at decoding time, in frequency rank order. While both HMM and HMM+Sp exhibit a fast decrease in the size of the
clusters, HMM+VB more closely matches the power law-like distribution achieved by the gold
labels. This difference explains the improvement on the 1-1 mapping, where HMM+VB is assigning larger size clusters to the most frequent tags. However, HMM+VB achieves this power
62
CHAPTER 5. PART OF SPEECH INDUCTION
4.96
5
4
3.55
3
0.31
0.26
1.48
0.31
0.18
1.69
0.18
0.37
0.29
2.19
1
0.37
0.65
2.65
2
0
En
Pt
Es
Languages
HMM
HMM+VB
HMM+ME
HMM+Sp
HMM+ME+Sp
Figure 5.6: Average absolute �1 /�∞ difference for each word type between each training
method and the supervised initialization.
70
69
68
67
66
65
64
63
62
61
60
Sparse 32
Sparse 10
Sparse 100
70
Sparse 32
Sparse 10
69
-1
VEM 10
EM
68
VEM 10-3
0
1
2
3
1
2
3
VEM 10-3
67
4
52
-1
VEM 10
51
50
49
VEM 10-3
48
47
46
45
Sparse 100
44
Sparse 10
43
42 Sparse 32
41
EM
40
0
Sparse 100
VEM 10-1
4
0
1
2
EM
3
4
53 VEM 10-1
52
51
50 VEM 10-3
49
48
Spar
s
Sparse e 10
32
Sparse 100
47
EM
46
0
1
2
3
4
Figure 5.7: L1LMax vs Accuracy. Left: 1-Many mapping (Pt corpus En corpus). Right 1-1
mapping. Top: Pt corpus. Bottom: En corpus.
5.4. TAG SPARSITY RESULTS
4
3
2
63
4
EM
VEM 10-3
VEM 10-1
Sparse 32
True
EM
VEM 10-3
VEM 10-1
Sparse 32
True
3
2
1
1
0
0
5
10
5
10
15
20
Figure 5.8: Token distribution per cluster. left: Bg corpus, middle: Pt corpus, right: En corpus.
vertical axis: number of tokens in tens of thousands.
75
68.05
60.5
67.64
71.16
68.72
70.90
70.86
71.22
71.07
69.34
71.48
68.86
64.1
60
66.9
65
70.35
70
55
50
En
Pt
Es
Languages
HMM
5
10
15
20
Figure 5.9: Accuracy (using 1-Many mapping) of the HMM+Sp model for the different corpora
using different cutoffs for the number of words required for a word type to be regularized. All
values correspond to an average over 5 different runs. The error bars extend one standard
deviation.
law distribution at the expense of the mutual information with the gold labels as we see in
Figure 5.1. From all methods, HMM+VB has the lowest mutual information, while HMM+Sp
has the highest. In section 5.6 we will show further evidence that mutual information is in fact
the best measure to evaluate the quality of the induced clusters and that it correlates with the
gains of using the induced clusters as features for a supervised part of speech tagger.
As with the HMM+ME model there is no principled way on how to choose the minimum
number of word occurrences for the cutoff when choosing which words to constrain. On the
one hand, it does not make sense to constrain word types with very few occurrences, since
we do not have enough counts to decide which cluster should the probability mass be pushed
to. On the other hand, if we only constraint words that occur a lot of times, we leave most
of the words types, and possible the more ambiguous unconstrained. Figure 5.9 shows the
performance of the method for different cutoffs. We see that in all languages constraining more
words seems to always help. The difference is more significant for Es since it has a smaller
corpus and hence lesser word types with a high number of occurrences. Figure 5.10 shows that
the model is not very sensible to different choices of σ.
64
CHAPTER 5. PART OF SPEECH INDUCTION
75
69.73
71.07
67.47
60.5
69.41
71.42
71.16
64.1
60
70.35
68.78
71.35
71.22
66.9
70.19
65
71.07
70
55
50
En
Pt
Es
Languages
HMM
10
32
64
100
PREP
DET
N
ADJ
RPUNC
POS
V
INPUNC
CONJ
TO
EPUNC
Others
PREP
DET
N
ADJ
RPUNC
POS
CONJ
TO
CONJ
ENDPUNC
V
INPUNC
V
V
ADJ
V
INPUNC
RPUNC
ADJ
ADJ
N
N
N
N
N
DET
PREP
ENDPUNC
TO
CONJ
INPUNC
V
V
POS
N
ADJ
N
N
N
N
N
N
DET
PREP
Figure 5.10: Accuracy (using 1-Many mapping) of the HMM+Sp model for the different σ
values. All values correspond to an average over 5 different runs. The error bars extend one
standard deviation.
EPUNC
Others
Figure 5.11: Induced tags by the HMM model (Left), and by the HMM+ME+Sp model (Right)
on the En corpus. Each column represents a hidden state, and is labeled by its 1-Many mapping.
Unused true tags are grouped into the cluster named “Others”.
5.5
Error Analysis
Figures 5.11 shows the distribution of true tags and clusters for both the HMM model (left)
and the HMM+ME+Sp model (right) on the English corpus. Each bar represents a cluster,
labeled by the tag assigned to it after performing the 1-Many mapping. The colors represent
the number of words with the corresponding true tag. To avoid clutter true tags that were
never used to label a cluster are grouped into “Others”.
We see in Figure 5.11 that both models split common tags such as “nouns” into several
hidden states. This splitting accounts for much of the errors in both models. By using 5 states
for nouns instead of 7, HMM+ME+Sp is able to use more states for adjectives. Another improvement comes from a better grouping of prepositions. For example “to” is grouped with
punctuations in the left, while in the right it is correctly mapped to prepositions. Although this
should be the correct behaviour, this actually hurts, since the tagset actually has a true tag “To”
we get all occurrences of the word “to” as incorrectly assigned, which consists in a lost of 2.2%
points absolute accuracy. Potentially, this is not a very serious error since “to” is most of the
time a preposition. By contrast, HMM has a state mapped to the tag “TO” but the word “to”
5.6. USING THE CLUSTERS
65
25000
25000
20000
20000
15000
15000
10000
10000
5000
5000
0
0
art
n
adj
prop
v-fin
prp
punc
num
adv
v-pcp
v-inf
conj-c
pron-pers
sumOthers
art
n
adj
prop
v-fin
prp
punc
num
adv
v-pcp
v-inf
conj-c
pron-pers
sumOthers
Figure 5.12: Induced tags by the HMM model (Left), and by the HMM+ME+Sp model (Right)
on the Pt corpus. Each column represents a hidden state, and is labeled by its 1-Many mapping.
Unused true tags are grouped into the cluster named “Others”.
40
25
None
HMM
MaxEnt
HMM L1L∞
MaxEnt L1L∞
55
50
45
Error
30
Error
60
None
HMM
MaxEnt
HMM L1L∞
MaxEnt L1L∞
35
20
40
35
15
30
10
25
5
20
10
100
Labeled Sentences
500
10
100
Labeled Sentences
500
Figure 5.13: Tagging error rate as a function of labeled data size for semi-supervised model
using the induced tags as features in the Es corpus (left) and for the Tr corpus (right).
comprises only one fifth of that state. The most common error made by HMM+ME+Sp is to
include the word “The” with the second noun induced tag in Figure 5.11 (Right). This induced
tag contains mostly capitalized nouns and pronouns, which often precede nouns of other induced tags. Potentially, the capitalization feature might be a culprit for the word “The”.
Portuguese morphology is much richer than English which explains the better performance of the feature based models. Figure 5.12 shows the induced clusters for Pt language.
The HMM+ME+Sp model improves over HMM for all tags except of adjectives. Both models have trouble distinguishing nouns from adjectives. The reduced accuracy on adjectives for
HMM+ME+Sp is explained by the mapping of a cluster to adjectives in the HMM model and
to nouns in the HMM+ME+Sp model. Removing the noun/adjective distinction, as suggested
by Zhao and Marcus [2009], would increase performance of both models by about 6%. One
qualitative difference we observed was that the HMM+ME+Sp model used a single induced
cluster for proper nouns rather than spreading them across different clusters.
66
CHAPTER 5. PART OF SPEECH INDUCTION
En
Pt
Bg
Es
Dk
Tr
14
12
Error
10
8
6
4
2
0
10
100
Labeled Sentences
500
Figure 5.14: Error reduction by using the HMM+ME+Sp model produced clusters as features
as opposed to the clusters produced by the HMM model, as a function of the labeled data size.
5.6
Using The Clusters
As a further comparison of the models trained using the different methods, we use them
to generate features for a semi-supervised PoS tagger. The basic supervised model has the
same features as the HMM+ME model (2.5.1), except that we use all word identities and suffixes regardless of frequency. We train the semi-supervised model using averaged perceptron
for a number of iterations chosen as follows: split the training set into 20% for development
and 80% for training and pick the number of iterations ν to optimize accuracy on the development set. Finally, train on the full training set using ν iterations and report results on a 500
sentence test-set. We augment the supervised features with the state identity for the current token, based on the automatically generated models. For each unsupervised training procedure
(HMM, HMM+Sp, HMM+ME, HMM+ME+Sp), we train 10 models using different random
initializations to get 10 state identities for each token. We then add these cluster identities as
features to the supervised model. Figure 5.13 shows the average accuracy of the supervised
model for the Es and Tr languages as we vary the type of unsupervised features. The average
is taken over 10 random samples from the training set at each training set size. We can see
from Figure 5.13 that using semi-supervised features from any of the models except HMM improves performance even if we have 500 labeled sentences. In particular, to get an error rate
comparable to supervised with 500 labeled sentences, we need less than 200 labeled sentences
for most languages. Figure 5.14 shows the absolute error reduction by using the HMM+ME+Sp
model features as opposed to the HMM model features on all languages. The error reduction
is specially dramatic for low data conditions, but its still significant after the supervised model
observes 500 training examples across all languages.
6
Conclusion and Future Work
6.1
Conclusions
In this thesis we developed a novel learning framework, the Posterior Regularization Framework, for incorporating domain knowledge into the learning algorithm while at the same time
keeping the base model simple and tractable. The framework defines a rich language to specify prior knowledge about the desired hidden structure. The prior knowledge is expressed as
inequalities on the expected value under the posterior distribution of some user defined constraint features. In these thesis we showed the expressiveness of the framework by expressing
three different types of constraints: 1) restricting the degree of the hidden state values; 2) making two different models agree on the value of the state posteriors during training; 3) enforcing
sparsity over groups of hidden variables. Moreover, we showed a concrete example on how
enforcing the constraints directly on the posterior distribution is better than applying the same
intuition as parameter priors.
For the SWA task we applied the following two constraints: (i) bijectivity: “one word in
one language should not translate to many words in the other language”; and (ii) symmetry:
“the translation of words should be the same when translated from language A to language B
and from language B to A”. Using these constraints we showed consistent and significant improvements in 6 different languages pairs even when compared to a more complex model such
as IBM M4. Besides solving the known “garbage collector” effect, we showed that the posterior distributions obtained modeled better the desired alignments. Given the intuition that
both constrains are biasing the models for 1-1 alignments, we also showed some systematic
mistakes that the constraints introduce and suggest future work on how to solve them. In addition, we tested the alignments produced by the PR framework within two different tasks that
rely on word alignments, syntax transfer and statistical phrase based MT, where the improved
alignments lead to an increase in performance. In the case of syntax transfer, we showed that
the number of edges of a dependency tree that can be transferred from one language to the
other increases due to the decrease of incorrect alignment points. On phrase based MT, the
improvements are harder to explain, but in fact over three different language pairs we got improvements using the new alignments. We noted however that the current heuristics for phrase
extraction and phrase scoring do not take into account the quality (or probability) of the base
alignments, and aim at achieving as many phrases as possible (leading to huge phrase tables),
which obfuscates the relation between word alignments and MT quality.
We also applied the new learning framework to the task of fully unsupervised PoSI in
six different languages. We addressed the problems of the common approach of using an
HMM trained using EM and investigate two efficient modifications to the model that effectively control the model complexity. First, we re-parametrize the standard HMM by using a
maximum entropy observation model with orthographic and morphological features and reduce the number of parameters, and show that this simple change produces significant gains
68
CHAPTER 6. CONCLUSION AND FUTURE WORK
in all the languages. Second, we use the PR framework to effectively constrain the allowable
ambiguity between words and tags, which achieves significant additive gains in all six languages. The combination of these two approaches achieved remarkable improvements in all
languages, ranging from 4.7% to 15.4% absolute error reduction. Moreover, we show that the
output of these unsupervised systems can be very useful when training a supervised learning
system using a small labeled corpus, reducing the amount of labeled data needed by roughly a
half. However, we also showed some unsolved challenges with this approach. Firstly, it is not
clear what features help on unsupervised learning and how they interact. It is surprising that
morphological features do not help on this particular task since they are particular useful on
supervised systems. Secondly, the propose model for this task had too many parameters that
influence its outcome. Finding a way to pick good parameters in the unsupervised setting is
still an open research question.
6.2
Future Work
Beyond the two constraints discussed for the SWA task, many others are worth investigating, specially after a better understanding of which systematic errors are still being performed
by the current models. One such example are linguistic informed constraints, such as:
• Enforce small distortion as measured by dependency tree distance, as tried by Lopez and
Resnik [2005] and DeNero and Klein [2007];
• Enforce that aligned words should belong to compatible PoS tags. Conditioning on PoS
tags has showed good results in Toutanova et al. [2002];
• Enforce alignments to respect syntactic boundaries;
• Enforce similar words to align to each other with probability one. This would avoid the
example in Figure 2.4 where the proper name “baldwin” is acting as a garbage collector
although the words are identical in both languages.
A general question regarding these constraints, which is related with the PR framework in
general is what kind of prior knowledge is better expressed as a constraint in expectation or
directly on the parameters. Regarding the word similarity constraint we could enforce this
directly on the parameters by initializing the probability p(”baldwin”|”baldwin”) to one, thus
discarding all possible alternatives. Also regarding the tree distortion model we could have the
distortion probability be a convex combination between tree distortion and absolute position
distortion. A different kind of constraint would be to train word alignments models on several
languages at the same time and enforce pairwise agreement between languages. The approach
of using several languages at the same time as been used before and proved successfully (for
instance Kumar et al. [2007]).
To simulate fertility, we could replace the bijectivity constraint by a fertility constraint
where the probability mass of each word had to sum not to one, but to a constant depending on the word identity or the word class. In fact, some small experiments where we applied a
constraint that the fertility of each word should be equal to the one it had on the development
corpus, produced a systematic improvement in performance.
6.2. FUTURE WORK
69
While for SWA both constraints were parameter free and linguistic justifiable, leaving only
the projection optimization parameters as free parameters, the task of PoSI has a wide range of
parameters that need to be tuned. A first direction to explore is what features to use, against
common wisdom in supervised learning, where the more features the better, in this application
we saw that features that we thought would be useful were actually harmful to the model.
Furthermore, there are interactions between the features and the prior used during the optimization, which are not completely clear. Finally, while using the soft penalty version of PR
has the advantage of alleviating the requirement of a non empty set where to project the posteriors, it also introduces an extra strength parameter that needs to be tuned. Furthermore, the
useful values for that parameter do not have a clear meaning, such as the average sparsity we
wish to achieve. Finally, we wish to analyse the usefulness of the induced clusters into different
tasks.
70
CHAPTER 6. CONCLUSION AND FUTURE WORK
A
Derivations
A.1
Convergence Proof
We show that minimizing constrained F (q, θ) results in minimizing the PR Objective.
Theorem: F (q, θ) and PR
Local minima of F (q, θ) s.t. q(z̄ | x̄) ∈ Qx̄ are local minima of
�
L(θ) + E[KL(Q
x̄ � pθ (z̄ | x̄))].
� � q(z̄ | x̄) log pθ (z̄ | x̄)] from F (q, θ), we get:
Proof: By adding and subtracting E[
z̄
�
F (q, θ) = E
�
�
�
= −E
z̄
�
�
�
�
�
�
�
�
q(z̄ | x̄)
p
(x̄,
z̄)
q(z̄
|
x̄)
θ
�
�
q(z̄ | x̄) log
= −E
q(z̄ | x̄) log
+E
q(z̄ | x̄) log
pθ (x̄, z̄)
p
(z̄
|
x̄)
pθ (z̄ | x̄)
θ
z̄
z̄
�
z̄
(A.1)
�
�
�
q(z̄ | x̄) log pθ (x̄) + E[KL(q(z̄
| x̄)||pθ (z̄ | x̄)] = L(θ) + E[KL(q(z̄
| x̄) || pθ (z̄ | x̄)].
(A.2)
Since the first term does not depend on q(z̄ | x̄), the second term is minimized by q(z̄ | x̄) =
arg minq(z̄|x̄))∈Qx̄ KL(q(z̄ | x̄) || pθ (z̄ | x̄)) at local minima.
A.2
Modified E-step Dual Derivation
The modified E-step involves a projection step that minimizes the Kullback-Leibler divergence:
E� :
arg min
q(z̄|x̄)
KL(q(z̄ | x̄) � pθ (z̄ | x̄))
s.t. Eq(z̄|x̄) [φ(x̄, z̄)] − bx̄ ≤ ξ; ||ξ||22 ≤ �.
(A.3)
Assuming the set Qx̄ = {q(z̄ | x̄) : Eq(z̄|x̄) [φ(x̄, z̄)] − bx̄ ≤ ξ; ||ξ||22 ≤ �} is non-empty, the
corresponding Lagrangian is
max
λ,α,γ
min
q(z̄|x̄),ξ
L(q(z̄ | x̄), ξ, λ, α, γ)
(A.4)
72
CHAPTER A. DERIVATIONS
, where
L(q(z̄ | x̄), ξ, λ, α, γ) =
+
� q [φ(x̄, z̄) − bx̄ − ξ)
KL(q(z̄ | x̄) � pθ (z̄ | x̄)) + λ� (E
�
α(||ξ||22 − �2 ) + γ(
q(z̄ | x̄) − 1)
(A.5)
log(q(z̄ | x̄)) + 1 − log(pθ (z̄ | x̄)) + λ� φ(x̄, z̄) + γ = 0
(A.7)
(A.6)
z̄
∂L(q(z̄ | x̄), ξ, λ, α, γ)
=
∂q(z̄ | x̄)
q(z̄ | x̄) =
=⇒
∂L(q(z̄ | x̄), ξ, λ, α, γ)
=
∂ξi
pθ (z̄ | x̄) exp(−λ� φ(x̄, z̄))
e exp(γ)
λi
2αξi − λi = 0 =⇒ ξi =
2α
(A.8)
(A.9)
Plugging q(z̄ | x̄) and ξ in L(q(z̄ | x̄), ξ, λ, α, γ) and taking the derivative with respect to γ.
�
�
� pθ (z̄ | x̄) exp(−λ� φ(x̄, z̄))
∂L(λ, α, γ)
q(z̄|x̄) pθ (z̄ | x̄) exp(−λ φ(x̄, z̄))
=
−1 = 0 =⇒ γ = log(
)
∂γ
e exp(γ)
e
q(z̄|x̄)
(A.10)
From where we can simplify q(z̄ | x̄) =
where Zλ = q(z̄|x̄) pθ (z̄ | x̄) exp(−λ� φ(x̄, z̄))
ensures that q(z̄ | x̄) is properly normalized. Plugging γ into L(λ, α, γ) and taking the derivative with respect to α, we get:
pθ (z̄|x̄) exp(−λ� φ(x̄,z̄))
Z
�
λ2
λ2
+
− α�2
2α 4α
||λ||2
λ2
λ2
−
− �2 = 0 =⇒ α =
2α2 4α2
2�
− log(Z) − b�
x̄ λ −
L(λ, α) =
∂L(λ, α)
=
∂α
(A.11)
(A.12)
Replacing back into L(λ, α) we get the dual objective:
Dual E� :
arg max
λ≥0
A.3
−b�
x̄ λ − log(Z) − ||λ||2 �
(A.13)
Modified Penalty E-step Dual Derivation
The penalty version of PR has to solve the following projection problem on the E’-Step:
Primal Penalty Projection
min
q(z̄|x̄),cvh
KL(q(z̄ | x̄) � pθ (z̄ | x̄)) + σ
�
cvh
vh
s. t. Eq [φ(v, h, i)] ≤ cvh
0 ≤ cvh
(A.14)
A.3. MODIFIED PENALTY E-STEP DUAL DERIVATION
73
The Lagrangian becomes:
L(q(z̄ | x̄), cvh , αvh , λvhi ) = KL(q(z̄ | x̄)||pθ (z̄ | x̄))+σ
�
cvh +
�
wti
vh
λvhi (Eq(z̄|x̄) [φ(v, h, i)]−cvh )−αvh ·cvh
(A.15)
where we are maximizing with respect to λvhi ≥ 0 and αvh ≥ 0. Taking the derivative with
respect to q(z̄ | x̄) we have:
∂L(q(z̄ | x̄), cvh , αvh , λvhi )
= log q(z̄ | x̄) + 1 − log pθ (z̄ | x̄) − φ(v, h, i) · λvhi
∂q(z̄ | x̄)
(A.16)
Setting this to zero and ensuring q(z̄ | x̄) normalizes we get:
q(z̄ | x̄) =
pθ (z̄ | x̄) exp(−φ(v, h, i) · λvhi )
Zλ
(A.17)
Taking the derivative with respect to cvh we have:
�
∂L(q, cvh , αvh , λvhi )
=σ−
λvhi − αvh
∂cvh
(A.18)
i
�
setting this to zero �
gives us αvh = σ − i λvhi . Knowing that αvh ≥ 0 we will have to introduce
the constraint σ ≥ i λvhi . Substituting into the KL term we have: yields:
KL(q(z̄ | x̄)||pθ (z̄ | x̄)) =
� pθ (z̄ | x̄) exp(−φ(v, h, i) · λvhi )
Zλ
z
log
pθ (z̄ | x̄) exp(−φ(v, h, i) · λvhi )
Zλ p(z)
= − log(Zλ ) − Eq(z̄|x̄) [λvhi · φ(v, h, i)]
The second part of this will cancel with
�
L(q(z̄ | x̄), c, αvh , λvhi ) = − log(Zλ ) + σ
= − log(Zλ ) + σ
= − log(Zλ )
(A.19)
wti λvhi Eq(z̄|x̄) [φ(v, h, i)]
�
cvh +
wti
vh
�
vh
�
cvh −
�
wti
leaving us with:
λvhi (−cvh ) − αvh · cvh
λvhi cvh −
�
vh
cvh (σ −
�
λvhi )
(A.20)
i
So our objective becomes very simple:
max − log(Zλ ) s. t.
λvhi ≥0
�
i
λvhi ≤ σ
(A.21)
74
CHAPTER A. DERIVATIONS
B
Bibliography
Afonso, S., E. Bick, R. Haber, and D. Santos (2002). Floresta Sinta(c)tica: a treebank for Portuguese. In In Proc. LREC, pp. 1698–1703.
Axelrod, A., R. B. Mayne, C. Callison-burch, M. Osborne, and D. Talbot (2005). Edinburgh
system description for the 2005 iwslt speech translation evaluation. In In Proc. International
Workshop on Spoken Language Translation (IWSLT.
Ayan, N. F. and B. J. Dorr (2006). Going beyond aer: An extensive analysis of word alignments
and their impact on mt. In Proc. ACL.
Banko, M. and R. Moore (2004). Part of speech tagging in context. In Proc. COLING.
Bannard, C. and C. Callison-burch (2005). Paraphrasing with bilingual parallel corpora. In
ACL ’05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics,
Morristown, NJ, USA, pp. 597–604. Association for Computational Linguistics.
Bellare, K., G. Druck, and A. McCallum (2009). Alternating projections for learning with expectation constraints. In Proceedings of the Proceedings of the Twenty-Fifth Conference Annual
Conference on Uncertainty in Artificial Intelligence, Corvallis, Oregon, pp. 43–50. AUAI Press.
Bertsekas, D. P. (1999). Nonlinear Programming: 2nd Edition. Athena scientific.
Biemann, C., C. Giuliano, and A. Gliozzo (2007). Unsupervised part-of-speech tagging supporting supervised methods. In Proceedings of RANLP.
Brown, P. F., S. A. D. Pietra, V. J. D. Pietra, M. J. Goldsmith, J. Hajic, R. L. Mercer, and S. Mohanty (1993). But dictionaries are data too. In HLT ’93: Proceedings of the workshop on Human
Language Technology, Morristown, NJ, USA, pp. 202–205. Association for Computational Linguistics.
Brown, P. F., S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer (1993). The mathematics of statistical
machine translation: Parameter estimation. Computational linguistics 19(2), 263–311.
Callison-Burch, C. (2007). Paraphrasing and Translation. Ph. D. thesis, University of Edinburgh.
Callison-Burch, C. (2008). Syntactic Constraints on Paraphrases Extracted from Parallel Corpora. In Proc. EMNLP.
Charniak, E. and M. Elsner (2009). EM works for pronoun anaphora resolution. In Proceedings
of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp.
148–156. Association for Computational Linguistics.
76
CHAPTER B. BIBLIOGRAPHY
Chiang, D., A. Lopez, N. Madnani, C. Monz, P. Resnik, and M. Subotin (2005, October). The hiero machine translation system: extensions, evaluation, and analysis. In Proceedings of Human
Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, pp. 779–786. Association for Computational
Linguistics.
Civit, M. and M. Martí (2004). Building cast3lb: A spanish treebank. Research on Language &
Computation.
Clark, A. (2003). Combining distributional and morphological information for part of speech
induction. In Proc. EACL.
Cohen, S. and N. Smith (2009). The shared logistic normal distribution for grammar induction.
In Proc. NAACL.
Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete
data via the em algorithm. Royal Statistical Society, Ser. B 39(1), 1–38.
DeNero, J. and D. Klein (2007, June). Tailoring word alignments to syntactic machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics,
Prague, Czech Republic, pp. 17–24. Association for Computational Linguistics.
Deng, Y. and W. Byrne (2006). Hmm word and phrase alignment for statistical machine translation. In Proc. IEEE.
Fraser, A. and D. Marcu (2007a, June). Getting the structure right for word alignment: Leaf. In
In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning (EMNLP-CoNLL), pp. 51–60.
Fraser, A. and D. Marcu (2007b). Measuring word alignment quality for statistical machine
translation. Computational Linguistics 33(3), 293–303.
Galley, M., M. Hopkins, K. Knight, and D. Marcu (2004, May 2 - May 7). What’s in a translation
rule? In D. M. Susan Dumais and S. Roukos (Eds.), HLT-NAACL 2004: Main Proceedings,
Boston, Massachusetts, USA, pp. 273–280. Association for Computational Linguistics.
Ganchev, K., J. Gillenwater, and B. Taskar (2009). Dependency grammar induction via bitext
projection constraints. In ACL-IJCNLP ’09: Proceedings of the Joint Conference of the 47th Annual
Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of
the AFNLP: Volume 1, Morristown, NJ, USA, pp. 369–377. Association for Computational
Linguistics.
Ganchev, K., J. Graca, J. Gillenwater, and B. Taskar (2009). Posterior regularization for structured latent variable models. Technical Report MS-CIS-09-16, University of Pennsylvania
Department of Computer and Information Science.
Ganchev, K., J. V. Graça, J. Blitzer, and B. Taskar (2008). Multi-view learning over structured
and non-identical outputs. In Proc. UAI.
Ganchev, K., J. V. Graça, and B. Taskar (2008, June). Better alignments = better translations?
In Proceedings of ACL-08: HLT, Columbus, Ohio, pp. 986–993. Association for Computational
Linguistics.
77
Gao, J. and M. Johnson (2008, October). A comparison of Bayesian estimators for unsupervised
Hidden Markov Model POS taggers. In In Proc. EMNLP, Honolulu, Hawaii, pp. 344–352.
ACL.
Goldwater, S. and T. Griffiths (2007). A fully bayesian approach to unsupervised part-of-speech
tagging. In In Proc. ACL, Volume 45, pp. 744.
Graça, J., K. Ganchev, F. Pereira, and B. Taskar (2009). Parameter vs. posterior sparisty in latent
variable models. In Proc. NIPS.
Graça, J. V., K. Ganchev, and B. Taskar (2007, December). Expectation maximization and posterior constraints. In J. Platt, D. Koller, Y. Singer, and S. Roweis (Eds.), Advances in Neural
Information Processing Systems 20, Cambridge, MA, pp. 569–576. MIT Press.
Graça, J. V., J. P. Pardal, L. Coheur, and D. Caseiro (2008, may). Building a golden collection
of parallel multi-language word alignment. In Proceedings of the Sixth International Language
Resources and Evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association (ELRA).
Haghighi, A. and D. Klein (2006). Prototype-driven learning for sequence models. In Proc.
HTL-NAACL. ACL.
Headden III, W. P., M. Johnson, and D. McClosky (2009, June). Improving unsupervised dependency parsing with richer contexts and smoothing. In Proceedings of Human Language
Technologies: The 2009 Annual Conference of the North American Chapter of the Association for
Computational Linguistics, Boulder, Colorado, pp. 101–109. Association for Computational
Linguistics.
Headden III, W. P., D. McClosky, and E. Charniak (2008). Evaluating unsupervised part-ofspeech tagging for grammar induction. In Proceedings of the 22nd International Conference on
Computational Linguistics-Volume 1, pp. 329–336. Association for Computational Linguistics.
Hoang, H., A. Birch, C. Callison-burch, R. Zens, R. Aachen, A. Constantin, M. Federico,
N. Bertoldi, C. Dyer, B. Cowan, W. Shen, C. Moran, and O. Bojar (2007, June). Moses: Open
source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting
of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and
Poster Sessions, Prague, Czech Republic, pp. 177–180. Association for Computational Linguistics.
Huang, F. and A. Yates (2009). Distributional representations for handling sparsity in supervised sequence-labeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of
the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP:
Volume 1-Volume 1, pp. 495–503. Association for Computational Linguistics.
Hwa, R., P. Resnik, A. Weinberg, C. Cabezas, and O. Kolak (2005). Bootstrapping parsers via
syntactic projection across parallel texts. Natural Language Engineering 11, 11–311.
Johnson, M. (2007). Why doesn’t EM find good HMM POS-taggers. In In Proc. EMNLP-CoNLL.
Klein, D. and C. Manning (2004). Corpus-based induction of syntactic structure: Models of
dependency and constituency. In Proc. ACL.
Koehn, P. (2002). Europarl: A multilingual corpus for evaluation of machine translation.
78
CHAPTER B. BIBLIOGRAPHY
Koehn, P., F. J. Och, and D. Marcu (2003). Statistical phrase-based translation. In Proceedings of
the 2003 Conference of the North American Chapter of the Association for Computational Linguistics
on Human Language Technology (NAACL), Morristown, NJ, USA, pp. 48–54. Association for
Computational Linguistics.
Kromann, M., L. Mikkelsen, and S. Lynge (2003). Danish Dependency Treebank. Proceedings of
TLT2003. Växjö University Press, Sweden.
Kumar, S. and W. Byrne (2002). Minimum Bayes-Risk word alignments of bilingual texts. In
Proceedings of the ACL-02 conference on Empirical methods in natural language processing, pp.
140–147.
Kumar, S., F. J. Och, and W. Macherey (2007). Improving word alignment with bridge languages. In Proc. EMNLP, pp. 42–50.
Lambert, P., A. D. Gispert, R. Banchs, and J. B. M. no (2005). Guidelines for word alignment
evaluation and manual alignment. Language Resources and Evaluation 39(4), 267–285.
Liang, P., M. I. Jordan, and D. Klein (2009). Learning from measurements in exponential families. In ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning,
New York, NY, USA, pp. 641–648. ACM.
Liang, P., B. Taskar, and D. Klein (2006, June). Alignment by agreement. In Proceedings of the
Human Language Technology Conference of the NAACL, Main Conference, New York City, USA,
pp. 104–111. Association for Computational Linguistics.
Lopez, A. and P. Resnik (2005). Improved HMM alignment models for languages with scarce
resources. In Proc. of the ACL.
Lopez, A. and P. Resnik (2006). Word-based alignment, phrase-based translation: What´s the
link? In Proceedings of the 7th conference of the association for machine translation in the Americas
(AMTA): visions for the future of machine translation, Boston, MA, pp. 90–99.
Mann, G. and A. McCallum (2007). Simple, robust, scalable semi-supervised learning via expectation regularization. In Proceedings of the 24th international conference on Machine learning,
pp. 600. ACM.
Mann, G. S. and A. McCallum (2008, June). Generalized expectation criteria for semisupervised learning of conditional random fields. In Proceedings of ACL-08: HLT, Columbus,
Ohio, pp. 870–878. Association for Computational Linguistics.
Marcus, M., M. Marcinkiewicz, and B. Santorini (1993). Building a large annotated corpus of
English: The Penn Treebank. Computational linguistics 19(2), 313–330.
Matusov, E., N. Ueffing, and H. Ney (2006). Computing consensus translation from multiple
machine translation systems using enhanced hypotheses alignment. In Cambridge University
Engineering Department, pp. 33–40.
McCallum, A., D. Freitag, and F. Pereira (2000). Maximum entropy markov models for information extraction and segmentation. In Proc. ICML. Morgan Kaufmann.
Merialdo, B. (1994). Tagging English text with a probabilistic model. Computational linguistics 20(2), 155–171.
79
Moore, R. C. (2004). Improving ibm word-alignment model 1. In Proc. ACL, pp. 518.
Neal, R. M. and G. E. Hinton (1998). A new view of the EM algorithm that justifies incremental,
sparse and other variants. In M. I. Jordan (Ed.), Learning in Graphical Models, pp. 355–368.
Kluwer.
Nocedal, J. and S. J. Wright (1999). Numerical optimization. Springer.
Och, F. J. and H. Ney (2000). Improved statistical alignment models. In ACL ’00: Proceedings
of the 38th Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA,
pp. 440–447. Association for Computational Linguistics.
Och, F. J. and H. Ney (2003). A systematic comparison of various statistical alignment models.
Computational Linguistics 29(1), 19–51.
Oflazer, K., B. Say, D. Hakkani-Tür, and G. Tür (2003). Building a Turkish treebank. Treebanks:
Building and Using Parsed Corpora.
Papineni, K., S. Roukos, T. Ward, and W. Zhu (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. In Proc. ACL.
Rabiner, L. (1989). A tutorial on hidden markov models and selected applications in speech
recognition. In Proc. IEEE 77(2), 257–286.
Ravi, S. and K. Knight (2009). Minimized models for unsupervised part-of-speech tagging. In
In Proc. ACL.
Rogati, M., S. McCarley, and Y. Yang (2003). Unsupervised learning of arabic stemming using
a parallel corpus. In Proc. ACL.
Schütze, H. (1995). Distributional part-of-speech tagging. In Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics, pp. 141–148. Morgan
Kaufmann Publishers Inc.
Simov, K., P. Osenova, M. Slavcheva, S. Kolkovska, E. Balabanova, D. Doikoff, K. Ivanova,
A. Simov, E. Simov, and M. Kouylekov (2002). Building a Linguistically Interpreted Corpus
of Bulgarian: the BulTreeBank. In Proc. LREC.
Smith, N. and J. Eisner (2005). Contrastive estimation: Training log-linear models on unlabeled
data. In Proc. ACL. ACL.
Smith, N. A. and J. Eisner (2006). Annealing structural bias in multilingual weighted grammar
induction. In ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, Morristown, NJ,
USA, pp. 569–576. Association for Computational Linguistics.
Snyder, B. and R. Barzilay (2008, June). Unsupervised multilingual learning for morphological
segmentation. In Proceedings of ACL-08: HLT, Columbus, Ohio, pp. 737–745. Association for
Computational Linguistics.
Snyder, B., T. Naseem, J. Eisenstein, and R. Barzilay (2009). Adding more languages improves
unsupervised multilingual part-of-speech tagging: a bayesian non-parametric approach. In
Proc. NAACL.
80
CHAPTER B. BIBLIOGRAPHY
Toutanova, K., H. Ilhan, and C. Manning (2002). Extensions to HMM-based statistical word
alignment models. In Proc. EMNLP.
Toutanova, K. and M. Johnson (2007). A Bayesian LDA-based model for semi-supervised partof-speech tagging. In Proc. NIPS 20.
Vilar, D., M. Popović, and H. Ney (2006). Aer: Do we need to "improve" our alignments? In
Proc. IWSLT.
Vogel, S., H. Ney, and C. Tillmann (1996). Hmm-based word alignment in statistical translation.
In Proceedings of the 16th conference on Computational linguistics, Morristown, NJ, USA, pp. 836–
841. Association for Computational Linguistics.
Wang, Q. and D. Schuurmans (2005). Improved estimation for unsupervised part-of-speech
tagging. In Proc. IEEE NLP-KE.
Yarowsky, D. and G. Ngai (2001). Inducing multilingual pos taggers and np bracketers via
robust projection across aligned corpora. In North American Chapter Of The Association For
Computational Linguistics, pp. 1–8. Association for Computational Linguistics Morristown,
NJ, USA.
Zhao, Q. and M. Marcus (2009). A simple unsupervised learner for POS disambiguation rules
given only a minimal lexicon. In Proceedings of the 2009 Conference on Empirical Methods in
Natural Language Processing.