UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Posterior Regularization Framework: Learning Tractable Models with Intractable Constraints João de Almeida Varelas Graça (MSc in Information Systems and Computer Engineering) Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Computer Science and Engineering Advisor: Professor Maria Luísa Torres Ribeiro Marques da Silva Coheur Co-Advisor: Professor Fernando Carlos das Neves Pereira Co-Advisor: Professor Ben Taskar Committee President: Jury: President of the IST scientific committee Professor Maria Luísa Torres Ribeiro Marques da Silva Coheur Professor Fernando Carlos das Neves Pereira Professor Ben Taskar Professor Mário Alexandre Teles de Figueiredo Professor Adrià de Gispert Professor Vasco Miguel Gomes Nunes Manquinho July 16, 2009 2 Abstract Unsupervised Learning of probabilistic structured models presents a fundamental tradeoff between richness of captured constraints and correlations versus efficiency and tractability of inference. In this thesis, we propose a new learning framework called Posterior Regularization (PR) that incorporates side-information into unsupervised estimation in the form of constraints on the model’s posteriors. The underlying model remains unchanged, but the learning method changes. During learning, our method is similar to the EM algorithm, but we solve a problem similar to Maximum Entropy inside the E-Step to enforce the constraints. We apply the PR framework to two different large scale tasks: Statistical Word Alignments and Unsupervised Part of Speech Induction. In the former, we incorporate two constraints: bijectivity and symmetry. Training using these constraints produces a significant boost in performance as measured by both precision and recall against manually annotated alignments for six language pairs. In the latter we enforce sparsity on the word tag distribution which is overestimated using the default training method. Experiments on six languages achieve dramatic improvements over state-of-the-art results. 4 Resumo A aprendizagem não supervisionada de modelos probabilísticos estruturados apresenta um compromisso fundamental entre a expressividade das restrições e correlações capturadas, e a eficiência e exequibilidade de inferência. Nesta tese apresentamos uma nova framework de aprendizagem intitulada framework de Regularização à Posteriori, que incorpora informação adicional durante a aprendizagem não supervisionada na forma de restrições sobre a distribuição à posteriori do modelo. O modelo probabilístico mantém-se inalterado, mas o algoritmo de aprendizagem não. Durante a aprendizagem o nosso método é semelhante ao algoritmo EM, mas resolvemos um problema semelhante ao problema de máxima entropia dentro do E-Step, de forma a satisfazer as restrições. Aplicamos a nova framework a duas tarefas de grande escala: O problema de Alinhamento Palavra a Palavra Estatístico e a Indução de Etiquetas Sintácticas. Na primeiro tarefa incorporamos duas restrições: bijectividade e simetria. Os modelos treinados com estas restrições produzem um aumento significativo nos resultados avaliados em termos de precisão e cobertura em seis pares de línguas distintos. Na segunda tarefa impomos dispersão na distribuição de etiquetas dada uma palavra, que é sobrestimado usando o método de treino padrão. Experiências em seis línguas diferentes apresentam resultados significativamente melhores do que o estado da arte. 6 Keywords & Palavras Chave Keywords • Posterior Regularization Framework • Unsupervised Learning • Prior Knowledge • Statistical Word Alignments • Part of Speech Induction • Natural Language Processing Palavras Chave • Framework de Regularização à Posteriori • Aprendizagem Não Supervisionada • Conhecimento à Priori • Alinhamentos Palavra a Palavra Estatístico • Indução de Etiquetas Sintácticas • Processamento de Língua Natural 8 Acknowledgments I would like to express my gratitude to everyone that helped me during the development of this dissertation, provided me with their support, and endured my constant stress and bad temper. Without them this work would not have been possible. My thanks extend to the INESC-ID Spoken Language Systems lab team, where I developed a part of my thesis, and to the University of Pennsylvania where I was an invited student for most of the time during the execution of this work. My thanks to my colleagues Partha Talukdar, Alex Kulesza and Jennifer Gillenwater who I had the pleasure to share the work room and discuss a lot of ideas. My special thanks to Kuzman Ganchev whith whom I had the pleasure to work closely during this work, for all his patience and friendship. I would like to thank my Supervisors, Luísa Coheur, Fernando Pereira and Ben Taskar for all their guidance over these years, constant advices and corrections, and never ending patience towards my doubts and requests. To all my friend that have always been there for me, even when things seemed to be going wrong, thank you for your words of comfort and motivation. And last, but certainly not least, I thank my family, for their unconditional support, not only throughout this project, but also for my entire life. Lisboa, April 18, 2011 João Graça 10 Contents 1 2 Introduction 1 1.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Background 5 2.1 Modeling - Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 HMM Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Learning - Expectation Maximization Algorithm . . . . . . . . . . . . . . . . . . . 13 2.3 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Statistical Word Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.5 Baseline Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Part of Speech Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.2 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.4 Baseline problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 3 Posterior Regularization Framework 29 3.1 Posterior Regularization Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.1 30 Posterior Regularization Objective . . . . . . . . . . . . . . . . . . . . . . . i 3.1.2 Computing The PR Objective . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1.3 Projection Before Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Posterior Regularization Optimization Algorithm . . . . . . . . . . . . . . . . . . 33 3.3 Statistical Word Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1 Bijectivity Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Symmetry Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 PoS Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.1 Parameter Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.2 Posterior Sparsity Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4 3.5 4 Statistical Word Alignment 45 4.1 Word Alignment Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.2 Dual Projection Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1.3 Projection Before Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.1.4 Overall Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1.5 Rare vs. Common Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1.6 Directional Alignments Combination . . . . . . . . . . . . . . . . . . . . . 50 4.1.7 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.1 Syntax Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.2 Phrase-based machine translation . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 5 Part of Speech Induction 57 5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 Overall Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3 Maximum Entropy Observation Model Results . . . . . . . . . . . . . . . . . . . . 58 5.4 Tag Sparsity Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.5 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.6 Using The Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 ii 6 Conclusion and Future Work 67 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 A Derivations 71 A.1 Convergence Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 A.2 Modified E-step Dual Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 A.3 Modified Penalty E-step Dual Derivation . . . . . . . . . . . . . . . . . . . . . . . 72 B Bibliography 75 iii iv List of Figures 2.1 HMM running example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 HMM Trellis representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Manual annotate word alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Example of a predicted word alignment . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Precision Recall curve example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6 State posterior distributions for different models for an English and French sentence pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.7 PoS mapping example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.8 Differences of sparsity per word on the true tag distribution versus the HMM and HMM-ME model trained using EM . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 PR objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Modified EM algorithm for optimizing the PR objective . . . . . . . . . . . . . . . 34 3.3 State posterior distributions for different models an English and French sentence pair. EM training vs PR with bijectivity constraint . . . . . . . . . . . . . . . . . . 35 State posterior distributions for different models an English and French sentence pair. EM training vs PR with symmetry constraint . . . . . . . . . . . . . . . . . . 37 3.5 Sparsity measure illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.6 Differences of sparsity per word on the true tag distribution versus the HMM and HMM-ME model trained using EM and PR with sparsity constraint . . . . . 42 Effects of the convergence criteria for optimization of the PR objective for Word Alignments with symmetry constraints . . . . . . . . . . . . . . . . . . . . . . . . 46 Effects of slack values for optimization of the PR objective for Word Alignments with symmetry constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Precision Recall comparing the use of constraints during training of just at decoding time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4 Precision Recall comparison for different training methods of word alignments . 49 4.5 Word alignment methods overall comparison . . . . . . . . . . . . . . . . . . . . . 49 4.6 Learning curves for different method for word alignments . . . . . . . . . . . . . 50 3.4 4.1 4.2 4.3 v 4.7 Word Alignments method comparison broken into rare and common words . . . 51 4.8 Word alignment methods comparison after directional symmetrization heuristics 51 4.9 Word Alignments error analysis for different methods . . . . . . . . . . . . . . . . 53 4.10 Edge conservation for cross lingual grammar induction . . . . . . . . . . . . . . . 54 4.11 Machine Translation phrase extraction from word alignments example . . . . . . 55 5.1 PoS methods comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 PoS Maximum entropy features effect comparison . . . . . . . . . . . . . . . . . . 59 5.3 PoS effect of reducing the number of model parameters . . . . . . . . . . . . . . . 60 5.4 PoS number of model parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.5 PoS Maximum Entropy regularizer strength comparison . . . . . . . . . . . . . . 61 5.6 PoS Average absolute �1 /�∞ difference for each word type between each training method and the supervised initialization. . . . . . . . . . . . . . . . . . . . . . . . 62 5.7 PoS L1LMax vs Accuracy for different methods . . . . . . . . . . . . . . . . . . . 62 5.8 PoS Token distribution per cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.9 PoS comparison of different number of word types constrained to be sparse . . . 63 5.10 PoS different sparsity strength comparison . . . . . . . . . . . . . . . . . . . . . . 64 5.11 PoS error analysis for English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.12 PoS error analysis for Portuguese . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.13 PoS Supervised learning error by using induced clusters as features . . . . . . . . 65 5.14 PoS Supervised learning error reduction by using induced clusters as features . . 66 vi List of Tables 2.1 General notation used in this document . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 HMM notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 HMM probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 HMM multinomial parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Word Alignment corpus statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.6 PoS corpora statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 Machine translation results for different word alignments . . . . . . . . . . . . . . 56 vii viii List of Algorithms 1 EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 Computing KL(Qx̄ � pθ (z̄ | x̄)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 PR optimization via modified EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 34 x 1 Introduction Unsupervised learning of linguistic structures is a complex task. The process by which linguistic structures are generated is not always clear and even when it is, it is normally too complex to be formally expressed. Nevertheless, unsupervised learning has been applied to a wide range of natural language processing tasks, such as: Part of Speech Induction (PoSI) [Schütze, 1995; Merialdo, 1994; Clark, 2003], Dependency Grammar Induction [Klein and Manning, 2004; Smith and Eisner, 2006], Constituency Grammar Induction [Klein and Manning, 2004], Statistical Word Alignments (SWA) [Brown et al., 1993] and Anaphora Resolution [Charniak and Elsner, 2009], just to name a few. Different motivations have pushed research in this area. From both a linguistic and cognitive point of view, unsupervised learning is useful as a tool to study language acquisition. From a more pragmatic perspective, unsupervised learning is required since annotated corpora is a scarce resource for several reasons. First, the construction of annotated corpus is costly and time consuming, resulting in an actual shortage of annotated corpora for several tasks. Second, even when these corpora exists, they are usually not generic, and when the context or domain changes their usefulness decreases dramatically. Third, when creating a corpus, several assumptions are embedded in the annotation process that will propagate to the results of any tool trained in it. These assumptions are not necessarily the most useful for a specific task. For instance, the decision of which Part of Speech (PoS) tags to use when annotating a corpus may seem useful at the time, but hinder the corpus utility for other tasks. This situation occurs for instance in the Penn TreeBank [Marcus et al., 1993] that contains a special PoS tag for the word “To” that is hard to justify, since it is generally a preposition. Although a model trained in a supervised manner can learn this easily from a few examples, the same is not true for unsupervised learning since learning is based on co-occurrences of words or combinations of words in the text. As another example, in the Portuguese Treebank [Afonso et al., 2002] the text was segmented in a way, in which various words were merged into a single unit with the same tag (compound nouns, adverbial expressions). Although this might have eliminated many ambiguities during the annotation process, it is inadequate to several multilingual applications, such as grammar transfer, when the other corpus does not follow the same segmentation guidelines. Also, the segmentation used is by no means trivial and easy to replicate, making these corpus unfitted to train a system with the goal of annotating a corpus with a different segmentation. Furthermore, even when annotated corpora is available the annotations induced by an unsupervised system can be useful as extra data. For instance, induced PoS tags have been used to boost the performance of supervised models in several tasks, such as PoS tagging either to account for rare words or for domain adaptation [Huang and Yates, 2009; Biemann et al., 2007], name entity recognition, word sense disambiguation [Biemann et al., 2007], estimate the fertility of words in word alignment models [Brown et al., 1993], or to replace actual words in unsupervised grammar induction [Headden III et al., 2008]. Unsupervised learning can be split into three different tasks: 1) Modeling - desig a model 2 CHAPTER 1. INTRODUCTION that explains the linguistic phenomena we are trying to capture; 2) Learning - given a model and a corpus, learn the parameters that maximize some objective; 3) Decoding - given the model and parameters, pick the best hidden structure from the posterior distribution (the distribution over all possible hidden structures conditioned on the observed data). A common approach in modeling is to chose a family of probabilistic parametric models where some variables are hidden and represent the linguistic structure that is trying to be modeled. This choice entails a fundamental trade-off between model complexity and tractability of inference. On the one hand, we desire a model capable of faithfully reproducing the various steps of the generative process underlying the linguistic phenomena. On the other hand, the model has to be computationally tractable to be of any use. This second requirement implies that we frequently end up with a model that follows an oversimplified generative story. For instance in machine translation, early models [Brown et al., 1993] assumed that the translation of a particular sentence breaks down as the translation of each individual word in a linear order from left to right. Training consists in finding the parameters that maximize some objective, normally the likelihood of the data. A common solution to this optimization problem is to use the Expectation Maximization (EM) algorithm [Dempster et al., 1977]. The quality of the parameters corresponds to the quality of the hidden structure chosen by the model during decoding, given the parametrization. When using maximum likelihood estimation, there is no way to specify any bias or prior knowledge about the desired hidden structures. The lack of guidance together with the oversimplification of the model causes the learned parameters to be suboptimal which may lead to undesired results. Furthermore, even simpler models normally have too many parameters. The lack of guidance also leads the model to overfit and try to explain undesired correlations in the data. To mitigate this problem one can go back to the modeling phase and design a new model that introduces constraints, disallowing certain configurations of the hidden variables, or a model that follows a more complex generative story. For instance, Brown et al. [1993] start by defining two simple and tractable models IBM M1 and IBM M2 for word based machine translation, which follow an oversimplified generative story on how alignments are produced. These models produce reasonable results but have systematic errors. Then, they proceed by defining a series of more complex models, IBM M3-M6, that model refined versions of the generative story, and produce better alignments. However, these improvements come at a price: the more complex models are intractable and require heuristics and approximate inference prone to search errors. Note that often the simplest models have the same expressive power than more complex ones, and can represent the same set of hidden structures (this is the case for the different IBM models). Nevertheless, their simplifying assumptions increase the difficulty of the learning algorithm to find the correct parameters without any guidance. A different option to solve the lack of guidance is to keep the model and change the training procedure. A common approach is to move the learning procedure from a maximum likelihood estimation to a maximum a posterior estimation, where a prior distribution over the model parameters (that encodes preferences about different parameter settings) is added, thus guiding the learning procedure. Another approach is to move from the maximum a posterior estimation to a complete Bayesian paradigm of learning. In this paradigm, instead of picking the best parameter setting, we directly pick the best hidden state structure by integrating over all possible parameter distributions. This approach has shown positive results for several tasks, such as Grammar Induction [Cohen and Smith, 2009; Headden III et al., 2009], and PoSI 1.1. THESIS CONTRIBUTIONS 3 [Johnson, 2007; Goldwater and Griffiths, 2007]. Despite its recent success, this is not an ideal solution, since we are normally interested in expressing preferences over the hidden structures and not over the parameters. Moreover, these preferences are normally hard to express as preferences on the parameters, since the posterior distributions depend on the parameters through complex interactions. In this thesis we propose a different approach to guide the model during learning. We present a new learning framework called Posterior Regularization (PR) that incorporates prior knowledge directly over the hidden variables during learning. The PR framework defines a rich language for specifying prior knowledge about the desired hidden structure. The prior knowledge is expressed as inequalities on the expected value under the posterior distribution of some user-defined constraint features (not necessarily the same features used by the model). Constraining the posterior distribution allows a more direct way to achieve the desired behaviour than adding priors to the parameters as we show on the experimental sections. Furthermore, constraining the expected value of the features instead of adding them to the model allows us to express features that would otherwise make the model intractable. With PR, the underlying model remains unchanged, but during learning it is forced to find parameters that satisfy the desired constraints. We also present a learning algorithm that resembles the EM algorithm to find the parameters that maximize the PR objective. We show the benefits of the new learning framework on two different case studies: SWA and PoSI. For the SWA task we use the Hidden Markov Model (HMM) for word alignments [Vogel et al., 1996] and achieve significant improvements in performance using the new training procedure. We use PR to enforce two different pieces of prior knowledge as constraints: (i) bijectivity: “one word in one language should not translate to many words in the other language”; and (ii) symmetry: “the translation of words should be the same when translated from language A to language B and from language B to language A”. We show for six different language pairs and different testing scenarios that training using PR with both of these constraints always outperforms training using EM, normally by a large margin. Moreover, by using this training procedure the simple HMM model generally outperforms the more complex IBM M4 model. Also, the alignments produced by the new training procedure are better suited for transferring syntactic dependency parse annotations, and also to extract minimal translation units leading to better end to end machine translation systems, specially on small data scenarios. In the task of PoSI we use PR to constrain the ambiguity of word-tag associations via a sparsity-inducing norm on the posteriors. We use a first-order HMM as our model to compare the different training conditions: EM training without modifications to encourage sparsity, a Bayesian approach with a sparsity-inducing prior as used by Johnson [2007] using variational Bayes EM, and PR with the posterior sparsity-inducing norm constraint. Our experiments on six languages (Bulgarian, Danish, English, Portuguese, Spanish, Turkish) shows dramatic improvements over state-of-the-art results: 11% average increase in accuracy. Furthermore, using the learned clusters in a semi-supervised setting cuts the amount of needed labeled data to achieve a given error level by over a half. 1.1 Thesis Contributions The main contributions of this thesis are: 4 CHAPTER 1. INTRODUCTION • a new learning framework, that exports a rich language to express prior knowledge about the hidden variables via posterior regularization; • an efficient and simple algorithm for parameter estimation for the proposed learning framework; • a state of the art system for SWA with a detailed analyses of the results; • a state of the art system for fully unsupervised PoSI with a detailed analyses of the results. This thesis also contributes with two important resources for the community: • manually annotated word alignments among six European languages freely available at https://www.l2f.inesc-id.pt/wiki/index.php/Word_Alignments; • The PR toolkit, an extensive software package that contains the code to reproduce all the experiments in this thesis as well as several other applications. The toolkit is available under GPL at http://code.google.com/p/pr-toolkit/. 1.2 Document Outline This document is organized as follows: in the next chapter we lay down the background for the remainder of the document. We then describe the probabilistic model (HMM), the baseline learning algorithm (EM), and different ways of finding the best hidden structure given a model and a set of parameters. We proceed by introducing the two case studies: SWA and PoSI. Chapter 3 introduces the PR Framework. It describes the PR objective and how to calculate it efficiently. Then we present a learning algorithm for this objective similar to the EM algorithm, and the application of PR to the two case studies. We conclude the chapter with a description of work related with PR. Chapter 4 presents a detailed evaluation and analysis of the results for the SWA task and Chapter 5 presents a detailed evaluation and analysis of the results for the PoSI task. Finally, Chapter 6 concludes and discusses directions for future work. 2 Background In this chapter we introduce the mathematical framework used throughout this document, and the baseline learning framework to which we shall compare our results. We describe two case studies and their problems when the baseline learning framework is used. The problem setting is the following: let X = x̄1 . . . x̄D be a training set of independent and identically-distributed random variables. In this work x̄d (for notation simplicity we will drop the superscript d when considering an isolated example) corresponds to a sentence (or a pair of sentences from parallel corpora, but for simplicity we will use the single sentence case to introduce the general notation) in natural language and decomposes as a sequence x̄ = x1 . . . xN , of observations of length N . Each xn is a discrete random variable, in our case a word, taking a value v from a finite vocabulary V. Each x̄ has an unknown hidden structure z̄ that we want to predict. In this work the structures are sequences z̄ = z1 . . . zN of the same length N as the observations. Each hidden state zn is a discrete random variable and can take a value h from a discrete vocabulary H. Notation X D x̄ = x1 . . . xN N xi V |V| vi z̄ = z1 . . . zN zi H |H| hi training set number of training examples observation sequence size of the observation sequence observation at position i in the sequence observation values set number of distinct observation types particular observation, i ∈ |V| hidden state sequence hidden state at position i in the sequence hidden states value set number of distinct hidden value types particular hidden value, i ∈ |H| Table 2.1: General notation used in this document This work focuses on the problem of unsupervised learning of linguistic structure. We can break down this problem into 3 stages: • Modeling - This step consists of choosing a model that explains the generative process by which language and the corresponding linguistic structure arise. We will work with families of parametric generative probabilistic models, pθ (x̄, z̄), where θ are the model 6 CHAPTER 2. BACKGROUND parameters, and each setting of θ defines a particular model. For both case studies we will use an HMM model that we describe in Section 2.1. • Learning - Given a model family and a training corpus X , the goal of the learning step is to find the parameters that maximize some objective function, normally the likelihood of the observed data. This objective is maximized using the Expectation Maximization (EM) algorithm, the most common approach for training models with hidden variables. In Section 2.2 we describe the EM algorithm. • Decoding - Given a model and a setting of parameters θ, find the “best” structure for a particular observation sequence x̄. Decoding will be addressed in Section 2.3. To make the problem setting more concrete we present a brief overview of the two particular applications used in this work: SWA and PoSI. In SWA we are given a training set X , where each training example x̄ is a pair of sentences: ¯ = xs1 . . . xsM of length a target sentence x̄t = xt1 . . . xtN of length N and a source sentence xs M . Each xtn is a word vt from a target language vocabulary Vt and each xsm is a word vs from a source vocabulary Vs. The hidden structure z̄ represents a mapping between target words and source words, indicating that the two words are a translation of each other. Section 2.4 describes this application in detail, the baseline model used and its major problems. In PoSI we are given a training set X where each training example is a sentence x̄ = x1 . . . xN of length N . Each xn is a word v from a vocabulary V. We are given a set of clusters H, and the task is to assign each word in x̄ to a given cluster hv. Section 2.5 describes the task in detail, the baseline model and its problems. 2.1 2.1.1 Modeling - Hidden Markov Models Overview The Hidden Markov Model (HMM) is one of the most common sequence probabilistic models, and has been applied to a wide variety of tasks. More generally, an HMM is a particular instance of a chain directed probabilistic graphical model, or a Bayesian network. In a Bayesian network, every random variable is represented as a node in a graph, and the edges in the graph are directed and represent probabilistic dependencies between the random variables. For an HMM, the random variables are divided into two sets, the observed variables, in our case x, and the hidden variables, in our case z. In the HMM terminology, the observed variables are called observations, and the hidden variables are called states (an excellent description of the HMM model is given in Rabiner [1989] ). To ground the definitions and simplify the explanation we will use a small example displayed graphically in Figure 2.1 (the notation is summarized in Table 2.2). A first order HMM model has the following independence assumptions over the join distribution pθ (x̄, z̄): • Independence of previous states - the probability of being in a given state hl at position zi only depends on the state hm of the previous position zi−1 , that is pθ (zi = hl | zi−1 = hm , zi−2 . . . z1 ) = pθ (zi = hl | zi−1 = hm ), defining a first order Markov chain. The order 2.1. MODELING - HIDDEN MARKOV MODELS 7 HMM Example x̄ N =4 i, j V = {we, know, the, way} p, q xi = vq , xj = vp H = {a, b, c} l, m zi = h l , zj = hm observed sentence “we know the way” observation length positions in the sentence: i, j ∈ {1 . . . N } observation vocabulary indexes into the vocabulary p, q ∈ |V| observation at position xi (xj ) has value vq (vp ) hidden values vocabulary indexes into the hidden values vocabulary state at position zi (zj ) has hidden value hl (hm ) Table 2.2: HMM notation for the running example. a a b c we know the way Figure 2.1: HMM structure, for the simple running example. of the Markov chain depends on the number of previous positions taken into account. The remainder of the exposition can be easily extend to higher order HMM, giving the model more expressiveness, but making inference harder. • Homogeneous transition - the probability of making a transition from state hl to state hm is independent of the particular position in the sentence, for all i, j ∈ N, pθ (zi = hl | zi−1 = hm ) = pθ (zj = hl | zj−1 = hm ), so pθ (zi = hl | zi−1 = hm ) = pθ (hl | hm ). • Observation independence - the probability of observing vq at position i is fully determined by the state at that position, that is pθ (xi = vq | zi = hl ), and this probability is independent of the particular position, that is pθ (xi = vq | zi = hl ) = pθ (vq | hl ). These conditional independence assumptions are crucial to allow efficient inference as will be described. There is also a particular starting probability, the probability of starting at state hl , which is modeled independently. The three probability distributions that define the HMM model are summarized in Table 2.3. For each one of them we will use a short notation to simplify the exposition. The parametrization of these distributions will be discussed in more detail in Subsection 2.1.2. The joint distribution can be expressed as: pθ (x̄, z̄) = πz1 bz1 (x1 ) N � i=2 azi−1 ,zi bzi (xi ), (2.1) 8 CHAPTER 2. BACKGROUND HMM distributions Name probability distribution initial probability pθ (z1 = hl ) transition probability pθ (zi = hl | zi−1 = hm ) observation probability pθ (xi = vq | zi = hl ) short notation πl am,l bl (xi ) Table 2.3: HMM probability distributions. a a a a b b b b c c c c Figure 2.2: Trellis representation of the HMM in Figure 2.1, for the observation sequence “we know the way”, where each hidden variable can take the values a, b, c. which for the example from Figure 2.1 is: pθ (x̄, z̄) = πa ba (”we”)aa,a ba (”know”)aa,b bb (”the”)ab,c bc (”way”). (2.2) Besides the probability of a given observation sequence and hidden state sequence, we will be interested in computing the total probability of an observed sequence, that is, “the likelihood” of the sequence, which corresponds to summing the probability of all possible hidden state sequences. � Likelihood : pθ (x̄) = pθ (x̄, z̄). (2.3) z̄ The number of possible hidden state sequences is exponential in the length of the sentence (|H|N ), which makes summing over all of them hard. Yet, we must be able to compute this sum (sum over z̄) to compute the above likelihood formula. For the particular example in Figure 2.1 we explicitly represent all possible paths in Figure 2.2. Each column represents a position, and each row represents a possible state. By following the arrows from position 1 to position 4 one can transverse all possible 34 = 81 paths. However, there is a well know dynamic programming algorithm, the Forward Backward algorithm (FB), that allows the computation to be performed in linear time, by making use of the independence assumptions. The FB algorithm relies on the independence of previous states assumption, which is illustrated in the trellis view by only having arrows between consecutive states. The FB algorithm defines two auxiliary probabilities, the forward probability and the backward probability. Forward Probability : αi (hl ) = pθ (zi = hl , x1 . . . xi ) (2.4) 2.1. MODELING - HIDDEN MARKOV MODELS 9 The forward probability represents the probability that in position i we are in state zi = hl and that we have observed x̄ up to that position. The forward probability is defined by the following recurrence: α1 (hl ) = πl bl (x1 ) � αi (hl ) = am,l αi−1 (hm ) bl (xi ) (2.5) (2.6) hm ∈H Considering our example, the probability of being in state hl at position 1 and having observed the word “we” is just the initial probability of starting at state hl times the probability of observing the word “we” in state hl . When we advance to position 2, the probability of being in state hl and observing the word “know” is the probability of observing the word “know” given that we are in state hl times all possible ways we could have reached that state at position 2. We could have reached hl at position 2 from any state at position 1, so it corresponds to the sum over all possible states for position 1, times the probability of transitioning from each of these to hl at position 2. At position 4, the last position of the sequence, the probability of being in state hi and observed the last word “way” accounts for all possible paths up to the last position. So, the likelihood of a given sequence can be computed by summing the forward probability at the last position for every possible state. � � pθ (x̄) = pθ (x̄, z̄) = αN (h). (2.7) z̄ h∈H Although the forward probability is enough to calculate the likelihood of a given sequence, we will also need the backward probability to calculate other quantities of interest. The backward probability, represents the probability of observing x̄ from position i + 1 up to N , given that at position i we are at state zi = hl : Backward Probability : βi (hl ) = pθ (xi+1 ...xN |zi = hl ). (2.8) The backward recurrence is given by: βN (hl ) = 1 � βi (hl ) = al,m bm (xi+1 )βi+1 (hm ). (2.9) (2.10) hm ∈H The backward probability is similar to the forward probability, but operates in the inverse direction. At position N there are no more observations, and the backward probability is set to 1. At position i the probability of having observed the future and being in state hl , is given by the sum for all possible states of the probability of having transitioned from position i in state hl to position i + 1 with state hm and observed xi+1 at time i + 1 and the future given by βi+1 (zi+1 = hm ). With the FB probability one can compute the likelihood of a given sentence using any position in the sentence. � � pθ (x̄) = pθ (x̄, z̄) = ∀i αi (h)βi (h). (2.11) z̄ h∈H Note that for time N, βN (h) = 1 and we get back to equation 2.7. Although redundant, this 10 CHAPTER 2. BACKGROUND fact is useful when implementing an HMM as a sanity check that the computations are being performed correctly, since one can compare the likelihood at each position that should be the same. The FB algorithm may fail for long sequences since the nested multiplication of numbers smaller than 1 may easily become smaller than the machine precision. To avoid that problem, Rabiner [1989] presents a scaled version of the FB algorithm that avoids this problem. In order to complete the three stages of unsupervised learning described in the beginning of this chapter (modeling, learning, decoding), we will need to calculate a few other probabilities. These are: a) the sequence posterior distribution, that is, the probability of a particular hidden state sequence given that we have observed a particular sentence; b) the state posterior distribution, corresponding to the probability of being in a given state in a certain position given the observed sentence; c) the transition posterior distribution, which is the probability of making a particular transition, from position i to i + 1 given the observed sentence. In Section 2.2 it will become clear why these quantities are necessary. All these quantities are easily computed after running the FB algorithm: Sequence Posterior : State Posterior : Transition Posterior : pθ (z̄ | x̄) = pθ (x̄, z̄) ; pθ (x̄) γi (hl ) = pθ (zi = hl | x̄) = (2.12) αi (hl )βi (hl ) ; pθ (x̄) ξi (hl , hm ) = pθ (zi = hl , zi+1 = hm | x̄) = (2.13) αi (hl )al,m bm (xi+1 )βi+1 (hm ) . pθ (x̄) (2.14) Since we are working in an unsupervised setting, we never get to observe the hidden state sequence. Instead, given a training set X = {x̄1 . . . x̄D }, we will need to collect sufficient statistics, or expected counts that represent the expected number of times that each hidden variable is expected to be used with the current parameters setting. This sufficient statistics will then be used during learning as fake observations of the hidden variables. The sufficient statistics are: a) initial counts, which we denote as ic(hl ) and counts the expected number of times we pick state hl as the initial state; b) transition counts, denoted as tc(hl , hm ) and counts the expected number of times a transition from state hl to state hm was performed; c) state counts denoted as sc(vq , hm ), that counts the number of times in state hm and seeing observation vq . Using the posteriors described above, these quantities can be computed by the following formulas: Initial Counts : ic(hl ) = D � (2.15) γ1 (hl ); d=1 Transition Counts : tc(hl , hm ) = State Counts : sc(vq , hm ) = D N −1 � � ξi (hl , hm ); d=1 i=1 D N � � γi (hm ). (2.16) (2.17) d=1 i=1,xi =vq These formulas correspond to summing for all instances in the training corpus D, the posterior probability (the cumulative probability over all possible hidden states sequences) of observing a given event. 2.1. MODELING - HIDDEN MARKOV MODELS 11 short notation probability distribution |parameters| constraint � πj pθ (z1 = hj ) |H| - 1 π = 1; � h∈H j 2 al,m pθ (zi = hl | zi−1 = hm ) (|H| − 1) a = 1; � hl ∈H m,l |H| bq (l) pθ (xi = vq | zi = hl ) (|V| − 1) vq ∈V bq (l) = 1. Table 2.4: Multinomial parametrization of the HMM distributions. 2.1.2 HMM Parameters So far we have not committed to any form for the probability distributions, πl , am,l and bl (xi ). In both applications addressed in this thesis, both the observations and the hidden variables are discrete. The most common approach is to model each of these probability distributions as multinomial distributions, summarized in Table 2.4. Multinomial distributions are attractive for several reasons: first of all they are easy to implement; secondly the estimation of the maximum likelihood parameters has a simple closed form. Using the sufficient statistics from the previous section, the parameter estimates are: π̂l = � âl,m = � b̂l (vq = o) = � ic(hl ) hm ∈H ic(hm ) tc(hl , hm ) hm ∈H tc(hl , hm ) sc(vq , hl ) vp ∈V sc(vp , hl ) (2.18) (2.19) (2.20) Computing the new parameters only involves normalizing the expected counts to turn them into proper probabilities. A major drawback of using these distributions is that each observation is treated atomically and it does not allow the use of features, such as morphology and orthography of words. Also, given that each word is treated atomically there is no way to reduce the number of parameters in the model, while at the same time model all observations. In this thesis, in particular in the PoSI application, we will replace the multinomial observation probability by a maximum entropy distribution, exp(f (vq , hl ) · θ) pθ (vq | hl ) = � , vp exp(f (vp , hl ) · θ) (2.21) where f (vq , hl ) is a feature function defined over observation and hidden state values, and θ are the parameters. If we only use an identity feature per each observation and hidden state pair then both models can represent the same distributions. However, the advantage of this parametrization is that it enables us to define different features for each observation and to share some of those features between different observations, for instance word suffixes and orthography features (capitalization, numeric values). Moreover, since we can share features between observations, we can reduce the total number of parameters by discarding word identity features and relying on the shared features. We will denote the HMM model with maximum entropy observation distribution by HMM+ME to distinguish it from the HMM with multinomial distribution. 12 CHAPTER 2. BACKGROUND The drawback of using the HMM+ME model is that there is no closed form solution to estimate the maximum likelihood parameters. Instead we need to solve the following unconstrained optimization problem: �� θ̂ = arg min pθ (x̄, z̄) (2.22) θ = arg min θ = arg min θ x̄∈X z̄ � � vq ∈V hl ∈H � � vq ∈V hl ∈H (2.23) sc(vq , hl ) · log(pθ (vq |hl )) sc(vq , hl )(θ · f (vq , hl ) − log( � vp ∈V exp(θ · f (vp , hl )))), (2.24) where we are finding the parameters that maximize the joint likelihood of both x̄ and z̄ for all the corpus. Note that pθ (vq | hl ) defines a probability distribution over word values and hidden state values and does not depend on the particular context where they occur. So the sum over all the corpus amounts to iterating over all pairs of word values and hidden state values and counting how many times they appear in the corpus. Furthermore, since we are focusing on the unsupervised learning setting we do not have the values for the hidden variables, so instead we will use the expected counts described earlier. This can be seen as if we had observed a given pair word value and hidden state value sc(vq , hl ) times. When training a maximum entropy model it is usual to add a prior on the parameters to avoid overfitting. This is more important in the maximum entropy model than in the multinomial model, since the parameters on the maximum entropy model are unconstrained, as opposed to the multinomial parameters that have to be positive and sum to one (since they correspond to a probability distribution). We add a Gaussian prior to the parameters which corresponds to adding a squared Euclidean norm penalty on the parameters multiplied by the inverse of the Gaussian variance, σ1 ||θ||22 that discourages the parameters from having big values. The estimation procedure becomes: θ̂ = arg min θ ∂ and its gradient is ∂θ = � vq ∈V � �� x̄∈X z̄ pθ (x̄, z̄) + 1 ||θ||22 σ (2.25) � � � sc(v , h ) f (v , h ) − p (x = v |z = h ) · f (v , h ) q l q l p p l + l hl ∈H vp ∈V θ To solve this optimization problem we use a gradient based method with direction, given by the Limited Memory Quasi Newton method and we pick the step size using the strong Wolfe’s rule [Nocedal and Wright, 1999]. 2 σ θ. Using this parametrization we still have a regular HMM, which is different from the Maximum Entropy Hidden Markov (MEMM) model, described in McCallum et al. [2000], since the HMM+ME is modeling the probability of a word given a hidden state, which is the opposite of what is done in MEMMs. Note, that we still have � a proper probability distribution of observation values given hidden state values, that is vq ∈V pθ (vq | hl ) = 1, we just have a different set of parameters θ. This is the biggest difference between this model and a Markov random field (MRF). An MRF is an undirected graphical model and the joint probability of an observation x̄ and a state sequence z̄ is given by: �N ψ(hi , hi−1 )ψ(hi , vi ) pθ (x̄, z̄) = i=1 , (2.26) Z 2.2. LEARNING - EXPECTATION MAXIMIZATION ALGORITHM 13 where Z is the normalization constant and ψ(hi , hi−1 ) and ψ(hi , vi ) are non-negative (transition and observation) potentials, and as opposed to the HMM, the observation and transition potentials need not to be normalized, since the MRF only needs to be normalized globally. Even using the same features, the two models differ in the degrees of freedom, since the HMM+ME needs to be locally normalized. 2.2 Learning - Expectation Maximization Algorithm Given the parametric model pθ (x̄, z̄) and a training corpus X of D sentences x̄1 . . . x̄D , training seeks model parameters θ that minimize the negative log-likelihood of the corpus: � � log pθ (x̄)] = E[− � log Negative Log Likelihood : L(θ) = E[− pθ (x̄, z̄)], (2.27) z̄ � (x̄)] = where E[f corpus. 1 D �D i=1 f (x̄ i) denotes the empirical average of a function f over the training Because of the hidden variables z̄, the likelihood term contains a sum over all possible hidden structures inside of a logarithm, which makes this quantity hard to compute. The most common minimization algorithm to fit the model parameters in the presence of hidden variables is the Expectation Maximization (EM) algorithm [Dempster et al., 1977]. The EM procedure can be thought of intuitively in the following way: If we observe the hidden variables’ values for all sentences in the corpus, then we could easily compute the maximum likelihood value of the parameters as described in Subsection 2.1.2. On the other hand, if we had the model parameters we could compute the posterior distributions and collect the sufficient statistics. The EM procedure starts with an initial guess for the parameters θ0 at time t = 0. The algorithm iterates for T iterations until it converges to a local minima of the negative log likelihood, and each iteration is divided into two steps. In the first step - “E Step” (Expectation) - given the current parameter values θt and the observed variables it computes the posteriors for the hidden variables pθ (z̄ | x̄). In the case of the HMM this requires only to run the FB algorithm. The second step - “M step” (Maximization) - uses pθ (z̄ | x̄) to “softly fill in” the values of the hidden variables z̄, and collects the sufficient statistics, initial counts (Eq: 2.15), transition counts (Eq: 2.16) and state counts (Eq: 2.17) and uses those counts to estimate maximum likelihood parameters θt+1 as described in Subsection 2.1.2. The EM algorithm is guaranteed to converge to a local minimum of L(θ) under mild conditions [Neal and Hinton, 1998]. Note that we are not committing to the best assignment of the hidden variables, but summing the occurrences of each parameter weighed by the posterior probability of all possible assignments. This modular split into two intuitive and straightforward steps accounts for the vast popularity of EM (see Algorithm 1). More formally, EM minimizes L(θ) via block-coordinate descent on an upper bound F (q, θ) 14 1 2 3 4 5 CHAPTER 2. BACKGROUND for t = 1..T do for each training sentence x̄ do E-Step: q t+1 (z̄ | x̄) = pθt (z̄|x̄) ; end � � � � M-Step: θt+1 = arg max E q t+1 (z̄ | x̄) log pθ (x̄, z̄) θ 6 z̄ end Algorithm 1: EM algorithm. using an auxiliary distribution over the latent variables q(z̄ | x̄) [Neal and Hinton, 1998]: � � � � − log L(θ) = E pθ (x̄, z̄) (2.28) � z̄ � � � � p (x̄, z̄) p (x̄, z̄) θ θ � − log � − = E q(z̄ | x̄) ∗ ≤E q(z̄ | x̄) log q(z̄ | x̄) q(z̄ | x̄) z̄ z̄ � � � q(z̄ | x̄) � = E q(z̄ | x̄) log = F (q, θ), pθ (x̄, z̄) z̄ � (2.29) (2.30) where we have multiplied and divided the pθ (x̄, z̄) by the same quantity q(z̄ | x̄), and the lower bound comes from applying Jensen Inequality (Equation 2.29). F (q, θ) is normally referred to as the energy function, which comes from the physics field and refers to the energy of a given system that we want to minimize. � � � q(z̄ | x̄) � EM Upper Bound : L(θ) ≤ F (q, θ) = E q(z̄ | x̄) log . (2.31) pθ (x̄, z̄) z̄ The alternating E and M steps at iteration t + 1 can be seen as minimizing the energy function first with respect to q(z̄ | x̄) and then with respect to θ: E: q t+1 (z̄ | x̄) = arg min F (q, θt ) = arg min KL(q(z̄ | x̄) || pθt (z̄ | x̄)) = pθt (z̄ | x̄); (2.32) q(z̄|x̄) M: q(z̄|x̄) � θt+1 = arg min F (q t+1 , θ) = arg max E θ θ � � z̄ � q t+1 (z̄ | x̄) log pθ (x̄, z̄) ; (2.33) q(·) where KL(q||p) = Eq [log p(·) ] is the Kullback-Leibler divergence. The KL term in the E-Step results from dropping all terms from the energy function that are constant for a set θ, in this case the likelihood of the observation sequence pθ (x̄): 2.3. DECODING � z̄ q(z̄ | x̄) log 15 � q(z̄ | x̄) � = q(z̄ | x̄) log q(z̄ | x̄) − q(z̄ | x̄) log pθ (x̄, z̄) pθ (x̄, z̄) z̄ z̄ � � = q(z̄ | x̄) log q(z̄ | x̄) − q(z̄ | x̄) log pθ (x̄)pθ (z̄ | x̄) z̄ = � z̄ (2.35) z̄ q(z̄ | x̄) log q(z̄ | x̄) − log pθ (x̄) pθ (z̄ | x̄) = KL(q(z̄ | x̄)||pθ (z̄ | x̄)) − log pθ (x̄). 2.3 (2.34) (2.36) (2.37) Decoding Given the learned parameters and an observation x̄ we want to find the best hidden state sequence z̄∗ . There are several ways to define what we mean by the best z̄∗ , for instance, the best assignment to each hidden variable zi and the best assignment to the sequence z̄ as a whole. The first way, normally called Posterior decoding consists in picking the highest state posterior for each position in the sequence: z̄∗ = arg max γi (zi ). z1 ...zN (2.38) This method does not guarantee that the sequence z is a valid sequence of the model. For instance there might be a transition probability between two of the best node posteriors with probability zero. The second approach called Viterbi decoding consists in picking the best global hidden state sequence z̄. z̄∗ = arg max pθ (x̄, z̄). (2.39) z̄ This method is very similar to the forward procedure of the FB algorithm, making use of the same trellis structure to efficiently represent and use all the exponential number of sequences. In fact the only difference from the forward algorithm is in the recursion 2.6 where instead of summing over all possible hidden state assignments, we take their maximum. 2.4 Statistical Word Alignments The seminal work of Brown et al. [1993] introduced a series of probabilistic models (IBM M1-5) for statistical machine translation (MT) and the concept of “word-by-word” alignment, the correspondence mapping between words in source and target languages. Although no longer competitive as end-to-end translation models, IBM models, as well as the HMM of Vogel et al. [1996], are still widely used for word alignment. Word alignments are used primarily for extracting minimal translation units for machine translation, such as phrases in phrasebased translation systems [Koehn et al., 2003] and rules in syntax-based machine translation [Galley et al., 2004; Chiang et al., 2005], as well as for MT systems combination [Matusov et al., 2006]. Furthermore, their importance has grown far beyond machine translation: for instance, for transferring annotations, between languages by projecting PoS taggers, NP chunkers and 16 CHAPTER 2. BACKGROUND parsers [Yarowsky and Ngai, 2001; Rogati et al., 2003; Hwa et al., 2005; Ganchev et al., 2009], discovery of paraphrases [Bannard and Callison-burch, 2005; Callison-Burch, 2007, 2008], and joint unsupervised PoS and grammar induction across languages [Snyder and Barzilay, 2008; Snyder et al., 2009]. Word alignment for a parallel sentence pair represents the correspondence between words in a source language and their translations in a target language. Due to the inherent ambiguity in word alignments when performing manual annotations. it is common to distinguish two kinds of alignments [Och and Ney, 2003]: sure alignments, for unambiguous translations, and possible alignments, for ambiguous translations. These annotations are required for evaluating the alignments produced by different models. 0 0 1 2 3 4 5 6 7 8 9 1 2 3 • • • • • • 4 5 6 7 8 9 • • • • • • • • • i disagree with the argument advanced by the minister . je ne partage pas le avis de le ministre . Figure 2.3: Human-annotated alignment between the English sentence i disagree with the argument advanced by the minister. and the French sentence je ne partage pas le avis de le ministre. Sure alignments are squares with borders; possible alignments are squares without borders. Figure 2.3 shows a gold alignment example between an English and a French sentence. The following notation will be used during this thesis: Sure alignments are represented in the figure as squares with borders, possible alignments are represented as squares without borders. Rows contain the source words and columns contain the target words. This figure shows the difficulty of the alignment task, where 3 different blocks of possible alignments exist with no direct one to one correspondence (e.g. ne partage pas, disagree with). There are many reasons why a simple word-to-word (1 to 1) correspondence is not possible for every sentence pair: for instance, auxiliary verbs used in one language but not the other (e.g. English He walked and French Il est allé), articles required in one language but optional in the other (e.g. English Cars use gas and Portuguese Os carros usam gasolina), cases where the content is expressed using multiple words in one language and a single word in the other language (e.g. agglutination such as English weapons of mass destruction and German Massenvernichtungswaffen), and expressions translated indirectly. Going back to the problem setting of the beginning of the chapter, given a parallel corpus 2.4. STATISTICAL WORD ALIGNMENTS 17 X = {x̄1 . . . x̄D } each x̄ consists of a target sentence x̄t = xt1 . . . xtN of length N and a source ¯ = xs1 . . . xsM of length M , the goal of statistical machine translation (SMT) is given sentence xs ¯ to find the most likely translation x̄t in the target language moda source language sentence xs, ¯ It’s not feasible to model pθ (x̄t | xs) ¯ directly so different decompositions eled by pθ (x̄t | xs). have been proposed: e.g. word based [Brown et al., 1993], phrase based [Koehn et al., 2003], syntax based [Galley et al., 2004]. The first models proposed were word based translation models [Brown et al., 1993]. These models translate a source sentence into a target sentence by translating each individual source word at a time. In this thesis we will work with the simplest IBM model, IBM M1 and with the HMM model [Vogel et al., 1996]. These models have the following generative story: ¯ of length M , choose the target sentence length N with proba1. Given a source sentence xs bility p(N |M ); 2. For each target sentence position n, choose a position on the source sentence m = zn ¯ based on the previous history, with probability p(zn = m | zn−1 = l, xs); 3. For each target position n, choose a target word identity based on the source word at ¯ position m with probability p(xtn = v | xsm , xs). Since the hidden state values correspond to the positions in the source sentence we represent their values hl , just by the integer representing the position l to simplify the notation. To account for words that appear in the target sentence and that have no correspondence in the source sentence, an extra source word, denoted “null word”, is added to the source sentence. Step two of the generative story introduces the hidden variable z̄ = z1 . . . zN of the same length as the target sentence, where for each position n has as value the position of the aligned word in the source sentence zn = m, where m ∈ 1 . . . M . Since we are not dealing with machine translation, but with word alignment we always get to observe both the target sentence and the source sentence, so step 1 of the generative story is ignored and we can ignore the conditioning on the source sentence during the alignment process. With this generative story the alignment process is a sequence model as described in Section 2.1. However, this generative story is a gross oversimplification of the true alignment process. For instance, it only allows (1-n) alignments, meaning that each source word can be aligned to several target words, by picking the same m at step 2, but each target word can only be aligned to a source word. This is far from real, for instance Figure 2.3 has 3 examples of (n-n) alignments. Moreover, these models are not symmetric, the alignments produced depend on which language is picked to be the source. 2.4.1 Models In this thesis we focus on the hidden Markov model (HMM) for word alignment proposed by Vogel et al. [1996]. Following the description of subsection 2.1 the probability of an align¯ can be expressed as: ment z̄ and target sentence x̄t given a source sentence xs ¯ = pθ (x̄, z̄) = pθ (z1 ) pθ (x̄t, z̄ | xs) N � i=1 pθ (zi | zi−1 )pθ (xti | zi ). (2.40) 18 CHAPTER 2. BACKGROUND Since both sentences are observed and we are only trying to model the possible alignments h, ¯ always has the same value (is a constant) and does not influence the pθ (x̄t, z̄ | the variable xs ¯ So we can simplify the expression and drop the dependence on xs. ¯ xs). There are several important standard details on the parametrization of the model: The distortion probability pθ (zi = m | zi−1 = l) depends only on the distance (m − l) between the source positions the states represent. Only distances in the range ±5 are modeled explicitly, with larger distances assigned equal probabilities. This simplification is done to reduce the number of parameters of the model, and because we are interested in adding the bias to the model that consecutive words in one language tend to align to consecutive words in the other language, but we are not really interested in the absolute positions of the words. To incorporate null-links, we add an observation probability given null: pθ (xi | zm = hnull ). Null links also need to maintain position information so that we can still measure the distance between aligned words. To implement this, we create position-specific null hidden states for each source position, and set pθ (l|nullm ) = 0, this is we can only move to the null word on the same position as the previous word, and pθ (nulll |nullm ) = 0 null states can only transition to the same null state or non null states. In this work we will also use the simpler model IBM M1 as a way to initialize the parameters of the HMM model. IBM M1 assumes that the translation of words is independent of their particular position in the sentence. The joint probability for an alignment and a target sentence given a source sentence for IBM M1 is given by: ¯ = pθ (x̄, z̄) = pθ (x̄t, z̄ | xs) N � i=1 pθ (xti | zi ), (2.41) For a full description of the different IBM models we refer the reader to the original IBM paper [Brown et al., 1993]. For a good tutorial on Word Alignments and EM applied to word alignments we refer the reader to Kevin Knight excellent tutorial, “A Statistical MT Tutorial Workbook” available from the authors website1 . 2.4.2 Decoding Besides the decoding options described in Section 2.3, another possibility that often works better for word alignments is to use Minimum Bayes-Risk (MBR) decoding [Kumar and Byrne, 2002; Liang et al., 2006; Graça et al., 2007]. Using this decoding we include an alignment link n − m if its state posterior probability is above some threshold. This allows the accumulation of probability from several low-scoring alignments that agree on one alignment link. The threshold is tuned in some small amount of labeled data, in our case the development set to minimize some loss. Kumar and Byrne [2002] study different loss functions that incorporate linguistic knowledge, and show significant improvements. Note that this could potentially result in an alignment having zero probability under the model, since many-to-many alignments can be produced. MBR decoding has several advantages. First, independently of the particular choice of the loss function, by picking a specific threshold one can trade-off precision and recall of the predicted word alignments. In fact, in this work when comparing different alignment models we do not commit to any loss function but instead compare precision vs recall 1 http://www.isi.edu/ knight/#pubs 2.4. STATISTICAL WORD ALIGNMENTS 19 curves, by generating alignments for different thresholds (0..1). Second, with this method one can ignore the null word probabilities which are normally poorly estimated. 2.4.3 Corpora We use six different language pairs during the development of this thesis. The corpora are the Hansard corpus [Och and Ney, 2000] of English/French Canadian Parliamentary proceedings (En-Fr), and from the Europarl corpus [Koehn, 2002] we use the following language pairs English/Portuguese (En-Pt), Portuguese/Spanish (Pt-Es), Portuguese/French (Pt-Fr) and Spanish/French (Es-Fr). In order to evaluate the quality of generated word alignments one needs manually annotated word alignments. For the Hansard corpus we use the annotations defined by Och and Ney [2000] and for the English/Spanish we use the annotations from EPPS [Lambert et al., 2005] (En-Es). During this thesis we also developed annotations for six different language pairs of the Europarl corpus (all combinations between Portuguese, English, French and Spanish) [Graça et al., 2008], which are freely available2 . For all these languages we used the first 100 sentences of the Europarl common test set defined in Koehn et al. [2003], which are taken from Q4/2000 portion of the data (2000-10 to 2000- 12). The common test set can be download from Europarl archives 3 . Different annotation guidelines and annotation processes can lead to different annotations. This can make a substantial difference in the ability to evaluate word alignments. For instance the models described previously are directional, thus unable to produce phrase alignment (nm alignments). As we will see different guidelines tend to generate more or less phrase alignments. Another important difference is the use of sure vs possible alignments. This makes a significant difference, as the word alignment evaluation metrics described in the following subsection only penalize the absence of sure alignments. Table 2.5 shows the different challenges that each corpus presents. For example, En-Es has longer sentences, which raise the ambiguity level of the alignment task, and smaller percentage of one-to-one alignments. Coping with non-bijective alignments is difficult because of the added uncertainty about word fertility. The range of the percentage of sure alignments varies between 79% for Es-Fr to 21% for En-Fr. Although a direct correspondence between words is often not possible, the great majority of sure alignments are in fact one-to-one (86% - 98%). This characteristic will be explored in this work. 2.4.4 Evaluation Metrics There are several ways to evaluate the quality of word alignments. The extrinsic metric that really matters is the quality of the end system for which the aligments were produced. However we still need intrinsic measures to optimize some parameters (for instance the threshold when using posterior decoding) and compare different models. These intrinsic measures are based on the manually built sets of sure aligned points S and possible aligned points P . These sets are compared against the set of alignments A produced by a word alignment model. Figure 2.4 shows an example of a predicted word alignment together with the human annotation. The set S corresponds to the squares with borders, the set P corresponds to all squares 2 3 https://www.l2f.inesc-id.pt/resources/translation/ http://www.statmt.org/europarl/archives.html 20 CHAPTER 2. BACKGROUND Corpus Sentence Pairs Ave Length Max Length % Sure En/Fr 447 16/17 30/30 21% En/Es 400 29/31 90/99 67% En/Pt 60 11/11 20/20 54% Pt/Es 60 11/11 20/20 69% Pt/Fr 60 11/12 20/20 77% Es/Fr 60 11/12 20/20 79% % 1-1 98% 86% 91% 92% 88% 87% Table 2.5: Word Alignment test corpora statistics: The last column shows the percentage of 1 to 1 alignments (for S links only). 0 1 2 3 4 5 0 1 ♣ ♣ ③ ♣ ♣ ♣ ♣ 2 ♣ ♣ ③ ♣ ♣ ♣ ♣ cependant , 3 4 ♣ ♣ 5 ♣ ♣ ♣ ♣ 6 ♣ ♣ 7 ♣ ♣ but then ③ ♣ ♣ ♣ ♣ ♣ mr. ♣ ③ ③ ③ ♣ ♣ baldwin ♣ ♣ ♣ ♣ ③ ♣ said ♣ ♣ ♣ ♣ ♣ ③: m. baldwin avait ensuite dèclarè : Figure 2.4: Example of a predicted word alignment. Red circles mean predicted points and the set A corresponds to red circles. We start by defining precision and recall with some modification to handle sure and possible alignments: |A ∩ P | ; |A| |A ∩ S| Recall = . |S| P recision = (2.42) (2.43) A first thing to note is that recall only penalizes the absence of sure alignments. In this particular example this alignment has a precision of 62.5% and a recall of 75%. Some common metrics exist that combine precision and recall in different ways. The most common one is alignment error rate (AER) [Och and Ney, 2003] AER(S, P, A) = 1 − |A ∩ S| + |A ∩ P | . |A| + |S| (2.44) In this particular example, the AER is 33%. Fraser and Marcu [2007b] criticize AER for having low correlation with the usual Machine Translation metric (BLEU score Papineni et al. [2002]), and suggest using a weighted F-measure: F (α) = |A ∩ S| . α|S| + (1 − α)|A| (2.45) 2.4. STATISTICAL WORD ALIGNMENTS 21 where 0.1 ≤ α ≤ 0.4, which they show to correlate better with BLEU [Papineni et al., 2002]. One disadvantage of this metric is that the appropriate α is corpus-specific. Other intrinsic metrics, such as the kind of AER calculated by Ayan and Dorr [2006] based on phrases extracted from the alignments, also has poor correlation with translation performance. Note that all these metrics were evaluated against the BLEU metric used for machine translation, and aim at achieving high correlation with its score. However, no studies were realized about the correlation of these metrics for other applications, such as transfer of annotations between languages. During this thesis we will show results mostly using precision and recall, and will use the above metrics only when strictly necessary. One way to compare different models is for instance to look at the precision recall curve. This is done by generating alignments using posterior decoding for thresholds varying from 0-1 and plotting the different precision and recall values. Precision/Recall curves 1 0.9 recall 0.8 0.7 0.6 0.5 Example 1 Example 2 Example 3 0.4 0.4 [h] 0.5 0.6 0.7 0.8 0.9 1 precision Figure 2.5: An example of a Precision Recall curve for 3 different models Figure 2.5 shows an example of such a curve for different models. These curves contain more information than using posterior decoding with a given threshold, no matter what metric we are using (this would be just a single point in the curve). From this curve we can see that Example 3 performs significantly worst than Example 1 and 2, and that Example 1 is better than 2 when high precision levels are required. Another way to use these curves is to measure the area under each curve, which gives us a single number to compare models. We use the area under the graph to measure the performance of models during training. We stop training when the area under the graph starts decreasing. 2.4.5 Baseline Problems On the positive side, the HMM model is simple, with complexity of inference O(N × M 2 ). However there are several problems with the model that arise from its directionality. • Non-bijective: Multiple target words can be linked to a single source word, which is rarely desirable. For instance, the model produces non-bijective links 22% of the time for En-Fr instead of 2% that the gold corpus has. 22 CHAPTER 2. BACKGROUND 0 0 ♣ 1 ♣ 2 ♣ 3 ♣ 4 ♣ 5 ① 6 ♣ 1 2 3 4 5 6 7 8 ♣ ♣ ♣ ♣ ① ♣ ♣ ♣ ♣ ♣ ① ♣ ♣ ♣ ♣ ✈ ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ ♣ ♣ ✇ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ et que à ♣ ♣ ♣ ♣ ♣ ♣ 9 ♣ 0 0 ♣ ♣ 1 ♣ ♣ 2 ♣ ♣ 3 ♣ ♣ 4 ♣ ♣ 5 ♣ ③6 ♣ le lieu de provoquer la rupture , 1 2 3 4 5 6 7 8 9 ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ ♣ ♣ ♣ ③ � ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ✉ ③ ③ ♣ ♣ o ♣ cau ♣ a ♣ inte ♣ sch ③ ♣ ♣ et que à ♣ ③ ♣ ♣ ♣ ♣ le lieu de provoquer la rupture , Figure 2.6: State posterior distributions for different models for an English and French sentence pair. Left: EN→FR model. Right: FR→ EN model. • Asymmetric: By switching the (arbitrary) choice of which language is source and which is target, the HMM produces very different results. For example, intersecting the sets of alignments produced by the two possible choices for source preserves less than half of their union for both En-Fr and En-Pt. Figure 2.6 shows an example of the state posterior distribution for the alignment between an English and French sentence using the HMM model. Each entry in the matrix corresponds to the probability of having the corresponding two words being a translation of each other. Note that the state posterior distribution is a proper distribution so for a given source word the sum over all target words has to sum to 1. The left figure shows the alignment in the English to French direction where the rows are source words and columns are target words, while the right figure shows the state posteriors on the opposite direction. The first observation is that the posteriors are concentrated around particular source words (rare words occurring less than 5 times in the corpus) in both directions, instead of being spread across different words. This is a well known problem when training using EM called the “garbage collector effect” [Brown et al., 1993]. A rare word in the source language aligns to many words in the target language that we would ideally like to see unaligned, or aligned to other words in the sentence. The reason why this happens is because the generative model has to distribute translation probability for each source word among different candidate target words. If one translation is much more common than another, but the rare translation is used in the sentence, the model might have a very low translation probability for the correct alignment. On the other hand, since the rare source word occurs only in a few sentences it needs to spread its probability mass over fewer competing translations. In this case, choosing to align the rare word to all of these words leads to higher likelihood than correctly aligning them or aligning them to the special null word, since it increases the likelihood of this sentence without lowering the likelihood of many other sentences. 2.4.6 ♣ Related Work Together with IBM M1 and IBM M2, Brown et al. [1993] developed a series of more complex models IBM M3 to IBM M6. These models differ from the previous ones in two particular ways. First the length of the target sentence is defined based on the identity of the words in inst 2.5. PART OF SPEECH INDUCTION 23 the source sentence. For each source word there is an associated probability of generating a particular number of target words. These provides a better model than to just pick the sentence size based on the source sentence length. Furthermore, it allows to directly model how many words should be generate by each source word, which better explains the translation process. Nevertheless, these models are still 1 to n models. Secondly, there is a extra reordering step on the target words at the end of the translation process. Moreover, these models are allowed to generate spurious words, that correspond to the “null word” in the simpler models, but in this case we are allowed to have more than one “null word”, which has been shown to improve the quality of the resulting alignments [Moore, 2004]. IBM M3 and IBM M4 are deficient which means that these models assigns probability mass to alignments that do not form a proper target sentence. IBM M5 and IBM M6 solve this problem, however Brown et al. [1993] shows that the added complexity of removing deficiency does not provide a significant improvements, so IBM M4 is the default choice. These models do not decompose in a tractable way. For instance for a target sentence of length 9 there are 362.880 possible reordering options (9!). Which means that in practice besides being very slow to train, one needs to resort to several heuristic approximations when summing over all possible alignments. Since these models were developed several researches have work on improving them and on creating new models. The HMM model due to its simplicity has been the target of many of those improvements. Toutanova et al. [2002] proposes several extensions to the HMM model: condition the translation probabilities with part of speech information; condition the transition probabilities on the source word part of speech tags; try to capture fertility by modeling how many words should be generated for each source word (this is performed by modeling if the HMM should generate more words from the previous word or move to another word using a probability over a boolean variable “stay”); better modeling the “null” alignments by making it a mixture model between the normal probability of translating a word to “null” and the same probability conditioned on the following source word. These improvements lead to an increase in performance when compared against the HMM model but the experiments were only carried for small corpus, so there is no guarantee that they will hold for bigger corpus. Deng and Byrne [2006] extend the HMM model to a Semi-HMM, where the state duration is explicitly model. They use the state duration to model the fertility of each word. Again the results are competitive with IBM M4 but not clearly better. Lopez and Resnik [2005] improve the HMM by adding a distortion model based on the tree distance between words which offered a slightly gain over the surface distortion model. More recently, Fraser and Marcu [2007a] proposes a new generative model capable of modeling non consecutive n-m alignments which slightly outperforms IBM M4. 2.5 Part of Speech Induction In this section we present the PoSI task. PoS tags are pre-requisite for many text applications. The task of PoS tagging where one is given a labeled training set of words and respective tags is a well studied task with several methods achieving high prediction quality. On the other hand the task of PoSI where one does not have access to a labeled corpus is a much harder task with a huge space for improvement. PoSI has been used to refer to two very different tasks. In one case, which we will refer to as weakly supervised, we are given a raw text and a dictionary of which tags each word can have, and the goal is to disambiguate, for a particular word occurrence what tag should it have. This was the path started by the influential work 24 CHAPTER 2. BACKGROUND of Merialdo [1994] and there has been a large amount of recent follow-up work. On the other case, which we call fully unsupervised, we are given only the raw text along with sentence boundaries and a predefined number of cluster we can use. In this setting, which is the one we address in this work, the problem can be seen as a clustering problem. We want to cluster words that behave grammatically in the same way on the same cluster. This line of work typically focus on distributional features, words with the same grammatical function tend to occur in the same neighbourhood of different words, and morphological features, similar grammatical words tend to have common morphological features (suffixes, capitalization) or a mixture or both [Schütze, 1995; Clark, 2003]. The fully unsupervised setting presents several extra difficulties: Firstly, one has to decide on how many clusters to use. A problem is that due to the zipfian distribution true PoS tags, for instance there are many more nouns and verbs than there are prepositions, statistical models will prefer to use several clusters to explaining nouns in detriment of explaining classes with few word occurrences as possessives. Another problem is related with the word tag ambiguity, each word can have different PoS Tags according to the context, for instance the word run can be either a verb or a noun. The first approaches proposed to solve the PoSI task dealt with this problem by assuming that each word could only have one tag, in fact most approaches modeled the tag of a word type independent of their position in the sentence using distributional clustering techniques [Schütze, 1995]. A more realistic option is to model the distribution of tags for each word, for instance by using an HMM model whose hidden states values are possible clusters and observations are the words, we follow this later approach. However, this approach adds additional problems: training an HMM model to maximize the likelihood of the data tends to assign a uniform distribution to the distribution of words given tags [Johnson, 2007], meaning that each word has some probability of being associated with all hidden states, while in general each word can only be assigned to a small subset of the existing tags, normally 1. In fact this problem was the main reason why people tend to use the dictionary informed setting, bypassing this ambiguity problem. Several works fall between the two extremes of fully unsupervised and weakly unsupervised, where one has access to an incomplete dictionary. Different approaches have applied this setting and present different ways to generalize the information contained in the given dictionary [Banko and Moore, 2004; Wang and Schuurmans, 2005; Smith and Eisner, 2005; Ravi and Knight, 2009]. Formally, in the problem setting of fully unsupervised PoSI we are given a training set X = x̄1 . . . x̄D of D training examples, where each example x̄ = x1 . . . xN is a sentence of N words, whose values v are taken from a vocabulary V of possible word types. We are also given the set of clusters H that we are allowed to use. The hidden structure z̄ = z1 . . . zN corresponds to a sequence of cluster assignments for each individual word, such that zn = hl with hl ∈ H. 2.5.1 Models The baseline model is a first order hidden Markov model (HMM) as described in Section 2.1. Where the observations x are the words in a sentence, and the hidden states z are cluster assignments, from a pre-determined set of possible clusters H. For this application we explicit model the final state distribution, since this provides a small consistent improvement. We append each sentence, x̄ with a fake token representing the last word at position N + 1, and augment the hidden state space with an extra value hv|H|+1 . The transition probabilities are zero for going in or out of this new state, with the exception of the last sentence position, where one can move into it. 2.5. PART OF SPEECH INDUCTION En Pt Bg Es Dk Tr Sent. 49208 9359 14187 3512 5620 5512 25 Types 49206 29489 34928 16858 19400 18805 LUnk 49.28% 37.83% 39.26% 37.73% 26.30% 33.83% Tokens 1173766 212545 214930 95028 65057 100238 Tags 17 22 12 47 30 25 Table 2.6: PoS corpora statistics. The third column shows the percentage of word types after lower-casing and eliminating word types occurring only once. 2.5.2 Corpora We explore POS induction for six different languages, using corpora with manual annotation only to evaluate the performance of various models. The characteristics of the corpora are summarized in Table 2.6: the Wall Street Journal portion of the Penn treebank [Marcus et al., 1993] using the 17 tags set defined by Smith and Eisner [2005] (En); the Bosque subset of the Portuguese Floresta Sinta(c)tica Treebank [Afonso et al., 2002] (Pt); the Bulgarian BulTreeBank [Simov et al., 2002] (Bg) with the 12 coarse tags; the Spanish corpus from the Cast3LB treebank [Civit and Martí, 2004] (Es); the Turkish METU-Sabanci treebank [Oflazer et al., 2003] (Tr); and the Danish Dependency Treebank (DDT) [Kromann et al., 2003] (Dk). 2.5.3 Evaluation Metrics To compare the performance of the different models one needs to evaluate the quality of the induced clusters. To that end, we use two different kinds of evaluation. The most relevant one is how useful are the induced clusters for different tasks that require PoS tags. In Section 5.6 we perform such evaluation. The task is semi-supervised PoS tagging where we have a supervised tagger that only has access to a small amount of labeled data. We show that by adding the clusters induced by the unsupervised model as additional features to the supervised model, we boost the performance of the supervised model. The second evaluation approach is to compare the induced clusters to the true tags of the corpus. To compare against the true tags and compute accuracy we need to map the hidden states to the true PoS tags. The two most common approaches in the literature are 1-Many mapping and 1-1 mapping [Haghighi and Klein, 2006]. In the 1-Many mapping each hidden state is mapped to the tag with which it co-occurs the most. This means that several hidden states can be mapped to the same tag, and some tags might not be used at all. The 1-1 mapping greedily assigns each hidden state to a single tag. In the case there are the same number of tags and hidden states this will give a 1-1 correspondence. A major drawback of the latter mapping is that it fails to express all the information of the hidden states. Typically, unsupervised models prefer to explain very frequent tags with several hidden states, and combine some very rare tags. For example the Pt corpus has 3 tags that occur only once in the corpus. Grouping these together but sub-dividing nouns still provides a lot of information about the true tag assignments but this would not be captured by the 1-1 mapping. Nevertheless, the 1-Many mapping also has drawbacks, since it can only distinguish clusters based on their highest frequency tag. So having a cluster split almost evenly between nouns and adjectives, or having a cluster with the same number of nouns, but a mixture of words with different tags is judged the same. We also evaluate resulting tags using mutual information (MI) between tags and hidden states. To compute this, imagine 26 CHAPTER 2. BACKGROUND State 0 State1 State2 Sum Noun 1463 100 987 2550 Verb 12 700 353 Adv 8 35 40 Sum 1483 835 1380 1-1: 59.6% 1-Many: 85.1% State 0 State1 State2 Sum Noun 2063 100 387 2550 1065 Verb 462 500 103 1065 83 Adv 8 35 40 83 Sum 2533 635 530 MI 0.29 1-1: 70.3% 1-Many: 79.8% MI 0.14 Figure 2.7: PoS mapping example. Each entry in each matrix indicates the number of tokens of a given hidden state (column) which belongs to a given true tag (row). Sums indicate the number of tokens belonging to each category. The last row shows the accuracy according to the different metrics. randomly selecting a token in the corpus: the PoS tag of the token is one random variable, and the hidden state is another random variable. MI is the mutual information between these two random variables. Figure 2.7 shows two examples of the clusters assign by an unsupervised model (columns) and the corresponding true tags (rows). Each entry contains the number of tokens for a given hidden state/tag pair. The last row contains the accuracy for each mapping. In the first case (left example) we can see how the 1-1 maps state 2 to Adv although that cluster is modeling mostly nouns and verbs, because it has already mapped nouns and verbs to other clusters. In the second case we see how the 1-1 can be deceiving. It assign a higher score to that clustering than to the first cluster, although the clusters are not well separated as in the first case. There is no clear cluster for nouns, since a lot of verbs are assign to state 1 as well. But since this mapping only uses the maximum for each tag and these are the same the performance is bigger. On the other hand both 1-Many and MI prefer the first case. Although 1-Many mapping correlates well with MI it still fails to capture all the information about the clusters as described before. On the other hand its easier to analyse the results in term of 1-Many since the scale is more familiar. We will use both of these metrics in the remainder of this thesis. 2.5.4 Baseline problems Using the HMM model for this task has several disadvantages. First, it treats words atomically, without regard to orthography and morphology which is critical to generalization in many languages [Clark, 2003]. Second, the model has a very large number of parameters: the number of tags times the number of word types, which presents an extremely rich model space capable of fitting irrelevant correlations in the data. We address these problems by switching to the Maximum Entropy parametrization of the observation probability described in Subsection 2.1.2, and introduce several classes of features: Word Identity (Words): Lower cased token, for words that occur at least 10 times; 2.5. PART OF SPEECH INDUCTION 10 10 Supervised HMM HMM+ME 8 4 2 Supervised HMM HMM+ME 8 L1L∞ 6 L1L∞ 27 6 4 2 0 0 0 1000 2000 3000 4000 5000 6000 7000 8000 rank of word by L1L∞ 0 200 400 600 800 1000 1200 1400 1600 1800 rank of word by L1L∞ Figure 2.8: Differences of sparsity per word on the true tag distribution versus the HMM and HMM-ME model trained using EM. Suffixes (Suff.): suffixes of length 2-5 if they occur in at least 20 word forms; Short Word (Size): word length ≤ 3; Regular Expressions (Ortho.): for intial-capitalization, mix-cased, punctuation, contains-hyphen, -underscore and -digit4 ; Frequent (Counts): set iff the word is one of the top 10% most frequent words. However, for both models HMM and HMM-ME the parameter that maximize the likelihood of the observed data allow a high level of ambiguity for the tag given word distribution: common words tend to be associated with every tag with some probability, depending on the context. By contrast, a natural property of POS categories across many languages and annotation standards is that each word only has a small number of allowed tags. To see this problem, we show the sparsity differences between the true tags distributions and the clusters obtain by the both HMM and HMM-ME models trained using the EM procedure. Figure 2.8 shows the distribution of tag ambiguity across words for two corpora, where we see that both models grossly overestimate the tag ambiguity of almost all words, compared with true data (Supervised). 2.5.5 Related Work There is a long history of unsupervised part of speech induction using distributional features [Schütze, 1995], which is beyond the scope of this thesis. To the best of our knowledge, Merialdo [1994] were the first to use a sequence model (a second order HMM) for POS induction. Merialdo [1994] use a dictionary containing the set of possible tags for each word in order to provide some constraint to the model, effectively restricting somewhat its capacity. Unfortunately the availability of a large dictionary is somewhat unrealistic. Much of the following work, tries to reduce the required dictionary size [Toutanova and Johnson, 2007; Smith and Eisner, 2005; Goldwater and Griffiths, 2007; Haghighi and Klein, 2006]. Recently, Ravi and Knight [2009] restrict the capacity of a first-order HMM tagger further by first inducing a list of possible transitions and disallowing the model from using any others. In a fully unsupervised setting, several works [Johnson, 2007; Gao and Johnson, 2008] attempt to restrict the tag ambiguity by using a Dirichlet prior to favor a sparse model and hence 4 Capitalization is only used for words that are sentence-initial less than 10% of the time. 28 CHAPTER 2. BACKGROUND indirectly reduce tag ambiguity. Clark [2003] tries to capture morphological information in POS induction using an HMM with observation probabilities also modeled as an HMM over characters. They find that modeling morphological information can produce better PoS clusters. 3 Posterior Regularization Framework During the modeling stage a critical design decision concerns choosing between model complexity and model expressiveness. Although one wishes to use a model that can explain as closely as possible the particular application’s generative process, one has to restrict oneself to models that are computationally tractable. This leads to models that are generally an oversimplification of the inherent process, as seen in the previous chapter. These models, due to their simplifying assumptions, fail to learn the parameters settings that produce the desired predictions. One solution to this problem is to add more complexity to the model to better reflect the underlying process. For example in the SWA application, this is the approach taken by IBM models 4+ [Brown et al., 1993; Och and Ney, 2003], and more recently by the LEAF model [Fraser and Marcu, 2007a]. Unfortunately, these changes make the models probabilistically deficient and intractable, requiring approximations and heuristic learning and inference prone to search errors. Another solution to this problem is to move to a more Bayesian setting and express preferences about the parameters by adding prior distributions. This approach has shown positive results for several tasks, namely grammar induction [Cohen and Smith, 2009; Headden III et al., 2009]. However, preferences about the hidden variables is normally not easy to express directly in terms of the parameters, since they depend on complex interactions of the parameters. In fact for the PoSI task, the same intuition works well when express over the output variables as will be shown in this work, but when expressed over the parameters fails to achieve good results. Instead, in this thesis we propose a new learning framework called Posterior Regularization (PR) [Graça et al., 2007] that incorporates prior knowledge directly on the output variables during learning. The prior knowledge is expressed as inequalities on the expected value under the posterior distribution of user-defined constraint features (not necessarily the same features used by the model). Since in most applications we are interested in the hidden variables, constraining the posteriors allows a more direct way to achieve the desired behaviour. We show evidence of this fact for the PoSI application. On the other hand, constraining the expected value of the features instead of adding them to the model allows us to express features that would otherwise make the model intractable. For example enforcing that each hidden state value of an HMM model should be used at most once per sentence would break the Markov property and make the model intractable. In contrast, we will show how to enforce that each hidden state is used at most once in expectation by using PR. The underlying model remains unchanged, but the learning method changes. During learning, our method is similar to the EM algorithm with the addition of solving a problem similar to maximum entropy inside the E-Step in order to satisfy the specified constraints. In this thesis we focus the presentation of the PR framework on unsupervised learning of 30 CHAPTER 3. POSTERIOR REGULARIZATION FRAMEWORK generative models, in particular on the HMM model. However, the PR framework is more general and has been applied to semi-supervised settings, with both generative and discriminative models [Ganchev et al., 2008], and to different probabilistic models such as tree models for the task of grammar induction [Ganchev et al., 2009]. For a broader discussion of PR and different applications see [Ganchev et al., 2009]. The next section presents the Posterior Regularization framework, then Section 3.2 presents a learning algorithm for the PR objective, very similar to the EM algorithm described in the previous chapter. In Section 3.3 we will use PR to train models for the SWA application using the following constraints: (i) bijectivity: “one word should not translate to many words”; and (ii) symmetry: “directional alignments should agree”. In Section 3.4 we will use PR as the training procedure for the PoSI application, and propose a constraint that solves the high word/tag ambiguity problem. Finally, Section 3.5 presents work related with the PR framework. 3.1 Posterior Regularization Objective Prior knowledge in PR is specified in terms of expectations under the posterior distribution of user-defined constraint features φ(x̄, z̄). Under certain assumptions, such as linearity of the constraints and features that decompose the same way as the underlying model (which is the case for all constraints used in this work), then all this new learning procedure requires is the ability to do inference in the underlying model, in the case of HMM to run the FB algorithm. The key advantage of PR is that the base model remains unchanged, but, during learning, it is driven to obey the constraints by setting appropriate parameters θ. Moreover, empirical results show that enforcing constraints in expectation, results in predictions that also satisfy the constraints. More formally, prior information in PR is specified with sets Qx̄ of allowed distributions over the hidden variables z̄, which satisfy inequality constraints on some user-defined feature expectations, with violations bounded by � ≥ 0: Qx̄ = {q(z̄ | x̄) : ∃ξ, Eq(z̄|x̄) [φ(x̄, z̄)] − bx̄ ≤ ξ; ||ξ||22 ≤ �} (3.1) Where, Qx̄ which we assume non-empty, and convex, denotes the set of valid distributions, where some feature (φ(x̄, z̄)) expectations under the posterior distribution are bounded by bx̄ and � ≥ 0 is an allowable violation slack. Setting � = 0 enforces inequality constraints strictly, requiring Eq(z̄|x̄) [φ(x̄, z̄)] ≤ bx̄ to hold for all q(z̄ | x̄) in Qx̄ . Constrained Posterior Set : 3.1.1 Posterior Regularization Objective We can see the PR objective as a trade-off between two distance measures - KL-divergences - as pictured on Figure 3.1. On the left side we have the set of all distributions over the observations x̄, per example all distributions over all possible English sentences. In the picture, δ(x̄) represents the Dirac delta function for sentences in the training corpus (it has the value one if the sentence appears in the corpus and zero otherwise). On the left side we wish to find the parameters θ that minimize the KL-divergence from our model, pθ (x̄) , to the empirical 3.1. POSTERIOR REGULARIZATION OBJECTIVE p(x) n 31 p(z|x) Θ δ(x ) K L(δ (x n )||pθ (x)) pθ (x) |p θ ( Qx | K L( θ n n )) z|x pθ (z | xn ) N Negative Log-Likelihood Qxn N Posterior Regularization � � Figure 3.1: PR objective E[KL(δ(x̄) � pθ (x̄))] + E[KL(Q x̄ � pθ (z̄ | x̄))]. distribution δ(x̄): KL(δ(x̄)||pθ (x̄)) = � δ(x̄) log x̄ = � x̄ = − δ(x̄) pθ (x̄) δ(x̄) log δ(x̄) − � x̄∈X (3.2) � δ(x̄) log pθ (x̄) (3.3) x̄ log pθ (x̄) = L(θ). (3.4) The first term is always zero (note that δ(x̄) is either zero or one and both make the first term zero), and the second term is only non-zero for sentences in the training corpus, which is the same as the negative log likelihood of the observed data. The right side represents all possible distributions over outputs z̄. Here, we are trying to find the parameters that minimize the distance between the posterior distribution, pθ (z̄ | x̄), and the set Qx̄ of all distributions that satisfy the desired constraints. Putting it into words, in PR, the negative log-likelihood of a model is penalized with the KL-divergence between the desired distribution space Qx̄ and the model posteriors, KL(Qx̄ � pθ (z̄ | x̄)) = min KL(q(z̄ | x̄) � pθ (z̄ | x̄)). The regularized objective is: q(z̄|x̄)∈Qx̄ Posterior Regularized Likelihood : � L(θ) + E[KL(Q x̄ � pθ (z̄ | x̄))]. (3.5) The objective trades-off likelihood and distance to the desired posterior subspace (modulo getting stuck in local maxima) and provides an effective method of controlling the posteriors. Note that the second term is data dependent, that is if a set of parameters already satisfies the constraints, then the PR objective is the same as minimizing the negative log likelihood. On the other hand, there might be cases that the pθ (z̄ | x̄) distribution will never reach the set Qx̄ because the model cannot express the constraints, or because the required parameters to achieve Qx̄ make the negative likelihood term very high. Even when the model has enough parameters to represent the constraints, the required parameter setting can be such that the negative likelihood term becomes very high, obfuscating the effect of the regularization. 32 CHAPTER 3. POSTERIOR REGULARIZATION FRAMEWORK 3.1.2 Computing The PR Objective Computing the PR objective involves solving the optimization problem: KL(Qx̄ � pθ (z̄ | x̄)) = Primal Projection min q(z̄|x̄)∈Qx̄ KL(q(z̄ | x̄) � pθ (z̄ | x̄)). (3.6) directly minimizing this objective is hard since there is an exponential number of hidden sequences z̄, for instance |N ||H| for the HMM model. However, the problem becomes easy to solve in its dual formulation (see Appendix A.2 for derivation): Dual Projection arg max λ≥0 −b� x̄ λ − log Z(λ) − � ||λ||2 , (3.7) � where Z(λ) = z̄ pθ (z̄ | x̄) exp(−λ · φ(x̄, z̄)) is the normalization constant and the primal solution is q(z̄ | x̄) = pθ (z̄ | x̄) exp{−λ� φ(x̄, z̄)}/Z(λ). There is one dual variable per expectation λi constraint, and the dual gradient at λ �= 0 is ∇(λ) = Eq(z̄|x̄) [φ(x̄, z̄)] − � ||λ|| − bx̄ . Note that this 2 primal-dual relationship is very similar to the one between maximum likelihood and maximum entropy. If bx̄ correspond to empirical expectations and pθ (z̄ | x̄) is uniform, then Equation 3.7 would be a log-likelihood. As with maximum entropy, gradient computation involves computing an expectation under q(z̄ | x̄), which can be performed efficiently if the features φ(x̄, z̄) factor in the same way as the model pθ (x̄, z̄), and the constraints are linear. The posterior over z̄ represented by a graphical model such as HMM can be written as a product of cliques C, which in the HMM correspond to the observation probabilities and transition probabilities. N Factored Posterior : 1 � pθ (z̄ | x̄) = pθt (zi | zi−1 )pθo (xi | zi ). pθ (x̄) (3.8) i=1 If the features φ(x̄, z̄) are factorized along the same cliques (which is the case for all our constraints): N � Factored Features : φ(x̄, z̄) = φ(xi , zi )φ(zi , zi−1 ). (3.9) i=1 Then q(z̄ | x̄) has the same form as pθ (z̄ | x̄), q(z̄ | x̄) = = 1 � pθ (z̄ | x̄) exp(−λ φ(x̄,z̄)) Zλ N 1 � � � pθt (zi | zi−1 ) exp(−λ φ(zi ,zi−1 )) pθo (xi | zi ) exp(−λ φ(xi ,zi )) Zλ (3.10) (3.11) i=1 where Zλ is the new normalization constant that ensures that q(z̄ | x̄) is a proper probability distribution. Hence the projection step uses the same inference algorithm (the FB algorithm for HMMs) to compute the gradient, only modifying the local factors (observation and transition probabilities) using the current setting of λ. Note that the new factors are no longer proper probability distributions. We optimize the dual objective using a gradient based method shown in Algorithm 2. Here η is an optimization precision, α is a step size. Here, β∇(λ) represents an ascent direction chosen as follows: For inequality constraints, it is the projected gradient [Bertsekas, 1999]; for 3.2. POSTERIOR REGULARIZATION OPTIMIZATION ALGORITHM 1 2 3 4 5 6 7 8 33 λi ← 0; while ||∇(λi )|| > η do � pθt (zi | zi−1 )� = pθt (zi | zi−1 ) exp(−λi φ(zi ,zi−1 )) ; � pθo (xi | zi )� = pθo (xi | zi ) exp(−λi φ(xi ,zi )) ; q(z̄ | x̄) ← forwardBackward(pθo (xi | zi )� , pθt (zi | zi−1 )� ); λi+1 =← λi + αβ∇(λi ); i + +; end Algorithm 2: Computing KL(Qx̄ � pθ (z̄ | x̄)) = min q(z̄|x̄)∈Qx̄ KL(q(z̄ | x̄) � pθ (z̄ | x̄)) equality constraints with slack, we use conjugate gradient [Nocedal and Wright, 1999], noting that when λ = 0, the objective is not differentiable. In practice this only happens at the start of optimization and we use a sub-gradient for the first direction. Computing the projection requires an algorithm for inference in the original model, and uses that inference as a subroutine. For HMMs we need to make several calls to the FB algorithm in order to choose λ. Setting the optimization precision η more loosely, allows the optimization to terminate more quickly but at a less accurate value, which corresponds to a bigger distance between the pθ (z̄ | x̄) and the set Qx̄ . We found that aggressive optimization significantly improves results and consequently we choose relatively small η so that tighter values do not significantly improve performance. 3.1.3 Projection Before Decoding The optimal parameters determined on the train set might not satisfy the constraints in the test data. However, we can perform the projection to compute min KL(q(z̄ | x̄) � pθ (z̄ | x̄)) q(z̄|x̄)∈Qx̄ and use it at test time instead of pθ (z̄ | x̄). When performing Posterior decoding we can just use q(z̄ | x̄) and pick the highest probability node marginals according to that distribution. � When using Viterbi decoding one uses the changed factors pθt (zi | zi−1 ) exp(−λ φ(zi ,zi−1 )) and � pθo (xi | zi ) exp(−λ φ(xi ,zi )) with the λ found during projection. We will evaluate the benefits of projecting at decode time for each application/constraint individually on Chapters 4 and 5. 3.2 Posterior Regularization Optimization Algorithm We can optimize the PR objective using a procedure very similar to the expectation maximization (EM) algorithm. Recall from Eq. 2.32 that in the E-step, q(z̄ | x̄) is set to the posterior over hidden variables given the current θ. To converge to the PR objective, we must modify the E-step so that q(z̄ | x̄) is a projection of the posteriors onto the constraint set Qx̄ for each example x̄. E� : arg min q(z̄|x̄),ξ KL(q(z̄ | x̄) � pθt (z̄ | x̄)) s.t. Eq(z̄|x̄) [φ(x̄, z̄)] − bx̄ ≤ ξ; ||ξ||22 ≤ �. (3.12) 34 CHAPTER 3. POSTERIOR REGULARIZATION FRAMEWORK pθ (z̄ | x̄) θ min KL E� − Step : max F (q(z̄), θ) M − Step : max F (q(z̄), θ) q(z̄) q(z̄)∈Qx̄ θ q(z̄) Qx̄ � Figure 3.2: Modified EM algorithm for optimizing the PR objective L(θ) − E[KL(Q x̄ � pθ (z̄ | x̄))]. The new posteriors q(z̄ | x̄) are used to compute sufficient statistics for this instance and hence to update the model’s parameters in the M-step (Eq. 2.33), which remains unchanged. This scheme is illustrated in Figure 3.2 and in Algorithm 3. The only implementation difference is that we must now perform the KL projection before collecting sufficient statistics. Appendix A.1 shows the proof that this procedure converges to a local minima of the PR objective. 1 2 3 for t = 1..T do for each training sentence x̄ do E’-Step: q t+1 (z̄) = arg min KL(q(z̄ | x̄)||pθt (z̄ | x̄)) q(z̄|x̄)∈Qx̄ 4 5 6 end � � � � q t+1 (z̄) log pθ (x̄, z̄) M-Step: θt+1 = arg maxθ E z̄ end Algorithm 3: PR optimization via modified EM. E’-Step is computed using Algorithm 2. 3.3 Statistical Word Alignments In this section we propose two constraints for SWAṄote that since in word alignments the observation x̄ corresponds to a pair of sentences, the target sentence x̄t = xt1 . . . xtN , and the ¯ = xs1 . . . xsM , when defining features we will explicit write the feature source sentence xs function as a function of both sentences and the alignment to make the exposition clearer. 3.3.1 Bijectivity Constraints We observed in Table 2.5 that most alignments are 1 to 1 and we would like to introduce this prior information into the model. Unfortunately including such a constraint in the model, which corresponds to use each hidden state value only once per sequence, directly breaks the Markov property in a fairly fundamental way, since for each position we need to keep track of all states that were used before. Introducing the constraints in expectation using the PR framework is easy and tractable. We encode them as the constraint E[φ(x̄, z̄)] ≤ 1 where we have one 3.3. STATISTICAL WORD ALIGNMENTS 35 0 1 ♣ 2 3 1 ♣ 2 5 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 ♣ ③ � ♣ ♣ ♣ 6 ♣ ♣ ♣ ♣ ♣ ♣ ♣ 9 ③ ♣ ♣ ♣ ♣ ♣ ♣ 8 ♣ ♣ ♣ ♣ ♣ ♣ ♣ 7 ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ 6 ♣ ♣ ♣ ♣ ♣ ♣ ♣ 4 ♣ ♣ ♣ ♣ ♣ ♣ ♣ 3 ♣ ♣ ♣ ♣ ③ ♣ ♣ 0 ♣ ♣ ♣ ♣ ✇ ♣ ♣ 9 ♣ ♣ ♣ ♣ ③ ♣ ♣ 8 ♣ ♣ ♣ ♣ ♣ ♣ ♣ 7 ♣ ♣ ♣ ♣ ③ ♣ ♣ 6 ♣ ♣ ♣ ✈ ♣ ♣ ♣ 5 ♣ ♣ ♣ ♣ ① ♣ ♣ 4 ♣ ♣ ♣ ♣ ① ♣ ♣ 7 ♣ ♣ ✉ ③ ③ ♣ 8 ♣ o ♣ cau ♣ a ♣ inte ♣ sch ③ 9 ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ✈ ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ ♣ ♣ ♣ ① ♣ ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ② ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ ♣ ♣ ♣ ✇ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ o ♣ cau ♣ a ♣ inte ♣ sch ③ 0 ♣ 1 ♣ 2 ♣ 3 ♣ 4 ♣ 5 ① 6 ♣ 0 ♣ 1 ♣ 2 ♣ 3 ♣ 4 ♣ 5 ♣ 6 ♣ ♣ ♣ 0 1 2 et que à ♣ ③ ♣ ♣ ♣ ♣ 0 ♣ ♣ 1 ♣ ♣ 2 ♣ ♣ 3 ♣ ♣ 4 ♣ ♣ 5 ♣ ③6 ♣ ♣ 0 ♣ ♣ 1 ♣ ♣ 2 ♣ ♣ 3 ♣ ♣ 4 ♣ ♣ 5 ♣ ①6 ♣ 3 4 5 6 7 8 9 le lieu de provoquer la rupture , ♣ ♣ 0 1 2 et que à ♣ ③ ♣ ♣ ♣ feature φ(x̄t, xsm , z̄) for each position m in the source sentence that counts how many times it is aligned to a target word in the alignment z̄: Bijective Features : φ(x̄t, xsm , z̄) = ♣ 3 4 5 6 7 8 9 le lieu de provoquer la rupture , Figure 3.3: State posterior distributions for different models an English and French sentence pair. Left: EN→FR model. Right: FR→ EN model. Top: Regular HMM. Bottom: After applying the bijectivity constraint. Circle size indicates probability value. Circle color in the bottom rows indicates differences in posterior from the top row. Green - higher probability, red - lower probability. N � ♣ ♣ 1(zn = m). (3.13) n=1 For example for the sentence in Figure 3.3 in the top left alignment the state posteriors over the source word schism clearly sum to more than 1. The effects of applying these constraints to the posteriors are shown in the second row. Enforcing the one to (at most) one constraint clearly solves the garbage collector effect. Moreover, when distributing the probability mass to the other words, most of the probability mass goes into the right positions (as measured by gold alignments). By enforcing the constraint at training and decoding time the percentage of 1-1 alignments increases from 78% on Hansards English to French to 97.3% (manual alignments have 98.1%), and in the En-Pt corpus, the constraint improves from 84.7% to 95.8% (manual alignments have 90.8%). inst inst 36 3.3.2 CHAPTER 3. POSTERIOR REGULARIZATION FRAMEWORK Symmetry Constraints The directional nature of the generative models used to recover word alignments conflicts with their interpretation as translations. In practice, we see that the choice of which language is source versus target matters and changes the mistakes made by the model (see Figure 3.4). The standard approach is to train two models independently and then combine their predictions Och and Ney [2003] (more details in Subsection 4.1.6). However, we show that it is much better to train two directional models concurrently, coupling their posterior distributions over alignments to approximately agree. Let the directional models be defined → − ← −̄ − − ¯ (target-source) and ← ¯ z | x̄t) (source-target). Define z̄ to range as: → p θ (x̄t, z̄ | xs)) p θ (xs, → − ← −̄ over the union of all possible directional alignments z̄ ∪ z . We define a mixture model → − ← −̄ → − − − − 1→ 1← ¯ = 2 p θ (x̄t, z̄ | xs) ¯ + 2 p θ (xs, ¯ z | x̄t) where ← ¯ z̄ | x̄t) = 0 and vicepθ (x̄t, z̄ | xs) p θ (xs, versa (i.e., the alignment of one directional model has probability zero according to the other model). We then define the following feature for each target-source position pair n, m: Symmetry Features : → − → − +1 z̄ ∈ z̄ and z n = m ← −̄ − φnm (xtn , xsm , z̄) = −1 z̄ ∈ z and ← zm =n . 0 otherwise If the feature φnm (xtn , xsm , z̄) has an expected value of zero, then both models predict the n, m link with equal probability. We therefore impose the constraint Eq [φnm (xtn , xsm , z̄)] = 0 (possibly with some small slack). Note that satisfying this implies satisfying the bijectivity constraint presented above, since both models are normalized according to a particular direction. To compute expectations of these features under the model q(z̄ | x̄) we only need to be able to compute them under each directional HMM. To see this, we have by the definition of q(z̄ | x̄) and pθ (z̄ | x̄), − −̄ Z→ Z← → − − z̄ z → − ← −̄ → − − q (z̄) − +← q (z̄) ← → − ¯ +← ¯ x̄t) exp{−λ� φnm (xtn , xsm , z̄)} p θ ( z̄ | x̄t, xs) p θ ( z | xs, ¯ ¯ x̄t) p θ (x̄t|xs) p θ (xs| q(z̄ | x̄) = = , 2 Zλ 2Zλ (3.14) where we have defined: 1 → − q (z̄) = → − → − ¯ exp{−λ� φnm (xtn , xsm , z̄)} p θ (x̄t, z̄ | xs) → Z− q (z̄) � → − → − → ¯ exp{−λ� φnm (xtn , xsm , z̄)} with Z− = p θ (x̄t, z̄ | xs) q (z̄) (3.15) (3.16) − → z̄ ← − q (z̄) = 1 − Z← q (z̄) − with Z← q (z̄) = ← −̄ ← − ¯ z | x̄t) exp{−λ� φnm (xtn , xsm , z̄)} p θ (xs, � ← −̄ z ← −̄ ← − ¯ z | x̄t) exp{−λ� φnm (xtn , xsm , z̄)} p θ (xs, (3.17) (3.18) All these quantities can be computed separately in each model using forward-backward and − Z→ − Z← q (z̄) q (z̄) furthermore, Zλ = 12 ( − + ← ). The effect of this constraint is illustrated at the → − ¯ ¯ x̄t) p θ (x̄t|xs) p θ (xs| bottom of Figure 3.4. The projected link posteriors are equal for the two models, and in most cases the probability mass was moved to the correct alignment links. The exception is the word pair internal/le. In this case, the model chose to incorrectly have a high posterior for the 3.4. POS INDUCTION 37 0 1 ♣ 2 3 1 ♣ 2 5 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 ♣ ③ � ♣ ♣ ♣ 6 ♣ ♣ ♣ ♣ ♣ ♣ ♣ 9 ③ ♣ ♣ ♣ ♣ ♣ ♣ 8 ♣ ♣ ♣ ♣ ♣ ♣ ♣ 7 ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ 6 ♣ ♣ ♣ ♣ ♣ ♣ ♣ 4 ♣ ♣ ♣ ♣ ♣ ♣ ♣ 3 ♣ ♣ ♣ ♣ ③ ♣ ♣ 0 ♣ ♣ ♣ ♣ ✇ ♣ ♣ 9 ♣ ♣ ♣ ♣ ③ ♣ ♣ 8 ♣ ♣ ♣ ♣ ♣ ♣ ♣ 7 ♣ ♣ ♣ ♣ ③ ♣ ♣ 6 ♣ ♣ ♣ ✈ ♣ ♣ ♣ 5 ♣ ♣ ♣ ♣ ① ♣ ♣ 4 ♣ ♣ ♣ ♣ ① ♣ ♣ 7 ♣ ♣ ✉ ③ ③ ♣ 8 ♣ o ♣ cau ♣ a ♣ inte ♣ sch ③ 9 ♣ ♣ � ♣ ♣ ♣ ♣ ♣ � ♣ ♣ ♣ ♣ ♣ ♣ ① ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ ♣ ♣ ♣ ✇ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ � ♣ ♣ ♣ ♣ ♣ � ♣ ♣ ♣ ♣ ♣ ♣ ① ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ ♣ ♣ ♣ ✇ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ o ♣ cau ♣ a ♣ inte ♣ sch ③ 0 ♣ 1 ♣ 2 ♣ 3 ♣ 4 ♣ 5 ① 6 ♣ 0 ♣ 1 ♣ 2 ♣ 3 ♣ 4 ♣ 5 ♣ 6 ♣ ♣ ♣ et que à ♣ ③ ♣ ♣ ♣ ♣ 0 ♣ ♣ 1 ♣ ♣ 2 ♣ ♣ 3 ♣ ♣ 4 ♣ ♣ 5 ♣ ③6 ♣ ♣ 0 ♣ ♣ 1 ♣ ♣ 2 ♣ ♣ 3 ♣ ♣ 4 ♣ ♣ 5 ♣ ③6 ♣ le lieu de provoquer la rupture , ♣ ♣ et que à ♣ ③ ♣ ♣ ♣ ♣ le lieu de provoquer la rupture , Figure 3.4: Posterior distributions for different models for an English and French sentence pair. Left: EN→FR model. Right: FR→ EN model. Top: Regular HMM posteriors. Bottom: After applying symmetry constraints. Circle size indicates probability value. Circle color in the bottom rows indicates differences in posterior from the top row. Green - higher probability, red - lower probability. alignment link rather than generating internal from null in one direction and le from null in the other. Figure 3.4 shows the expected value of each feature as the difference between the posterior probability for each directional models for the same word pair. The last row shows both directional posteriors after imposing the symmetry constraint. Note that the projected posteriors are equal in the two models. Also, one can see that in most cases the probability mass was moved into the right place with the exception of the word pair internal/le; this is because the word internal does not appear on the French side, but the model still has to spread around the probably mass to that word. In this case, the model decided to accumulate it into the word le instead of moving it to the null word. By enforcing this constraint at training and decoding time we can improve the symmetry of the directional alignments (as measured by intersection over union) on the Hansards corpus from 48% to 89.9% and on the En-Pt corpus from 48% to 94.2%. 3.4 ♣ ♣ PoS Induction As seen in the previous chapter, one of the biggest problems in the PoSI task is the high ambiguity on the word tag distribution. Each word can be assigned to most tags, while we know that each word should be assigned to a very small number of tags, in most cases only one. In this section we use PR and propose a word-tag sparsity constraint that favours posteriors inst inst 38 CHAPTER 3. POSTERIOR REGULARIZATION FRAMEWORK 2 Prior Probability No Prior Sparse prior y=x y = exppsi(x) y = x − 0.5 1.5 1 0.5 Model parameter 0 0 0.5 1 1.5 2 (a) Dirichlet distribution with hyper- (b) Digamma Function which acts like using parameter α < 1. The product of this negative Laplacian smoothing of 0.5. Dirichlet distribution will produce a distribution picked around the values 0 and 1, encouraging sparsity. distributions that assign to each word a few number of possible part of speech tags. Before we describe our constraint the next subsection describes a different approach to solve the same problem by moving into the Bayesian setting and using priors on the parameters to encourage sparsity. This is an example on how imposing constraints over the parameters does not work so well as imposing the constraints directly over the posteriors. Then in Subsection 3.4.2 we describe our constraint and how it solves the high word tag ambiguity problem. 3.4.1 Parameter Sparsity Recent advances in inference methods for Bayesian estimation have been applied to unsupervised PoS tagging [Gao and Johnson, 2008; Johnson, 2007; Goldwater and Griffiths, 2007]. In the Bayesian setting, preference for sparsity is expressed as a prior distribution p(θ, α) over model parameters. Since the underlying model (HMM) is a product of multinomials, Dirichlet distributions are a natural choice for priors since they are conjugate to multinomials (the product of a Dirichlet with a multinomial is a Dirichlet). This conjugacy simplifies the mathematical and computational aspects of the model. The complete description of the model is: θt ∼ Dir (αt ) pθt (zi |zt−1 ) ∼ Multi (θt ) θo ∼ Dir (αo ) pθo (xi |zi ) ∼ Multi (θo ) Here, αt controls sparsity over the state transition matrix and αo controls the sparsity of the state observation probabilities. Johnson [2007] notes that αt does not influence the model that much. In contrast, as αo approaches zero, it encourages the model to have highly skewed pθo (xi |zi ) distributions, that is, each tag is encouraged to generate a few words with high probability, and the rest with very low probability (see Figure 3.5(a) for an example of a Dirichlet distribution with αo ≤ 1). This is not exactly the constraint we would like to enforce: there are some PoS tags that generate many different words with relatively high probability (for example, nouns and verbs), while each word is associated with a small number of tags. This difference is one possible explanation for the relatively worse performance of this prior compared to our method. 3.4. POS INDUCTION 39 Johnson [2007] describes two approaches to learn the model parameters: a componentwise Gibbs sampling scheme (GS) and a Variational Bayes (VB) approximation using a mean field. We will use the VB approach in our comparison since it only requires a small change to the M-Step of the EM algorithm. It is out of the scope of this document to fully describe VB approach, and we refer the readers to the original paper. The authors assume a mean field assumption, and since the likelihood and the prior are conjugates (multinomial and Dirichlet), there is an EM-like procedure that finds local-optimal model parameters. In the particular case of the HMM this procedure only requires a minor change in the M-Step. In the EM algorithm the M-Step consists in normalizing the expected counts collected during the E-Step. VB in this setup changes the M-Step by passing each of the individual counts as well as the normal MStep normalizer through a squashing function (exponential of the Digamma function, see figure 3.5(b)). This as the effect of subtracting 0.5 to the expected counts, ensuring that the counts don’t become negative, which as the effect of the Rich get Richer: words tag pairs with low counts will be pushed to have zero probability. Note that this approach can not be applied to the maximum entropy observation model since there, the model parameters having value zero (sparsity), do not enforce posterior sparsity. In fact if all the parameters have probability zero we get the uniform distribution, since we will have for each possible word-tag pair exp0·φ(x̄,z̄) . 3.4.2 Posterior Sparsity Constraint To measure sparsity, we need to define an appropriate measure of PoS tag ambiguity that we will attempt to control. For hard assignments this measure is more intuitive so we describe it for hard assignments and then extend it to distributions over assignments. Fix a word-type, such as “stock”. Intuitively, we would like all occurrences of “stock” to be tagged with a small subset of all possible tags. For a hard assignment of labels to the entire corpus, we could count how many different tags are used for occurrences of the word “stock.” If instead of a single tagging of the corpus, we have a distribution q(z̄ | x̄) over assignments (resulting from the E-step), we can generalize this measure as follows. Instead of asking was a particular tag ever used for the word “stock”, we would ask what is the maximum probability with which a particular tag was used for the word “stock”. We then add the maximums for each particular occurrence of the word “stock”. This is the quantity that we wish to maximize. We call this measure �1 /�∞ . While in the case of the SWA application the constraints were defined for each sentence individually, the constraint specified before is defined for the entire corpus. We wish to count for the entire corpus how many different hidden state values (tags) were used per word type. Seeing the training corpus X as a sequence of sentences x̄1 . . . x̄D appended together, where in turn each sentence is a sequence of words x, we can reefer to a given word occurrence by its type v ∈ V, and its occurrence number in the whole corpus. For instance, if the full corpus was the following two sentences He went with the car to garage. The car was running out of oil., one can refer to the second occurrence of the word car, in two ways. The first way is the one we have been using so far, we refer to the second sentence in the corpus as x̄2 and then to the second word in that sentences as x2 . A second way that we will use when defining this constraint is to refer to the word type, vv = car, where v ∈ V, and to use the number of times it occurred in the all of the corpus until now, in this case two i = 2. More formally, we define a feature φ(v, h, i) that has the value 1 whenever the ith occurrence of the word v has part of speech tag h. For every word v ∈ V, we would like there to be only a few PoS tags h such that there are occurrences i where v has nonzero probability. This 40 CHAPTER 3. POSTERIOR REGULARIZATION FRAMEWORK can be achieved if it “costs” a lot to allow an occurrence of a word to take a tag, but once that happens, it should be “free” for other occurrences of the word to receive that same tag. More precisely, we would like the sum (�1 norm) over tags h and word types v of the maxima (�∞ norm) of the state posterior of taking tag h over all occurrences of v to be small. The constraint can be described by the following formula: �1 /�∞ : arg min cvh �� v∈V h∈H (3.19) cvh s. t. Eq(z̄|x̄) [φ(v, h, i)] ≤ cvh where the cvh that minimizes that expression is the maximum value of the state posterior for all occurrences of a word type for a given tag. Moreover, we will use a slightly modified version of the PR framework described in Section 3 so that instead of hard constraints on q(z̄ | x̄), it allows the constraints to be relaxed at a cost specified by a penalty. This relaxation can allow combining multiple constraints without having to explicitly ensure that the constraint set remains non-empty. Additionally, it will be useful in dealing with the �1 /�∞ constraints we need. If those were incorporated as hard constraints, the dual objective would become non-differentiable, making the optimization (somewhat) more complicated. Using soft constraints, the non-differentiable portion of the dual objective turns into simplex constraints on the dual variables, allowing us to use an efficient projected gradient method. Formally, the E-step of our approach is expressed by the objective: Primal Penalty Projection min q(z̄|x̄),cvh KL(q(z̄ | x̄) � pθ (z̄ | x̄))+σ � vh cvh s. t. Eq [φ(v, h, i)] ≤ cvh (3.20) where σ is the strength of the regularization. Note that setting σ = 0 we are back to normal EM where q is the model posterior distribution. As σ → ∞, the constraints force each occurrence of a word type to have the same posterior distribution, effectively reducing the model to a 0th-order Markov chain in the E step. The dual of this objective has a very simple form, closely related to the original dual formulation 3.7. (see Appendix A.3 for derivation): � � � � Dual Penalty Projection max − log pθ (z̄ | x̄) exp(−λ · φ(x̄, z̄)) s. t. λvhi ≤ σ λ≥0 z̄ i (3.21) where z̄ ranges over assignments to the hidden tag variables for all of the occurrences in the training data, φ(x̄, z̄) is the vector of φ(v, h, i) feature values for assignment z̄, λ is the vector of dual parameters λvhi , and the primal parameters are q(z̄ | x̄) ∝ pθ (z̄ | x̄) exp (−λ · φ(x̄, z̄)). This can be computed by projected gradient, as described by Bertsekas [1999]. Figure 3.5 illustrates how the �1 /�∞ norm operates on a toy example. For simplicity suppose we are only regularizing one word and our model pθ (z̄ | x̄) is just a product distribution over 15 instances of the word. The left panel in Figure 3.5 shows the posteriors under pθ (z̄ | x̄). We would like to concentrate the posteriors on a small subset of rows. The center panel of the figure shows the λ values determined by Equation 3.7, and the right panel shows the projected distribution q(z̄ | x̄), which concentrates most of the posterior on the bottom row. Note that 3.5. RELATED WORK 41 pθ DT JJ VB NN instance qti ∝ pθ e−λti λ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 DT JJ VB NN instance 10 9 8 7 6 5 4 3 2 1 0 1 DT 0.8 JJ 0.6 VB 0.4 0.2 NN instance 0 Figure 3.5: An illustration of �1 /�∞ norm. Left panel: initial posterior distributions (columns) for 15 instances of a word. Middle panel: optimal regularization parameters λ, each row sums to σ = 20 the regularization strength. Right panel: Projected posterior distributions q, which concentrates the probability mass for all instances on the NN tag, reducing the �1 /�∞ norm from just under 4 to a little over 1. we are not requiring the posteriors to be sparse, which would be equivalent to preferring that the distribution is peaked; rather, we want a word to concentrate its tag posterior on a few tags across all instances of the word. Indeed, most of the instances (columns) become less peaked than in the original posterior to allow posterior mass to be redistributed away from the outlier tags. Since they are more numerous than the outliers, they moved less. This also justifies only regularizing relatively frequent events in our model. We will call the models training using PR and the �1 /�∞ constraint as HMM+Sp for the HMM model and HMM+ME+Sp for the HMM+ME model. We refer to the HMM using the priors to encourage sparsity has HMM+VB. Figure 3.6 shows the distribution of tag ambiguity across words for two corpora. As we see from Figure 3.6, when we train using the EM procedure (HMM), both models grossly overestimate the tag ambiguity of almost all words, compared with true data (Supervised), however when we used both the �1 /�∞ constraint as well as the HMM+VB model we achieve tag ambiguity closer to the true. However, in Chapter 5 will show that both models achieve a better �1 /�∞ distribution in very different ways. While the model trained with PR actually produces clusters of higher quality the same does not happen when using the HMM+VB model. 3.5 Related Work The idea of introducing constraints over a model to better guide the learning process has been used before. In the context of word alignment, Deng and Byrne [2006] use a state-duration HMM in order to model word-to-phrase translations. The fertility of each source word is implicitly encoded in the durations of the HMM states. Without any restrictions, likelihood prefers to always use longer phrases and the authors try to control this behavior by multiplying every transition probability by a constant η > 1. This encourages more transitions and hence shorter phrases. For the task of unsupervised dependency parsing, Smith and Eisner [2006] add a constraint of the form “the average length of dependencies should be X”, to capture the locality of syntax ( at least half of the dependencies are between adjacent words), using a scheme CHAPTER 3. POSTERIOR REGULARIZATION FRAMEWORK 10 Supervised HMM HMM+VB HMM+ME HMM+Sp HMM+ME+Sp L1L∞ 8 6 4 2 10 Supervised HMM HMM-VB HMM+ME HMM+Sp HMM+ME+Sp 8 L1L∞ 42 6 4 2 0 0 0 1000 2000 3000 4000 5000 6000 7000 8000 rank of word by L1L∞ 0 200 400 600 800 1000 1200 1400 1600 1800 rank of word by L1L∞ Figure 3.6: Our ambiguity measure (�1 /�∞ ) for each word type in two corpora measured for an HMM initialized with supervised data, using EM training (HMM, HMM+ME), using sparsity inducing parameter prior HMM+VB and using �1 /�∞ (HMM+Sp, HMM+ME+Sp). Left:En corpus, Right:Pt corpus. they call structural annealing. They modify � the model’s distribution over trees pθ (z̄ | x̄) by a (δ e∈z̄ length(e)) � penalty term as: p θ (z̄ | x̄)θ ∝ pθ (z̄ | x̄)e . Where length(e) is the surface length of edge e. The factor δ changes from a high value to a lower one so that the preference for short edges (hence a smaller sum) is stronger at the start of training. These two approaches also have goal of controlling unsupervised learning, and the form of the modified distributions is reminiscent of the form that the projected posteriors take. However, the approaches differ substantially from PR. Smith and Eisner [2006] makes a statement of the form “scale the total length of edges”, which depending on the value of δ will prefer to have more shorter/longer edges. Such statements are not data dependent; depending on the value of δ, for instance if δ ≤ 0 even if the data are such that the model uses edges that are too short on average we will still push them to be lower. By contrast the statements we can make in PR are of the form “there should be more short edges than long edges.” Such a statement is data dependent in the sense that if the model satisfies the constraints then we do not need to change it; if it is far from satisfying it we might need to make very dramatic changes. PR is closely related to the work of Mann and McCallum [2007, 2008], who concurrently developed the idea of using penalties based on posterior expectations of features to guide semisupervised learning. They call their method generalized expectation (GE) constraints or alternatively expectation regularization. In the original GE framework, the posteriors of the model on unlabeled data are regularized directly. They train a discriminative model, using conditional likelihood on labeled data and “expectation regularization” penalty term on the unlabeled data: 2 � arg max L(θ) − λE[||E pθ (z̄|x̄) [φ(x̄, z̄) − bx̄ ||2 ]. θ (3.22) Notice that there is no intermediate distribution q(z̄ | x̄). For some kinds of constraints this objective is difficult to optimize in θ and in order to improve efficiency Bellare et al. [2009] propose interpreting the PR framework as an approximation to the GE objective in Equation 3.22. They compare the two frameworks on several datasets and find that performance is similar. Liang et al. [2009] casts the problem of incorporating partial information about latent variables into a Bayesian framework using “measurements,” and after several approximation steps, they arrive at the objective we optimize. The idea of jointly training two directional models has been explored by Liang et al. [2006], 3.5. RELATED WORK 43 although under very different formalization. They define a joint objective: � � � → − ← −̄ − − → − − � log → ¯ + log ← ¯ | x̄t) + log ¯ ← ¯ x̄t) . max E p (x̄t | xs) p (xs p ( z̄ | x̄t, xs) p ( z | xs, θ1 ,θ2 θ θ θ θ z̄ → − ← −̄ − − ¯ ← ¯ x̄t) ranges over all one-to-one However, the product distribution → p θ ( z̄ | x̄t, xs) p θ ( z | xs, alignments and computing it is #P-complete [Liang et al., 2006]. They approximate this distri� − − (h ). It is not clear what bution as a product of state posteriors: q(z̄ | x̄) = i,j → γ n (hm )← γ m n objective the approximate procedure actually optimizes. While the approximation leads to good alignment error rate, the authors note that they were not able to observe improved MT performance as a result. 44 CHAPTER 3. POSTERIOR REGULARIZATION FRAMEWORK 4 Statistical Word Alignment In this chapter we present an in depth discussion of the quality of the word alignments when training using PR with the two proposed constraints. We begin with an evaluation of the word alignments measured against manual annotated corpora. We then proceed to evaluate the alignments on two different tasks: dependency grammar transfer and phrase-based translation. 4.1 Word Alignment Evaluation We begin with a comparison of word alignment quality evaluated against manually annotated alignments as measured by precision and recall. We use the six parallel corpora with gold annotations described in the beginning of Section 2.4.3. 4.1.1 Experimental Setup Following common practice for all experiments given the training set, we discarded all training data sentence pairs where one of the sentences contained more than 40 words, since this makes training faster without significantly hurting performance. We added the unlabeled development and test data sets to the pool of training sentences, to avoid having unknown words. We initialized IBM M1 translation table with uniform probabilities over word pairs that occur together in the same sentence and trained IBM M1 for 5 iterations. All HMM alignment models were initialized with the translation table from IBM M1 and uniform distortion probabilities. We run each training procedure until the area under the precision recall curve measured on a development corpus stops increasing (see Figure 4.4 for an example of such a curve). Using the precision recall curve gives a broader sense of the model’s performance than using a single point (by tuning a threshold for a particular metric). In most cases this meant 4 iterations for normal EM training and 2 iterations using PR. We suspect that the constraints make the space easier to search. For the models trained with PR we describe the criteria used for the projection Algorithm 2 in detail in the next subsection. The intent of this experimental section is to evaluate the gains from using constraints during learning, hence the main comparison is between HMM trained with normal EM against trained with PR. We also report results for IBM M4, since it is often used as the default word alignment model, and can be used as a reference. However, we would like to note that IBM M4 is a more complex model, able to capture more structure albeit at the cost of intractable inference. Since our approach is orthogonal to the base model used, the constraints described here could be applied in principle to IBM M4 if exact inference was efficient, hopefully yielding similar improvements. We used a standard implementation of IBM M4 [Och and Ney, 2003] and since changing the existing code is not trivial, we could not use the same stopping criterion 46 CHAPTER 4. STATISTICAL WORD ALIGNMENT Figure 4.1: HMM with symmetry constraints for different values of τ in the Hansards corpus using 100k sentences. Left: Area under the precision recall curve. Right number of calls to the FB algorithm during projection for the 4 iterations. to avoid overfitting and we are not able to produce precision recall curves. We trained IBM M4 using the default configuration of the MOSES training script.1 This performs 5 iterations of IBM M1, 5 iterations of HMM and 5 iterations of IBM M4. The following experiments all use MBR decoding since this decoding method always outperforms Viterbi decoding.2 The competing models through this section will be the HMM model trained with EM denoted as HMM the HMM model trained using PR and the bijectivity constraint, denoted as B-HMM and using the symmetry constraints denoted as S-HMM and finally IBM M4. 4.1.2 Dual Projection Optimization When training using PR there are several parameters related with the projection of the posteriors pθ (z̄ | x̄) into the constraint set Qx̄ that need to be set ( see Algorithm 2). The convergence criterion for the projection algorithm was the squared normalized l2 norm of the ||∇(λ )||2 i 2 gradient (gradient norm divided by number of constraints #φ(x̄,z̄) ) being smaller than η. For bijective constraints we set η to 0.005 and used zero slack. For symmetry constraints η and slack were set to 0.001. We chose η aggressively and lower values did not significantly increase performance. Less aggressive settings cause degradation of performance. Figure 4.1 shows the difference in performance as we vary η when training with PR using the symmetry constraints. As η decreases the performance of the model gets much better, ranging from values close to using regular EM training for big values of η to an huge improvement when η = 0.001. However, these improvements come at a computational cost. If we measure optimization efficiency by the number of calls performed to the FB algorithm (required for gradient computations) we see that as we decrease η we exponential require more calls to the FB algorithm. An interesting remark is that when we optimize to high precision besides getting better results we also require less EM iterations (2 achieves the maximum performance), which indicates that the constraints provide a lot of guidance to the model since satisfying them more closely, gets the model to a local minimum much faster. There are however some numerical problems with using small values of η, as some sentences failed to converge more often in which case we use the original counts. The reasons for failing to converge are due to exceeding 1 2 http://www.statmt.org/moses/?n=FactoredTraining.HomePage IBM M4 uses Viterbi decoding since Giza++ does not support MBR decoding. 4.1. WORD ALIGNMENT EVALUATION 47 Figure 4.2: HMM with symmetry constraints for different allowable slack values (�) using 100k sentences of the Hansards corpus. Left: Area under the precision recall curve. Right number of calls to the FB algorithm used during projection for the 4 iterations. the maximum number of iterations either on projecting or looking for the best step size, or if the objective exceeds the numeric range for numbers (since we are taking the exponential of the parameters to calculate q(z̄ | x̄)). We also evaluate the effect of the value for the slack variable � used to define the set Qx̄ of acceptable constraints (Eq: 3.1). The slack is kind of complementary in an inverse way to η, as by making η bigger, we are making the primal projection objective KL(Qx � pθ (z̄ | x̄)) bigger as well. On the other hand by making � bigger we are making the set Q easier to attain and the dual optimization problem easier since we are subtracting � times the norm of the parameter from the gradient. The biggest difference between the two, although mostly a technicality, is that, when close to the boundary of Q if we have no slack we pay the distance as measure by KL, while in the second case when we reach inside of the L2 of radius � we are paying the distance in terms of L2 norm. Figure 4.2 shows the performance of HMM using PR with the symmetry constraints when using the best convergence value η = 0.001 for different values of slack �. The figure on the left shows the area under the curve, we note that the difference is not significant for different slack values. On the right we show the number of required call to the Forward Backward algorithm per EM iteration. We note that better optimization methods, such as L-BFGS, or using a warm start for the parameters at each EM iteration (parameters from the previous iteration), or training the models online, would potentially decrease the running time of our method, but this is left as future work. 4.1.3 Projection Before Decoding Here we analyse the difference of projecting the posteriors at decode time vs not projecting, and the effect of training with PR against training with EM and projecting the posteriors at decode time. Figure 4.3 shows the precision vs recall curves of both models when training using standard EM versus PR with both constraints, and the results of additionally applying the constraints at decode time. A first observation is that training with PR significantly boosts the performance of each model. Moreover using the projection at decode time always increases performance. Comparing both constraints, it seems that bijective constraint is more useful at training time. Note that using this constraint at decode time with regular training yields worse 48 CHAPTER 4. STATISTICAL WORD ALIGNMENT M1 Hansards EN-FR S-HMM Constraint 1 0.9 0.8 0.8 recall 1 0.9 0.7 B-HMM BP 0.6 B-HMM NP HMM BP 0.5 HMM NP 0.4 0.4 0.5 0.7 S-HMM SP HMM SP 0.5 S-HMM NP HMM NP 0.4 0.4 0.5 0.6 0.6 0.7 0.8 0.9 1 0.6 0.7 0.8 0.9 precision precision HMM Hansards EN-FR Bijective Constraint HMM Hansards EN-FR Symmetric Constraint 1 0.95 0.9 0.85 0.8 0.75 B-HMM BP 0.7 B-HMM NP HMM BP 0.65 HMM NP 0.6 0.6 0.65 0.7 recall recall recall M1 Hansards EN-FR Bijective Constraint 0.75 0.8 0.85 precision 0.9 0.95 1 1 0.95 0.9 0.85 0.8 0.75 S-HMM SP HMM SP 0.7 S-HMM NP 0.65 HMM NP 0.6 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1 precision Figure 4.3: Precision recall curves for IBM M1 and HMM using standard EM training (HMM) versus PR with bijective constraints (B-HMM) and symmetry constraints (S-HMM) and different decoding types: decoding without any projection (NP), doing bijective projection before decoding (BP), and doing symmetry projection before decoding (SP), using 100k sentences of the Hansards corpus. results than just training with the same constraint using PR. On the other hand, the symmetry constraint is stronger at decode time. The rest of the results reported are all using projection at decode time. 4.1.4 Overall Results In this subsection we present results on alignment quality. Figure 4.4 shows precision recall curves for the different models on the En-Fr corpus using English as the source language (left), and on the En-Pt corpus using Portuguese as the source. Precision recall curves are obtained by varying the posterior threshold from 0 to 1 and then plotting the different precision and recall values obtained. We observe several trends from Figure 4.4. First, both types of constraints improve over the HMM in terms of both precision and recall (their precision recall curve is always above). Second, S-HMM performs slightly better than B-HMM. IBM M4 is comparable with both constraints (after symmetrization). The results for all language pairs are in Figure 4.5. For ease of comparison, we choose a decoding threshold for HMM models to achieve the recall of the corresponding IBM M4 and report precision. Our methods always improve over the HMM by 10% to 15%, and improve over IBM M4 9 out of 12 times. Comparing the constraints with each other we see that S-HMM performs better than B-HMM in 10 out of 12 cases. Since S-HMM 4.1. WORD ALIGNMENT EVALUATION 49 Precision/Recall curves Precision/Recall curves 1 1 0.95 0.95 0.85 0.8 0.75 0.7 0.65 recall 0.9 0.85 recall 0.9 S-HMM B-HMM IBM M4 F IBM M4 Int IBM M4 GDF HMM 0.6 0.6 0.65 0.8 0.75 0.7 0.65 0.7 0.75 0.8 0.85 0.9 0.95 S-HMM IBM M4 IBM M4 Int IBM M4 GDF B-HMM HMM 0.6 0.6 1 0.65 0.7 0.75 precision 0.8 0.85 0.9 0.95 1 precision Figure 4.4: Precision recall curves for different models using 1000k sentences. Precision on the horizontal axis. Left: Hansards EN-FR direction. Right: EN-PT PT-EN direction. 70.5 70.5 91.6 93.4 90.8 90.1 91.8 82.4 84.9 84.4 84.6 74.9 76.3 78.3 79.8 80.9 82.5 84.6 86.5 88.0 88.9 88.4 89.1 84.0 85.3 87.2 87.8 87.2 84.6 84.3 75.7 73.0 70 71.3 74.4 77.6 75 82.7 84.5 86.3 87.9 82.4 85.0 85.0 80 83.9 85.0 86.2 85 88.4 90 94.6 95 67.5 65 60 En-Pt Pt-En Pt-Fr Fr-Pt En-Es Es-En Es-Fr Fr-Es Pt-Es Es-Pt En-Fr Fr-En Languages HMM IBM M4 B-HMM S-HMM Figure 4.5: Word alignment precision when the threshold is chosen to achieve IBM M4 recall with a difference of +/- 0.005. The average relative increase in precision (against the HMM model) is 10% for IBM M4, 11% for B-HMM and 14% for S-HMM. 50 CHAPTER 4. STATISTICAL WORD ALIGNMENT 90 90 80 70 S-HMM B-HMM IBM M4 HMM 60 50 1000 10000 100000 precision 100 precision 100 80 70 S-HMM B-HMM IBM M4 HMM 60 1e+06 size 50 1000 10000 100000 1e+06 size Figure 4.6: Word alignment precision as a function of training data size (number of sentence pairs). Posterior decoding threshold chosen to achieve IBM M4 recall in the Hansards corpus. Right: En-Fr direction. Left: Fr-En direction. indirectly enforces bijectivity and models sequential correlations on both sides, this is perhaps not surprising. Figure 4.6 shows performance as a function of training data size. As before, we decode to achieve the recall of IBM M4. For small training corpora adding the constraints provides larger improvements (20-30)% but we still achieve significant gains even with a million parallel sentences (15%). Greater improvements for small data sizes indicate that our approach can be especially effective for resource-poor language pairs. 4.1.5 Rare vs. Common Words One of the main benefits of using the constraints described is an alleviation of the garbage collector effect [Brown et al., 1993]. Figure 4.7 breaks down the performance improvements by common versus rare words. As before, we use posterior decoding, tuning the threshold to match IBM M4 recall. For common words, this tuning maintains recall very close for all models so we do not show this in the figure. In the top left panel of Figure 4.7, we see that precision of common words follows the pattern we saw for the overall corpus: symmetry and bijective outperform both IBM M4 and the baseline HMM, with symmetry slightly better than bijective. The results for common words vary more slowly as we increase the quantity of training data than they did for the full corpus. In the top right panel of Figure 4.7 we show the precision for rare words. For the baseline HMM as well as for IBM M4, this is very low precisely because of the garbage collector problem: rare words become erroneously aligned to untranslated words, leading to low precision. In fact the constrained models achieve absolute precision improvements of up to 50% over the baseline. By removing these erroneous alignments the translation table becomes more accurate allowing higher recall on the full corpus. In the bottom panel of Figure 4.7, we observe a slightly diminished recall for rare words. This slight drop in recall is due to moving the mass corresponding to rare words to null. 4.1.6 Directional Alignments Combination As discussed earlier, the word alignment models are asymmetric, while most applications require a single alignment for each sentence pair. Typically this is achieved by a symmetriza- 4.1. WORD ALIGNMENT EVALUATION 51 100 90 80 70 80 70 S-HMM B-HMM IBM M4 HMM 60 50 1000 10000 60 precision precision 90 50 40 S-HMM B-HMM 20 IBM M4 HMM 10 1000 30 100000 1e+06 10000 size 100000 1e+06 size 100 recall 90 80 70 HMM IBM M4 B-HMM S-HMM 60 50 1000 10000 100000 1e+06 size 1 1 0.95 0.95 0.9 0.9 0.85 0.85 recall recall Figure 4.7: Precision and recall as a function of training data size for Hansards corpus split by common and rare words. Top Left: Common Precision, Top Right: Rare Precision. Bottom: Rare Recall. 0.8 0.8 0.75 0.75 0.7 0.7 S-HMM B-HMM IBM M4 Int 0.65 IBM M4 GDF HMM 0.6 0.6 0.65 0.7 0.75 0.8 precision 0.85 0.9 0.95 1 S-HMM B-HMM HMM 0.65 IBM M4 Int IBM M4 GDF 0.6 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 precision Figure 4.8: Precision recall curves for the different models after soft union symmetrization. Precision is on the horizontal axis. Left EN-FR, Right PT-ES 52 CHAPTER 4. STATISTICAL WORD ALIGNMENT tion heuristic that takes two directional alignments and produces a single alignment. For MT the most commonly used heuristic is called grow diagonal final Axelrod et al. [2005]. This starts with the intersection of the sets of aligned points and adds points around the diagonal that are in the union of the two sets of aligned points. The alignment produced has high recall relative to the intersection and only slightly lower recall than the union. In syntax transfer the intersection heuristic is normally used, since one wants to have high precision links to transfer knowledge between languages. One pitfall of these symmetrization heuristics is that they can obfuscate the link between the original alignment and the ones used for a specific task, making errors more difficult to analyze. Since they are heuristics tuned for a particular phrase-based translation system, it is not clear when they will help and when they will hinder system performance. In this work we followed a more principled approach that uses the knowledge about the posterior distributions of each directional model. We include a point in the final alignment if the average of the posteriors under the two models for that point is above a threshold. This heuristic is called soft union [DeNero and Klein, 2007]. Figure 4.8 shows the precision recall curves after symmetrization for the En-Fr corpus. The posterior regularization trained models still performed better, but the differences get smaller after doing the symmetrization. This should not be very surprising, since the soft union symmetrization can be viewed as an approximation of our symmetry constraint applied only at decode time. Applying the symmetrization to the model with symmetry constraints does not affect performance. 4.1.7 Error Analysis In this subsection we discuss some scenarios in which the constraints make the alignments better, and some scenarios where they fail. We have already discussed the garbage collector effect and how both models address it. Both of the constraints also bias the model to have at most probability one in any row or column of the posterior matrix, encouraging 1 to 1 alignments. Obviously whenever alignments are systematically not 1 to 1 , this can lead to errors. Figure 4.9, shows the state posterior distributions for an English/French sentence pair using the same notation as in Figure 2.6. In the top panel of Figure 4.9, we see the left baseline model (EN→FR), where the English word met is incorrectly being aligned to séance est ouverte. This makes it impossible to recover the correct alignment house/séance. Both constraints correct this problem. On the other hand, by enforcing a 1 to 1 mapping the correct alignment met / est ouverte is lost. Going back to the first row (regular HMM) this alignment is correct in one direction and absent in the other (due to the n to 1 model restriction) but we can recover that information using the symmetrization heuristics, since the point is present at least in one direction with high probability mass. This is not the case for the constraint-based models that reduce the mass of that alignment in both directions. Going back to the right panel of Figure 4.8, we can see that for low values of precision the HMM model actually achieves better recall than the constraint-based methods. There are two possible solutions to alleviate this type of problem, both with their caveats. One solution is to model the fertility of each word in a way similar to IBM M4, or more generally to model alignments of multiple words. This can lead to significant computational burden, and is not guaranteed to improve results. A more complicated model may require approximations that destroy its performance gain, or require larger corpora to estimate its parameters. Another option is to perform some linguistically motivated pre-processing of the language pair to conjoin words. This of course has the disadvantage that it needs to be specific to a language pair in order to include information such as “English simple past is written using a single word, so join together French passé composé.” An additional problem 4.2. APPLICATIONS 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 53 0 ② ♣ ♣ ♣ 1 2 0 ③ ♣ ♣ ♣ ♣ ♣ 1 2 3 4 4 ♣ ♣ ♣ ♣ ③ ♣ 6 3 ♣ ♣ ♣ ③ ♣ ♣ 5 2 ♣ ♣ ③ ♣ ♣ ♣ ♣ ③ 1 ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ 0 ♣ ③ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ 6 ♣ ♣ ♣ ♣ ♣ ③ 5 ♣ ♣ ♣ ♣ ③ ♣ 4 ♣ ♣ ♣ ♣ ① ♣ 3 ♣ ♣ ♣ ♣ ③ ♣ 5 6 ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ③ ③ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ ♣ ③ la séance est ouverte à ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ ♣ 2 heures 7 0 1 0 ③ ♣ ③ 1 ♣ ♣ 2 ♣ ♣ 3 ♣ ♣ 4 ♣ ♣ ①5 ♣ ♣ 7 0 1 ♣ 0 ③ ♣ ♣ 1 ♣ ③ ♣ 2 ♣ ♣ ♣ 3 ♣ ♣ ♣ 4 ♣ ♣ ♣ 5 ♣ ♣ 7 0 1 ♣ 0 ③ ♣ ♣ 1 ♣ ③ ♣ 2 ♣ ♣ ♣ 3 ♣ ♣ ♣ 4 ♣ ♣ ♣ 5 ♣ ♣ . la séance ♣ ♣ ♣ ♣ 2 ♣ ♣ ♣ ♣ 3 4 2 ♣ ♣ ♣ ♣ ♣ ♣ 3 4 5 6 5 ♣ ♣ ♣ ♣ ♣ ③ 7 4 ♣ ♣ ♣ ♣ ③ ♣ ♣ 2 ♣ p.m. 3 ♣ ♣ ♣ ③ ♣ ♣ ♣ ③ 2 ♣ ♣ ③ ♣ ♣ ♣ ③ ♣ ♣ ♣ ♣ ♣ 7 ♣ ♣ ♣ ♣ ♣ ♣ 6 ♣ ♣ ♣ ♣ ♣ ③ 5 ♣ ♣ ♣ ♣ ③ ♣ 6 7 ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ③ ♣ 2 ♣ p.m. ♣ ♣ ♣ ♣ ♣ ♣ ③ ♣ ♣ ♣ ♣ ③ est ouverte à ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ the ♣ house ♣ met ♣ at ♣ the ♣ house ♣ met ♣ at ♣ 2 ♣ p.m. ♣ the ♣ house ♣ met ♣ at 2 heures . Figure 4.9: State posterior distributions for different models for an English and French sentence pair. Left: EN→FR direction. Right: FR→ EN direction. Top: EM trained HMM posteriors. Middle: After applying bijective constraint. Bottom: After applying symmetry constraint. Sure alignments are squares with borders; possible alignments are squares without borders. Circle size indicates probability value. Circle color in the middle and bottom rows indicates differences in posterior from the top row. Green - higher probability, red - lower probability. with joining words to alleviate inter-language divergences is that it can increase data sparsity. 4.2 Applications In this section we evaluate the alignments resulting from using the proposed constraints in two different tasks: Syntax transfer where alignments are used to indicate which labels should match in the different languages and statistical machine translation where alignments are used to restrict the number of possible minimal translation units. 4.2.1 Syntax Transfer Here we compare the different alignment models based on how well they can be used for transfer of linguistic resources across languages. In particular, Ganchev et al. [2009] use a word aligned corpus and a parser for a resource rich language (source language) in order to create a parser for a resource poor language (target language). Consider the parse tree of the source language as a set of dependency edges. For each such edge, this work checks whether both end points are aligned to words in the target language. If both endpoints are aligned, then the edge is transferred. Obviously, not all edges transferred in this way result in correct parse edges. Parser errors in the source language can result in incorrect parses. Additionally, 54 CHAPTER 4. STATISTICAL WORD ALIGNMENT 80 60 78 76 55 74 72 50 70 68 S-HMM B-HMM 66 HMM 3 45 3.2 3.4 3.6 3.8 4 4.2 4.4 S-HMM B-HMM HMM 12 14 16 18 20 22 Figure 4.10: Edge conservation for cross lingual grammar induction. Left: En→Bg subtitle corpus; Right: En→Es parliamentary proceedings. Vertical axis: percentage of transferred edges that are correct. Horizontal axis: average number of predicted edges per sentence. linguistic phenomena in the two languages might cause correct parses and alignments to result in incorrect transferred edges. For example the verb “bike” in English might be translated to French as “aller à vélo” where the semantic content is translated as the word “vélo” but we cannot expect this French noun to behave similarly to the English verb. In order to address this concern the resulting alignments are filtered by PoS tag. In order to evaluate the alignments without dependence on a particular parser, we computed the fraction of correctly transferred edges as a function of the average number of edges transferred. In general as we trade-off precision vs recall of alignments, we can increase the accuracy of the transferred edges by transferring a smaller number of edges. Figure 4.10 shows our results for transferring from English to Bulgarian (En→Bg) and from English to Spanish (En→Es). En→Bg results are based on a corpus of movie subtitles, and are consequently shorter sentences, while En→Es results are based on a corpus of parliamentary proceedings. We use the “two-rules” system for Bulgarian, using sentences of any length that have at least one transferred edge. For Spanish we use the “three-rules” system. See Ganchev et al. [2009] for details on the corpora and transfer rules. The alignments were produced using the softUnion heuristic. Figure 4.10 shows the percentage of transferred edges that are correct, as a function of the average number of edges per sentence that were transferred for the two languages (for different values of the posterior threshold). The Bulgarian corpus has shorter sentences on average, resulting in a smaller number of edges transferred per sentence and in a higher accuracy relative to Spanish. We see in Figure 4.10 that for both domains, the models trained using posterior regularization perform better than the baseline model trained using EM. 4.2.2 Phrase-based machine translation In this subsection we attempt to investigate whether the new alignments produce improvements in a end to end phrase based machine translation system. In particular we fix a state of the art machine translation system, the open source Moses [Hoang et al., 2007] toolkit3 , and measure its performance when we vary the supplied word alignments. The exact relationship between word alignment quality and machine translation quality is not straightforward. Despite recent work showing that better word alignments can result in 3 www.statmt.org/moses/ 4.2. APPLICATIONS Foreign Source Points 0 1 0 1 2 x a 0-0 ✇ ✇ 0 0 a xy ab 0-0 1-1 ✇ b 1 xyz a b c 0-0 1-1 2-2 1 • ✇ y b 1-1 2 c 2 yz bc 1-1 2-2 x( y z x y z c 2-2 55 Foreign Source Points x a 0-0 xy a 0-0 2 x a b 0-0 a x y a b 0-0 b xyz a b c 0-0 2-2 ✇c z c 2-2 z yz c 2-2 z bc 2-2 yz bc 2-2 Figure 4.11: Results from the phrase extraction procedure for different word alignments. In the right panel the word alignment model has missed the y-b alignment which will result in different phrases being extracted. better MT [Ganchev et al., 2008], there is evidence that the two are only indirectly connected [Lopez and Resnik, 2006; Vilar et al., 2006]. Most statistical MT systems use word alignments to extract phrase pairs which are then used along with corresponding probabilities in the translation pipeline. Lopez and Resnik [2006] identify three ways in which word alignments can affect a phrase based translation system: which phrase pairs are extracted; the probabilities associated to them; and the value of the lexical weight feature. The phrase pairs extracted are closely tied to the alignments, but in perhaps an unintuitive way. The phrase extraction code extracts all phrase pairs that include at least one aligned point but do not contradict an alignment by including an aligned word in one language without its translation in the other language. So on the one hand if there are too many incorrect alignment points forming a cluster, the correct phrases cannot be extracted without adding the spurious words, leading to missing words/phrases from the phrase table. On the other hand, unaligned words act as wild cards that can be aligned to every neighboring word, thus increasing the size of the phrase table (in the Hansards corpus the phrase table size increases by a factor of 5 when using a decoding of threshold of 0.6 instead of 0.2). Another undesirable effect of unaligned words is that whey will only appear (in the phrase table) in the context of the surrounding words. In the case that the word was incorrectly misaligned and has a correct word translation, it will not be available at decoding time on different contexts. Figure 4.11 shows the phrases extracted from two word alignments for the same sentence pair. The alignments differ only in the alignment point y-b. By incorrectly excluding this point, we remove the phrase y-b, and hence would not be able to translate y as b except in the context we have seen it. We will also increase the total number of phrases, since any unaligned points can be included in adjacent phrases. The spurious phrase pairs, will change both the phrase probability as well as the lexical weight feature. Lopez and Resnik [2006] conclude that the factor with most impact was the degradation of the translation probabilities due to noisy phrase pairs. In a different study Ayan and Dorr [2006] note that the lexical weight feature of common phrase based MT systems do not correctly take account for null aligned words. They show that accounting for these correctly can improve performance. For all experiments, the experimental setup is as follows: we lowercase the corpora, and 56 CHAPTER 4. STATISTICAL WORD ALIGNMENT train language models from all available data. The reasoning behind this is that even if bilingual texts might be scarce in some domain, monolingual text should be relatively abundant. For each competing alignment model we train Moses and use MERT optimization to tune its parameters on a development set. Moses is trained starting from a symmetrized alignment set, phrases up to 7 words are extracted, and the msd-bidirectional-fe distance based distortion model is used. The competing alignment models are: GIZA++ implementation of IBM M4, where the directional models are symmetrized using the grow diagonal and final heuristic (M4 GDF); our implementation of the baseline HMM model using the softUnion heuristic (HMM SU); our implementation of HMM trained with PR using both constraints and the softUnion heuristic (B-HMM SU, S-HMM SU). When using the softUnion heuristic we tested three values of the threshold (0.2,0.4,0.6) which try to capture the choice between recall oriented alignments and precision oriented alignments, and pick the best according to the performance in the development set. Table 4.1 summarizes the results for the different corpora and alignments. M4 GDF HMM SU B-HMM SU S-HMM SU Fr → En 35.7 35.9 36.0 35.5 En → Fr 31.2 28.9 31.5 31.2 Es → En 32.4 32.3 32.6 31.9 En → Es 31.6 31.6 31.7 32.5 Pt → En 31.4 30.9 31.0 31.4 En → Pt 28.9 31.6 32.2 32.3 Table 4.1: BLEU scores for all language pairs. The best threshold was picked according to the development set after the last MERT iteration. Bold represents the best value overall. A first observation is that the constraint based models always outperform the HMM model trained with EM and outperform M4 GDF in all but one experiment, with a difference ranging from (0.2 to 0.9). However both constrains help in different cases. While the Bijectivity constrains, seem to help more for the Hansards corpus and for one direction of the EPPS corpus, in the other cases the symmetry constrains outperform all other models. It is also interesting that the baseline HMM with softUnion outperforms the more complex M4 baseline in half the cases. Even if the constrained alignments consistently lead to better BLEU scores these differences are small and not easy to justify. There are several influencing factors, related with the high entropy on the typical MT cascade: there was a huge variance in the number of iterations MERT required to converge ranging from just 2 iterations to 28 iterations. Also the best value on the development set did not always coincide with the best value on the test set. Moreover, against common knowledge in MT community which indicates that, bigger phrase tables are better, we note that in 14 out of 18 times the threshold picked was 0.4 (middle size phrase tables) and the other 4 times 0.2 was picked (smaller phrase tables). For instance for the symmetry constraints HMM model on the En-Pt corpus the phrase table sizes ranges from 0.8 GM for 0.2, 1.7G for 0.4 and 2.7G for 0.6. This is probably related with the fact that the number of aligned points with bigger thresholds is small leading to a lot of erroneous phrases constituted mostly by unaligned points, which will lead to a poor estimation of phrase probabilities, as noted also in Lopez and Resnik [2006]. 5 Part of Speech Induction This Chapter compares the different models used for the PoSI task: the HMM-ME model and the HMM model using standard EM and PR with sparsity constraint and the HMM model using a discounting Dirichlet prior. 5.1 Experimental Setup We test our models on the six languages described in subsection 2.5.2. The competing models are the HMM with a multinomial observation model (HMM), the HMM with a maximum entropy observation model trained using EM (HMM+ME) and trained using PR with the sparsity penalty (HMM+Sp, HMM+ME+Sp), and a multinomial HMM with a Dirichlet prior on the parameters (HMM+VB) trained using variational Bayes. Following standard practice for the HMM models with a multinomial observation model, we lowercase the corpora and replace words with only one occurrence by a special unknown token. This improves the multinomial HMM results by decreasing the number of parameters and eliminating rare words, which are mostly nouns and normally end up together with other nouns. Since the max-entropy observation models have access to features, these pre-processing steps are not necessary and we do not perform them. We randomly initialized all models from the same posteriors obtained by running the E-Step of the HMM model with a set of random parameters. This means that the initialization was identical for all models (for a fixed random seed). For the models using the maximum entropy observation model, we regularize the likelihood with a Gaussian prior with variance 10. For EM and Variational Bayes training we trained the model for 200 iterations, since we found that the models converge around iteration 100. For HMM+VB model we use a transition and observation prior of 0.001 corresponding to the best values reported in Johnson [2007]. For PR training, we initialize the models with the parameter obtained by training with EM for 30 iterations (mostly for speed) and then we run an additional 170 iterations of PR. In all experiments, unless explicitly mentioned, for PR training, we regularize only words that occur at least 10 times, with σ = 32, following the setup of Graça et al. [2009]. We will see that this setting is not optimal for every language, and model, but we prefer to use it instead of doing supervised model selection, which does not make sense in an unsupervised scenario. We obtain hard assignments using Posterior decoding since this showed small but consistent improvements over Viterbi decoding. In these experiments, all models have a number of hidden states equal to the number of tags in the corpus. 58 CHAPTER 5. PART OF SPEECH INDUCTION Accuracy with 1-many mapping Mutual Information 1.38 1.16 1.54 1.44 1.68 2.11 2.10 2.29 1.68 1.56 1.65 1 2.23 2.22 2.19 2.44 1.84 1.76 1.83 1.93 1.98 1.81 1.5 0.92 35 2 1.25 1.50 1.54 1.69 1.19 47.7 50.8 54.0 61.6 57.8 63.6 67.3 68.2 71.3 58.6 40 2.78 2.66 2.93 75.7 47.1 45 2.5 68.1 53.9 50 60.5 55 63.9 65.5 68.1 61.2 60 58.1 65 64.1 70 66.9 67.6 70.7 71.6 68.1 75 72.4 3 72.5 70.8 76.5 80 0.5 En HMM Pt HMM+ME Bg HMM+Sp Es HMM+ME+Sp Dk Tr HMM+VB En HMM Pt HMM+ME Bg HMM+Sp Es Dk HMM+ME+Sp Tr HMM+VB Figure 5.1: Average accuracy and mutual information for 10 different runs (random seeds identical across models) for 200 iterations of training. Left: Accuracy after performing the 1Many mapping. Right: Mutual information between gold tags distribution and hidden state distribution. The error bars extend one standard deviation. 5.2 Overall Results Figure 5.1 compares performance of the models across the six languages using both the 1Many and MI metrics described in Subsection 2.5.3. Figure 5.1 shows that for both metrics, our model (HMM+ME) always performs better than the baseline (HMM). Also, training with PR using the sparsity constraint always improves over HMM. Strikingly, the improvements of the two modifications are additive: HMM+ME+Sp always performs better than both HMM+ME and also HMM+Sp. Finally HMM+VB performs better than HMM for English, and comparably or worse for the other languages. 5.3 Maximum Entropy Observation Model Results As shown in the previous section HMM+ME always outperforms the HMM model for all corpora. In this section we break down the performance of the HMM+ME model in terms of different feature sets, the number of parameters used, and the value for the Gaussian prior over the parameters. Conventional wisdom for discriminative learning is that adding more features almost always improves performance. While this is often true in a setting where there is a large labeled corpus, the situation is very different for an unsupervised problem such as ours. In the supervised setting, the model has guidance that it can use to discard some of the features as irrelevant to the task. By contrast, there is no way for an unsupervised model to learn that it should not use some of the features to explain the data. For example, the features described in Section 2.5.1 (Words, Suff., Size, Ortho. and Counts) could be used for a named entity recognizer, or a noun-phrase chunker. However, without labeled data we have to rely that our selection of features is adequate to the task at hand. Moreover, increasing the number of features will increase the capacity of the model, which can lead to adversarial results, such as leading the model to explain irrelevant correlations of the data. Figure 5.2 breaks down the performance of the model by different feature sets. A first observation is that not all features sets are useful. In fact, only the word identity features with cutoff and the suffixes features are useful across all languages, while the other features seem to be hurting or at least not helping the model at all. The fact that the morphological features are 5.3. MAXIMUM ENTROPY OBSERVATION MODEL RESULTS 59 61.60 61.86 61.85 60.17 57.26 61.80 72.52 58.6 63.13 67.31 66.96 66.82 64.63 63.38 66.80 67.78 72.40 72.98 72.56 71.35 54.0 50 60.5 63.90 63.12 63.95 58.1 55 59.17 57.07 60 64.2 65 66.9 67.60 68.01 68.59 67.36 67.14 68.44 70 72.50 71.83 72.71 68.20 66.90 70.83 75 45 En Pt Bg Es Dk Tr Languages HMM All No-Size No-Count No-Suff No-Cutoff No-Orto Figure 5.2: Accuracy (using 1-Many mapping) of the HMM+ME model for the different corpora using different feature sets. All is using all features described in Subsection 2.5.1; NoSize: removes the feature for short words; No-Count removes the feature for frequent words; No-orto: removes the regular expression features; No-Suff: removes the suffixes features; Nocutoff: removes the cutoff to include the word identity, meaning that all word identities are included. All values correspond to an average over 10 different runs. The error bars extend one standard deviation. not useful is unexpected and seems to contradict prior work [Clark, 2003]. We leave a better understating of the features interactions for future work. One of the advantages of the HMM+ME model is that it allows us to reduce the number of parameters of the model, hence giving the model less chance to overfit. Figure 5.4 shows the accuracy of the model as we vary the number of word identities used as parameters. We can observe that even when the cutoff is set to 1, the HMM+ME still improves over HMM although they have parameters for the same word identities (recall that in the HMM words that occurred only once were replaced by the token “unknown”), which shows that the other features are actually helping the model. As we increase the word identity cutoff value the performance of the model varies according to the language. For some languages the less words we have the better the performance is (En , Pt), while for other languages the performance of the model starts to degrade after a certain point (Es ,Dk). A possible explanation is that the value of the cutoff is based on the absolute number of occurrences of a word type, which does not scale with the corpus size. In fact, the corpus where higher cutoffs hurt performance are the ones with smaller corpus size. Figure 5.4 shows the number of features used by each model as we increase the cutoff. As expected, the number of features of the HMM+ME model with cutoff of 1 is similar to the HMM but the number of features decreases drastically as we increase the cutoff, specially for small corpus, relying more on the remaining shared features. This might be particularly useful for domain adaptation where we want to rely as less as possible on the particular word identity since these are bound to change or have a different meaning in the new domain. There is a direct correlation between the number of features used and the Gaussian prior used on the HMM+ME model. Since as we increase the regularizer strength (inverse of the Gaussian prior), the model pays more for giving high weights to features, it will prefer to use common features such as suffixes to explain the data and avoid using the word identity features. Figure 5.5 shows the accuracy of the HMM+ME as we change the regularizer value 60 CHAPTER 5. PART OF SPEECH INDUCTION 75 65.6 58.6 55 68.1 67.7 66.3 66.6 72.7 72.8 72.8 72.0 71.2 72.2 71.9 72.6 60.5 60 64.2 66.9 66.9 66.9 68.2 67.5 67.6 69.0 65 70.4 70 50 En Pt Es Dk Languages HMM 1 5 10 15 20 Figure 5.3: Accuracy (using 1-Many mapping) of the HMM+ME model for the different corpora using different cutoffs on the word identity feature. All values correspond to an average over 5 different runs. The error bars extend one standard deviation. 30000 25000 20000 15000 10000 5000 0 En Pt Es Dk Languages HMM 1 5 10 15 20 Figure 5.4: Number of features used by the HMM+ME as we increase the word identity cutoff value. 5.4. TAG SPARSITY RESULTS 61 67.2 67.7 67.0 57.4 60.5 66.9 71.3 73.2 72.2 70.3 64.2 54.0 50 60.1 60 66.9 61.1 67.1 69.9 67.5 68.6 70 72.3 72.8 72.0 80 40 32.5 38.9 30 20 En Pt Es Dk Languages HMM 0.01 0.1 1 10 100 Figure 5.5: Accuracy (using 1-Many mapping) of the HMM+ME model for the different corpora using different strengths for the regularizer. All values correspond to an average over 5 different runs. The error bars extend one standard deviation. using the default feature set. Setting the strength very high (0.01) always hurts the model. This is natural since this will force the model to give small weights to most of the features relying mostly on suffix features, which are not discriminative between some word classes. As we decrease the regularizer strength, the model performance increases significantly. We note that the model is not very sensitive to smaller regularization strengths (1,10,100). 5.4 Tag Sparsity Results In this section we analyse both sparse inducing approaches (HMM+Sp, HMM+VB). Going back to Figure 2.8 we see that both approaches are achieving an average �1 /�∞ close to the true distribution. However, the results on Figure 5.1 are quite different. In fact while HMM+Sp always improves performance, HMM+VB more often hurts than it helps. Figure 2.8 only shows the global distribution of �1 /�∞ for each method, but shows no information about the difference on sparsity for individual words. This is shown in Figure 5.6, which displays the average sum of absolute differences in sparsity between each model and the true tags. In all languages both sparsity inducing models have a smaller average sparsity difference per word with HMM+Sp having the slightly smaller difference. It is also interesting to see that the HMM+ME model also improves over this measure, although it is not directly dealing with sparsity. So far these measures did not illustrate the significant differences in performance between HMM+Sp and HMM+VB. Figure 5.7 shows scatter plots of the performance of each method against the �1 /�∞ value they achieve. HMM+Sp always achieves a better performance according to 1-Many mapping, while HMM+VB achieves a better performance on 1 to 1 mapping. However, the good behaviour of HMM+VB under the 1 to 1 mapping is not due to the quality of the induced clusters, but a consequence of the defects of the 1 to 1 mapping described in Subsection 2.5.3. Figure 5.8 shows the number of tokens assigned to each cluster at decoding time, in frequency rank order. While both HMM and HMM+Sp exhibit a fast decrease in the size of the clusters, HMM+VB more closely matches the power law-like distribution achieved by the gold labels. This difference explains the improvement on the 1-1 mapping, where HMM+VB is assigning larger size clusters to the most frequent tags. However, HMM+VB achieves this power 62 CHAPTER 5. PART OF SPEECH INDUCTION 4.96 5 4 3.55 3 0.31 0.26 1.48 0.31 0.18 1.69 0.18 0.37 0.29 2.19 1 0.37 0.65 2.65 2 0 En Pt Es Languages HMM HMM+VB HMM+ME HMM+Sp HMM+ME+Sp Figure 5.6: Average absolute �1 /�∞ difference for each word type between each training method and the supervised initialization. 70 69 68 67 66 65 64 63 62 61 60 Sparse 32 Sparse 10 Sparse 100 70 Sparse 32 Sparse 10 69 -1 VEM 10 EM 68 VEM 10-3 0 1 2 3 1 2 3 VEM 10-3 67 4 52 -1 VEM 10 51 50 49 VEM 10-3 48 47 46 45 Sparse 100 44 Sparse 10 43 42 Sparse 32 41 EM 40 0 Sparse 100 VEM 10-1 4 0 1 2 EM 3 4 53 VEM 10-1 52 51 50 VEM 10-3 49 48 Spar s Sparse e 10 32 Sparse 100 47 EM 46 0 1 2 3 4 Figure 5.7: L1LMax vs Accuracy. Left: 1-Many mapping (Pt corpus En corpus). Right 1-1 mapping. Top: Pt corpus. Bottom: En corpus. 5.4. TAG SPARSITY RESULTS 4 3 2 63 4 EM VEM 10-3 VEM 10-1 Sparse 32 True EM VEM 10-3 VEM 10-1 Sparse 32 True 3 2 1 1 0 0 5 10 5 10 15 20 Figure 5.8: Token distribution per cluster. left: Bg corpus, middle: Pt corpus, right: En corpus. vertical axis: number of tokens in tens of thousands. 75 68.05 60.5 67.64 71.16 68.72 70.90 70.86 71.22 71.07 69.34 71.48 68.86 64.1 60 66.9 65 70.35 70 55 50 En Pt Es Languages HMM 5 10 15 20 Figure 5.9: Accuracy (using 1-Many mapping) of the HMM+Sp model for the different corpora using different cutoffs for the number of words required for a word type to be regularized. All values correspond to an average over 5 different runs. The error bars extend one standard deviation. law distribution at the expense of the mutual information with the gold labels as we see in Figure 5.1. From all methods, HMM+VB has the lowest mutual information, while HMM+Sp has the highest. In section 5.6 we will show further evidence that mutual information is in fact the best measure to evaluate the quality of the induced clusters and that it correlates with the gains of using the induced clusters as features for a supervised part of speech tagger. As with the HMM+ME model there is no principled way on how to choose the minimum number of word occurrences for the cutoff when choosing which words to constrain. On the one hand, it does not make sense to constrain word types with very few occurrences, since we do not have enough counts to decide which cluster should the probability mass be pushed to. On the other hand, if we only constraint words that occur a lot of times, we leave most of the words types, and possible the more ambiguous unconstrained. Figure 5.9 shows the performance of the method for different cutoffs. We see that in all languages constraining more words seems to always help. The difference is more significant for Es since it has a smaller corpus and hence lesser word types with a high number of occurrences. Figure 5.10 shows that the model is not very sensible to different choices of σ. 64 CHAPTER 5. PART OF SPEECH INDUCTION 75 69.73 71.07 67.47 60.5 69.41 71.42 71.16 64.1 60 70.35 68.78 71.35 71.22 66.9 70.19 65 71.07 70 55 50 En Pt Es Languages HMM 10 32 64 100 PREP DET N ADJ RPUNC POS V INPUNC CONJ TO EPUNC Others PREP DET N ADJ RPUNC POS CONJ TO CONJ ENDPUNC V INPUNC V V ADJ V INPUNC RPUNC ADJ ADJ N N N N N DET PREP ENDPUNC TO CONJ INPUNC V V POS N ADJ N N N N N N DET PREP Figure 5.10: Accuracy (using 1-Many mapping) of the HMM+Sp model for the different σ values. All values correspond to an average over 5 different runs. The error bars extend one standard deviation. EPUNC Others Figure 5.11: Induced tags by the HMM model (Left), and by the HMM+ME+Sp model (Right) on the En corpus. Each column represents a hidden state, and is labeled by its 1-Many mapping. Unused true tags are grouped into the cluster named “Others”. 5.5 Error Analysis Figures 5.11 shows the distribution of true tags and clusters for both the HMM model (left) and the HMM+ME+Sp model (right) on the English corpus. Each bar represents a cluster, labeled by the tag assigned to it after performing the 1-Many mapping. The colors represent the number of words with the corresponding true tag. To avoid clutter true tags that were never used to label a cluster are grouped into “Others”. We see in Figure 5.11 that both models split common tags such as “nouns” into several hidden states. This splitting accounts for much of the errors in both models. By using 5 states for nouns instead of 7, HMM+ME+Sp is able to use more states for adjectives. Another improvement comes from a better grouping of prepositions. For example “to” is grouped with punctuations in the left, while in the right it is correctly mapped to prepositions. Although this should be the correct behaviour, this actually hurts, since the tagset actually has a true tag “To” we get all occurrences of the word “to” as incorrectly assigned, which consists in a lost of 2.2% points absolute accuracy. Potentially, this is not a very serious error since “to” is most of the time a preposition. By contrast, HMM has a state mapped to the tag “TO” but the word “to” 5.6. USING THE CLUSTERS 65 25000 25000 20000 20000 15000 15000 10000 10000 5000 5000 0 0 art n adj prop v-fin prp punc num adv v-pcp v-inf conj-c pron-pers sumOthers art n adj prop v-fin prp punc num adv v-pcp v-inf conj-c pron-pers sumOthers Figure 5.12: Induced tags by the HMM model (Left), and by the HMM+ME+Sp model (Right) on the Pt corpus. Each column represents a hidden state, and is labeled by its 1-Many mapping. Unused true tags are grouped into the cluster named “Others”. 40 25 None HMM MaxEnt HMM L1L∞ MaxEnt L1L∞ 55 50 45 Error 30 Error 60 None HMM MaxEnt HMM L1L∞ MaxEnt L1L∞ 35 20 40 35 15 30 10 25 5 20 10 100 Labeled Sentences 500 10 100 Labeled Sentences 500 Figure 5.13: Tagging error rate as a function of labeled data size for semi-supervised model using the induced tags as features in the Es corpus (left) and for the Tr corpus (right). comprises only one fifth of that state. The most common error made by HMM+ME+Sp is to include the word “The” with the second noun induced tag in Figure 5.11 (Right). This induced tag contains mostly capitalized nouns and pronouns, which often precede nouns of other induced tags. Potentially, the capitalization feature might be a culprit for the word “The”. Portuguese morphology is much richer than English which explains the better performance of the feature based models. Figure 5.12 shows the induced clusters for Pt language. The HMM+ME+Sp model improves over HMM for all tags except of adjectives. Both models have trouble distinguishing nouns from adjectives. The reduced accuracy on adjectives for HMM+ME+Sp is explained by the mapping of a cluster to adjectives in the HMM model and to nouns in the HMM+ME+Sp model. Removing the noun/adjective distinction, as suggested by Zhao and Marcus [2009], would increase performance of both models by about 6%. One qualitative difference we observed was that the HMM+ME+Sp model used a single induced cluster for proper nouns rather than spreading them across different clusters. 66 CHAPTER 5. PART OF SPEECH INDUCTION En Pt Bg Es Dk Tr 14 12 Error 10 8 6 4 2 0 10 100 Labeled Sentences 500 Figure 5.14: Error reduction by using the HMM+ME+Sp model produced clusters as features as opposed to the clusters produced by the HMM model, as a function of the labeled data size. 5.6 Using The Clusters As a further comparison of the models trained using the different methods, we use them to generate features for a semi-supervised PoS tagger. The basic supervised model has the same features as the HMM+ME model (2.5.1), except that we use all word identities and suffixes regardless of frequency. We train the semi-supervised model using averaged perceptron for a number of iterations chosen as follows: split the training set into 20% for development and 80% for training and pick the number of iterations ν to optimize accuracy on the development set. Finally, train on the full training set using ν iterations and report results on a 500 sentence test-set. We augment the supervised features with the state identity for the current token, based on the automatically generated models. For each unsupervised training procedure (HMM, HMM+Sp, HMM+ME, HMM+ME+Sp), we train 10 models using different random initializations to get 10 state identities for each token. We then add these cluster identities as features to the supervised model. Figure 5.13 shows the average accuracy of the supervised model for the Es and Tr languages as we vary the type of unsupervised features. The average is taken over 10 random samples from the training set at each training set size. We can see from Figure 5.13 that using semi-supervised features from any of the models except HMM improves performance even if we have 500 labeled sentences. In particular, to get an error rate comparable to supervised with 500 labeled sentences, we need less than 200 labeled sentences for most languages. Figure 5.14 shows the absolute error reduction by using the HMM+ME+Sp model features as opposed to the HMM model features on all languages. The error reduction is specially dramatic for low data conditions, but its still significant after the supervised model observes 500 training examples across all languages. 6 Conclusion and Future Work 6.1 Conclusions In this thesis we developed a novel learning framework, the Posterior Regularization Framework, for incorporating domain knowledge into the learning algorithm while at the same time keeping the base model simple and tractable. The framework defines a rich language to specify prior knowledge about the desired hidden structure. The prior knowledge is expressed as inequalities on the expected value under the posterior distribution of some user defined constraint features. In these thesis we showed the expressiveness of the framework by expressing three different types of constraints: 1) restricting the degree of the hidden state values; 2) making two different models agree on the value of the state posteriors during training; 3) enforcing sparsity over groups of hidden variables. Moreover, we showed a concrete example on how enforcing the constraints directly on the posterior distribution is better than applying the same intuition as parameter priors. For the SWA task we applied the following two constraints: (i) bijectivity: “one word in one language should not translate to many words in the other language”; and (ii) symmetry: “the translation of words should be the same when translated from language A to language B and from language B to A”. Using these constraints we showed consistent and significant improvements in 6 different languages pairs even when compared to a more complex model such as IBM M4. Besides solving the known “garbage collector” effect, we showed that the posterior distributions obtained modeled better the desired alignments. Given the intuition that both constrains are biasing the models for 1-1 alignments, we also showed some systematic mistakes that the constraints introduce and suggest future work on how to solve them. In addition, we tested the alignments produced by the PR framework within two different tasks that rely on word alignments, syntax transfer and statistical phrase based MT, where the improved alignments lead to an increase in performance. In the case of syntax transfer, we showed that the number of edges of a dependency tree that can be transferred from one language to the other increases due to the decrease of incorrect alignment points. On phrase based MT, the improvements are harder to explain, but in fact over three different language pairs we got improvements using the new alignments. We noted however that the current heuristics for phrase extraction and phrase scoring do not take into account the quality (or probability) of the base alignments, and aim at achieving as many phrases as possible (leading to huge phrase tables), which obfuscates the relation between word alignments and MT quality. We also applied the new learning framework to the task of fully unsupervised PoSI in six different languages. We addressed the problems of the common approach of using an HMM trained using EM and investigate two efficient modifications to the model that effectively control the model complexity. First, we re-parametrize the standard HMM by using a maximum entropy observation model with orthographic and morphological features and reduce the number of parameters, and show that this simple change produces significant gains 68 CHAPTER 6. CONCLUSION AND FUTURE WORK in all the languages. Second, we use the PR framework to effectively constrain the allowable ambiguity between words and tags, which achieves significant additive gains in all six languages. The combination of these two approaches achieved remarkable improvements in all languages, ranging from 4.7% to 15.4% absolute error reduction. Moreover, we show that the output of these unsupervised systems can be very useful when training a supervised learning system using a small labeled corpus, reducing the amount of labeled data needed by roughly a half. However, we also showed some unsolved challenges with this approach. Firstly, it is not clear what features help on unsupervised learning and how they interact. It is surprising that morphological features do not help on this particular task since they are particular useful on supervised systems. Secondly, the propose model for this task had too many parameters that influence its outcome. Finding a way to pick good parameters in the unsupervised setting is still an open research question. 6.2 Future Work Beyond the two constraints discussed for the SWA task, many others are worth investigating, specially after a better understanding of which systematic errors are still being performed by the current models. One such example are linguistic informed constraints, such as: • Enforce small distortion as measured by dependency tree distance, as tried by Lopez and Resnik [2005] and DeNero and Klein [2007]; • Enforce that aligned words should belong to compatible PoS tags. Conditioning on PoS tags has showed good results in Toutanova et al. [2002]; • Enforce alignments to respect syntactic boundaries; • Enforce similar words to align to each other with probability one. This would avoid the example in Figure 2.4 where the proper name “baldwin” is acting as a garbage collector although the words are identical in both languages. A general question regarding these constraints, which is related with the PR framework in general is what kind of prior knowledge is better expressed as a constraint in expectation or directly on the parameters. Regarding the word similarity constraint we could enforce this directly on the parameters by initializing the probability p(”baldwin”|”baldwin”) to one, thus discarding all possible alternatives. Also regarding the tree distortion model we could have the distortion probability be a convex combination between tree distortion and absolute position distortion. A different kind of constraint would be to train word alignments models on several languages at the same time and enforce pairwise agreement between languages. The approach of using several languages at the same time as been used before and proved successfully (for instance Kumar et al. [2007]). To simulate fertility, we could replace the bijectivity constraint by a fertility constraint where the probability mass of each word had to sum not to one, but to a constant depending on the word identity or the word class. In fact, some small experiments where we applied a constraint that the fertility of each word should be equal to the one it had on the development corpus, produced a systematic improvement in performance. 6.2. FUTURE WORK 69 While for SWA both constraints were parameter free and linguistic justifiable, leaving only the projection optimization parameters as free parameters, the task of PoSI has a wide range of parameters that need to be tuned. A first direction to explore is what features to use, against common wisdom in supervised learning, where the more features the better, in this application we saw that features that we thought would be useful were actually harmful to the model. Furthermore, there are interactions between the features and the prior used during the optimization, which are not completely clear. Finally, while using the soft penalty version of PR has the advantage of alleviating the requirement of a non empty set where to project the posteriors, it also introduces an extra strength parameter that needs to be tuned. Furthermore, the useful values for that parameter do not have a clear meaning, such as the average sparsity we wish to achieve. Finally, we wish to analyse the usefulness of the induced clusters into different tasks. 70 CHAPTER 6. CONCLUSION AND FUTURE WORK A Derivations A.1 Convergence Proof We show that minimizing constrained F (q, θ) results in minimizing the PR Objective. Theorem: F (q, θ) and PR Local minima of F (q, θ) s.t. q(z̄ | x̄) ∈ Qx̄ are local minima of � L(θ) + E[KL(Q x̄ � pθ (z̄ | x̄))]. � � q(z̄ | x̄) log pθ (z̄ | x̄)] from F (q, θ), we get: Proof: By adding and subtracting E[ z̄ � F (q, θ) = E � � � = −E z̄ � � � � � � � � q(z̄ | x̄) p (x̄, z̄) q(z̄ | x̄) θ � � q(z̄ | x̄) log = −E q(z̄ | x̄) log +E q(z̄ | x̄) log pθ (x̄, z̄) p (z̄ | x̄) pθ (z̄ | x̄) θ z̄ z̄ � z̄ (A.1) � � � q(z̄ | x̄) log pθ (x̄) + E[KL(q(z̄ | x̄)||pθ (z̄ | x̄)] = L(θ) + E[KL(q(z̄ | x̄) || pθ (z̄ | x̄)]. (A.2) Since the first term does not depend on q(z̄ | x̄), the second term is minimized by q(z̄ | x̄) = arg minq(z̄|x̄))∈Qx̄ KL(q(z̄ | x̄) || pθ (z̄ | x̄)) at local minima. A.2 Modified E-step Dual Derivation The modified E-step involves a projection step that minimizes the Kullback-Leibler divergence: E� : arg min q(z̄|x̄) KL(q(z̄ | x̄) � pθ (z̄ | x̄)) s.t. Eq(z̄|x̄) [φ(x̄, z̄)] − bx̄ ≤ ξ; ||ξ||22 ≤ �. (A.3) Assuming the set Qx̄ = {q(z̄ | x̄) : Eq(z̄|x̄) [φ(x̄, z̄)] − bx̄ ≤ ξ; ||ξ||22 ≤ �} is non-empty, the corresponding Lagrangian is max λ,α,γ min q(z̄|x̄),ξ L(q(z̄ | x̄), ξ, λ, α, γ) (A.4) 72 CHAPTER A. DERIVATIONS , where L(q(z̄ | x̄), ξ, λ, α, γ) = + � q [φ(x̄, z̄) − bx̄ − ξ) KL(q(z̄ | x̄) � pθ (z̄ | x̄)) + λ� (E � α(||ξ||22 − �2 ) + γ( q(z̄ | x̄) − 1) (A.5) log(q(z̄ | x̄)) + 1 − log(pθ (z̄ | x̄)) + λ� φ(x̄, z̄) + γ = 0 (A.7) (A.6) z̄ ∂L(q(z̄ | x̄), ξ, λ, α, γ) = ∂q(z̄ | x̄) q(z̄ | x̄) = =⇒ ∂L(q(z̄ | x̄), ξ, λ, α, γ) = ∂ξi pθ (z̄ | x̄) exp(−λ� φ(x̄, z̄)) e exp(γ) λi 2αξi − λi = 0 =⇒ ξi = 2α (A.8) (A.9) Plugging q(z̄ | x̄) and ξ in L(q(z̄ | x̄), ξ, λ, α, γ) and taking the derivative with respect to γ. � � � pθ (z̄ | x̄) exp(−λ� φ(x̄, z̄)) ∂L(λ, α, γ) q(z̄|x̄) pθ (z̄ | x̄) exp(−λ φ(x̄, z̄)) = −1 = 0 =⇒ γ = log( ) ∂γ e exp(γ) e q(z̄|x̄) (A.10) From where we can simplify q(z̄ | x̄) = where Zλ = q(z̄|x̄) pθ (z̄ | x̄) exp(−λ� φ(x̄, z̄)) ensures that q(z̄ | x̄) is properly normalized. Plugging γ into L(λ, α, γ) and taking the derivative with respect to α, we get: pθ (z̄|x̄) exp(−λ� φ(x̄,z̄)) Z � λ2 λ2 + − α�2 2α 4α ||λ||2 λ2 λ2 − − �2 = 0 =⇒ α = 2α2 4α2 2� − log(Z) − b� x̄ λ − L(λ, α) = ∂L(λ, α) = ∂α (A.11) (A.12) Replacing back into L(λ, α) we get the dual objective: Dual E� : arg max λ≥0 A.3 −b� x̄ λ − log(Z) − ||λ||2 � (A.13) Modified Penalty E-step Dual Derivation The penalty version of PR has to solve the following projection problem on the E’-Step: Primal Penalty Projection min q(z̄|x̄),cvh KL(q(z̄ | x̄) � pθ (z̄ | x̄)) + σ � cvh vh s. t. Eq [φ(v, h, i)] ≤ cvh 0 ≤ cvh (A.14) A.3. MODIFIED PENALTY E-STEP DUAL DERIVATION 73 The Lagrangian becomes: L(q(z̄ | x̄), cvh , αvh , λvhi ) = KL(q(z̄ | x̄)||pθ (z̄ | x̄))+σ � cvh + � wti vh λvhi (Eq(z̄|x̄) [φ(v, h, i)]−cvh )−αvh ·cvh (A.15) where we are maximizing with respect to λvhi ≥ 0 and αvh ≥ 0. Taking the derivative with respect to q(z̄ | x̄) we have: ∂L(q(z̄ | x̄), cvh , αvh , λvhi ) = log q(z̄ | x̄) + 1 − log pθ (z̄ | x̄) − φ(v, h, i) · λvhi ∂q(z̄ | x̄) (A.16) Setting this to zero and ensuring q(z̄ | x̄) normalizes we get: q(z̄ | x̄) = pθ (z̄ | x̄) exp(−φ(v, h, i) · λvhi ) Zλ (A.17) Taking the derivative with respect to cvh we have: � ∂L(q, cvh , αvh , λvhi ) =σ− λvhi − αvh ∂cvh (A.18) i � setting this to zero � gives us αvh = σ − i λvhi . Knowing that αvh ≥ 0 we will have to introduce the constraint σ ≥ i λvhi . Substituting into the KL term we have: yields: KL(q(z̄ | x̄)||pθ (z̄ | x̄)) = � pθ (z̄ | x̄) exp(−φ(v, h, i) · λvhi ) Zλ z log pθ (z̄ | x̄) exp(−φ(v, h, i) · λvhi ) Zλ p(z) = − log(Zλ ) − Eq(z̄|x̄) [λvhi · φ(v, h, i)] The second part of this will cancel with � L(q(z̄ | x̄), c, αvh , λvhi ) = − log(Zλ ) + σ = − log(Zλ ) + σ = − log(Zλ ) (A.19) wti λvhi Eq(z̄|x̄) [φ(v, h, i)] � cvh + wti vh � vh � cvh − � wti leaving us with: λvhi (−cvh ) − αvh · cvh λvhi cvh − � vh cvh (σ − � λvhi ) (A.20) i So our objective becomes very simple: max − log(Zλ ) s. t. λvhi ≥0 � i λvhi ≤ σ (A.21) 74 CHAPTER A. DERIVATIONS B Bibliography Afonso, S., E. Bick, R. Haber, and D. Santos (2002). Floresta Sinta(c)tica: a treebank for Portuguese. In In Proc. LREC, pp. 1698–1703. Axelrod, A., R. B. Mayne, C. Callison-burch, M. Osborne, and D. Talbot (2005). Edinburgh system description for the 2005 iwslt speech translation evaluation. In In Proc. International Workshop on Spoken Language Translation (IWSLT. Ayan, N. F. and B. J. Dorr (2006). Going beyond aer: An extensive analysis of word alignments and their impact on mt. In Proc. ACL. Banko, M. and R. Moore (2004). Part of speech tagging in context. In Proc. COLING. Bannard, C. and C. Callison-burch (2005). Paraphrasing with bilingual parallel corpora. In ACL ’05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA, pp. 597–604. Association for Computational Linguistics. Bellare, K., G. Druck, and A. McCallum (2009). Alternating projections for learning with expectation constraints. In Proceedings of the Proceedings of the Twenty-Fifth Conference Annual Conference on Uncertainty in Artificial Intelligence, Corvallis, Oregon, pp. 43–50. AUAI Press. Bertsekas, D. P. (1999). Nonlinear Programming: 2nd Edition. Athena scientific. Biemann, C., C. Giuliano, and A. Gliozzo (2007). Unsupervised part-of-speech tagging supporting supervised methods. In Proceedings of RANLP. Brown, P. F., S. A. D. Pietra, V. J. D. Pietra, M. J. Goldsmith, J. Hajic, R. L. Mercer, and S. Mohanty (1993). But dictionaries are data too. In HLT ’93: Proceedings of the workshop on Human Language Technology, Morristown, NJ, USA, pp. 202–205. Association for Computational Linguistics. Brown, P. F., S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer (1993). The mathematics of statistical machine translation: Parameter estimation. Computational linguistics 19(2), 263–311. Callison-Burch, C. (2007). Paraphrasing and Translation. Ph. D. thesis, University of Edinburgh. Callison-Burch, C. (2008). Syntactic Constraints on Paraphrases Extracted from Parallel Corpora. In Proc. EMNLP. Charniak, E. and M. Elsner (2009). EM works for pronoun anaphora resolution. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 148–156. Association for Computational Linguistics. 76 CHAPTER B. BIBLIOGRAPHY Chiang, D., A. Lopez, N. Madnani, C. Monz, P. Resnik, and M. Subotin (2005, October). The hiero machine translation system: extensions, evaluation, and analysis. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, pp. 779–786. Association for Computational Linguistics. Civit, M. and M. Martí (2004). Building cast3lb: A spanish treebank. Research on Language & Computation. Clark, A. (2003). Combining distributional and morphological information for part of speech induction. In Proc. EACL. Cohen, S. and N. Smith (2009). The shared logistic normal distribution for grammar induction. In Proc. NAACL. Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the em algorithm. Royal Statistical Society, Ser. B 39(1), 1–38. DeNero, J. and D. Klein (2007, June). Tailoring word alignments to syntactic machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, pp. 17–24. Association for Computational Linguistics. Deng, Y. and W. Byrne (2006). Hmm word and phrase alignment for statistical machine translation. In Proc. IEEE. Fraser, A. and D. Marcu (2007a, June). Getting the structure right for word alignment: Leaf. In In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 51–60. Fraser, A. and D. Marcu (2007b). Measuring word alignment quality for statistical machine translation. Computational Linguistics 33(3), 293–303. Galley, M., M. Hopkins, K. Knight, and D. Marcu (2004, May 2 - May 7). What’s in a translation rule? In D. M. Susan Dumais and S. Roukos (Eds.), HLT-NAACL 2004: Main Proceedings, Boston, Massachusetts, USA, pp. 273–280. Association for Computational Linguistics. Ganchev, K., J. Gillenwater, and B. Taskar (2009). Dependency grammar induction via bitext projection constraints. In ACL-IJCNLP ’09: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1, Morristown, NJ, USA, pp. 369–377. Association for Computational Linguistics. Ganchev, K., J. Graca, J. Gillenwater, and B. Taskar (2009). Posterior regularization for structured latent variable models. Technical Report MS-CIS-09-16, University of Pennsylvania Department of Computer and Information Science. Ganchev, K., J. V. Graça, J. Blitzer, and B. Taskar (2008). Multi-view learning over structured and non-identical outputs. In Proc. UAI. Ganchev, K., J. V. Graça, and B. Taskar (2008, June). Better alignments = better translations? In Proceedings of ACL-08: HLT, Columbus, Ohio, pp. 986–993. Association for Computational Linguistics. 77 Gao, J. and M. Johnson (2008, October). A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers. In In Proc. EMNLP, Honolulu, Hawaii, pp. 344–352. ACL. Goldwater, S. and T. Griffiths (2007). A fully bayesian approach to unsupervised part-of-speech tagging. In In Proc. ACL, Volume 45, pp. 744. Graça, J., K. Ganchev, F. Pereira, and B. Taskar (2009). Parameter vs. posterior sparisty in latent variable models. In Proc. NIPS. Graça, J. V., K. Ganchev, and B. Taskar (2007, December). Expectation maximization and posterior constraints. In J. Platt, D. Koller, Y. Singer, and S. Roweis (Eds.), Advances in Neural Information Processing Systems 20, Cambridge, MA, pp. 569–576. MIT Press. Graça, J. V., J. P. Pardal, L. Coheur, and D. Caseiro (2008, may). Building a golden collection of parallel multi-language word alignment. In Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association (ELRA). Haghighi, A. and D. Klein (2006). Prototype-driven learning for sequence models. In Proc. HTL-NAACL. ACL. Headden III, W. P., M. Johnson, and D. McClosky (2009, June). Improving unsupervised dependency parsing with richer contexts and smoothing. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Boulder, Colorado, pp. 101–109. Association for Computational Linguistics. Headden III, W. P., D. McClosky, and E. Charniak (2008). Evaluating unsupervised part-ofspeech tagging for grammar induction. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pp. 329–336. Association for Computational Linguistics. Hoang, H., A. Birch, C. Callison-burch, R. Zens, R. Aachen, A. Constantin, M. Federico, N. Bertoldi, C. Dyer, B. Cowan, W. Shen, C. Moran, and O. Bojar (2007, June). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic, pp. 177–180. Association for Computational Linguistics. Huang, F. and A. Yates (2009). Distributional representations for handling sparsity in supervised sequence-labeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pp. 495–503. Association for Computational Linguistics. Hwa, R., P. Resnik, A. Weinberg, C. Cabezas, and O. Kolak (2005). Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering 11, 11–311. Johnson, M. (2007). Why doesn’t EM find good HMM POS-taggers. In In Proc. EMNLP-CoNLL. Klein, D. and C. Manning (2004). Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proc. ACL. Koehn, P. (2002). Europarl: A multilingual corpus for evaluation of machine translation. 78 CHAPTER B. BIBLIOGRAPHY Koehn, P., F. J. Och, and D. Marcu (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL), Morristown, NJ, USA, pp. 48–54. Association for Computational Linguistics. Kromann, M., L. Mikkelsen, and S. Lynge (2003). Danish Dependency Treebank. Proceedings of TLT2003. Växjö University Press, Sweden. Kumar, S. and W. Byrne (2002). Minimum Bayes-Risk word alignments of bilingual texts. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing, pp. 140–147. Kumar, S., F. J. Och, and W. Macherey (2007). Improving word alignment with bridge languages. In Proc. EMNLP, pp. 42–50. Lambert, P., A. D. Gispert, R. Banchs, and J. B. M. no (2005). Guidelines for word alignment evaluation and manual alignment. Language Resources and Evaluation 39(4), 267–285. Liang, P., M. I. Jordan, and D. Klein (2009). Learning from measurements in exponential families. In ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, New York, NY, USA, pp. 641–648. ACM. Liang, P., B. Taskar, and D. Klein (2006, June). Alignment by agreement. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, New York City, USA, pp. 104–111. Association for Computational Linguistics. Lopez, A. and P. Resnik (2005). Improved HMM alignment models for languages with scarce resources. In Proc. of the ACL. Lopez, A. and P. Resnik (2006). Word-based alignment, phrase-based translation: What´s the link? In Proceedings of the 7th conference of the association for machine translation in the Americas (AMTA): visions for the future of machine translation, Boston, MA, pp. 90–99. Mann, G. and A. McCallum (2007). Simple, robust, scalable semi-supervised learning via expectation regularization. In Proceedings of the 24th international conference on Machine learning, pp. 600. ACM. Mann, G. S. and A. McCallum (2008, June). Generalized expectation criteria for semisupervised learning of conditional random fields. In Proceedings of ACL-08: HLT, Columbus, Ohio, pp. 870–878. Association for Computational Linguistics. Marcus, M., M. Marcinkiewicz, and B. Santorini (1993). Building a large annotated corpus of English: The Penn Treebank. Computational linguistics 19(2), 313–330. Matusov, E., N. Ueffing, and H. Ney (2006). Computing consensus translation from multiple machine translation systems using enhanced hypotheses alignment. In Cambridge University Engineering Department, pp. 33–40. McCallum, A., D. Freitag, and F. Pereira (2000). Maximum entropy markov models for information extraction and segmentation. In Proc. ICML. Morgan Kaufmann. Merialdo, B. (1994). Tagging English text with a probabilistic model. Computational linguistics 20(2), 155–171. 79 Moore, R. C. (2004). Improving ibm word-alignment model 1. In Proc. ACL, pp. 518. Neal, R. M. and G. E. Hinton (1998). A new view of the EM algorithm that justifies incremental, sparse and other variants. In M. I. Jordan (Ed.), Learning in Graphical Models, pp. 355–368. Kluwer. Nocedal, J. and S. J. Wright (1999). Numerical optimization. Springer. Och, F. J. and H. Ney (2000). Improved statistical alignment models. In ACL ’00: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA, pp. 440–447. Association for Computational Linguistics. Och, F. J. and H. Ney (2003). A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51. Oflazer, K., B. Say, D. Hakkani-Tür, and G. Tür (2003). Building a Turkish treebank. Treebanks: Building and Using Parsed Corpora. Papineni, K., S. Roukos, T. Ward, and W. Zhu (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. In Proc. ACL. Rabiner, L. (1989). A tutorial on hidden markov models and selected applications in speech recognition. In Proc. IEEE 77(2), 257–286. Ravi, S. and K. Knight (2009). Minimized models for unsupervised part-of-speech tagging. In In Proc. ACL. Rogati, M., S. McCarley, and Y. Yang (2003). Unsupervised learning of arabic stemming using a parallel corpus. In Proc. ACL. Schütze, H. (1995). Distributional part-of-speech tagging. In Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics, pp. 141–148. Morgan Kaufmann Publishers Inc. Simov, K., P. Osenova, M. Slavcheva, S. Kolkovska, E. Balabanova, D. Doikoff, K. Ivanova, A. Simov, E. Simov, and M. Kouylekov (2002). Building a Linguistically Interpreted Corpus of Bulgarian: the BulTreeBank. In Proc. LREC. Smith, N. and J. Eisner (2005). Contrastive estimation: Training log-linear models on unlabeled data. In Proc. ACL. ACL. Smith, N. A. and J. Eisner (2006). Annealing structural bias in multilingual weighted grammar induction. In ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, Morristown, NJ, USA, pp. 569–576. Association for Computational Linguistics. Snyder, B. and R. Barzilay (2008, June). Unsupervised multilingual learning for morphological segmentation. In Proceedings of ACL-08: HLT, Columbus, Ohio, pp. 737–745. Association for Computational Linguistics. Snyder, B., T. Naseem, J. Eisenstein, and R. Barzilay (2009). Adding more languages improves unsupervised multilingual part-of-speech tagging: a bayesian non-parametric approach. In Proc. NAACL. 80 CHAPTER B. BIBLIOGRAPHY Toutanova, K., H. Ilhan, and C. Manning (2002). Extensions to HMM-based statistical word alignment models. In Proc. EMNLP. Toutanova, K. and M. Johnson (2007). A Bayesian LDA-based model for semi-supervised partof-speech tagging. In Proc. NIPS 20. Vilar, D., M. Popović, and H. Ney (2006). Aer: Do we need to "improve" our alignments? In Proc. IWSLT. Vogel, S., H. Ney, and C. Tillmann (1996). Hmm-based word alignment in statistical translation. In Proceedings of the 16th conference on Computational linguistics, Morristown, NJ, USA, pp. 836– 841. Association for Computational Linguistics. Wang, Q. and D. Schuurmans (2005). Improved estimation for unsupervised part-of-speech tagging. In Proc. IEEE NLP-KE. Yarowsky, D. and G. Ngai (2001). Inducing multilingual pos taggers and np bracketers via robust projection across aligned corpora. In North American Chapter Of The Association For Computational Linguistics, pp. 1–8. Association for Computational Linguistics Morristown, NJ, USA. Zhao, Q. and M. Marcus (2009). A simple unsupervised learner for POS disambiguation rules given only a minimal lexicon. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing.