Skip to main content

Translating into Morphologically Rich Languages One Word at a Time

Chris Dyer ( Carnegie Mellon University )

Translation into morphologically rich languages (MRLs) is an important but recalcitrant problem in machine translation. When confronted with the large vocabulary sizes of MRLs, the independence assumptions made by standard translation models mean that vast amounts of parallel training data (which does not generally exist) would be necessary to reliably estimate the numerous required parameters. On the other hand, previous attempts to remedy this situation have been unsatisfying either because they were highly language-dependent, or because they failed from a modeling perspective (e.g., they improved performance on long-tale types at the expense of frequent types).

We introduce a simple and effective approach that deals with the problem of translation into MRLs using a two-phase model. In the first, for each sentence that is to be translated, a morphology-aware translation model is used to generate translation candidates for words and short phrases in the input. This local model works by by first picking a target language lemma or sequence of lemmas and then (independently) predicting the inflection for each lemma, conditioned on rich features of the relevant source word's sentential context. In the second phase, the set of generated translation candidates is used to augment the inventory of context-independent translation rules obtained using standard translation rule extraction techniques, and finally, a complete translation for the input is pieced together using standard decoding techniques. Our approach relies on a morphological analyzer in the target language that decomposes inflected words into tuples of lemmas and inflectional features. Since supervised analyzers may not always be available, we show that an unsupervised Bayesian method for inferring morphological analyses can be used with similar effect. We report significant improvements in translation quality when translating from English to three typologically distinct MRLs: Russian, Swahili, and Hebrew.

Speaker bio

Chris Dyer is an assistant professor in the Language Technologies Institute and affiliated faculty in the Machine Learning Department in the School of Computer Science at Carnegie Mellon University.

 

 

Share this: