Back to search

FRINATEK-Fri prosj.st. mat.,naturv.,tek

Dialogue Modelling for Statistical Machine Translation

Alternative title: Dialogmodellering for statistisk maskinoversettelse

Awarded: NOK 3.2 mill.

The project investigated how to improve machine translation technology in dialogue domains. Machine translation, known by the general public through applications such as Google Translate, is the automatic translation from one language to another through a computer algorithm - for instance, translating from Japanese to Norwegian or vice-versa. Albeit great progress has been made over the last decade, machine translation technology remains often poor at adapting its translations to the relevant context. In order to translate a dialogue (say, film subtitles from English to Norwegian), current translation systems typically operate one utterance at a time and ignore the global coherence and structure of the conversation. The project aimed to make machine translation systems more "context-aware". The project developed new translation methods that can dynamically adapt their outputs according to the surrounding dialogue context. More specifically, we sought to demonstrate in this project how to automatically extract contextual factors from dialogues and integrate these factors into a state-of-the-art statistical machine translation system. The main goal of the project was to show that this context-rich approach is able to produce translations of a higher quality than standard methods. In particular, the project examined how these new translation methods can be practically employed to produce high-quality translations of film subtitles. The project essentially focused on two aspects. The first aspect concentrates on the collection and preprocessing of conversational data in several languages. Together with other colleagues, we have recently released a new, expanded and improved version of the "OpenSubtitles" corpus, a collection of about 3.2 million aligned subtitles of movies and TV series in 60 languages. The second aspect relates to the use of new statistical models able to dynamically adapt the translations to the context in both the source and target language. We have shown in several research articles that these subtitles can be used to develop neural conversation models. Although the project only conducted experiments with a limited set of languages, the translation techniques developed through the project are meant to be language-independent and could in principle be applied to any language pair. In the longer term, speech-to-speech interpretation (the task of automatically translating speech from one language to another, in real-time) is another possible application of the project.

The main practical outcome of the project was the release of the OpenSubtitles 2016 and OpenSubtitles 2016 datasets, which are (to the best of our knowledge) the world-largest collections of parallel corpora available in the public domain. These datasets are widely used in machine translation, especially for languages that otherwise lack sufficient language resources. In addition to machine translation, the datasets have also been used for other important NLP tasks such as language modelling, conversation modelling, and cross-lingual NLP research. As an indicator of the popularity of the datasets, our 2016 paper that describes the dataset has already received over 100 citations (based on Google Scholar) in the space of two years. The OpenDial toolkit, which was released at the beginning of 2014 and is used to quickly develop spoken dialogue systems, has also gained some popularity in the field, and has been employed for both teaching and research purposes in several countries.

The project sets out to enhance the quality of statistical machine translation technology through a better account of the translation context. In most current approaches to machine translation, documents are usually reduced to collections of isolated sen tences without overarching structure. This assumption unfortunately ignores the vast amount of linguistic information that is expressed at the cross-sentential level. To remedy this shortcoming, researchers have recently started to pay more consideration to the contextual aspects of machine translation. Most work so far has however focused on textual domains such as news articles and legal documents, while conversational domains have been neglected. The proposed project aims to fill this gap and will i nvestigate how to optimise machine translation techniques for conversational domains. In particular, the project will develop new, adaptive translation methods that can dynamically modulate their outputs according to the surrounding dialogue context. In a dialogue, the contributions of the participants are indeed not isolated utterances but are built upon one another in tight sequence. The objective of the project is to provide an explicit account of these dependencies and demonstrate how to exploit them in order to produce more accurate and contextually relevant translations. To this end, the project will develop a range of new dialogue modelling techniques that allow rich contextual knowledge to be extracted from the dialogue history and integrated in the pipeline of a statistical machine translation architecture. To date, few researchers have studied these dialogue aspects of machine translation, thereby giving to the project a highly innovative character. In addition to its scientific value, the pr oposed project also has broad technological relevance for several key sectors of the language industry such as the translation of subtitles for audiovisual content and real-time speech-to-speech interpretation.

Publications from Cristin

No publications found

No publications found

Funding scheme:

FRINATEK-Fri prosj.st. mat.,naturv.,tek