NMT TRAINING PILOT PROJECT

Introduction

In this pilot project, together with my MIIS colleagues Lea Bruzzo Delgadillo and Raina Carroll, we created a fictitious company, Excel Localization Services (XLS), and proposed and implemented a pilot project to the director of localization of a renowned journal of Latin American Studies (our fictitious client). Our client needed help choosing the most adequate Neural Machine Translation (NMT) engine to translate their research publications from English into Spanish.

NMT Engines

What does an NMT engine do? An NMT engines predicts how likely a certain sequence of words is by using an artificial neural network. In our pilot project we trained and compared Microsoft Custom Translator (MCT) and SYSTRAN. The solutions allow users to build a customized NMT.

Client Needs

Our client provided specific requirements for language quality, budget and efficiency. Based on this information we defined the following project goals: (1) compare the translation quality output between two different machine translation engines and (2) select an engine capable of producing high-quality translations of research articles from English into Spanish.

Project Proposal

After the pilot project proposal was approved by the client, we began implementing the project as described below in the Pilot Project Initial Proposal:

Pilot Project Initial Proposal

OPUS & SCIELO

To train the engines we used the open source SciELO corpus from Opus, a site that provides English-centric multilingual corpus. The SciELO open-source corpus is from SciELO which is a scientific electronic online library of research publication in Spanish from all over Latin America.

The SciELO English-Spanish TMX file contains 25.11 million words and 416,322 sentence pairs. We followed the minimum data requirements for both systems and divided the SciELO TMX file using Visual Studio Code. Additionally, we also cleaned the files using Olifant and removed segments that were larger than 100 characters to meet the requirements of the free-trial version of MCT. Even though SYSTRAN did not have any segment restrictions we decided to also use the same data set (with segments with less than 100 characters) to ensure that we could objectively compare Bleu scores from both machines and determine which engine would be best for our requirements.

RESULTS

MCT and SYSTRAN LQA Results: Round 1 and Round 10.

After implementation of the pilot project, we provided and updated proposal to our client. The updated version also included our recommendation as to which engine was best to proceed with based on the Bleu scores and LQI results obtained.

Project Video Presentation

In the following video we present our experience testing MCT and SYSTRAN, our lessons learned, and recommendations.

Conclusion

SYSTRAN’s UI is much straightforward and simple. SYSTRAN also provides a training data set, so you don’t have to worry about finding a corpus, and it also does not require tuning data to train the engine. You just need to have your test data and you can start training right away. This can make the entire process much faster than if you had to clean up a huge data set. On the other hand, to use MCT you need to have separate data for training, testing, and tuning (the data cannot overlap). Regardless of what engine you test, make sure to find a good data set that is relevant to your subject matter needs. If your data set is good, your engine Bleu scores, and linguistic quality will reflect this!

Want to keep reading?

Leave a Comment