Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "Machine translation"

Sort by: Order: Results:

  • Zhixu, Gu (2023)
    Neural machine translation (NMT) has been a mainstream method for the machine translation (MT) task. Despite its remarkable progress, NMT systems still face many challenges when dealing with low-resource scenarios. Common approaches to address the data scarcity problem include exploiting monolingual data or parallel data in other languages. In this thesis, transformer-based NMT models are trained on Finnish-Simplified Chinese, a language pair with limited parallel data and the models are improved using various techniques such as hyperparameter tuning, transfer learning and back-translation. Finally, the best NMT system is an ensemble model that combines different single models. The results of our experiments also show that different hyperparameter settings can cause a performance gap of up to 4 BLEU scores. The ensemble model shows a 35% improvement over the baseline model. Overall, the experiments suggest that hyperparameter tuning is crucial for training vanilla NMT models. Back-translation offers more benefits for model improvement than the transfer learning method. The results also show that adding sampling in back-translation does not improve NMT model performance in this low-data setting. The findings may be useful for future research on low-resource NMT, especially the Finnish-Simplified Chinese MT task.
  • Zolotilin, Mikhail (2024)
    Language tags are additional tokens in the source corpus that indicate the language of the corresponding sentence in the target corpus. Like all words, they receive their own vector numerical representations in the translation model, which can then be used for various experiments. This work explores the use of language tag transformations in a multilingual translation model to produce mixed-language output, aiming to create an "intermediate" language variant. It delves into the nuances of interpolating between multiple languages via their embeddings and the language generation characteristics at these boundary regions. The experiments in this work were conducted with two multilingual translation models: English to Slavic languages and Slavic-to-Slavic languages, with target languages represented in both models and comparing their embeddings in vector space. The study investigates the conditions under which maximum language mixing occurs, examining how factors such as the source language, target languages, and script influence the process. It analyzes outputs from both pre-trained models and trains several models with varied features to understand how these elements affect the potential for target language mixing during interpolation. Due to the absence of reference-based automatic evaluation, the degree of mixing was assessed using a language identification model. The study also conducts a detailed qualitative linguistic analysis of the mixed generated output, examining the level and extent to which the grammar and lexicon of several languages can be mixed. Findings indicate that the extent and location of mixing vary according to different source and target languages. Notably, languages that have similar scripts but differ grammatically yielded the most interesting results, suggesting that standardizing the script across training data could enhance mixing quality. Several smaller multilingual translation models were trained from scratch, incorporating features such as alternative word segmentation (character-based) and script tags, enabling control over the script, not just the language of the output. In the case of smaller models, despite significantly less data, some common trends were observed in the interpolation with similar experiments on larger models: for example, the influence of the script. Additionally, introducing an extremely small number of alternative examples into the training corpus of the model noticeably affected its perception of the script category. The results suggest that mixing or averaging multiple language variants is viable with a uniform script, effective segmentation/encoding, sufficient data, and in-depth exploration of the spaces between embeddings to identify the most balanced and optimal interlanguage variant.