Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "Natural language processing"

Sort by: Order: Results:

  • Moisio, Mikko (2021)
    Semantic textual similarity (STS), the procedure of determining how similar pieces of text are in terms of their meaning, is an important problem in the rapidly evolving field of natural language processing (NLP). STS accelerates major information retrieval applications dealing with natural language text, such as web search engines. For computational efficiency reasons, text pieces are often encoded into semantically meaningful real-valued vectors, sentence embeddings, that can be compared with similarity metrics. Majority of recent NLP research has focused on a small set of largest Indo-European languages and Chinese. Although much of the research is machine learning oriented and is thus often applicable across languages, languages with lesser speaker population, such as Finnish, often lack annotated data required to train, or even evaluate, complex models. BERT, a language representation framework building on transfer learning, is one of the recent quantum leaps in NLP research. BERT-type models take advantage of unsupervised pre-training reducing annotated data demands for supervised tasks. Furthermore, a BERT modification called Sentence-BERT enables us to extend and train BERT-type models to derive semantically meaningful sentence embeddings. However, yet the annotated data demands for conventional training of a Sentence-BERT is relatively low, often such data is unavailable for low-resourced languages. Multilingual knowledge distillation has been shown to be a working strategy for extending mono- lingual Sentence-BERT models to new languages. This technique allows transferring and merging desired properties of two language models, and, instead of annotated data, consumes bilingual parallel samples. In this thesis we study using knowledge distillation to transfer STS properties learnt from English into a model pre-trained on Finnish while bypassing the lack of annotated Finnish data. Further, we experiment distillation with different types of data, English-Finnish bilingual, English monolingual and random pseudo samples, to observe which properties of training data are really necessary. We acquire a bilingual English-Finnish test dataset by translating an existing annotated English dataset and use this set to evaluate the fit of our resulting models. We evaluate the performance of the models in different tasks, English, Finnish and English-Finnish cross-lingual STS, to observe how well the properties being transferred are captured, and how well the models retain the desired properties they already have. We find that knowledge distillation is indeed a feasible approach for obtaining a relatively high quality Sentence-BERT for Finnish. Surprisingly, in all setups large portion of desired properties are transferred to the Finnish model, and, training with English-Finnish bilingual data yields best Finnish sentence embedding model we are aware of.
  • Kortesalmi, Ville (2024)
    Improving employee well-being is a key part of pension agency Keva’s mission statement. Recently, Keva has launched a tool for conducting repeated small-scale employee well-being surveys called ”Pulssi”. With the number of responses reaching thousands Keva has identified processing and organizing this data as a part of this process that could be improved using machine learning methods. In this thesis, we conducted a comprehensive investigation into using language models and sentiment classifications as a solution. We tested three different methodologies for this purpose, traditional machine learning with learned embeddings, generative language methods, and fine-tuned BERT models. To our knowledge, this is the first study evaluating the use of language models on the Finnish sentiment analysis task. Additionally, we evaluated the feasibility of implementing these methods based on their operating costs and the time it took to create classifications. We found that the traditional machine learning trained on learned embeddings performed surprisingly well, achieving an accuracy of 91%. These models offer a fast and cost-effective alternative to the more cumbersome language models. Our fine-tuned BERT model the ”KevaBERT” achieved an impressive accuracy of 93.6%, when trained on GPT-4 generated predictions, suggesting a potential pathway for training data creation. Overall our best performance was achieved by the ”GPT-4 few-shot with context” model at 93.9% accuracy. Our accuracies rival or even surpass the state-of-the-art accuracies achieved on other datasets. Despite the near human-level performance, this model was slow and expensive to operate. Based on these findings we recommend the use of our ”KevaBERT” model for sentiment classifications and a separate GPT-4 based model for text summarization.