Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "Easy Finnish"

Sort by: Order: Results:

  • Tarvainen, Jonna (2023)
    Monolingual paraphrases are semantically equivalent sentences in one single language transmitting the same meaning but not necessarily using the same words. Also, the same word can have different meanings in different contexts. Understanding the meaning of a text behind its words is essential for many natural language processing and deep-learning tasks such as machine translation, plagiarism detection, question-answering, and information extraction. Paraphrases have been studied extensively, mainly from an English-only or sometimes multilingual perspective. There are not many studies about paraphrase detection in Finnish and even fewer about detecting paraphrases between different registers of the Finnish language, such as Standard Finnish and Easy Finnish. In this thesis, three different pre-trained sentence-BERT models are tested in a paraphrase detection task. The aim of the task is to find paraphrase pairs and triples between three distinct registers of the Finnish language; Standard Finnish, Easy Finnish, and Colloquial Finnish. As the data Yle News articles in Standard and Easy Finnish mostly from the year 2014 are used, as well as Ylilauta online discussions. The applied BERT models are paraphrase-multilingual-MiniLM-L12-v2 sentence-transformers model and FinBERT model. The first mentioned is also fine-tuned with Finnish paraphrase corpus' data. According to the manual evaluation based on the models' precisions, the fine-tuned model outperforms the other two. The same three models are tested on two different balanced test sets of 50 paraphrase sentence pairs and 50 non-paraphrase sentence pairs. The FinBERT model reaches the best F1 score in this research setting. Among the precision and the F1 score, the average sentence lengths and the repetitiveness of the paraphrase sentence pairs and triples are compared and discussed. The FinBERT model detected the shortest sentences and the most repetition, but its total number of detected sentence pairs was also the highest. As a result of this study, a new Easy Finnish - Standard Finnish paraphrase corpus is collected to facilitate further studies in paraphrase detection or simplification in Finnish. The corpus is presented in this thesis. It contains 5881 sentence pairs of which approximately 98 % can be assumed to be true paraphrases according to the manual evaluation of randomly selected sentence pairs. The corpus is created by using the fine-tuned paraphrase-multilingual-MiniLM -L12-v2 sentence-transformers model and it includes paraphrase sentence pairs from Yle News articles in Easy Finnish and in Standard Finnish from the years 2014-2018.