Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by study line "Language Technology"

Sort by: Order: Results:

  • Hotti, Helmiina (2023)
    Diagrams are a mode of communication that offers challenges for its computational processing. The challenges arise from the multimodal nature of diagrams. This means that diagrams combine several types of expressive resources to achieve their communicative purposes, such as textual elements, connective elements such as arrows and lines, and illustrations. Humans interpret diagrams by judging how these different expressive resources work together to reach the communicative goals set for the diagram. In order to do that, humans make inferences of the diagram layout and the implicit relations that exist between different parts of the diagram. In order to build computational methods for diagram understanding, large amounts of data annotated with these implicit relations is required. Traditionally, these types of discourse structure annotations have been annotated by experts, due to the difficulty of the task and the requirement that the annotator is familiar with the theoretical framework used for describing discourse relations. The chosen theory for modeling discourse relations in diagrams is Rhetorical Structure Theory, originally developed for modeling textual coherence but applicable to multimodal data as well. This thesis explores the possibility to gather discourse relation annotations for multimodal diagram data with crowdsourcing; employing naive workers on crowdsourcing platforms to complete annotation tasks for a monetary reward. Adapting the task of discourse relation annotation to be feasible for naive workers has been proven challenging by past research concerned with only textual data, and the multimodality of the data adds to the complexity of the task. This thesis presents a novel method for gathering multimodal discourse relation annotations using crowdsourcing and methods of natural language processing. Two approaches are explored: adopting an insertive annotation task where the workers are asked to describe the relationship between two diagram elements in their own words and adopting a multiple-choice task, converting the formal definitions of Rhetorical Structure Theory to understandable phrases to annotate with. Natural language processing is used in the first approach to validate the language and structure of the crowdsourced descriptions. The results of the first approach highlight the difficulty of the task: the workers show tendencies of relying heavily on example descriptions shown in the task instructions and difficulty of grasping the differences of the more fine-grained relations. The multiple-choice approach seems more promising, with annotation agreement with expert annotators higher than in previous research concerned with discourse relations in textual data. The manual inspection of the annotated diagrams show that the disagreement of the crowdworkers and expert annotators is often justifiable; both annotations represent a valid interpretation of the discourse relation. This highlights one of the main challenges of the task, which is the ambiguity of some of the relations. Future work is encouraged to consider this by adopting an approach that is less concerned with a pre-defined set of relations and more interested in how the different discourse relations are actually perceived.
  • Vehomäki, Varpu (2022)
    Social media provides huge amounts of potential data for natural language processing but using this data may be challenging. Finnish social media text differs greatly from standard Finnish and models trained on standard data may not be able to adequately handle the differences. Text normalization is the process of processing non-standard language into its standardized form. It provides a way to both process non-standard data with standard natural language processing tools and to get more data for training new tools for different tasks. In this thesis I experiment with bidirectional recurrent neural network models and models based on the ByT5 foundation model, as well as the Murre normalizer to see if existing tools are suitable for normalizing Finnish social media text. I manually normalize a small set of data from the Ylilauta and Suomi24 corpora to use as a test set. For training the models I use the Samples of Spoken Finnish corpus and Wikipedia data with added synthetic noise. The results of this thesis show that there are no existing tools suitable for normalizing Finnish written on social media. There is a lack of suitable data for training models for this task. The ByT5-based models perform better than the BRNN models.
  • Narkevich, Dmitry (2021)
    Hypernymy is a relationship between two words, where the hyponym carries a more specific meaning, and entails a hypernym that carries a more general meaning. A particular kind of verbal hypernymy is troponymy, where troponyms are verbs that encode a particular manner or way of doing something, such as “whisper” meaning “to speak in a quiet manner”. Recently, contextualized word vectors have emerged as a powerful tool for representing the semantics of words in a given context, in contrast to earlier static embeddings where every word is represented by a single vector regardless of sense. BERT, a pre-trained language model that uses contextualized word representations, achieved state of the art performance on various downstream NLP tasks such as question answering. Previous research identified knowledge of scalar adjective intensity in BERT, but not systematic knowledge of nominal hypernymy. In this thesis, we investigate systematic knowledge of troponymy and verbal hypernymy in the base English version of BERT. We compare the similarity of vector representations for manner verbs and adverbs of interest, to see if troponymy is represented in the vector space. Then, we evaluate BERT’s predictions for cloze tasks involving troponymy and verbal hypernymy. We also attempt to train supervised models to probe vector representations for this knowledge. Lastly, we perform clustering analyses on vector representations of words in hypernymy pairs. Data on troponymy and hypernymy relationships is extracted from WordNet and HyperLex, and sentences containing instances of the relevant words are obtained from the ukWaC corpus. We were unable to identify any systematic knowledge about troponymy and verb hypernymy in BERT. It was reasonably successful at predicting hypernyms in the masking experiments, but a general inability to go in the other direction suggests that this knowledge is not systematic. Our probing models were unsuccessful at recovering information related to hypernymy and troponymy from the representations. In contrast with previous work that finds type-level semantic information to be located in the lower layers of BERT, our cluster-based analyses suggest that the upper layers contain stronger or more accessible representations of hypernymy.
  • Rahman, Dean (2022)
    There are comprehensive requirements in Finland for procurement by any government organization to go through a tendering process where information about each tender is made available not only to vendors and service providers, but to everyone else in Finland as well. This is accomplished through the website Hilma and should make tenders easy to find. Moreover, in Finnish, variance in domain terminology is not thought to be the problem that it is in English. For instance, the last four years of tenders on Hilma never refer to jatkuva parantaminen as toiminnallinen erinomaisuus whereas “continuous improvement” and “operational excellence” could be used interchangeably in English. And yet, it is considered very difficult for a vendor or service provider to find applicable tenders on Hilma. Rather than lexical variability being the cause as it might be in English, the differences in concept paradigms between the private and public sectors in Finland pose the challenge. Whereas a taxi company representative would be looking for tenders about transportation services, a public officer could be posting a tender about social equity for the disabled. The second difficulty is that the Hilma search engine is purely Boolean with restrictive string match criteria rather than inviting natural language questions. Finally, the Hilma search engine does not account for Finnish being a highly inflecting and compounding language where single words usually morph instead of taking on adpositions, and where compound words are affixed together without hyphenation. Many information retrieval approaches would look outside the corpus for query expansion terms. Natural language processing might also offer the potential to look for paraphrases in existing parallel corpora on tenders throughout the European Union rather than in Hilma. However, this thesis focuses on clustering the tenders posted in Finnish on Hilma, applying the comprehensive workflow of the very recent BERTopic package for Python. All documents in each cluster are concatenated and the highest TFIDF-scoring words in the concatenated document are slated to be “search extension terms.” If one of the terms were to be entered by a Hilma user, the user would be invited to perform parallel searches with the remaining terms as well. The first main contribution of this thesis is to use state of the art models and algorithms to represent the corpus, reduce dimensionality of the representations and hierarchically cluster the representations. Second, this thesis develops analytical metrics to be used in automatic evaluation of the efficacy of the clusterings and in comparisons among model iterations that programmatically remove more and more distractions to the clustering that are discovered in the corpus. Finally, this thesis performs case studies on Hilma to demonstrate the remarkable efficacy of the search extension terms in generating tremendous numbers of additional useful matches, addressing paradigm-based differences in terminology, morphovariance and affixation.
  • Tapper, Suvi (2023)
    The purpose of this thesis is to examine the prosodic features of English dialects using WaveNet. The exact goal is to investigate whether the differences in prosody between the dialects are present in the data and the results, and whether the geographical distance between the cities included in the data has any influence on this. Another aim is to see how the prosodic features of the sentence types present in the data and their possible differences are manifested in the data and the results. Prosody is concerned with those characteristics of speech which cover more than just individual sounds. Prosodic features can further be divided into paralinguistic features, such as the rate of speech and pausing, and linguistic features, like intonation. Parameters useful for analysing prosody are fundamental frequency (f0), intensity and voice quality – we are interested in the first two. Fundamental frequency is the speed of the vibration of the vocal folds while speaking. Intensity in turn is connected to the changes of air pressure while speaking. The data used for this study is the IViE corpus (Intonational Variation in English), comprising of recordings done in nine British cities – Belfast, Bradford, Cambridge, Cardiff, Dublin, Leeds, Liverpool, London and Newcastle, with approximately 12 speakers per city. In three of the cities, Bradford, Cardiff and London, the dialect is that of a minority. The part of the corpus chosen for this study is a set of 22 sentences consisting of five sentence types. The analysis was performed using WaveNet, a convolutional neural network. It uses causal convolutions to ensure the data is processed correctly. In addition to being conditioned on the output of the network itself, it can also be conditioned using embeddings. The WaveNet implementation used here has two embedding layers – target and normalisation embeddings. Before the analysis the data was pre-processed and the relevant information concerning the fundamental frequency and intensity were extracted from the sound files. A corresponding *.time file was also created for each of the sound files, with the aim of minimising the influence of the possible differences in length between sentences and thus improve the network's ability to recognise the intonation contours correctly. The results are presented in the form of dendrograms, depicting the relationships between the dialects and sentence types – both separately and as a combination of the dialects and sentence types. It was shown, that the differences in prosody were in fact manifested in the data for both dialects and sentence types, although not exactly as expected. The geographical proximity did not seem to influence the dialectal similarities as much as was assumed – in addition to other influences this might also be due to some of the dialects being minority dialects in the cities, and therefore not necessarily so easily comparable to the dialects of the neighbouring area as the majority dialects might have been.
  • China-Kolehmainen, Elena (2021)
    Computer-Assisted Language Learning (CALL) is one of the sub-disciplines within the area of Second Language Acquisition. Clozes, also called fill-in-the-blank, are largely used exercises in language learning applications. A cloze is an exercise where the learner is asked to provide a fragment that has been removed from the text. For language learning purposes, in addition to open-end clozes where one or more words are removed and the student must fill the gap, another type of cloze is commonly used, namely multiple-choice cloze. In a multiple-choice cloze, a fragment is removed from the text and the student must choose the correct answer from multiple options. Multiple-choice exercises are a common way of practicing and testing grammatical knowledge. The aim of this work is to identify relevant learning constructs for Italian to be applied to automatic exercises creation based on authentic texts in the Revita Framework. Learning constructs are units that represent language knowledge. Revita is a free to use online platform that was designed to provide language learning tools with the aim of revitalizing endangered languages including several Finno-Ugric languages such as North Saami. Later non-endangered languages were added. Italian is the first majority language to be added in a principled way. This work paves the way towards adding new languages in the future. Its purpose is threefold: it contributes to the raising of Italian from its beta status towards a full development stage; it formulates best practices for defining support for a new language and it serves as a documentation of what has been done, how and what remains to be done. Grammars and linguistic resources were consulted to compile an inventory of learning constructs for Italian. Analytic and pronominal verbs, verb government with prepositions, and noun phrase agreement were implemented by designing pattern rules that match sequences of tokens with specific parts-of-speech, surfaces and morphological tags. The rules were tested with test sentences that allowed further refining and correction of the rules. Current precision of the 47 rules for analytic and pronominal verbs on 177 test sentences results in 100%. Recall is 96.4%. Both precision and recall for the 5 noun phrase agreement rules result in 96.0% in respect to the 34 test sentences. Analytic and pronominal verb, as well as noun phrase agreement patterns, were used to generate open-end clozes. Verb government pattern rules were implemented into multiple-choice exercises where one of the four presented options is the correct preposition and the other three are prepositions that do not fit in context. The patterns were designed based on colligations, combinations of tokens (collocations) that are also explained by grammatical constraints. Verb government exercises were generated on a specifically collected corpus of 29074 words. The corpus included three types of text: biography sections from Wikipedia, Italian news articles and Italian language matriculation exams. The last text type generated the most exercises with a rate of 19 exercises every 10000 words, suggesting that the semi-authentic text met best the level of verb government exercises because of appropriate vocabulary frequency and sentence structure complexity. Four native language experts, either teachers of Italian as L2 or linguists, evaluated usability of the generated multiple-choice clozes, which resulted in 93.55%. This result suggests that minor adjustments i.e., the exclusion of target verbs that cause multiple-admissibility, are sufficient to consider verb government patterns usable until the possibility of dealing with multiple-admissible answers is addressed. The implementation of some of the most important learning constructs for Italian resulted feasible with current NLP tools, although quantitative evaluation of precision and recall of the designed rules is needed to evaluate the generation of exercises on authentic text. This work paves the way towards a full development stage of Italian in Revita and enables further pilot studies with actual learners, which will allow to measure learning outcomes in quantitative terms
  • An, Yu (2020)
    Maps of science, or cartography of scientific fields, provide insights into the state of scientific knowledge. Analogous to geographical maps, maps of science present the fields as positions and show the paths connecting each other, which can serve as an intuitive illustration for the history of science or a hint to spot potential opportunities for collaboration. In this work, I investigate the reproducibility of a method to generate such maps. The idea of the method is to derive representations representations for the given scientific fields with topic models and then perform hierarchical clustering on these, which in the end yields a tree of scientific fields as the map. The result is found unreproducible, as my result obtained on the arXiv data set (~130k articles from arXiv Computer Science) shows an inconsistent structure from the one in the reference study. To investigate the cause of the inconsistency, I derive a second set of maps using the same method and an adjusted data set, which is constructed by re-sampling the arXiv data set to a more balanced distribution. The findings show the confounding factors in the data cannot account for the inconsistency; instead, it should be due to the stochastic nature of the unsupervised algorithm. I also improve the approach by using ensemble topic models to derive representations. It is found the method to derive maps of science can be reproducible when it uses an ensemble topic model fused from a sufficient number of base models.
  • Williams, Salla (2023)
    Hostility in the player communication of video games (and by extension, mobile games) is a well-documented phenomenon that can have negative repercussions for the well-being of the individual being subjected to it, and the society in general. Existing research on detecting hostility in games through machine learning methods is scarce due to the unavailability of data, imbalanced existing data (few positive samples in a large data set), and the challenges involved in defining and identifying hostile communication. This thesis utilizes communication data from the Supercell game Brawl Stars to produce two distinct machine learning models: a support vector classifier and a multi-layer perceptron. Their performance is compared to each other as well as to that of an existing sentiment analysis classifier, VADER. Techniques such as oversampling and using additional data are also used in an attempt to reach better results by improving the balance of the data set. The support vector classifier model was found to have the best performance overall, with an F1 score of 64.15% when used on the pure data set and 65.74% when combined with the SMOTE oversampling algorithm. The thesis includes an appendix with a list of the words that were found to have the strongest influence on the hostile/non-hostile classification.
  • Nyholm, Sabine (2020)
    Universella meningsrepresentationer och flerspråkig språkmodellering är heta ämnen inom språkteknologi, specifikt området som berör förståelse för naturligt språk (natural language understanding). En meningsinbäddning (sentence embedding) är en numerisk skildring av en följd ord som motsvaras av en hel fras eller mening, speficikt som ett resultat av en omkodare (encoder) inom maskininlärning. Dessa representationer behövs för automatiska uppgifter inom språkteknologi som kräver förståelse för betydelsen av en hel mening, till skillnad från kombinationer av enskilda ords betydelser. Till sådana uppgifter kan räknas till exempel inferens (huruvida ett par satser är logiskt anknutna, natural language inference) samt åsiktsanalys (sentiment analysis). Med universalitet avses kodad betydelse som är tillräckligt allmän för att gynna andra relaterade uppgifter, som till exempel klassificering. Det efterfrågas tydligare samförstånd kring strategier som används för att bedöma kvaliteten på dessa inbäddningar, antingen genom att direkt undersöka deras lingvistiska egenskaper eller genom att använda dem som oberoende variabler (features) i relaterade modeller. På grund av att det är kostsamt att skapa resurser av hög kvalitet och upprätthålla sofistikerade system på alla språk som används i världen finns det även ett stort intresse för uppskalering av moderna system till språk med knappa resurser. Tanken med detta är så kallad överföring (transfer) av kunskap inte bara mellan olika uppgifter, utan även mellan olika språk. Trots att behovet av tvärspråkiga överföringsmetoder erkänns i forskningssamhället är utvärderingsverktyg och riktmärken fortfarande i ett tidigt skede. SentEval är ett existerande verktyg för utvärdering av meningsinbäddningar med speciell betoning på deras universalitet. Syftet med detta avhandlingsprojekt är ett försök att utvidga detta verktyg att stödja samtidig bedömning på nya uppgifter som omfattar flera olika språk. Bedömningssättet bygger på strategin att låta kodade meningar fungera som variabler i så kallade downstream-uppgifter och observera huruvida resultaten förbättras. En modern mångspråkig modell baserad på så kallad transformers-arkitektur utvärderas på en etablerad inferensuppgift såväl som en ny känsloanalyssuppgift (emotion detection), av vilka båda omfattar data på en mängd olika språk. Även om det praktiska genomförandet i stor utsträckning förblev experimentellt rapporteras vissa tentativa resultat i denna avhandling.
  • Tarvainen, Jonna (2023)
    Monolingual paraphrases are semantically equivalent sentences in one single language transmitting the same meaning but not necessarily using the same words. Also, the same word can have different meanings in different contexts. Understanding the meaning of a text behind its words is essential for many natural language processing and deep-learning tasks such as machine translation, plagiarism detection, question-answering, and information extraction. Paraphrases have been studied extensively, mainly from an English-only or sometimes multilingual perspective. There are not many studies about paraphrase detection in Finnish and even fewer about detecting paraphrases between different registers of the Finnish language, such as Standard Finnish and Easy Finnish. In this thesis, three different pre-trained sentence-BERT models are tested in a paraphrase detection task. The aim of the task is to find paraphrase pairs and triples between three distinct registers of the Finnish language; Standard Finnish, Easy Finnish, and Colloquial Finnish. As the data Yle News articles in Standard and Easy Finnish mostly from the year 2014 are used, as well as Ylilauta online discussions. The applied BERT models are paraphrase-multilingual-MiniLM-L12-v2 sentence-transformers model and FinBERT model. The first mentioned is also fine-tuned with Finnish paraphrase corpus' data. According to the manual evaluation based on the models' precisions, the fine-tuned model outperforms the other two. The same three models are tested on two different balanced test sets of 50 paraphrase sentence pairs and 50 non-paraphrase sentence pairs. The FinBERT model reaches the best F1 score in this research setting. Among the precision and the F1 score, the average sentence lengths and the repetitiveness of the paraphrase sentence pairs and triples are compared and discussed. The FinBERT model detected the shortest sentences and the most repetition, but its total number of detected sentence pairs was also the highest. As a result of this study, a new Easy Finnish - Standard Finnish paraphrase corpus is collected to facilitate further studies in paraphrase detection or simplification in Finnish. The corpus is presented in this thesis. It contains 5881 sentence pairs of which approximately 98 % can be assumed to be true paraphrases according to the manual evaluation of randomly selected sentence pairs. The corpus is created by using the fine-tuned paraphrase-multilingual-MiniLM -L12-v2 sentence-transformers model and it includes paraphrase sentence pairs from Yle News articles in Easy Finnish and in Standard Finnish from the years 2014-2018.
  • Nikula, Ottilia (2023)
    Recent progress in natural language generation tools has raised concerns that the tools are being used to generate neural fake news. Fake news impacts our society in many ways, and they have been used for monetization schemes, to tip political elections, and have been shown to have a severe effect on people’s mental health. Accordingly, being able to detect neural fake news and countering their spread is becoming increasingly important. The aim of the thesis is to explore whether there are linguistic features that can help detect neural news. Using Grover, a neural language model, I generate a set of articles based on both real and fake human-written news. I then extract a range of linguistic features, previously found to differ between human-written real and fake news, to investigate whether the same features can be used detect Grover-written news, whether there are features that can differentiate between Grover-written news, whose source material is different, and whether based on these features Grover-written news are more similar to real or fake news. The data consists of 64 articles, of which 16 are real news sourced from reputable news sites and 16 are fake news articles from the ISOT Fake News Dataset. The other 32 articles are written by Grover, with having either the real news or fake news articles as source text (16 each). A broad range of linguistic features are extracted from the article bodies and titles to capture the style, complexity, and sentiment of the articles. The features measured include punctuation, quotes, syntax tree depths, and emotion counts. The results show that the same features which have been found to differ between real and fake news, can with some limitations be used to discern Grover Fake News (Grover-written articles based on fake news). However, Grover Real News (Grover-written articles based on real news) cannot reliably be discerned from real news. Moreover, while the features measured do not provide a reliable method for discerning Grover Real News and Grover Fake News from each other, there are still noticeable differences between the two groups. Grover Fake News can be differentiated from real news, but the texts can be considered of better quality than fake news. These findings also align with previous research, showcasing that Grover is adept at re-writing misinformation and making it more credible to readers, and that feature extraction alone cannot reliably distinguish neural fake news, but that human evaluation also needs to be considered.
  • Zhixu, Gu (2023)
    Neural machine translation (NMT) has been a mainstream method for the machine translation (MT) task. Despite its remarkable progress, NMT systems still face many challenges when dealing with low-resource scenarios. Common approaches to address the data scarcity problem include exploiting monolingual data or parallel data in other languages. In this thesis, transformer-based NMT models are trained on Finnish-Simplified Chinese, a language pair with limited parallel data and the models are improved using various techniques such as hyperparameter tuning, transfer learning and back-translation. Finally, the best NMT system is an ensemble model that combines different single models. The results of our experiments also show that different hyperparameter settings can cause a performance gap of up to 4 BLEU scores. The ensemble model shows a 35% improvement over the baseline model. Overall, the experiments suggest that hyperparameter tuning is crucial for training vanilla NMT models. Back-translation offers more benefits for model improvement than the transfer learning method. The results also show that adding sampling in back-translation does not improve NMT model performance in this low-data setting. The findings may be useful for future research on low-resource NMT, especially the Finnish-Simplified Chinese MT task.
  • Palma-Suominen, Saara (2021)
    Maisterintutkielma käsittelee monikielistä nimien tunnistusta. Tutkielmassa testataan kahta lähestymistapaa monikieliseen nimien tunnistukseen: annotoidun datan siirtoa toisille kielille, sekä monikielisen mallin luomista. Lisäksi nämä kaksi lähestymistapaa yhdistetään. Tarkoitus on löytää menetelmiä, joilla nimien tunnistusta voidaan tehdä luotettavasti myös pienemmillä kielillä, joilla annotoituja nimientunnistusaineistoja ei ole suuressa määrin saatavilla. Tutkielmassa koulutetaan ja testataan malleja neljällä kielellä: suomeksi, viroksi, hollanniksi ja espanjaksi. Ensimmäisessä metodissa annotoitu data siirretään kieleltä toiselle monikielisen paralleelikorpuksen avulla, ja näin syntynyttä dataa käytetään neuroverkkoja hyödyntävän koneoppimismallin opettamiseen. Toisessa metodissa käytetään monikielistä BERT-mallia. Mallin koulutukseen käytetään annotoituja korpuksia, jotka yhdistetään monikieliseksi opetusaineistoksi. Kolmannessa metodissa kaksi edellistä metodia yhdistetään, ja kieleltä toiselle siirrettyä dataa käytetään monikielisen BERT-mallin koulutuksessa. Kaikkia kolmea lähestymistapaa testataan kunkin kielen annotoidulla testisetillä, ja tuloksia verrataan toisiinsa. Metodi, jossa rakennettiin monikielinen BERT-malli, saavutti selkeästi parhaimmat tulokset nimien tunnistamisessa. Neuroverkkomallit, jotka koulutettiin kielestä toiseen siirretyillä annotaatioilla, saivat selkeästi heikompia tuloksia. BERT-mallin kouluttaminen siirretyillä annotaatioilla tuotti myös heikkoja tuloksia. Annotaatioiden siirtäminen kieleltä toiselle osoittautui haastavaksi, ja tuloksena syntynyt data sisälsi virheitä. Tulosten heikkouteen vaikutti myös opetusaineiston ja testiaineiston kuuluminen eri genreen. Monikielinen BERT-malli on tutkielman mukaan testatuista parhaiten toimiva metodi, ja sopii myös kielille, joilla annotoituja aineistoja ei ole paljon saatavilla.
  • Kylliäinen, Ilmari (2022)
    Automatic question answering and question generation are two closely related natural language processing tasks. They both have been studied for decades, and both have a wide range of uses. While systems that can answer questions formed in natural language can help with all kinds of information needs, automatic question generation can be used, for example, to automatically create reading comprehension tasks and improve the interactivity of virtual assistants. These days, the best results in both question answering and question generation are obtained by utilizing pre-trained neural language models based on the transformer architecture. Such models are typically first pre-trained with raw language data and then fine-tuned for various tasks using task-specific annotated datasets. So far, no models that can answer or generate questions purely in Finnish have been reported. In order to create them using modern transformer-based methods, both a pre-trained language model and a sufficiently big dataset suitable for question answering or question generation fine-tuning are required. Although some suitable models that have been pre-trained with Finnish or multilingual data are already available, a big bottleneck is the lack of annotated data needed for fine-tuning the models. In this thesis, I create the first transformer-based neural network models for Finnish question answering and question generation. I present a method for creating a dataset for fine-tuning pre-trained models for the two tasks. The dataset creation is based on automatic translation of an existing dataset (SQuAD) and automatic normalization of the translated data. Using the created dataset, I fine-tune several pre-trained models to answer and generate questions in Finnish and evaluate their performance. I use monolingual BERT and GPT-2 models as well as a multilingual BERT model. The results show that the transformer architecture is well suited also for Finnish question answering and question generation. They also indicate that the synthetically generated dataset can be a useful fine-tuning resource for these tasks. The best results in both tasks are obtained by fine-tuned BERT models which have been pre-trained with only Finnish data. The fine-tuned multilingual BERT models come in close, whereas fine-tuned GPT-2 models are generally found to underperform. The data developed for this thesis will be released to the research community to support future research on question answering and generation, and the models will be released as benchmarks.
  • Keturi, Joonas (2022)
    The subject of the thesis is the comparison of lexical semantics and phonetics. The thesis investigates with computational methods if there is significantly more phonetic variance in words that belong to the same semantic domains than with phonetically similar words from other semantic domains. In other words, phonetically very similar words and especially phonological minimal pairs would be in separate semantic domains. The method clusters word embedding vectors and distinctive phonological feature vectors from multiple languages, and the phonetic and semantic standard deviations are calculated for each cluster, and the mean standard deviations of cluster sets are compared. In addition to semantic and phonetic clusters, two test clusters are constructed which have the same number and the same size of clusters as the semantic clusters. The first test clusters use the words from phonetic clusters in order and the second test clusters are randomly permuted. These different cluster sets are compared by their mean standard deviations and cluster set similarity index. The results imply that words on the same semantic domains contain rarely phonetically very similar words, and those words are usually in separate semantic domains.
  • Hynynen, Jussi-Veikka (2023)
    Using language that is easy to understand when presenting information in a written form is critical for ensuring effective communication. Yet, using language that is too complex or technical for its intended audience is a common pitfall in many domains, such as legal and medical text. Automatic text simplification (ATS) aims to automatize the conversion of complex text into a simpler, more easily comprehensible form. This study explores ATS models for English that can be controlled in terms of the readability of the output text. Readability is measured with an automatically calculated readability level that corresponds to a school grade level. The readability- controlled models take a readability level as a parameter and simplify input text to match the reading level of the intended audience corresponding to the parameter value. In total, six readability-controlled sentence simplification models with different control attribute configurations are trained in this study. The models use a pretrained sequence-to-sequence model architecture that is finetuned on a dataset of sentence pairs in regular and simple English. The trained models are evaluated using automatic evaluation metrics and compared to each other and ATS systems from previous research. Additionally, the simplified sentences produced by the best performing model are evaluated manually to identify errors and the types of text transformations that the model employs to simplify sentences. When the readability level input value is optimized to maximise model performance on validation data, the readability-controlled models surpass systems from previous works in terms of automatic evaluation metrics, suggesting that the addition of readability level as a control attribute results in improved simplification quality. Manual evaluation shows that readability-controlled models are capable of splitting long sentences to multiple shorter sentences to reduce syntactic complexity of text. This finding suggests that readability level metrics can be used to effectively control syntactic complexity in ATS models as a lightweight alternative to previously applied, more computationally demanding methods that rely on dependency parsing. Finally, this study discusses the different types errors produced by the models, their potential causes and ways to reduce errors in future ATS systems.
  • Pöyhönen, Teemu (2023)
    While natural language generation (NLG) and large-language models (LLM) seem to be transforming many industries, video games have yet to be affected. This study investigates the potential of using NLG systems to generate dialogue for non-playable characters (NPCs) in role-playing games (RPGs). For this, dialogue data is extracted from six popular RPGs and is then used to fine-tune Microsoft’s GODEL to create an “RPG chatbot” (RPG-GPT). Motivated by computational creativity frameworks, a survey and an interactive experiment were conducted to evaluate the creativity and the effectiveness of RPG-GPT in generating relevant and engaging responses to player input. Survey respondents rated dialogues on a 5-point agree-disagree Likert scale, with questions related to e.g. the relevance of the NPC answers. Results indicate that RPG-GPT can provide relevant responses with a mean difference of game relevance of 3.93 vs. 3.85 of RPG-GPT (p=0.0364). Also, the participants of the interactive experiment reported engagement when interacting with RPG-GPT. Overall, the results suggest that creative NLG has the potential to enhance gaming experiences through task-oriented game dialogue (TOGD) systems. In this framework, creative TOGD systems could solve a common issue where pre-written NPCs are unable to provide the specific information sought by players. Additionally, the study discusses a concept of how players through their interaction with the NLG models can expand the lore of a game, which is a new consideration for game designers and developers when implementing such systems. Future work could explore ways to incorporate external knowledge and context to improve the performance of a TOGD system.
  • Vahtola, Teemu (2020)
    Modernit sanaupotusmenetelmät, esimerkiksi Word2vec, eivät mallinna leksikaalista moniselitteisyyttä luottaessaan kunkin sanan mallinnuksen yhden vektorirepresentaation varaan. Näin ollen leksikaalinen moniselitteisyys aiheuttaa ongelmia konekääntimille ja voi johtaa moniselitteisten sanojen käännökset usein harhaan. Työssä tarkastellaan mahdollisuutta mallintaa moniselitteisiä sanoja merkitysupotusmenetelmän (sense embeddings) avulla ja hyödynnetään merkitysupotuksia valvomattoman konekäännösohjelman (unsupervised machine translation) opetuksessa kieliparilla Englanti-Saksa. Siinä missä sanaupotusmenetelmät oppivat yhden vektorirepresentaation kullekin sanalle, merkitysupotusmenetelmän avulla voidaan oppia useita representaatioita riippuen aineistosta tunnistettujen merkitysten määrästä. Näin ollen yksi valvomattoman konekääntämisen perusmenetelmistä, sanaupotusten kuvaus joukosta lähde- ja kohdekielten yksikielisiä vektorirepresentaatioita jaettuun kaksikieliseen vektoriavaruuteen, voi tuottaa paremman kuvauksen, jossa moniselitteiset sanat mallintuvat paremmin jaetussa vektoriavaruudessa. Tämä mallinnustapa voi vaikuttaa positiivisesti konekäännösohjelman kykyyn kääntää moniselitteisiä sanoja. Työssä merkitysupotusmalleja käytetään saneiden alamerkitysten yksiselitteistämiseen, ja tämän myötä jokainen konekäännösmallin opetusaineistossa esiintyvä sane annotoidaan merkitystunnisteella. Näin ollen konekäännösmalli hyödyntää sanaupotusten sijaan merkitysupotuksia oppiessaan kääntämään lähde- ja kohdekielten välillä. Työssä opetetaan tilastollinen konekäännösmalli käyttäen tavanomaista sanaupotusmenetelmää. Tämän lisäksi opetetaan sekä tilastollinen että neuroverkkokonekäännösmalli käyttäen merkitysupotusmenetelmää. Aineistona työssä käytetään WMT-14 News Crawl -aineistoa. Opetettujen mallien tuloksia verrataan aiempaan konekäännöstutkimuksen automaattisessa arvioinnissa hyvin menestyneeseen tilastolliseen konekäännösmalliin. Lisäksi työssä suoritetaan tulosten laadullinen arviointi, jossa keskitytään yksittäisten moniselitteisten sanojen kääntämiseen. Tulokset osoittavat, että käännösmallit voivat hyötyä merkitysupotusmenetelmästä. Tarkasteltujen esimerkkien perusteella merkitysupotusmenetelmää hyödyntävät konekäännösmallit onnistuvat kääntämään moniselitteisiä sanoja sanaupotusmenetelmää hyödyntävää mallia tarkemmin vastaamaan referenssikäännöksissä valittuja käännöksiä. Näin ollen laadullisen arvioinnin kohdistuessa yksittäisten moniselitteisten sanojen kääntämiseen, merkitysupotusmenetelmästä näyttää olevan hyötyä konekäännösmallien opetuksessa.
  • Bedretdin, Ümit (2022)
    Tämä työ esittelee ohjattuun koneoppimiseen perustuvan tekstiluokittelijan kehitysprosessin mediatutkimuksen näkökulmasta. Valittu lähestymistapa mahdollistaa mediatutkijan asiantuntijatiedon valjastamisen laaja-alaiseen laskennalliseen analyysiin ja suurten aineistojen käsittelyyn. Työssä kehitetään neuroverkkopohjainen tekstiluokittelija, jonka avulla vertaillaan tekstistä erotettujen erilaisten luokittelupiirteiden kykyä mallintaa journalististen tekstien kehystystaktiikoita ja aihepiirejä. Kehitystyössä käytetyt aineistot on annotoitu osana kahta mediatutkimusprojektia. Näistä ensimmäisessä tutkitaan tapoja, joilla vastamedia MV-lehti uudelleenkehystää valtamedian artikkeleita. Siinä on aineistona 37 185 MV-lehden artikkelia, joista on eristetty kolme erilaista kehystystaktiikkaa (Toivanen et al. 2021), jotka luokittelijan on määrä tunnistaa tekstistä automaattisesti. Toisessa projektissa keskiössä on valtamedioissa käyty alkoholipolitiikkaa koskeva keskustelu, jota varten kerättiin 33 902 artikkelin aineisto Ylen, Iltalehden ja STT:n uutisista (Käynnissä oleva Vallan virrat -tutkimusprojekti). Luokittelijan tehtävänä on tunnistaa aineistosta artikkelit, jotka sisältävät keskustelua alkoholipolitiikasta. Työn tarkoituksena on selvittää, mitkä tekstin piirteet soveltuvat parhaiten luokittelupiirteiksi kulloiseenkin tehtävään, ja mitkä niistä johtavat parhaaseen luokittelutarkkuuteen. Luokittelupiirteinä käytetään BERT-kielimallista eristettyä virketason kontekstuaalista tietoa, artikkelin muotoiluun liittyviä ominaisuuksia, kuten lihavointeja ja html-koodia, ja aihemallinnuksen avulla tuotettuja artikkelikohtaisia aihejakaumia. Alustavat kokeet pelkästään kontekstuaalista tietoa hyödyntävällä luokittelijalla olivat lupaavia, mutta niidenkään tarkkuus ei yltänyt tarvittavalle tasolle. Oli siis tarpeen selvittää, paraneeko luokittelijan suorituskyky yhdistelemällä eri piirteitä. Hypoteesi on uskottava, sillä esimerkiksi BERT-pohjaiset upotukset koodaavat muutaman virkkeen pituisen sekvenssin lingvististä ja jakaumallista informaatiota, kun taas aihemalli sisältää laajempaa rakenteellista informaatiota. Nämä piirteet täydentäisivät toisiaan artikkelitason luokitustehtävässä. Yhdistelemällä tekstien kontekstuaalista informaatiota aihemallinnukseen on hiljattain saavutettu parannuksia erilaisissa tekstinluokittelutesteissä ja sovelluksissa (Peinelt et al. 2020, Glazkova 2021). Yhdistämällä kontekstuaaliset piirteet aihemallin informaatioon päästään tässä työssä tosin vain marginaalisiin parannuksiin ja vain tietyissä ympäristöissä. Tästä huolimatta kehitetty luokittelija suoriutuu monesta luokittelutehtävästä paremmin kuin pelkästään kontekstuaalisia piirteitä hyödyntävä luokittelija. Lisäksi löydetään potentiaalisia kehityskohteita, joilla voitaisiin päästä edelleen parempaan luokittelutarkkuuteen. Kokeiden perusteella kehysanalyysiin perustuva automaattinen luokittelu neuroverkkojen avulla on mahdollista, mutta luokittelijoiden tarkkuudessa ja tulkittavuudessa on vielä kehityksen varaa, eivätkä ne vielä ole tarpeeksi tarkkoja korkeaa varmuutta vaativiin johtopäätöksiin.
  • Koho, Tiina (2022)
    Tekstin normalisointi on prosessi, jossa epästandardia kirjoitettua kieltä muutetaan standardisoituun muotoon. Murteet ovat yksi esimerkki epästandardista kielestä, joka voi poiketa huomattavastikin standardisoidusta yleiskielestä. Lisäksi suomen kieli on ortografialtaan varsin pitkälti foneemista, minkä ansiosta myös puhutun kielen ominaispiirteet on mahdollista tuoda esille kirjoitetussa muodossa. Etenkin epävirallisilla alustoilla ja arkikielisessä kontekstissa, kuten sosiaalisessa mediassa, suomen kielen puhujat saattavat kirjoittaa sanat kuten ääntäisivät ne normaalisti puhuessaan. Tällaista epästandardista kielestä koostuvaa aineistoa voi löytää myös luonnollisen kielen käsittelyn tarpeisiin esimerkiksi Twitteristä. Perinteiselle yleiskieliselle tekstiaineistolle suunnatut luonnollisen kielen käsittelyn työkalut eivät kuitenkaan välttämättä saavuta toivottavia tuloksia puhekieliselle aineistolle sovellettuna, jolloin ratkaisuna voidaan käyttää välivaiheena tekstin normalisointia. Normalisointiprosessissa syötteenä käytettävä puhekielinen tai muutoin epästandardia kieltä sisältävä teksti muutetaan standardisoituun kirjoitusasuun, jota luonnollisen kielen käsittelyn työkalut paremmin ymmärtävät. Tämä työ pohjaa aiempaan tutkimukseen, jota on tehty suomen murteiden normalisoinnin parissa. Aiemmissa tutkimuksissa on todettu, että merkkipohjaiset BRNN-neuroverkkomallit (Bidirectional Recurrent Neural Nerwork) saavuttavat hyviä tuloksia suomen kielen murteiden normalisoinnissa, kun syötteenä käytetään sanoja kolmen kappaleen lohkoissa. Tämä tarkoittaa, että järjestelmä saa syötteenä kerrallaan kolmen sanan joukon, ja jokainen sana on edelleen pilkottu välilyönnein eroteltuihin kirjoitusmerkkeihin. Tässä työssä pyrittiin käyttämään samoja metodeja ja aineistoa kuin aiemmassa tutkimuksessa, jotta tulokset olisivat vertailukelpoisia. Aineistona on käytetty Kotimaisten kielten keskuksen ylläpitämää Suomen kielen näytteitä -korpusta, ja normalisointiin on käytetty OpenNMT-nimistä avoimen lähdekoodin kirjastoa. Työssä toteutetuista kokeiluista saadut tulokset näyttävät vahvistavan aiempien tutkimustulosten pohjalta tehdyt löydökset, mutta lisäksi on viitteitä siitä, että neuroverkkomallit saattaisivat pidemmistä lohkoista koostuvista syötteistä. BRNN-mallin lisäksi työssä kokeillaan myös muita neuroverkkoarkkitehtuureja, mutta vertailtaessa sanavirheiden suhdelukua mittaavaa WER-arvoa (Word Error Rate) voidaan todeta, että BRNN-malli suoriutuu normalisointitehtävästä muita neuroverkkoarkkitehtuureja paremmin.