Browsing by Subject "NLP"

Now showing items 1-13 of 13

Developing a Finnish geoparser for extracting location information from unstructured texts

Leppämäki, Tatu (2022)

Ever more data is available and shared through the internet. The big data masses often have a spatial dimension and can take many forms, one of which are digital texts, such as articles or social media posts. The geospatial links in these texts are made through place names, also called toponyms, but traditional GIS methods are unable to deal with the fuzzy linguistic information. This creates the need to transform the linguistic location information to an explicit coordinate form. Several geoparsers have been developed to recognize and locate toponyms in free-form texts: the task of these systems is to be a reliable source of location information. Geoparsers have been applied to topics ranging from disaster management to literary studies. Major language of study in geoparser research has been English and geoparsers tend to be language-specific, which threatens to leave the experiences provided by studying and expressed in smaller languages unexplored. This thesis seeks to answer three research questions related to geoparsing: What are the most advanced geoparsing methods? What linguistic and geographical features complicate this multi-faceted problem? And how to evaluate the reliability and usability of geoparsers? The major contributions of this work are an open-source geoparser for Finnish texts, Finger, and two test datasets, or corpora, for testing Finnish geoparsers. One of the datasets consists of tweets and the other of news articles. All of these resources, including the relevant code for acquiring the test data and evaluating the geoparser, are shared openly. Geoparsing can be divided into two sub-tasks: recognizing toponyms amid text flows and resolving them to the correct coordinate location. Both tasks have seen a recent turn to deep learning methods and models, where the input texts are encoded as, for example, word embeddings. Geoparsers are evaluated against gold standard datasets where toponyms and their coordinates are marked. Performance is measured on equivalence and distance-based metrics for toponym recognition and resolution respectively. Finger uses a toponym recognition classifier built on a Finnish BERT model and a simple gazetteer query to resolve the toponyms to coordinate points. The program outputs structured geodata, with input texts and the recognized toponyms and coordinate locations. While the datasets represent different text types in terms of formality and topics, there is little difference in performance when evaluating Finger against them. The overall performance is comparable to the performance of geoparsers of English texts. Error analysis reveals multiple error sources, caused either by the inherent ambiguousness of the studied language and the geographical world or are caused by the processing itself, for example by the lemmatizer. Finger can be improved in multiple ways, such as refining how it analyzes texts and creating more comprehensive evaluation datasets. Similarly, the geoparsing task should move towards more complex linguistic and geographical descriptions than just toponyms and coordinate points. Finger is not, in its current state, a ready source of geodata. However, the system has potential to be the first step for geoparsers for Finnish and it can be a steppingstone for future applied research.
Discourse Act Classification in Asynchronous Online Forum Discussions

Joosten, Rick (2020)

In the past two decades, an increasing amount of discussions are held via online platforms such as Facebook or Reddit. The most common form of disruption of these discussions are trolls. Traditional trolls try to digress the discussion into a nonconstructive argument. One strategy to achieve this is to give asymmetric responses, responses that don’t follow the conventional patterns. In this thesis we propose a modern machine learning NLP method called ULMFiT to automatically detect the discourse acts of online forum posts in order to detect these conversational patterns. ULMFiT finetunes the language model before training its classifier in order to create a more accurate language representation of the domain language. This task of discourse act recognition is unique since it attempts to classify the pragmatic role of each post within a conversation compared to the functional role which is related to tasks such as question-answer retrieval, sentiment analysis, or sarcasm detection. Furthermore, most discourse act recognition research has been focused on synchronous conversations where all parties can directly interact with each other while this thesis looks at asynchronous online conversations. Trained on a dataset of Reddit discussions, the proposed model achieves a matthew’s correlation coefficient of 0.605 and an F1-score of 0.69 to predict the discourse acts. Other experiments also show that this model is effective at question-answer classification as well as showing that language model fine-tuning has a positive effect on both classification performance along with the required size of the training data. These results could be beneficial for current trolling detection systems.
Methods for investigating the external and internal validity of machine learned signals

Hrin, Adam (2023)

Understanding Machine Learning models’ behaviour is becoming increasingly important as models are growing in complexity. This thesis proposes a framework for validating machine learned signals using performance metrics and model explainability tools, applied to the context of Digital Humanities and Social Sciences. The framework allows for investigation whether the real-world problem that the model tries to represent is well-defined and whether the model accurately captures the phenomena at hand. Explainability techniques such as SHAP, LIME and Gradient-based methods have been used. These produce feature importance scores that the model bases its decisions on. The cases presented in this thesis are related to the research in Computational History and Historical Discourse Analysis with High Performance Computing. The subject of analysis is the large language model BERT fine-tuned on Eighteenth Century Collections Online (ECCO) documents that classifies books into genres. Investigating the performance of the classifier with precision-recall curves suggests that the class signals might be overlapping and not clearly delineated. Further results do not suggest that the noise elements present in the data caused by the OCR digitising process have significant importance for the decision making of the model. The explainability techniques helped uncover the model’s inner workings by showing that the model gets its signal mostly from the beginnings of samples. In a proxy task, a simpler linear model was trained to perform a projection from keywords to genres and showed inconsistency in the explainability method. Different subsets of data have been investigated as given by cells of a confusion matrix, the confidence in prediction probability or additional metadata. Investigating individual samples allows for qualitative analysis as well as more detailed signal understanding.
Modeling emotions in dialogue generation

Salmenkivi, Essi (2020)

This work introduces a system for generating radio play scripts. Generating dramatic dialogue presents unique challenges in language generation. In addition to fluency of language, dramatic text should exhibit plot and characters' affective stances to each other and events. Character relationships and affect may be expressed beneath the surface level of everyday conversation topics. In the affect-driven dialogue generation system introduced by this thesis, characters have goals, relationships and a three-dimensional model of mood which influences their behaviour. Given conflicting goals, characters will navigate the web of conversation, making choices that influence others to accept their goal while simultaneously trying to maintain the relationship to others. Characters react emotionally to each others' speech acts and express their own affective state in how they speak. The system separates the form of a sentence from its content, allowing the system to generate a wide range of coherent, dramatic conversations by combining affect-expressing sentence templates with goal-expressing content. Because content and form are independent from each other, only a finite number of sentence templates need to be prepared to generate conversations about any content.
Multilingual Named Entity Recognition through Data and Model Transfer

Palma-Suominen, Saara (2021)

Maisterintutkielma käsittelee monikielistä nimien tunnistusta. Tutkielmassa testataan kahta lähestymistapaa monikieliseen nimien tunnistukseen: annotoidun datan siirtoa toisille kielille, sekä monikielisen mallin luomista. Lisäksi nämä kaksi lähestymistapaa yhdistetään. Tarkoitus on löytää menetelmiä, joilla nimien tunnistusta voidaan tehdä luotettavasti myös pienemmillä kielillä, joilla annotoituja nimientunnistusaineistoja ei ole suuressa määrin saatavilla. Tutkielmassa koulutetaan ja testataan malleja neljällä kielellä: suomeksi, viroksi, hollanniksi ja espanjaksi. Ensimmäisessä metodissa annotoitu data siirretään kieleltä toiselle monikielisen paralleelikorpuksen avulla, ja näin syntynyttä dataa käytetään neuroverkkoja hyödyntävän koneoppimismallin opettamiseen. Toisessa metodissa käytetään monikielistä BERT-mallia. Mallin koulutukseen käytetään annotoituja korpuksia, jotka yhdistetään monikieliseksi opetusaineistoksi. Kolmannessa metodissa kaksi edellistä metodia yhdistetään, ja kieleltä toiselle siirrettyä dataa käytetään monikielisen BERT-mallin koulutuksessa. Kaikkia kolmea lähestymistapaa testataan kunkin kielen annotoidulla testisetillä, ja tuloksia verrataan toisiinsa. Metodi, jossa rakennettiin monikielinen BERT-malli, saavutti selkeästi parhaimmat tulokset nimien tunnistamisessa. Neuroverkkomallit, jotka koulutettiin kielestä toiseen siirretyillä annotaatioilla, saivat selkeästi heikompia tuloksia. BERT-mallin kouluttaminen siirretyillä annotaatioilla tuotti myös heikkoja tuloksia. Annotaatioiden siirtäminen kieleltä toiselle osoittautui haastavaksi, ja tuloksena syntynyt data sisälsi virheitä. Tulosten heikkouteen vaikutti myös opetusaineiston ja testiaineiston kuuluminen eri genreen. Monikielinen BERT-malli on tutkielman mukaan testatuista parhaiten toimiva metodi, ja sopii myös kielille, joilla annotoituja aineistoja ei ole paljon saatavilla.
Multilingual Named Entity Recognition through Data and Model Transfer

Palma-Suominen, Saara (2021)

Maisterintutkielma käsittelee monikielistä nimien tunnistusta. Tutkielmassa testataan kahta lähestymistapaa monikieliseen nimien tunnistukseen: annotoidun datan siirtoa toisille kielille, sekä monikielisen mallin luomista. Lisäksi nämä kaksi lähestymistapaa yhdistetään. Tarkoitus on löytää menetelmiä, joilla nimien tunnistusta voidaan tehdä luotettavasti myös pienemmillä kielillä, joilla annotoituja nimientunnistusaineistoja ei ole suuressa määrin saatavilla. Tutkielmassa koulutetaan ja testataan malleja neljällä kielellä: suomeksi, viroksi, hollanniksi ja espanjaksi. Ensimmäisessä metodissa annotoitu data siirretään kieleltä toiselle monikielisen paralleelikorpuksen avulla, ja näin syntynyttä dataa käytetään neuroverkkoja hyödyntävän koneoppimismallin opettamiseen. Toisessa metodissa käytetään monikielistä BERT-mallia. Mallin koulutukseen käytetään annotoituja korpuksia, jotka yhdistetään monikieliseksi opetusaineistoksi. Kolmannessa metodissa kaksi edellistä metodia yhdistetään, ja kieleltä toiselle siirrettyä dataa käytetään monikielisen BERT-mallin koulutuksessa. Kaikkia kolmea lähestymistapaa testataan kunkin kielen annotoidulla testisetillä, ja tuloksia verrataan toisiinsa. Metodi, jossa rakennettiin monikielinen BERT-malli, saavutti selkeästi parhaimmat tulokset nimien tunnistamisessa. Neuroverkkomallit, jotka koulutettiin kielestä toiseen siirretyillä annotaatioilla, saivat selkeästi heikompia tuloksia. BERT-mallin kouluttaminen siirretyillä annotaatioilla tuotti myös heikkoja tuloksia. Annotaatioiden siirtäminen kieleltä toiselle osoittautui haastavaksi, ja tuloksena syntynyt data sisälsi virheitä. Tulosten heikkouteen vaikutti myös opetusaineiston ja testiaineiston kuuluminen eri genreen. Monikielinen BERT-malli on tutkielman mukaan testatuista parhaiten toimiva metodi, ja sopii myös kielille, joilla annotoituja aineistoja ei ole paljon saatavilla.
Readability-Controlled Sentence Simplification

Hynynen, Jussi-Veikka (2023)

Using language that is easy to understand when presenting information in a written form is critical for ensuring effective communication. Yet, using language that is too complex or technical for its intended audience is a common pitfall in many domains, such as legal and medical text. Automatic text simplification (ATS) aims to automatize the conversion of complex text into a simpler, more easily comprehensible form. This study explores ATS models for English that can be controlled in terms of the readability of the output text. Readability is measured with an automatically calculated readability level that corresponds to a school grade level. The readability- controlled models take a readability level as a parameter and simplify input text to match the reading level of the intended audience corresponding to the parameter value. In total, six readability-controlled sentence simplification models with different control attribute configurations are trained in this study. The models use a pretrained sequence-to-sequence model architecture that is finetuned on a dataset of sentence pairs in regular and simple English. The trained models are evaluated using automatic evaluation metrics and compared to each other and ATS systems from previous research. Additionally, the simplified sentences produced by the best performing model are evaluated manually to identify errors and the types of text transformations that the model employs to simplify sentences. When the readability level input value is optimized to maximise model performance on validation data, the readability-controlled models surpass systems from previous works in terms of automatic evaluation metrics, suggesting that the addition of readability level as a control attribute results in improved simplification quality. Manual evaluation shows that readability-controlled models are capable of splitting long sentences to multiple shorter sentences to reduce syntactic complexity of text. This finding suggests that readability level metrics can be used to effectively control syntactic complexity in ATS models as a lightweight alternative to previously applied, more computationally demanding methods that rely on dependency parsing. Finally, this study discusses the different types errors produced by the models, their potential causes and ways to reduce errors in future ATS systems.
Readability-Controlled Sentence Simplification

Hynynen, Jussi-Veikka (2023)

Using language that is easy to understand when presenting information in a written form is critical for ensuring effective communication. Yet, using language that is too complex or technical for its intended audience is a common pitfall in many domains, such as legal and medical text. Automatic text simplification (ATS) aims to automatize the conversion of complex text into a simpler, more easily comprehensible form. This study explores ATS models for English that can be controlled in terms of the readability of the output text. Readability is measured with an automatically calculated readability level that corresponds to a school grade level. The readability- controlled models take a readability level as a parameter and simplify input text to match the reading level of the intended audience corresponding to the parameter value. In total, six readability-controlled sentence simplification models with different control attribute configurations are trained in this study. The models use a pretrained sequence-to-sequence model architecture that is finetuned on a dataset of sentence pairs in regular and simple English. The trained models are evaluated using automatic evaluation metrics and compared to each other and ATS systems from previous research. Additionally, the simplified sentences produced by the best performing model are evaluated manually to identify errors and the types of text transformations that the model employs to simplify sentences. When the readability level input value is optimized to maximise model performance on validation data, the readability-controlled models surpass systems from previous works in terms of automatic evaluation metrics, suggesting that the addition of readability level as a control attribute results in improved simplification quality. Manual evaluation shows that readability-controlled models are capable of splitting long sentences to multiple shorter sentences to reduce syntactic complexity of text. This finding suggests that readability level metrics can be used to effectively control syntactic complexity in ATS models as a lightweight alternative to previously applied, more computationally demanding methods that rely on dependency parsing. Finally, this study discusses the different types errors produced by the models, their potential causes and ways to reduce errors in future ATS systems.
RPG-GPT: Leveling up game dialogue with creative NLG

Pöyhönen, Teemu (2023)

While natural language generation (NLG) and large-language models (LLM) seem to be transforming many industries, video games have yet to be affected. This study investigates the potential of using NLG systems to generate dialogue for non-playable characters (NPCs) in role-playing games (RPGs). For this, dialogue data is extracted from six popular RPGs and is then used to fine-tune Microsoft’s GODEL to create an “RPG chatbot” (RPG-GPT). Motivated by computational creativity frameworks, a survey and an interactive experiment were conducted to evaluate the creativity and the effectiveness of RPG-GPT in generating relevant and engaging responses to player input. Survey respondents rated dialogues on a 5-point agree-disagree Likert scale, with questions related to e.g. the relevance of the NPC answers. Results indicate that RPG-GPT can provide relevant responses with a mean difference of game relevance of 3.93 vs. 3.85 of RPG-GPT (p=0.0364). Also, the participants of the interactive experiment reported engagement when interacting with RPG-GPT. Overall, the results suggest that creative NLG has the potential to enhance gaming experiences through task-oriented game dialogue (TOGD) systems. In this framework, creative TOGD systems could solve a common issue where pre-written NPCs are unable to provide the specific information sought by players. Additionally, the study discusses a concept of how players through their interaction with the NLG models can expand the lore of a game, which is a new consideration for game designers and developers when implementing such systems. Future work could explore ways to incorporate external knowledge and context to improve the performance of a TOGD system.
RPG-GPT: Leveling up game dialogue with creative NLG

Pöyhönen, Teemu (2023)

While natural language generation (NLG) and large-language models (LLM) seem to be transforming many industries, video games have yet to be affected. This study investigates the potential of using NLG systems to generate dialogue for non-playable characters (NPCs) in role-playing games (RPGs). For this, dialogue data is extracted from six popular RPGs and is then used to fine-tune Microsoft’s GODEL to create an “RPG chatbot” (RPG-GPT). Motivated by computational creativity frameworks, a survey and an interactive experiment were conducted to evaluate the creativity and the effectiveness of RPG-GPT in generating relevant and engaging responses to player input. Survey respondents rated dialogues on a 5-point agree-disagree Likert scale, with questions related to e.g. the relevance of the NPC answers. Results indicate that RPG-GPT can provide relevant responses with a mean difference of game relevance of 3.93 vs. 3.85 of RPG-GPT (p=0.0364). Also, the participants of the interactive experiment reported engagement when interacting with RPG-GPT. Overall, the results suggest that creative NLG has the potential to enhance gaming experiences through task-oriented game dialogue (TOGD) systems. In this framework, creative TOGD systems could solve a common issue where pre-written NPCs are unable to provide the specific information sought by players. Additionally, the study discusses a concept of how players through their interaction with the NLG models can expand the lore of a game, which is a new consideration for game designers and developers when implementing such systems. Future work could explore ways to incorporate external knowledge and context to improve the performance of a TOGD system.
Transfer Learning with Language Models for Classification Problems

Nevalainen, Janne (2020)

Neural network based modern language models can reach state of the art performance on wide range of natural language tasks. Their success is based on capability to learn from large unlabeled data by pretraining, using transfer learning to learn strong representations for the language and transferring the learned into new domains and tasks. I look at how language models produce transfer learning for NLP. Especially from the viewpoint of classification. How transfer learning can be formally defined? I compare different LM implementations in theory and also use two example data sets for empirically testing their performance on very small labeled training data.
Unsupervised zero-shot classification of Finnish documents using pre-trained language models

Leal, Rafael (2020)

In modern Natural Language Processing, document categorisation tasks can achieve success rates of over 95% using fine-tuned neural network models. However, so-called "zero-shot" situations, where specific training data is not available, are researched much less frequently. The objective of this thesis is to investigate how pre-trained Finnish language models fare when classifying documents in a completely unsupervised way: by relying only on their general "knowledge of the world" obtained during training, without using any additional data. Two datasets are created expressly for this study, since labelled and openly available datasets in Finnish are very uncommon: one is built using around 5k news articles from Yle, the Finnish Broacasting Company, and the other, 100 pieces of Finnish legislation obtained from the Semantic Finlex data service. Several language representation models are built, based on the vector space model, by combining modular elements: different kinds of textual representations for documents and category labels, different algorithms that transform these representations into vectors (TF-IDF, Annif, fastText, LASER, FinBERT, S-BERT), different similarity measures and post-processing techniques (such as SVD and ensemble models). This approach allows for a variety of models to be tested. The combination of Annif for extracting keywords and fastText for producing word embeddings out of them achieves F1 scores of 0.64 on the Finlex dataset and 0.73-0.74 on the Yle datasets. Model ensembles are able to raise these figures by up to three percentage points. SVD can bring these numbers to 0.7 and 0.74-0.75 respectively, but these gains are not necessarily reproducible on unseen data. These results are distant from the ones obtained from state-of-the-art supervised models, but this is a method that is flexible, can be quickly deployed and, most importantly, do not depend on labelled data, which can be slow and expensive to make. A reliable way to set the input parameter for SVD would be an important next step for the work done in this thesis.
Unsupervised zero-shot classification of Finnish documents using pre-trained language models

Leal, Rafael (2020)

In modern Natural Language Processing, document categorisation tasks can achieve success rates of over 95% using fine-tuned neural network models. However, so-called "zero-shot" situations, where specific training data is not available, are researched much less frequently. The objective of this thesis is to investigate how pre-trained Finnish language models fare when classifying documents in a completely unsupervised way: by relying only on their general "knowledge of the world" obtained during training, without using any additional data. Two datasets are created expressly for this study, since labelled and openly available datasets in Finnish are very uncommon: one is built using around 5k news articles from Yle, the Finnish Broacasting Company, and the other, 100 pieces of Finnish legislation obtained from the Semantic Finlex data service. Several language representation models are built, based on the vector space model, by combining modular elements: different kinds of textual representations for documents and category labels, different algorithms that transform these representations into vectors (TF-IDF, Annif, fastText, LASER, FinBERT, S-BERT), different similarity measures and post-processing techniques (such as SVD and ensemble models). This approach allows for a variety of models to be tested. The combination of Annif for extracting keywords and fastText for producing word embeddings out of them achieves F1 scores of 0.64 on the Finlex dataset and 0.73-0.74 on the Yle datasets. Model ensembles are able to raise these figures by up to three percentage points. SVD can bring these numbers to 0.7 and 0.74-0.75 respectively, but these gains are not necessarily reproducible on unseen data. These results are distant from the ones obtained from state-of-the-art supervised models, but this is a method that is flexible, can be quickly deployed and, most importantly, do not depend on labelled data, which can be slow and expensive to make. A reliable way to set the input parameter for SVD would be an important next step for the work done in this thesis.

Now showing items 1-13 of 13

Browsing by Subject "NLP"

Yhteystiedot

HELSINGIN YLIOPISTO