Browsing by study line "Language Technology"
Now showing items 1-20 of 28
-
(2024)Large Language Models (LLMs) demonstrate increasingly impressive capabilities as they grow in size, but these ever-growing models come at the expense of high training, inference, storage, and deployment costs. Parameter Efficient Fine-Tuning (PEFT) methods have emerged as a response to the growing cost of performance and have demonstrated success when used with general language models. PEFT methods have also been applied to train models with fewer than one billion parameters on code tasks such as code summarization. However, few have compared multiple PEFT approaches when training models on code generation tasks. We investigate the training methods' impact on model performance on code generation tasks by training five model families, ranging from 124 million to 15.5 billion parameters, using four PEFT approaches and regular fine-tuning. We find the impact of each PEFT method varies depending on model size and dataset size and quality. Larger models required fewer updated parameters and saw the best performance with prompt-tuning and LoRA approaches, while models smaller than 1.5 billion parameters saw the best results with more parameter updates, such as with full fine-tuning. In addition to differences in performance results, we also find that as model sizes, increase memory savings and training speeds become increasingly similar. Surprisingly, we see a decline in model performance after training large models. We hypothesize this is due to data misalignment between the pre-training data and sub-optimal training hyperparameters. The results of this study suggest that LoRA, when applied to all linear layers, is an effective and competitive training method for code generation tasks across various model sizes. For models with fewer than 1.5 billion parameters, if the resources are available full fine-tuning should be done for optimal performance, which is not the case for larger models. We also report all training hyperparameters to aid others in determining the best hyperparameters for their use case. Finally, this study discusses the benefits and criticisms of commonly used metrics, and their impact on evaluating model performance.
-
Annotating multimodal discourse relations by combining crowdsourcing and natural language processing (2023)Diagrams are a mode of communication that offers challenges for its computational processing. The challenges arise from the multimodal nature of diagrams. This means that diagrams combine several types of expressive resources to achieve their communicative purposes, such as textual elements, connective elements such as arrows and lines, and illustrations. Humans interpret diagrams by judging how these different expressive resources work together to reach the communicative goals set for the diagram. In order to do that, humans make inferences of the diagram layout and the implicit relations that exist between different parts of the diagram. In order to build computational methods for diagram understanding, large amounts of data annotated with these implicit relations is required. Traditionally, these types of discourse structure annotations have been annotated by experts, due to the difficulty of the task and the requirement that the annotator is familiar with the theoretical framework used for describing discourse relations. The chosen theory for modeling discourse relations in diagrams is Rhetorical Structure Theory, originally developed for modeling textual coherence but applicable to multimodal data as well. This thesis explores the possibility to gather discourse relation annotations for multimodal diagram data with crowdsourcing; employing naive workers on crowdsourcing platforms to complete annotation tasks for a monetary reward. Adapting the task of discourse relation annotation to be feasible for naive workers has been proven challenging by past research concerned with only textual data, and the multimodality of the data adds to the complexity of the task. This thesis presents a novel method for gathering multimodal discourse relation annotations using crowdsourcing and methods of natural language processing. Two approaches are explored: adopting an insertive annotation task where the workers are asked to describe the relationship between two diagram elements in their own words and adopting a multiple-choice task, converting the formal definitions of Rhetorical Structure Theory to understandable phrases to annotate with. Natural language processing is used in the first approach to validate the language and structure of the crowdsourced descriptions. The results of the first approach highlight the difficulty of the task: the workers show tendencies of relying heavily on example descriptions shown in the task instructions and difficulty of grasping the differences of the more fine-grained relations. The multiple-choice approach seems more promising, with annotation agreement with expert annotators higher than in previous research concerned with discourse relations in textual data. The manual inspection of the annotated diagrams show that the disagreement of the crowdworkers and expert annotators is often justifiable; both annotations represent a valid interpretation of the discourse relation. This highlights one of the main challenges of the task, which is the ambiguity of some of the relations. Future work is encouraged to consider this by adopting an approach that is less concerned with a pre-defined set of relations and more interested in how the different discourse relations are actually perceived.
-
(2022)Social media provides huge amounts of potential data for natural language processing but using this data may be challenging. Finnish social media text differs greatly from standard Finnish and models trained on standard data may not be able to adequately handle the differences. Text normalization is the process of processing non-standard language into its standardized form. It provides a way to both process non-standard data with standard natural language processing tools and to get more data for training new tools for different tasks. In this thesis I experiment with bidirectional recurrent neural network models and models based on the ByT5 foundation model, as well as the Murre normalizer to see if existing tools are suitable for normalizing Finnish social media text. I manually normalize a small set of data from the Ylilauta and Suomi24 corpora to use as a test set. For training the models I use the Samples of Spoken Finnish corpus and Wikipedia data with added synthetic noise. The results of this thesis show that there are no existing tools suitable for normalizing Finnish written on social media. There is a lack of suitable data for training models for this task. The ByT5-based models perform better than the BRNN models.
-
(2021)Hypernymy is a relationship between two words, where the hyponym carries a more specific meaning, and entails a hypernym that carries a more general meaning. A particular kind of verbal hypernymy is troponymy, where troponyms are verbs that encode a particular manner or way of doing something, such as “whisper” meaning “to speak in a quiet manner”. Recently, contextualized word vectors have emerged as a powerful tool for representing the semantics of words in a given context, in contrast to earlier static embeddings where every word is represented by a single vector regardless of sense. BERT, a pre-trained language model that uses contextualized word representations, achieved state of the art performance on various downstream NLP tasks such as question answering. Previous research identified knowledge of scalar adjective intensity in BERT, but not systematic knowledge of nominal hypernymy. In this thesis, we investigate systematic knowledge of troponymy and verbal hypernymy in the base English version of BERT. We compare the similarity of vector representations for manner verbs and adverbs of interest, to see if troponymy is represented in the vector space. Then, we evaluate BERT’s predictions for cloze tasks involving troponymy and verbal hypernymy. We also attempt to train supervised models to probe vector representations for this knowledge. Lastly, we perform clustering analyses on vector representations of words in hypernymy pairs. Data on troponymy and hypernymy relationships is extracted from WordNet and HyperLex, and sentences containing instances of the relevant words are obtained from the ukWaC corpus. We were unable to identify any systematic knowledge about troponymy and verb hypernymy in BERT. It was reasonably successful at predicting hypernyms in the masking experiments, but a general inability to go in the other direction suggests that this knowledge is not systematic. Our probing models were unsuccessful at recovering information related to hypernymy and troponymy from the representations. In contrast with previous work that finds type-level semantic information to be located in the lower layers of BERT, our cluster-based analyses suggest that the upper layers contain stronger or more accessible representations of hypernymy.
-
(2022)There are comprehensive requirements in Finland for procurement by any government organization to go through a tendering process where information about each tender is made available not only to vendors and service providers, but to everyone else in Finland as well. This is accomplished through the website Hilma and should make tenders easy to find. Moreover, in Finnish, variance in domain terminology is not thought to be the problem that it is in English. For instance, the last four years of tenders on Hilma never refer to jatkuva parantaminen as toiminnallinen erinomaisuus whereas “continuous improvement” and “operational excellence” could be used interchangeably in English. And yet, it is considered very difficult for a vendor or service provider to find applicable tenders on Hilma. Rather than lexical variability being the cause as it might be in English, the differences in concept paradigms between the private and public sectors in Finland pose the challenge. Whereas a taxi company representative would be looking for tenders about transportation services, a public officer could be posting a tender about social equity for the disabled. The second difficulty is that the Hilma search engine is purely Boolean with restrictive string match criteria rather than inviting natural language questions. Finally, the Hilma search engine does not account for Finnish being a highly inflecting and compounding language where single words usually morph instead of taking on adpositions, and where compound words are affixed together without hyphenation. Many information retrieval approaches would look outside the corpus for query expansion terms. Natural language processing might also offer the potential to look for paraphrases in existing parallel corpora on tenders throughout the European Union rather than in Hilma. However, this thesis focuses on clustering the tenders posted in Finnish on Hilma, applying the comprehensive workflow of the very recent BERTopic package for Python. All documents in each cluster are concatenated and the highest TFIDF-scoring words in the concatenated document are slated to be “search extension terms.” If one of the terms were to be entered by a Hilma user, the user would be invited to perform parallel searches with the remaining terms as well. The first main contribution of this thesis is to use state of the art models and algorithms to represent the corpus, reduce dimensionality of the representations and hierarchically cluster the representations. Second, this thesis develops analytical metrics to be used in automatic evaluation of the efficacy of the clusterings and in comparisons among model iterations that programmatically remove more and more distractions to the clustering that are discovered in the corpus. Finally, this thesis performs case studies on Hilma to demonstrate the remarkable efficacy of the search extension terms in generating tremendous numbers of additional useful matches, addressing paradigm-based differences in terminology, morphovariance and affixation.
-
Comparative analysis of prosodic characteristics of dialects of the English language using WaveNet (2023)The purpose of this thesis is to examine the prosodic features of English dialects using WaveNet. The exact goal is to investigate whether the differences in prosody between the dialects are present in the data and the results, and whether the geographical distance between the cities included in the data has any influence on this. Another aim is to see how the prosodic features of the sentence types present in the data and their possible differences are manifested in the data and the results. Prosody is concerned with those characteristics of speech which cover more than just individual sounds. Prosodic features can further be divided into paralinguistic features, such as the rate of speech and pausing, and linguistic features, like intonation. Parameters useful for analysing prosody are fundamental frequency (f0), intensity and voice quality – we are interested in the first two. Fundamental frequency is the speed of the vibration of the vocal folds while speaking. Intensity in turn is connected to the changes of air pressure while speaking. The data used for this study is the IViE corpus (Intonational Variation in English), comprising of recordings done in nine British cities – Belfast, Bradford, Cambridge, Cardiff, Dublin, Leeds, Liverpool, London and Newcastle, with approximately 12 speakers per city. In three of the cities, Bradford, Cardiff and London, the dialect is that of a minority. The part of the corpus chosen for this study is a set of 22 sentences consisting of five sentence types. The analysis was performed using WaveNet, a convolutional neural network. It uses causal convolutions to ensure the data is processed correctly. In addition to being conditioned on the output of the network itself, it can also be conditioned using embeddings. The WaveNet implementation used here has two embedding layers – target and normalisation embeddings. Before the analysis the data was pre-processed and the relevant information concerning the fundamental frequency and intensity were extracted from the sound files. A corresponding *.time file was also created for each of the sound files, with the aim of minimising the influence of the possible differences in length between sentences and thus improve the network's ability to recognise the intonation contours correctly. The results are presented in the form of dendrograms, depicting the relationships between the dialects and sentence types – both separately and as a combination of the dialects and sentence types. It was shown, that the differences in prosody were in fact manifested in the data for both dialects and sentence types, although not exactly as expected. The geographical proximity did not seem to influence the dialectal similarities as much as was assumed – in addition to other influences this might also be due to some of the dialects being minority dialects in the cities, and therefore not necessarily so easily comparable to the dialects of the neighbouring area as the majority dialects might have been.
-
(2021)Computer-Assisted Language Learning (CALL) is one of the sub-disciplines within the area of Second Language Acquisition. Clozes, also called fill-in-the-blank, are largely used exercises in language learning applications. A cloze is an exercise where the learner is asked to provide a fragment that has been removed from the text. For language learning purposes, in addition to open-end clozes where one or more words are removed and the student must fill the gap, another type of cloze is commonly used, namely multiple-choice cloze. In a multiple-choice cloze, a fragment is removed from the text and the student must choose the correct answer from multiple options. Multiple-choice exercises are a common way of practicing and testing grammatical knowledge. The aim of this work is to identify relevant learning constructs for Italian to be applied to automatic exercises creation based on authentic texts in the Revita Framework. Learning constructs are units that represent language knowledge. Revita is a free to use online platform that was designed to provide language learning tools with the aim of revitalizing endangered languages including several Finno-Ugric languages such as North Saami. Later non-endangered languages were added. Italian is the first majority language to be added in a principled way. This work paves the way towards adding new languages in the future. Its purpose is threefold: it contributes to the raising of Italian from its beta status towards a full development stage; it formulates best practices for defining support for a new language and it serves as a documentation of what has been done, how and what remains to be done. Grammars and linguistic resources were consulted to compile an inventory of learning constructs for Italian. Analytic and pronominal verbs, verb government with prepositions, and noun phrase agreement were implemented by designing pattern rules that match sequences of tokens with specific parts-of-speech, surfaces and morphological tags. The rules were tested with test sentences that allowed further refining and correction of the rules. Current precision of the 47 rules for analytic and pronominal verbs on 177 test sentences results in 100%. Recall is 96.4%. Both precision and recall for the 5 noun phrase agreement rules result in 96.0% in respect to the 34 test sentences. Analytic and pronominal verb, as well as noun phrase agreement patterns, were used to generate open-end clozes. Verb government pattern rules were implemented into multiple-choice exercises where one of the four presented options is the correct preposition and the other three are prepositions that do not fit in context. The patterns were designed based on colligations, combinations of tokens (collocations) that are also explained by grammatical constraints. Verb government exercises were generated on a specifically collected corpus of 29074 words. The corpus included three types of text: biography sections from Wikipedia, Italian news articles and Italian language matriculation exams. The last text type generated the most exercises with a rate of 19 exercises every 10000 words, suggesting that the semi-authentic text met best the level of verb government exercises because of appropriate vocabulary frequency and sentence structure complexity. Four native language experts, either teachers of Italian as L2 or linguists, evaluated usability of the generated multiple-choice clozes, which resulted in 93.55%. This result suggests that minor adjustments i.e., the exclusion of target verbs that cause multiple-admissibility, are sufficient to consider verb government patterns usable until the possibility of dealing with multiple-admissible answers is addressed. The implementation of some of the most important learning constructs for Italian resulted feasible with current NLP tools, although quantitative evaluation of precision and recall of the designed rules is needed to evaluate the generation of exercises on authentic text. This work paves the way towards a full development stage of Italian in Revita and enables further pilot studies with actual learners, which will allow to measure learning outcomes in quantitative terms
-
(2020)Maps of science, or cartography of scientific fields, provide insights into the state of scientific knowledge. Analogous to geographical maps, maps of science present the fields as positions and show the paths connecting each other, which can serve as an intuitive illustration for the history of science or a hint to spot potential opportunities for collaboration. In this work, I investigate the reproducibility of a method to generate such maps. The idea of the method is to derive representations representations for the given scientific fields with topic models and then perform hierarchical clustering on these, which in the end yields a tree of scientific fields as the map. The result is found unreproducible, as my result obtained on the arXiv data set (~130k articles from arXiv Computer Science) shows an inconsistent structure from the one in the reference study. To investigate the cause of the inconsistency, I derive a second set of maps using the same method and an adjusted data set, which is constructed by re-sampling the arXiv data set to a more balanced distribution. The findings show the confounding factors in the data cannot account for the inconsistency; instead, it should be due to the stochastic nature of the unsupervised algorithm. I also improve the approach by using ensemble topic models to derive representations. It is found the method to derive maps of science can be reproducible when it uses an ensemble topic model fused from a sufficient number of base models.
-
(2023)Hostility in the player communication of video games (and by extension, mobile games) is a well-documented phenomenon that can have negative repercussions for the well-being of the individual being subjected to it, and the society in general. Existing research on detecting hostility in games through machine learning methods is scarce due to the unavailability of data, imbalanced existing data (few positive samples in a large data set), and the challenges involved in defining and identifying hostile communication. This thesis utilizes communication data from the Supercell game Brawl Stars to produce two distinct machine learning models: a support vector classifier and a multi-layer perceptron. Their performance is compared to each other as well as to that of an existing sentiment analysis classifier, VADER. Techniques such as oversampling and using additional data are also used in an attempt to reach better results by improving the balance of the data set. The support vector classifier model was found to have the best performance overall, with an F1 score of 64.15% when used on the pure data set and 65.74% when combined with the SMOTE oversampling algorithm. The thesis includes an appendix with a list of the words that were found to have the strongest influence on the hostile/non-hostile classification.
-
(2024)This thesis investigates relative clauses in Russian and Belarusian, focusing on syntactic structures, usage patterns, and the distribution of linguistic elements within relative clauses. The study employs a corpus-based approach, analyzing data from the Universal Dependencies (UD) Russian SynTagRus corpus and the Belarusian UD HSE treebank. The research explores various aspects of relative clauses, including the position of the relative clause, the syntactic function of the head, the function of the relativizer, and the part-of-speech distribution within relative clauses. Through quantitative analysis, the study identifies consistent patterns and similarities between Russian and Belarusian relative clauses. Key findings include the predominance of left-headed relative clauses in both languages, with rare instances of right-headed structures primarily occurring in specific linguistic contexts such as correlatives. Nominal subjects emerge as the most frequent syntactic function of the head in both languages, reflecting universal principles governing the syntactic organization of relative clauses. The analysis of relativizers reveals that relative pronouns dominate as the predominant type, with "который" and "якi" being the most prevalent in Russian and Belarusian, respectively. Furthermore, both languages exhibit similar distributions of relative adverbs and conjunctions, indicating a shared syntactic strategy for introducing and connecting subordinate clauses to main clauses. Regarding part-of-speech distribution within relative clauses, nouns and verbs emerge as the most prevalent syntactic relation, highlighting their role in specifying entities and actions. While variations exist, particularly with certain adjective-verb and adverb-noun relations, the overall patterns remain largely similar between Russian and Belarusian.
-
(2020)Universella meningsrepresentationer och flerspråkig språkmodellering är heta ämnen inom språkteknologi, specifikt området som berör förståelse för naturligt språk (natural language understanding). En meningsinbäddning (sentence embedding) är en numerisk skildring av en följd ord som motsvaras av en hel fras eller mening, speficikt som ett resultat av en omkodare (encoder) inom maskininlärning. Dessa representationer behövs för automatiska uppgifter inom språkteknologi som kräver förståelse för betydelsen av en hel mening, till skillnad från kombinationer av enskilda ords betydelser. Till sådana uppgifter kan räknas till exempel inferens (huruvida ett par satser är logiskt anknutna, natural language inference) samt åsiktsanalys (sentiment analysis). Med universalitet avses kodad betydelse som är tillräckligt allmän för att gynna andra relaterade uppgifter, som till exempel klassificering. Det efterfrågas tydligare samförstånd kring strategier som används för att bedöma kvaliteten på dessa inbäddningar, antingen genom att direkt undersöka deras lingvistiska egenskaper eller genom att använda dem som oberoende variabler (features) i relaterade modeller. På grund av att det är kostsamt att skapa resurser av hög kvalitet och upprätthålla sofistikerade system på alla språk som används i världen finns det även ett stort intresse för uppskalering av moderna system till språk med knappa resurser. Tanken med detta är så kallad överföring (transfer) av kunskap inte bara mellan olika uppgifter, utan även mellan olika språk. Trots att behovet av tvärspråkiga överföringsmetoder erkänns i forskningssamhället är utvärderingsverktyg och riktmärken fortfarande i ett tidigt skede. SentEval är ett existerande verktyg för utvärdering av meningsinbäddningar med speciell betoning på deras universalitet. Syftet med detta avhandlingsprojekt är ett försök att utvidga detta verktyg att stödja samtidig bedömning på nya uppgifter som omfattar flera olika språk. Bedömningssättet bygger på strategin att låta kodade meningar fungera som variabler i så kallade downstream-uppgifter och observera huruvida resultaten förbättras. En modern mångspråkig modell baserad på så kallad transformers-arkitektur utvärderas på en etablerad inferensuppgift såväl som en ny känsloanalyssuppgift (emotion detection), av vilka båda omfattar data på en mängd olika språk. Även om det praktiska genomförandet i stor utsträckning förblev experimentellt rapporteras vissa tentativa resultat i denna avhandling.
-
(2023)Monolingual paraphrases are semantically equivalent sentences in one single language transmitting the same meaning but not necessarily using the same words. Also, the same word can have different meanings in different contexts. Understanding the meaning of a text behind its words is essential for many natural language processing and deep-learning tasks such as machine translation, plagiarism detection, question-answering, and information extraction. Paraphrases have been studied extensively, mainly from an English-only or sometimes multilingual perspective. There are not many studies about paraphrase detection in Finnish and even fewer about detecting paraphrases between different registers of the Finnish language, such as Standard Finnish and Easy Finnish. In this thesis, three different pre-trained sentence-BERT models are tested in a paraphrase detection task. The aim of the task is to find paraphrase pairs and triples between three distinct registers of the Finnish language; Standard Finnish, Easy Finnish, and Colloquial Finnish. As the data Yle News articles in Standard and Easy Finnish mostly from the year 2014 are used, as well as Ylilauta online discussions. The applied BERT models are paraphrase-multilingual-MiniLM-L12-v2 sentence-transformers model and FinBERT model. The first mentioned is also fine-tuned with Finnish paraphrase corpus' data. According to the manual evaluation based on the models' precisions, the fine-tuned model outperforms the other two. The same three models are tested on two different balanced test sets of 50 paraphrase sentence pairs and 50 non-paraphrase sentence pairs. The FinBERT model reaches the best F1 score in this research setting. Among the precision and the F1 score, the average sentence lengths and the repetitiveness of the paraphrase sentence pairs and triples are compared and discussed. The FinBERT model detected the shortest sentences and the most repetition, but its total number of detected sentence pairs was also the highest. As a result of this study, a new Easy Finnish - Standard Finnish paraphrase corpus is collected to facilitate further studies in paraphrase detection or simplification in Finnish. The corpus is presented in this thesis. It contains 5881 sentence pairs of which approximately 98 % can be assumed to be true paraphrases according to the manual evaluation of randomly selected sentence pairs. The corpus is created by using the fine-tuned paraphrase-multilingual-MiniLM -L12-v2 sentence-transformers model and it includes paraphrase sentence pairs from Yle News articles in Easy Finnish and in Standard Finnish from the years 2014-2018.
-
(2024)The topic of the thesis is integrating Natural Language Processing (NLP) and Computer-assisted Language Learning (CALL) into teacher-led Spanish instruction. The aim is to present a development process and a CALL application to be used to study learning results. The study seeks answers to questions on how an NLP-based CALL application can be used to investigate learning, and how its usage rate and usage affect learning outcomes. Also, the focus is on usability, asking how usable the students evaluate the application to be, and what kind of open feedback they give for it. 108 secondary school students and four teachers from the Helsinki Metropolitan region participated in the study, where a gamified application creates a competitive setting between five teaching groups. The students use the application to solve textbook-based cloze exercises that are generated using a combination of a neural language model and a rule-based exercise creation. The vocabulary tests measure learning by selecting test words according to the usage analytics so that they are from outside the cloze fields of exercise sentences. The students who used the application were divided into two groups: those (N=26) who encountered the test words in the application and those (N=31) who did not. The results are being compared to those in the control group (N=8) who did not use the application. The results show that the group encountering the test words performed 11.39 percentage points better than the control group. Interestingly, the students who did not encounter the words performed 25.21 percentage points better in tests than the control group. Despite the positive results, statistical analysis revealed a significant relationship only between usage rate and encountering the test words, not between the test words and the vocabulary test results. This may be explained by the different sizes of the groups, the random way how the application selected exercises, and the fact that the students did not encounter the words often enough. The method requires many enhancements before utilising it on a larger scale. The students evaluated the application's usability to be good, and they left 18 open feedback responses, which were mostly positive.
-
(2024)Topic modelling is an unsupervised machine learning method that can be used for extracting topics from a collection of documents. Topic models discover shared themes across the collection and return a distribution of words over each topic and a second distribution of topics over each document as their output. This thesis introduces and compares three different topic modelling techniques and their evaluation methods. Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF) and a type of neural topic model called Contextual Topic Model are presented and their distinguishing features are described. Then, intrinsic and extrinsic evaluation methods and metrics of the topic models are described. Intrinsic metrics such as coherence describe how interpretable the created topics are for humans. The measurement of coherence can be approximated by metrics that can be computationally calculated, which allows iterative optimisation of topic models. Finally, to complete the survey part, visualisation tools and libraries are discussed. This thesis applies these three different modelling techniques to the domain of mobile game descriptions and seeks answers to two research questions: (1) to what extent can topic modelling be used to identify latent game genres or game features? (2) How well do the genres extracted from the text descriptions correlate with the categories defined in the existing categories? First, a dataset consisting of 13,000 game descriptions and the associated metadata is constructed, and then the three different topic modelling techniques are applied. All of the models are optimised towards the best coherence metric and the results are compared. The best results, i.e. the most coherent topics, are acquired from the NMF topic model, although all techniques show promise to be effective as long as they are properly utilised. As the answer to the research questions, topic modelling is shown to help extract information about mobile games that correlate with the existing category information in the dataset and can be used to identify new facets regarding the game settings and themes.
-
(2023)Recent progress in natural language generation tools has raised concerns that the tools are being used to generate neural fake news. Fake news impacts our society in many ways, and they have been used for monetization schemes, to tip political elections, and have been shown to have a severe effect on people’s mental health. Accordingly, being able to detect neural fake news and countering their spread is becoming increasingly important. The aim of the thesis is to explore whether there are linguistic features that can help detect neural news. Using Grover, a neural language model, I generate a set of articles based on both real and fake human-written news. I then extract a range of linguistic features, previously found to differ between human-written real and fake news, to investigate whether the same features can be used detect Grover-written news, whether there are features that can differentiate between Grover-written news, whose source material is different, and whether based on these features Grover-written news are more similar to real or fake news. The data consists of 64 articles, of which 16 are real news sourced from reputable news sites and 16 are fake news articles from the ISOT Fake News Dataset. The other 32 articles are written by Grover, with having either the real news or fake news articles as source text (16 each). A broad range of linguistic features are extracted from the article bodies and titles to capture the style, complexity, and sentiment of the articles. The features measured include punctuation, quotes, syntax tree depths, and emotion counts. The results show that the same features which have been found to differ between real and fake news, can with some limitations be used to discern Grover Fake News (Grover-written articles based on fake news). However, Grover Real News (Grover-written articles based on real news) cannot reliably be discerned from real news. Moreover, while the features measured do not provide a reliable method for discerning Grover Real News and Grover Fake News from each other, there are still noticeable differences between the two groups. Grover Fake News can be differentiated from real news, but the texts can be considered of better quality than fake news. These findings also align with previous research, showcasing that Grover is adept at re-writing misinformation and making it more credible to readers, and that feature extraction alone cannot reliably distinguish neural fake news, but that human evaluation also needs to be considered.
-
(2023)Neural machine translation (NMT) has been a mainstream method for the machine translation (MT) task. Despite its remarkable progress, NMT systems still face many challenges when dealing with low-resource scenarios. Common approaches to address the data scarcity problem include exploiting monolingual data or parallel data in other languages. In this thesis, transformer-based NMT models are trained on Finnish-Simplified Chinese, a language pair with limited parallel data and the models are improved using various techniques such as hyperparameter tuning, transfer learning and back-translation. Finally, the best NMT system is an ensemble model that combines different single models. The results of our experiments also show that different hyperparameter settings can cause a performance gap of up to 4 BLEU scores. The ensemble model shows a 35% improvement over the baseline model. Overall, the experiments suggest that hyperparameter tuning is crucial for training vanilla NMT models. Back-translation offers more benefits for model improvement than the transfer learning method. The results also show that adding sampling in back-translation does not improve NMT model performance in this low-data setting. The findings may be useful for future research on low-resource NMT, especially the Finnish-Simplified Chinese MT task.
-
(2024)Language tags are additional tokens in the source corpus that indicate the language of the corresponding sentence in the target corpus. Like all words, they receive their own vector numerical representations in the translation model, which can then be used for various experiments. This work explores the use of language tag transformations in a multilingual translation model to produce mixed-language output, aiming to create an "intermediate" language variant. It delves into the nuances of interpolating between multiple languages via their embeddings and the language generation characteristics at these boundary regions. The experiments in this work were conducted with two multilingual translation models: English to Slavic languages and Slavic-to-Slavic languages, with target languages represented in both models and comparing their embeddings in vector space. The study investigates the conditions under which maximum language mixing occurs, examining how factors such as the source language, target languages, and script influence the process. It analyzes outputs from both pre-trained models and trains several models with varied features to understand how these elements affect the potential for target language mixing during interpolation. Due to the absence of reference-based automatic evaluation, the degree of mixing was assessed using a language identification model. The study also conducts a detailed qualitative linguistic analysis of the mixed generated output, examining the level and extent to which the grammar and lexicon of several languages can be mixed. Findings indicate that the extent and location of mixing vary according to different source and target languages. Notably, languages that have similar scripts but differ grammatically yielded the most interesting results, suggesting that standardizing the script across training data could enhance mixing quality. Several smaller multilingual translation models were trained from scratch, incorporating features such as alternative word segmentation (character-based) and script tags, enabling control over the script, not just the language of the output. In the case of smaller models, despite significantly less data, some common trends were observed in the interpolation with similar experiments on larger models: for example, the influence of the script. Additionally, introducing an extremely small number of alternative examples into the training corpus of the model noticeably affected its perception of the script category. The results suggest that mixing or averaging multiple language variants is viable with a uniform script, effective segmentation/encoding, sufficient data, and in-depth exploration of the spaces between embeddings to identify the most balanced and optimal interlanguage variant.
-
(2024)Communicative efficiency principles are an area of great interest in linguistics research. Analyses are performed into determining how potentially infinite outputs of human language can be formed within the bounds of limited memory. One way in which the cognitive burden of a sentence is measured is through dependency distances. In this thesis, the idea that morphological marking could be used to alleviate communicative memory burdens was evaluated using token-based quantitative typological methods to extract tendencies of language use. Large, multilingual, labeled corpora were parsed to find and evaluate more than 300,000 simple transitive sentences for patterns of morphological agreement and case-marking in relation to dependency distances. No significant, meaningful, cross-linguistic correlation was found between morphological agreement and dependency distances when it was examined in usual patterns of sentence construction. Nor was a correlation found to suggest that marking would allow for longer dependencies in exceptional circumstances, indicating that marking was not of any assistance in alleviating memory burdens. Preliminary evidence was discovered which may suggest an inverse correlation between agreement and dependency distance, advocating for the future work into the process of ensuring agreement increasing cognitive burdens.
-
(2021)Maisterintutkielma käsittelee monikielistä nimien tunnistusta. Tutkielmassa testataan kahta lähestymistapaa monikieliseen nimien tunnistukseen: annotoidun datan siirtoa toisille kielille, sekä monikielisen mallin luomista. Lisäksi nämä kaksi lähestymistapaa yhdistetään. Tarkoitus on löytää menetelmiä, joilla nimien tunnistusta voidaan tehdä luotettavasti myös pienemmillä kielillä, joilla annotoituja nimientunnistusaineistoja ei ole suuressa määrin saatavilla. Tutkielmassa koulutetaan ja testataan malleja neljällä kielellä: suomeksi, viroksi, hollanniksi ja espanjaksi. Ensimmäisessä metodissa annotoitu data siirretään kieleltä toiselle monikielisen paralleelikorpuksen avulla, ja näin syntynyttä dataa käytetään neuroverkkoja hyödyntävän koneoppimismallin opettamiseen. Toisessa metodissa käytetään monikielistä BERT-mallia. Mallin koulutukseen käytetään annotoituja korpuksia, jotka yhdistetään monikieliseksi opetusaineistoksi. Kolmannessa metodissa kaksi edellistä metodia yhdistetään, ja kieleltä toiselle siirrettyä dataa käytetään monikielisen BERT-mallin koulutuksessa. Kaikkia kolmea lähestymistapaa testataan kunkin kielen annotoidulla testisetillä, ja tuloksia verrataan toisiinsa. Metodi, jossa rakennettiin monikielinen BERT-malli, saavutti selkeästi parhaimmat tulokset nimien tunnistamisessa. Neuroverkkomallit, jotka koulutettiin kielestä toiseen siirretyillä annotaatioilla, saivat selkeästi heikompia tuloksia. BERT-mallin kouluttaminen siirretyillä annotaatioilla tuotti myös heikkoja tuloksia. Annotaatioiden siirtäminen kieleltä toiselle osoittautui haastavaksi, ja tuloksena syntynyt data sisälsi virheitä. Tulosten heikkouteen vaikutti myös opetusaineiston ja testiaineiston kuuluminen eri genreen. Monikielinen BERT-malli on tutkielman mukaan testatuista parhaiten toimiva metodi, ja sopii myös kielille, joilla annotoituja aineistoja ei ole paljon saatavilla.
-
(2022)Automatic question answering and question generation are two closely related natural language processing tasks. They both have been studied for decades, and both have a wide range of uses. While systems that can answer questions formed in natural language can help with all kinds of information needs, automatic question generation can be used, for example, to automatically create reading comprehension tasks and improve the interactivity of virtual assistants. These days, the best results in both question answering and question generation are obtained by utilizing pre-trained neural language models based on the transformer architecture. Such models are typically first pre-trained with raw language data and then fine-tuned for various tasks using task-specific annotated datasets. So far, no models that can answer or generate questions purely in Finnish have been reported. In order to create them using modern transformer-based methods, both a pre-trained language model and a sufficiently big dataset suitable for question answering or question generation fine-tuning are required. Although some suitable models that have been pre-trained with Finnish or multilingual data are already available, a big bottleneck is the lack of annotated data needed for fine-tuning the models. In this thesis, I create the first transformer-based neural network models for Finnish question answering and question generation. I present a method for creating a dataset for fine-tuning pre-trained models for the two tasks. The dataset creation is based on automatic translation of an existing dataset (SQuAD) and automatic normalization of the translated data. Using the created dataset, I fine-tune several pre-trained models to answer and generate questions in Finnish and evaluate their performance. I use monolingual BERT and GPT-2 models as well as a multilingual BERT model. The results show that the transformer architecture is well suited also for Finnish question answering and question generation. They also indicate that the synthetically generated dataset can be a useful fine-tuning resource for these tasks. The best results in both tasks are obtained by fine-tuned BERT models which have been pre-trained with only Finnish data. The fine-tuned multilingual BERT models come in close, whereas fine-tuned GPT-2 models are generally found to underperform. The data developed for this thesis will be released to the research community to support future research on question answering and generation, and the models will be released as benchmarks.
Now showing items 1-20 of 28