Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "natural language processing"

Sort by: Order: Results:

  • Koppatz, Maximilian (2022)
    Automatic headline generation has the potential to significantly assist editors charged with head- lining articles. Approaches to automation in the headlining process can range from tools as creative aids, to complete end to end automation. The latter is difficult to achieve as journalistic require- ments imposed on headlines must be met with little room for error, with the requirements depending on the news brand in question. This thesis investigates automatic headline generation in the context of the Finnish newsroom. The primary question I seek to answer is how well the current state of text generation using deep neural language models can be applied to the headlining process in Finnish news media. To answer this, I have implemented and pre-trained a Finnish generative language model based on the Transformer architecture. I have fine-tuned this language model for headline generation as autoregression of headlines conditioned on the article text. I have designed and implemented a variation of the Diverse Beam Search algorithm, with additional parameters, to perform the headline generation in order to generate a diverse set of headlines for a given text. The evaluation of the generative capabilities of this system was done with real world usage in mind. I asked domain-experts in headlining to evaluate a generated set of text-headline pairs. The task was to accept or reject the individual headlines in key criteria. The responses of this survey were then quantitatively and qualitatively analyzed. Based on the analysis and feedback, this model can already be useful as a creative aid in the newsroom despite being far from ready for automation. I have identified concrete improvement directions based on the most common types of errors, and this provides interesting future work.
  • Leppämäki, Tatu (2022)
    Ever more data is available and shared through the internet. The big data masses often have a spatial dimension and can take many forms, one of which are digital texts, such as articles or social media posts. The geospatial links in these texts are made through place names, also called toponyms, but traditional GIS methods are unable to deal with the fuzzy linguistic information. This creates the need to transform the linguistic location information to an explicit coordinate form. Several geoparsers have been developed to recognize and locate toponyms in free-form texts: the task of these systems is to be a reliable source of location information. Geoparsers have been applied to topics ranging from disaster management to literary studies. Major language of study in geoparser research has been English and geoparsers tend to be language-specific, which threatens to leave the experiences provided by studying and expressed in smaller languages unexplored. This thesis seeks to answer three research questions related to geoparsing: What are the most advanced geoparsing methods? What linguistic and geographical features complicate this multi-faceted problem? And how to evaluate the reliability and usability of geoparsers? The major contributions of this work are an open-source geoparser for Finnish texts, Finger, and two test datasets, or corpora, for testing Finnish geoparsers. One of the datasets consists of tweets and the other of news articles. All of these resources, including the relevant code for acquiring the test data and evaluating the geoparser, are shared openly. Geoparsing can be divided into two sub-tasks: recognizing toponyms amid text flows and resolving them to the correct coordinate location. Both tasks have seen a recent turn to deep learning methods and models, where the input texts are encoded as, for example, word embeddings. Geoparsers are evaluated against gold standard datasets where toponyms and their coordinates are marked. Performance is measured on equivalence and distance-based metrics for toponym recognition and resolution respectively. Finger uses a toponym recognition classifier built on a Finnish BERT model and a simple gazetteer query to resolve the toponyms to coordinate points. The program outputs structured geodata, with input texts and the recognized toponyms and coordinate locations. While the datasets represent different text types in terms of formality and topics, there is little difference in performance when evaluating Finger against them. The overall performance is comparable to the performance of geoparsers of English texts. Error analysis reveals multiple error sources, caused either by the inherent ambiguousness of the studied language and the geographical world or are caused by the processing itself, for example by the lemmatizer. Finger can be improved in multiple ways, such as refining how it analyzes texts and creating more comprehensive evaluation datasets. Similarly, the geoparsing task should move towards more complex linguistic and geographical descriptions than just toponyms and coordinate points. Finger is not, in its current state, a ready source of geodata. However, the system has potential to be the first step for geoparsers for Finnish and it can be a steppingstone for future applied research.
  • Huhtilainen, Heli (2023)
    This thesis studies the application of language models to improve search in an online shop specialising in wholesale builder–trade product and service sales. The first aim was to determine if a Finnish language model could capture the meaning behind the search query words to improve the match between the queries and the product descriptions. Secondly, it was investigated if it was possible to train the model to recognise what products the users wanted to find with the search terms they used. Finally, it was investigated if it was possible to use the model for search ranking. Three models were trained using FinBERT as a model checkpoint and domain-specific product and clickthrough data for fine-tuning the models. The first two models were trained to classify online store products into product categories. The task was completed with a 0.98 F measure score for the model with 55 target categories and a 0.81 F measure score for the model with 762 target categories. The third model was trained to evaluate the relevance probabilities of search query-product pairs. The model generally determined more products as relevant than the current search engine solution. The F measure score for the model was 0.90, and in qualitative evaluation, the predictions made by the model made semantically sense. The restrictions for the practical use of the third model for search ranking come from the prediction inference needing to be faster to make search ranking predictions for many products.