Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "BERT"

Sort by: Order: Results:

  • Jokinen, Olli (2024)
    The rise of large language models (LLMs) has revolutionized natural language processing, par- ticularly through transfer learning and fine-tuning paradigms that enhance the understanding of complex textual data. This thesis builds upon the concept of fine-tuning to improve the under- standing of Finnish Wikipedia articles. Specifically, a BERT-based language model is fine-tuned to create high-quality document representations from Finnish texts. The learned representations are applied to downstream tasks, where the model’s performance is evaluated against baseline models. This thesis draws on the SPECTER paper, published in 2020, which introduced a training frame- work for fine-tuning a general-purpose document embedder. SPECTER was trained using a document-level training objective that leveraged document link information. Originally, SPECTER was designed for scientific articles, utilizing citations between articles. The training instances con- sisted of triplets of query, positive, and negative papers, with the aim of capturing the semantic similarity of the documents. This work extends the SPECTER framework to Finnish Wikipedia data. While scientific articles have citations, Wikipedia’s cross-references are used to build a document graph that captures the relatedness between articles. Additionally, Wikipedia data is publicly available as a full data dump, making it an attractive choice for the dataset in this thesis. One of the objectives is to demonstrate the flexibility of the SPECTER framework on a new dataset that has a similar networked structure to that of scientific articles. The fine-tuned model can be used as a general-purpose tool for various tasks and applications; however, in this thesis, its performance is measured in topic classification and cross-reference ranking. The Transformer-based language model produces fixed-length embeddings, which are used as features in the topic classification task and as vectors to measure the L2 distance of article vectors in the cross-reference prediction task. This thesis shows that the proposed model, WikiSpecter, optimized with a document-level objective, outperformed baseline models in both tasks. The performance indicates that Finnish Wikipedia provides relevant cross-references that help the model capture relationships across a range of topics.
  • Tulijoki, Juha-Pekka (2024)
    A tag is a freely chosen keyword that a user attaches to an item. Offering a simple, cheap, and natural way to describe content, tagging has become popular in contemporary web applications. The tag genome is a data structure that contains item-tag relevance scores, i.e., continuous scale numbers from 0 to 1 indicating how relevant a tag is for an item. For example, the tag romantic comedy has a relevance score of 0.97 for the movie Love Actually. With sufficient data, a tag genome dataset can be constructed for any domain. To the best of available knowledge, there are tag genome datasets for movies and books. The tag genome for movies is used in a movie recommender and for various purposes in recommender systems research, such as detecting filter bubbles and serendipity. Creating a diverse tag genome dataset requires an effective machine learning solution, as manual assessment of item-tag relevance scores is impractical. The current state-of-the-art solution, called TagDL, uses features extracted from user-generated tags, reviews, and ratings to employ a multilayer perceptron architecture to predict the item-tag relevance scores. This study aims to enhance TagDL by extracting more features from the embeddings of textual content, namely tags, user reviews, and item titles, using Bidirectional Encoder Representations from Transformers (BERT). The results show that features based on BERT embeddings have a potential positive impact on item-tag relevance score prediction. However, the results do not generalize to both tag genome datasets, improving the results only for the movie dataset. This may indicate that the new features have a stronger impact if the amount of available training data is smaller, as with the movie dataset. Moreover, this thesis discusses future work ideas and implementation possibilities.