Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "semantics"

Sort by: Order: Results:

  • Venekoski, Viljami (2016)
    Advances in computational linguistics have made analyzing large quantities of text data a more feasible task than ever before. In particular, the recent distributional language models hold promise of effective semantic analysis at a low computational cost. Semantics, however, is a multifaceted phenomenon, and although various language model architectures have been presented, there is relatively little research evaluating the semantic validity of such models. The aim of this research is to evaluate the semantic validity of different distributional language models, particularly as tools for representing Finnish language online text data. The models and methods are evaluated based on their performance on three empirical studies, each estimating a different aspect of semantic representation. The language models in the studies were built using word2vec architecture. The models were taught on approximately 2.6 billion tokens from the Suomi24 corpus of Finnish language social media discussions. 18 models were built in total, each with a different combination of feature processing methods. The models were evaluated in three studies. For Study I, a resource consisting of 300 similarity ratings for word pairs from 55 human annotators was collected. This resource was used as an evaluation task by comparing model estimated similarity scores to the human rated similarity judgments. Study II investigated relational semantics as an evaluation method and were operationalized in form of an analogy task, for which a Finnish language resource is presented. In Study III, the language models were evaluated based on their performance in document classification of Suomi24 messages to their respective topics. The results of the Studies indicate that each presented evaluation task is sufficiently reliable method for estimating language model semantic validity. In turn, distributed language models are reported being able to represent semantics given morphologically rich yet fragmentary Finnish language social media data. Feature processing methods are shown to increase the semantic accuracy of language models in most cases, but to a limited extent. If evaluated valid, semantic language technologies are proposed to hold widespread applicability across scientific as well as commercial fields.