Browsing by Author "Oksanen, Joni"

Now showing items 1-1 of 1

Effects of Corpus Size on Word Similarity Model

Oksanen, Joni (2020)

Text mining methods provide a solution to the task of extracting relevant information from large text datasets. These methods can be applied to extract the relevant parts of Suomi24 internet health discussion to analyze how people discuss and negotiate their health through words, which represents medication or symptoms. Semantic similarities between these two concepts can be examined by learning the word vector representations from data and exploring the vector space using Word2Vec, a popular word embedding method. This thesis reviews how the training of word similarity models is affected by increasing corpus size using text retrieval methods.The effects of corpus size are examined by comparing the measured cosine similarity distances between word vectors representations in two different vector spaces. Word vector representations are learned using two different sized corpora. The first corpus includes only messages from the health discussion area of Suomi24. The second corpus includes the same messages as the first corpus, but also messages from other discussion areas, which include health related words. Cosine similarities are evaluated on using concept vocabularies including relevant health related words. Increasing the number of training examples by almost 30% did not have a drastic effect on the qualities of the training data. The results did not indicate a distinct connection between corpus size and the measured cosine similarity distances between word vector representations of health related words.

Now showing items 1-1 of 1

Browsing by Author "Oksanen, Joni"

Yhteystiedot

HELSINGIN YLIOPISTO