Skip to main content
Login | Suomeksi | På svenska | In English

Scalable Bayesian Induction of Word Embeddings

Show simple item record

dc.date.accessioned 2015-12-18T14:40:19Z und
dc.date.accessioned 2017-10-24T12:24:04Z
dc.date.available 2015-12-18T14:40:19Z und
dc.date.available 2017-10-24T12:24:04Z
dc.date.issued 2015-12-18T14:40:19Z
dc.identifier.uri http://radr.hulib.helsinki.fi/handle/10138.1/5243 und
dc.identifier.uri http://hdl.handle.net/10138.1/5243
dc.title Scalable Bayesian Induction of Word Embeddings en
ethesis.discipline Computer science en
ethesis.discipline Tietojenkäsittelytiede fi
ethesis.discipline Datavetenskap sv
ethesis.discipline.URI http://data.hulib.helsinki.fi/id/1dcabbeb-f422-4eec-aaff-bb11d7501348
ethesis.department.URI http://data.hulib.helsinki.fi/id/225405e8-3362-4197-a7fd-6e7b79e52d14
ethesis.department Institutionen för datavetenskap sv
ethesis.department Department of Computer Science en
ethesis.department Tietojenkäsittelytieteen laitos fi
ethesis.faculty Matematisk-naturvetenskapliga fakulteten sv
ethesis.faculty Matemaattis-luonnontieteellinen tiedekunta fi
ethesis.faculty Faculty of Science en
ethesis.faculty.URI http://data.hulib.helsinki.fi/id/8d59209f-6614-4edd-9744-1ebdaf1d13ca
ethesis.university.URI http://data.hulib.helsinki.fi/id/50ae46d8-7ba9-4821-877c-c994c78b0d97
ethesis.university Helsingfors universitet sv
ethesis.university University of Helsinki en
ethesis.university Helsingin yliopisto fi
dct.creator Sakaya, Joseph Hosanna
dct.issued 2015
dct.language.ISO639-2 eng
dct.abstract Traditional natural language processing has been shown to have excessive reliance on human-annotated corpora. However, the recent successes of machine translation and speech recognition, ascribed to the effective use of the increasingly availability of web-scale data in the wild, has given momentum to a re-surging interest in attempting to model natural language with simple statistical models, such as the n-gram model, that are easily scaled. Indeed, words and word combinations provide all the representational machinery one needs for solving many natural language tasks. The degree of semantic similarity between two words is a function of the similarity of the linguistic contexts in which they appear. Word representations are mathematical objects, often vectors, that capture syntactic and semantic properties of a word. This results in words that are semantic cognates having similar word representations, an important property that we will widely use. We claim that word representations provide a superb framework for unsupervised learning on unlabelled data by compactly representing the distributional properties of words. The current state-of-the-art word representation adopts the skip-gram model to train shallow neural networks and presents negative sampling, an idea borrowed from Noise Contrastive Estimation, as an efficient method of inducing embeddings. An alternative approach contends that the inherent multi-contextual nature of words entails a more Canonical Correlation Analysis-like approach for best results. In this thesis we develop the first fully Bayesian model to induce word embeddings. The prominent contributions of this thesis are: 1. A crystallisation of the best practices from previous literature on word embeddings and matrix factorisation into a single hierarchical Bayesian model. 2. A scalable matrix factorisation technique for structured sparse data. 3. Representation of the latent dimensions as continuous Gaussian densities instead of as point estimates. We analyse a corpus of 170 million tokens and learn for each word form a vectorial representation based on the 8 surrounding context words with a negative sampling rate of 2 per token. We would like to stress that while we certainly hope to beat the state-of-the-art, our primary goal is to develop a stochastic and scalable Bayesian model. We evaluate the quality of the word embeddings against the word analogy tasks as well as other such tasks as word similarity and chunking. We demonstrate competitive performance on standard benchmarks. en
dct.language en
ethesis.language.URI http://data.hulib.helsinki.fi/id/languages/eng
ethesis.language English en
ethesis.language englanti fi
ethesis.language engelska sv
ethesis.thesistype pro gradu-avhandlingar sv
ethesis.thesistype pro gradu -tutkielmat fi
ethesis.thesistype master's thesis en
ethesis.thesistype.URI http://data.hulib.helsinki.fi/id/thesistypes/mastersthesis
dct.identifier.urn URN:NBN:fi-fe2017112252226
dc.type.dcmitype Text

Files in this item

Files Size Format View
joseph_sakaya_thesis.pdf 1.187Mb PDF

This item appears in the following Collection(s)

Show simple item record