Scalable Bayesian Induction of Word Embeddings

Scalable Bayesian Induction of Word Embeddings

Title:	Scalable Bayesian Induction of Word Embeddings
Author(s):	Sakaya, Joseph Hosanna
Contributor:	University of Helsinki, Faculty of Science, Department of Computer Science
Discipline:	Computer science
Language:	English
Acceptance year:	2015
Abstract:	in English Traditional natural language processing has been shown to have excessive reliance on human-annotated corpora. However, the recent successes of machine translation and speech recognition, ascribed to the effective use of the increasingly availability of web-scale data in the wild, has given momentum to a re-surging interest in attempting to model natural language with simple statistical models, such as the n-gram model, that are easily scaled. Indeed, words and word combinations provide all the representational machinery one needs for solving many natural language tasks. The degree of semantic similarity between two words is a function of the similarity of the linguistic contexts in which they appear. Word representations are mathematical objects, often vectors, that capture syntactic and semantic properties of a word. This results in words that are semantic cognates having similar word representations, an important property that we will widely use. We claim that word representations provide a superb framework for unsupervised learning on unlabelled data by compactly representing the distributional properties of words. The current state-of-the-art word representation adopts the skip-gram model to train shallow neural networks and presents negative sampling, an idea borrowed from Noise Contrastive Estimation, as an efficient method of inducing embeddings. An alternative approach contends that the inherent multi-contextual nature of words entails a more Canonical Correlation Analysis-like approach for best results. In this thesis we develop the first fully Bayesian model to induce word embeddings. The prominent contributions of this thesis are: 1. A crystallisation of the best practices from previous literature on word embeddings and matrix factorisation into a single hierarchical Bayesian model. 2. A scalable matrix factorisation technique for structured sparse data. 3. Representation of the latent dimensions as continuous Gaussian densities instead of as point estimates. We analyse a corpus of 170 million tokens and learn for each word form a vectorial representation based on the 8 surrounding context words with a negative sampling rate of 2 per token. We would like to stress that while we certainly hope to beat the state-of-the-art, our primary goal is to develop a stochastic and scalable Bayesian model. We evaluate the quality of the word embeddings against the word analogy tasks as well as other such tasks as word similarity and chunking. We demonstrate competitive performance on standard benchmarks.

Files in this item

Files	Size	Format	View
joseph_sakaya_thesis.pdf	1.187Mb	PDF

This item appears in the following Collection(s)

Faculty of Science [4203]

Show full item record

Scalable Bayesian Induction of Word Embeddings

Files in this item

This item appears in the following Collection(s)

Yhteystiedot

HELSINGIN YLIOPISTO