Skip to main content
Login | Suomeksi | På svenska | In English

Novel dynamic topic models for modelling sequential document collections

Show full item record

Title: Novel dynamic topic models for modelling sequential document collections
Author(s): Liye, He
Contributor: University of Helsinki, Faculty of Science, Department of Computer Science
Language: English
Acceptance year: 2014
Abstract:
In this thesis, we concentrate on the problem of modelling real document collections, especially sequential document collections. The goal is to discover important hidden topics in the collection automatically by statistical modelling of its content. For the sequential document collections, we want to also capture how the topics change over time. To date, several computational tools such as latent dirichlet allocation (LDA) have been developed for modelling document collections. In this thesis, we develop new topic models for modelling the dynamic characteristics of a sequential document collection such as the news archives. We are, for example, interested in splitting the topics into long-term topics such as 'Eurozone crisis' that are discussed over years, and short-term topics such as 'Winter Olympics in 2014' that are only popular for several weeks. We first review the popular models for detecting the hidden topics and their evolution, and then propose two new approaches to detect these two kinds of topics. To provide real world data for the evaluation of our new approaches, we additionally design a pipeline for constructing sequential document collections through collecting documents from the Web. To investigate the performance of our new approaches from different aspects, we conduct qualitative and quantitative experiments on two different kinds of datasets respectively: news documents collected by the pipeline and 17 years' documents from the Neural Information Processing Systems (NIPS) conferences. The qualitative experiments aim at evaluating the quality of the discovered topics, whereas the quantitative experiments concern about their ability to predict new words from the unseen documents.


Files in this item

Files Size Format View
Masterthesis_LiyeHe.pdf 1.289Mb PDF

This item appears in the following Collection(s)

Show full item record