Skip to main content
Login | Suomeksi | På svenska | In English

Novel dynamic topic models for modelling sequential document collections

Show simple item record

dc.date.accessioned 2014-05-08T09:36:44Z und
dc.date.accessioned 2017-10-24T12:23:41Z
dc.date.available 2014-05-08T09:36:44Z und
dc.date.available 2017-10-24T12:23:41Z
dc.date.issued 2014-05-08T09:36:44Z
dc.identifier.uri http://radr.hulib.helsinki.fi/handle/10138.1/3703 und
dc.identifier.uri http://hdl.handle.net/10138.1/3703
dc.title Novel dynamic topic models for modelling sequential document collections en
ethesis.department.URI http://data.hulib.helsinki.fi/id/225405e8-3362-4197-a7fd-6e7b79e52d14
ethesis.department Institutionen för datavetenskap sv
ethesis.department Department of Computer Science en
ethesis.department Tietojenkäsittelytieteen laitos fi
ethesis.faculty Matematisk-naturvetenskapliga fakulteten sv
ethesis.faculty Matemaattis-luonnontieteellinen tiedekunta fi
ethesis.faculty Faculty of Science en
ethesis.faculty.URI http://data.hulib.helsinki.fi/id/8d59209f-6614-4edd-9744-1ebdaf1d13ca
ethesis.university.URI http://data.hulib.helsinki.fi/id/50ae46d8-7ba9-4821-877c-c994c78b0d97
ethesis.university Helsingfors universitet sv
ethesis.university University of Helsinki en
ethesis.university Helsingin yliopisto fi
dct.creator Liye, He
dct.issued 2014
dct.language.ISO639-2 eng
dct.abstract In this thesis, we concentrate on the problem of modelling real document collections, especially sequential document collections. The goal is to discover important hidden topics in the collection automatically by statistical modelling of its content. For the sequential document collections, we want to also capture how the topics change over time. To date, several computational tools such as latent dirichlet allocation (LDA) have been developed for modelling document collections. In this thesis, we develop new topic models for modelling the dynamic characteristics of a sequential document collection such as the news archives. We are, for example, interested in splitting the topics into long-term topics such as 'Eurozone crisis' that are discussed over years, and short-term topics such as 'Winter Olympics in 2014' that are only popular for several weeks. We first review the popular models for detecting the hidden topics and their evolution, and then propose two new approaches to detect these two kinds of topics. To provide real world data for the evaluation of our new approaches, we additionally design a pipeline for constructing sequential document collections through collecting documents from the Web. To investigate the performance of our new approaches from different aspects, we conduct qualitative and quantitative experiments on two different kinds of datasets respectively: news documents collected by the pipeline and 17 years' documents from the Neural Information Processing Systems (NIPS) conferences. The qualitative experiments aim at evaluating the quality of the discovered topics, whereas the quantitative experiments concern about their ability to predict new words from the unseen documents. en
dct.language en
ethesis.language.URI http://data.hulib.helsinki.fi/id/languages/eng
ethesis.language English en
ethesis.language englanti fi
ethesis.language engelska sv
ethesis.thesistype pro gradu-avhandlingar sv
ethesis.thesistype pro gradu -tutkielmat fi
ethesis.thesistype master's thesis en
ethesis.thesistype.URI http://data.hulib.helsinki.fi/id/thesistypes/mastersthesis
ethesis.degreeprogram Algorithms and Machine Learning en
dct.identifier.urn URN:NBN:fi-fe2017112251881
dc.type.dcmitype Text

Files in this item

Files Size Format View
Masterthesis_LiyeHe.pdf 1.289Mb PDF

This item appears in the following Collection(s)

Show simple item record