Skip to main content
Login | Suomeksi | På svenska | In English

A Mixture Model for Heterogeneous Data with Application to Public Healthcare Data Analysis

Show simple item record

dc.date.accessioned 2016-10-04T10:35:25Z und
dc.date.accessioned 2017-10-24T12:22:03Z
dc.date.available 2016-10-04T10:35:25Z und
dc.date.available 2017-10-24T12:22:03Z
dc.date.issued 2016-10-04T10:35:25Z
dc.identifier.uri http://radr.hulib.helsinki.fi/handle/10138.1/5787 und
dc.identifier.uri http://hdl.handle.net/10138.1/5787
dc.title A Mixture Model for Heterogeneous Data with Application to Public Healthcare Data Analysis en
ethesis.discipline Applied Mathematics en
ethesis.discipline Soveltava matematiikka fi
ethesis.discipline Tillämpad matematik sv
ethesis.discipline.URI http://data.hulib.helsinki.fi/id/2646f59d-c072-44e7-b1c1-4e4b8b798323
ethesis.department.URI http://data.hulib.helsinki.fi/id/61364eb4-647a-40e2-8539-11c5c0af8dc2
ethesis.department Institutionen för matematik och statistik sv
ethesis.department Department of Mathematics and Statistics en
ethesis.department Matematiikan ja tilastotieteen laitos fi
ethesis.faculty Matematisk-naturvetenskapliga fakulteten sv
ethesis.faculty Matemaattis-luonnontieteellinen tiedekunta fi
ethesis.faculty Faculty of Science en
ethesis.faculty.URI http://data.hulib.helsinki.fi/id/8d59209f-6614-4edd-9744-1ebdaf1d13ca
ethesis.university.URI http://data.hulib.helsinki.fi/id/50ae46d8-7ba9-4821-877c-c994c78b0d97
ethesis.university Helsingfors universitet sv
ethesis.university University of Helsinki en
ethesis.university Helsingin yliopisto fi
dct.creator Sirola, Johannes
dct.issued 2016
dct.language.ISO639-2 eng
dct.abstract In this thesis we present an algorithm for doing mixture modeling for heterogeneous data collections. Our model supports using both Gaussian- and Bernoulli distributions, creating possibilities for analysis of many kinds of different data. A major focus is spent to developing scalable inference for the proposed model, so that the algorithm can be used to analyze even a large amount of data relatively fast. In the beginning of the thesis we review some required concepts from probability theory and then proceed to present the basic theory of an approximate inference framework called variational inference. We then move on to present the mixture modeling framework with examples of the Gaussian- and Bernoulli mixture models. These models are then combined to a joint model which we call GBMM for Gaussian and Bernoulli Mixture Model. We develop scalable and efficient variational inference for the proposed model using state-of-the-art results in Bayesian inference. More specifically, we use a novel data augmentation scheme for the Bernoulli part of the model coupled with overall algorithmic improvements such as incremental variational inference and multicore implementation. The efficiency of the proposed algorithm over standard variational inference is highlighted in a simple toy data experiment. Additionally, we demonstrate a scalable initialization for the main inference algorithm using a state-of-the-art random projection algorithm coupled with k-means++ clustering. The quality of the initialization is studied in an experiment with two separate datasets. As an extension to the GBMM model, we also develop inference for categorical features. This proves to be rather difficult and our presentation covers only the derivation of the required inference algorithm without a concrete implementation. We apply the developed mixture model to analyze a dataset consisting of electronic patient records collected in a major Finnish hospital. We cluster the patients based on their usage of the hospital's services over 28-day time intervals over 7 years to find patterns that help in understanding the data better. This is done by running the GBMM algorithm on a big feature matrix with 269 columns and more than 1.7 million rows. We show that the proposed model is able to extract useful insights from the complex data, and that the results can be used as a guideline and/or preprocessing step for possible further, more detailed analysis that is left for future work. en
dct.language en
ethesis.language.URI http://data.hulib.helsinki.fi/id/languages/eng
ethesis.language English en
ethesis.language englanti fi
ethesis.language engelska sv
ethesis.thesistype pro gradu-avhandlingar sv
ethesis.thesistype pro gradu -tutkielmat fi
ethesis.thesistype master's thesis en
ethesis.thesistype.URI http://data.hulib.helsinki.fi/id/thesistypes/mastersthesis
dct.identifier.urn URN:NBN:fi-fe2017112251685
dc.type.dcmitype Text

Files in this item

Files Size Format View
thesis_sirola_johannes_2016.pdf 920.7Kb PDF

This item appears in the following Collection(s)

Show simple item record