Skip to main content
Login | Suomeksi | På svenska | In English

Matrix Factorization for Learning Metagenomic Pathways and Species

Show simple item record

dc.date.accessioned 2015-01-21T07:25:18Z und
dc.date.accessioned 2017-10-24T12:21:39Z
dc.date.available 2015-01-21T07:25:18Z und
dc.date.available 2017-10-24T12:21:39Z
dc.date.issued 2015-01-21T07:25:18Z
dc.identifier.uri http://radr.hulib.helsinki.fi/handle/10138.1/4408 und
dc.identifier.uri http://hdl.handle.net/10138.1/4408
dc.title Matrix Factorization for Learning Metagenomic Pathways and Species en
ethesis.discipline Applied Mathematics en
ethesis.discipline Soveltava matematiikka fi
ethesis.discipline Tillämpad matematik sv
ethesis.discipline.URI http://data.hulib.helsinki.fi/id/2646f59d-c072-44e7-b1c1-4e4b8b798323
ethesis.department.URI http://data.hulib.helsinki.fi/id/61364eb4-647a-40e2-8539-11c5c0af8dc2
ethesis.department Institutionen för matematik och statistik sv
ethesis.department Department of Mathematics and Statistics en
ethesis.department Matematiikan ja tilastotieteen laitos fi
ethesis.faculty Matematisk-naturvetenskapliga fakulteten sv
ethesis.faculty Matemaattis-luonnontieteellinen tiedekunta fi
ethesis.faculty Faculty of Science en
ethesis.faculty.URI http://data.hulib.helsinki.fi/id/8d59209f-6614-4edd-9744-1ebdaf1d13ca
ethesis.university.URI http://data.hulib.helsinki.fi/id/50ae46d8-7ba9-4821-877c-c994c78b0d97
ethesis.university Helsingfors universitet sv
ethesis.university University of Helsinki en
ethesis.university Helsingin yliopisto fi
dct.creator Polvi-Huttunen, Silja
dct.issued 2015
dct.language.ISO639-2 eng
dct.abstract This work considers learning meaningful sets of chemical reactions called pathways and groups of species called Operational Taxonomical Units (OTUs) from metagenomic data. The methods are based on Nonnegative Matrix Factorization (NMF). The rows of our data matrix correspond to metagenomic samples and columns correspond to chemical reactions present in the samples. In order to learn both pathways and OTUs as well as relationships between them, we consider ways to factorize the data matrix into three factors instead of two. Denoting the samples times reactions data matrix by V, our factorization problem setting is to find nonnegative matrices W, H and P so that V is approximately WHP. The matrix W tells what OTUs are present in each of the samples, P defines pathways as combinations of reactions while H describes what pathways are implemented by which OTUs. We first discuss two standard NMF algorithms based on different objective functions and four sparsity constrained variants. Sparsity constrained variants are designed to produce output matrices with few values significantly above zero. We are interested in sparser variants because metagenomic pathways are short, thus the method should find a representation where only a small set of reactions is present in each pathway. We describe how using a standard two-factor NMF method twice yields a three-factor representation. We briefly mention an existing method, Nonnegative Matrix Tri-factorization (NMTF), that learns all three matrices W, H and P simultaneously. However, this method applies hard orthogonality constraints, i.e. it only finds solutions where the matrices W and P are orthogonal. Because of this constraint, NMTF is not suitable in our biological problem setting. We introduce an unconstrained method called NMF3 as well as a sparsity constrained variant SNMF3 based on Sparse Nonnegative Matrix Factorization (SNMF) and show how both of these algorithms can be derived. In order to compare the different algorithms' performance, we have built two synthetic data sets. Both sets are based on human intestinal species and pathway information available in an existing biological database. One of the data matrices can be exactly factorized into the underlying matrices used to generate the data. The other data set is built through simulating a sampling process that introduces noise and strictly limits the number of observed reactions per sample. We tested factorization methods discussed in the thesis on both data sets, using 100 to 1500 samples. We compare the methods and show and discuss the results. We found differences between NMF variants that use different objective functions. Many methods perform well on our task, surprisingly even in the case where the number of pathways is greater than the number of samples. Varying the number of samples affected the results less than we expected. Instead, we found that all algorithms performed significantly better on the factorizable data than on the simulated set.We conclude that the number of available metagenomic samples does not dramatically affect the performance of the factorization methods. More important is the quality of the samples. en
dct.language en
ethesis.language.URI http://data.hulib.helsinki.fi/id/languages/eng
ethesis.language English en
ethesis.language englanti fi
ethesis.language engelska sv
ethesis.thesistype pro gradu-avhandlingar sv
ethesis.thesistype pro gradu -tutkielmat fi
ethesis.thesistype master's thesis en
ethesis.thesistype.URI http://data.hulib.helsinki.fi/id/thesistypes/mastersthesis
dct.identifier.urn URN:NBN:fi-fe2017112252073
dc.type.dcmitype Text

Files in this item

Files Size Format View
GraduDec8inclAbstract.pdf 1.006Mb PDF

This item appears in the following Collection(s)

Show simple item record