Protein Function Prediction using Biomedical Literature

Protein Function Prediction using Biomedical Literature

dc.date.accessioned	2017-06-20T11:16:29Z	und
dc.date.accessioned	2017-10-24T12:24:24Z
dc.date.available	2017-06-20T11:16:29Z	und
dc.date.available	2017-10-24T12:24:24Z
dc.date.issued	2017-06-20T11:16:29Z
dc.identifier.uri	http://radr.hulib.helsinki.fi/handle/10138.1/6128	und
dc.identifier.uri	http://hdl.handle.net/10138.1/6128
dc.title	Protein Function Prediction using Biomedical Literature	en
ethesis.discipline	Computer science	en
ethesis.discipline	Tietojenkäsittelytiede	fi
ethesis.discipline	Datavetenskap	sv
ethesis.discipline.URI	http://data.hulib.helsinki.fi/id/1dcabbeb-f422-4eec-aaff-bb11d7501348
ethesis.department.URI	http://data.hulib.helsinki.fi/id/225405e8-3362-4197-a7fd-6e7b79e52d14
ethesis.department	Institutionen för datavetenskap	sv
ethesis.department	Department of Computer Science	en
ethesis.department	Tietojenkäsittelytieteen laitos	fi
ethesis.faculty	Matematisk-naturvetenskapliga fakulteten	sv
ethesis.faculty	Matemaattis-luonnontieteellinen tiedekunta	fi
ethesis.faculty	Faculty of Science	en
ethesis.faculty.URI	http://data.hulib.helsinki.fi/id/8d59209f-6614-4edd-9744-1ebdaf1d13ca
ethesis.university.URI	http://data.hulib.helsinki.fi/id/50ae46d8-7ba9-4821-877c-c994c78b0d97
ethesis.university	Helsingfors universitet	sv
ethesis.university	University of Helsinki	en
ethesis.university	Helsingin yliopisto	fi
dct.creator	Zosa, Elaine
dct.issued	2017
dct.language.ISO639-2	eng
dct.abstract	Protein function prediction aims to identify the function of a given protein using, for example, sequence data, protein-protein interaction or evolutionary relationships. The use of biomedical literature to predict protein function, however, is a relatively under-studied topic given the vast amount of readily available data. This thesis explores the use of abstracts from biomedical literature to predict protein functions using the terms specified in the Gene Ontology (GO). The Gene Ontology (GO) is a standardised method of cataloguing protein functions where the functions are organised in a directed acyclic graph (DAG). The GO is composed of three separate ontologies: cellular component (CC), molecular function (MF) and biological process (BP). Hierarchical classification is a classification method that assigns an instance to one or more classes where the classes are hierarchically related to each other, as in the GO. We build a hierarchical classifier that assigns GO terms to abstracts by training individual binary Naïve Bayes classifiers to recognise each GO term. We present three different methods of mining abstracts from PubMed. Using these methods we assembled four datasets to train our classifiers. Each classifier is tested in three different ways: (a) in the paper-centric approach, we assign GO terms to a single abstract, (b) in the protein-centric approach, we assign GO terms to a concatenation of abstracts relating to single protein; and (c) the term-centric approach is a complement of the protein-centric approach where the goal is to assign proteins to a GO term. We evaluate the performance of our method using two evaluation metrics: maximum F-measure (F-max) and minimum semantic distance (S-min). Our results show that the best dataset for training our classifier depends on the evaluation metric, the ontology and the proteins being annotated. We also find that there is a negative correlation between the F-max score of a GO term and its information content (IC) and a positive correlation between the F-max and the term's centrality in the DAG. Lastly we compare our method with GOstruct, the state-of-the-art literature-based protein annotation program. Our method outperforms GOstruct on human proteins, showing a significant improvement for the MF ontology.	en
dct.language	en
ethesis.language.URI	http://data.hulib.helsinki.fi/id/languages/eng
ethesis.language	English	en
ethesis.language	englanti	fi
ethesis.language	engelska	sv
ethesis.thesistype	pro gradu-avhandlingar	sv
ethesis.thesistype	pro gradu -tutkielmat	fi
ethesis.thesistype	master's thesis	en
ethesis.thesistype.URI	http://data.hulib.helsinki.fi/id/thesistypes/mastersthesis
ethesis.degreeprogram	Bioinformatics	en
dct.identifier.urn	URN:NBN:fi-fe2017112251766
dc.type.dcmitype	Text

Files in this item

Files	Size	Format	View
Thesis_Elaine_Zosa.pdf	6.273Mb	PDF

This item appears in the following Collection(s)

Faculty of Science [4203]

Show simple item record

Protein Function Prediction using Biomedical Literature

Files in this item

This item appears in the following Collection(s)

Yhteystiedot

HELSINGIN YLIOPISTO