Skip to main content
Login | Suomeksi | På svenska | In English

Protein Function Prediction using Biomedical Literature

Show simple item record 2017-06-20T11:16:29Z und 2017-10-24T12:24:24Z 2017-06-20T11:16:29Z und 2017-10-24T12:24:24Z 2017-06-20T11:16:29Z
dc.identifier.uri und
dc.title Protein Function Prediction using Biomedical Literature en
ethesis.discipline Computer science en
ethesis.discipline Tietojenkäsittelytiede fi
ethesis.discipline Datavetenskap sv
ethesis.department Institutionen för datavetenskap sv
ethesis.department Department of Computer Science en
ethesis.department Tietojenkäsittelytieteen laitos fi
ethesis.faculty Matematisk-naturvetenskapliga fakulteten sv
ethesis.faculty Matemaattis-luonnontieteellinen tiedekunta fi
ethesis.faculty Faculty of Science en
ethesis.faculty.URI Helsingfors universitet sv University of Helsinki en Helsingin yliopisto fi
dct.creator Zosa, Elaine
dct.issued 2017
dct.language.ISO639-2 eng
dct.abstract Protein function prediction aims to identify the function of a given protein using, for example, sequence data, protein-protein interaction or evolutionary relationships. The use of biomedical literature to predict protein function, however, is a relatively under-studied topic given the vast amount of readily available data. This thesis explores the use of abstracts from biomedical literature to predict protein functions using the terms specified in the Gene Ontology (GO). The Gene Ontology (GO) is a standardised method of cataloguing protein functions where the functions are organised in a directed acyclic graph (DAG). The GO is composed of three separate ontologies: cellular component (CC), molecular function (MF) and biological process (BP). Hierarchical classification is a classification method that assigns an instance to one or more classes where the classes are hierarchically related to each other, as in the GO. We build a hierarchical classifier that assigns GO terms to abstracts by training individual binary Naïve Bayes classifiers to recognise each GO term. We present three different methods of mining abstracts from PubMed. Using these methods we assembled four datasets to train our classifiers. Each classifier is tested in three different ways: (a) in the paper-centric approach, we assign GO terms to a single abstract, (b) in the protein-centric approach, we assign GO terms to a concatenation of abstracts relating to single protein; and (c) the term-centric approach is a complement of the protein-centric approach where the goal is to assign proteins to a GO term. We evaluate the performance of our method using two evaluation metrics: maximum F-measure (F-max) and minimum semantic distance (S-min). Our results show that the best dataset for training our classifier depends on the evaluation metric, the ontology and the proteins being annotated. We also find that there is a negative correlation between the F-max score of a GO term and its information content (IC) and a positive correlation between the F-max and the term's centrality in the DAG. Lastly we compare our method with GOstruct, the state-of-the-art literature-based protein annotation program. Our method outperforms GOstruct on human proteins, showing a significant improvement for the MF ontology. en
dct.language en
ethesis.language English en
ethesis.language englanti fi
ethesis.language engelska sv
ethesis.thesistype pro gradu-avhandlingar sv
ethesis.thesistype pro gradu -tutkielmat fi
ethesis.thesistype master's thesis en
ethesis.degreeprogram Bioinformatics en
dct.identifier.urn URN:NBN:fi-fe2017112251766
dc.type.dcmitype Text

Files in this item

Files Size Format View
Thesis_Elaine_Zosa.pdf 6.273Mb PDF

This item appears in the following Collection(s)

Show simple item record