Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "HMM"

Sort by: Order: Results:

  • Viljamaa, Venla (2022)
    In bioinformatics, new genomes are sequenced at an increasing rate. To utilize this data in various bioinformatics problems, it must be annotated first. Genome annotation is a computational problem that has traditionally been approached by using statistical methods such as the Hidden Markov model (HMM). However, implementing these methods is often time-consuming and requires domain knowledge. Neural network-based approaches have also been developed for the task, but they typically require a large amount of pre-labeled data. Genomes and natural language share many properties, not least the fact that they both consist of letters. Genomes also have their own grammar, semantics, and context-based meanings, just like phrases in the natural language. These similarities give motivation to the use of Natural language processing (NLP) techniques in genome annotation. In recent years, pre-trained Transformer neural networks have been widely used in NLP. This thesis shows that due to the linguistic properties of genomic data, Transformer network architecture is also suitable for gene predicting. The model used in the experiments, DNABERT, is pre-trained using the full human genome. Using task-specific labeled data sets, the model is then trained to classify DNA sequences into genes and non-genes. The main fine-tuning dataset is the genome of the Escherichia coli bacterium, but preliminary experiments are also performed on human chromosome data. The fine-tuned models are evaluated for accuracy, F1-score and Matthews correlation coefficient (MCC). A customized estimation method is developed, in which the predictions are compared to ground-truth labels at the nucleotide level. Based on that, the best models achieve a 90.15% accuracy and an MCC value of 0.4683 using the Escherichia coli dataset. The model correctly classifies even the minority label, and the execution times are measured in minutes rather than hours. These suggest that the NLP-based Transformer network is a powerful tool for learning the characteristics of gene and non-gene sequences.