Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by master's degree program "Life Science Informatics -maisteriohjelma"

Sort by: Order: Results:

  • Varvarà, Giulia (2022)
    Species factories are defined as times and places in the fossil record where and when an exceptionally large number of new species occurs. While several tailored solutions for the mammalian record have been proposed, how to identify species factories computationally in a standardized way is still an open question. To quantify what is exceptional, we first need to quantify what is regular. One of the main challenges in this identification process is to account for sampling unevenness, which depends on several methodological decisions, including the scale of the analysis (aggrega- tion radius). In this thesis we used Capture-Mark-Recapture methods (CMR) with spatial aggregation guided by network modelling, to estimate the sampling probabilities for the species in the NOW database of mammalian fossil occurrences. Since the mammalian record is sparse and most localities include only a few species, we coupled CMR with tailored spatial aggregation approaches to estimate the sampling prob- abilities. We then used these sampling probabilities to quantify background speciation rates and assess what rates are abnormal. We represented aggregated fossil data as a bipartite network and used community detection to evaluate how the choice of an aggre- gation radius impacts the modular structure. After aggregating the data according to the radius chosen using networks analysis, we es- timated sampling probabilities using CMR. These probabilities allow the adjustment for sampling unevenness so that the difference in findings can be compared across locations and cannot be due to differences in sampling. We identified as species factories the locations with origination rate in the highest 5% after adjustment per time unit. Once the species factories had been identified, we looked for paleoecological patterns in these places that may be lacking elsewhere, finding that species factories present a lower number of findings and of different species among findings, but a higher ratio between the amount of different species and of total findings than the rest of the locations. This would indicate that, even if species factories might accommodate fewer species, they present a higher diversity. To make sure these results were not only due to chance, we performed the same analysis on 100 randomized experiments obtained using a modified version of the Curveball Algo- rithm and compared the values obtained from the original dataset and the ones obtained from the randomized ones. This comparison showed us that species factories tend to have more extreme values than the ones obtained through randomization, which would indicate that species factories present specific paleoecological patterns that are not present in other locations.
  • Dovydas, Kičiatovas (2021)
    Cancer cells accumulate somatic mutations in their DNA throughout their lifetime. The advances in cancer prevention and treatment methods call for a deeper understanding of carcinogenesis on the genetic sequence level. Mutational signatures present a novel and promising way to capture somatic mutation patterns and define their causes, allowing to summarize the mutational landscape of cancer as a combination of distinct mutagenic processes acting with different levels of strength. While the majority of previous studies assume an additive relationship between the mutational processes, this Master’s thesis provides tentative evidence that contemporary methods with additivity constraints, e.g. non-negative matrix factorization (NMF), are not sufficient to comprehensively explain the observed mutations in cancer genomes and the observed deviations are not random. To quantify these residues, two metrics are defined – additive and multiplicative residues – and hierarchical clustering algorithms are used to identify cancer subsets with similar residual profiles. It is shown that in certain cancer sample subsets there is a systematic mutational burden overestimation that can only be solved by a multiplicatively acting process, as well as non-random underestimation, requiring additional mutational signatures. Here an extension to the additive mutational signature model is proposed – a probabilistic model that incorporates a selectively active modulatory mutational process that is able to act in a multiplicative manner together with the known mutational signatures, reducing systematic variability.
  • Leppiniemi, Samuel Albert (2023)
    High-grade serous carcinoma (HGSC) is a highly lethal cancer type characterised by high genomic instability and frequent copy number alterations. This study examines the relationships between genetic variants in tumour germline and gene expression levels to obtain a better understanding of gene regulation in HGSC. This would then improve knowledge of the cancer mechanisms in order to find, for example, potential new treatment targets and biomarkers. The aim is to find significantly associated variant-gene pairs in HGSC. Expression quantitative trait loci (eQTL) analysis is a well-suited method to explore these associations. eQTL analysis is a suitable approach to analysing also those variants that are located in the non-coding genomic regions, as indicated by previous genome-wide association studies to contain many disease-linked germline variants. The current eQTL analysis methods are, however, not applicable for association testing between genes and variants in the context of HGSC because of the special genomic features of the cancer. Therefore, a new eQTL analysis approach, SegmentQTL, was developed for this study to accommodate the copy-number-driven nature of the disease. Careful input processing is of particular importance in eQTL as it has a notable effect on the number of significantly associated variant-gene pairs. It is also relevant to maintain adequate statistical power, which affects the reliability of the findings. In all, this study uses eQTL analysis to uncover variant-gene associations. This helps to improve knowledge of gene regulation mechanisms in HGSC in order to find new treatments. To apply the analysis to the HGSC data, a novel eQTL analysis method was developed. Additionally, appropriate input processing is important prior to running the analysis to ensure reliable results.
  • Soukainen, Arttu (2023)
    Insect pests substantially impact global agriculture, and pest control is essential for global food production. However, some pest control measures, such as intensive insecticide use, can have adverse ecological and economic effects. Consequently, there is a growing need for advanced pest management tools that can be integrated into intelligent farming strategies and precision agriculture. This study explores the potential of a machine learning tool to automatically identify and quantify fruit fly pests from images in the context of Ghanaian mango orchards in West Africa. Fruit flies provide a special challenge for computer vision-based deep learning due to their small size and taxonomic diversity. Insects were captured using sticky traps together with attractant pheromones. The traps were then photographed in the field using regular smartphone cameras. The image data contained 1434 examples of the targeted pests, and it was used to train a convolutional neural network model (CNN) for counting and classifying the fruit flies into two different genera: Bactrocera and Ceratits. High-resolution images were used to train the YOLOv7 object detection algorithm. The training involved manual hyper-parameter optimization emphasizing pre-selected hyper parameters. The focus was on employing appropriate evaluation metrics during model training. The final model had a mean average precision (mAP) of 0.746 and was able to identify 82% of the Ceratitis and 70% of the Bactrocera examples in the validation data. Results promote the advantages of a computer vision-based solution for automated multi-class insect identification and counting. Low-effort data collection using smartphones is sufficient to train a modern CNN model efficiently, even with a limited number of field images. Further research is needed to effectively integrate this technology into decision-making systems for pre cision agriculture in tropical Africa. Nevertheless, this work serves as a proof of concept, show casing the serious potential of computer vision-based models in automated or semi-automated pest monitoring. Such models can enable new strategies for monitoring pest populations and targeting pest control methods. The same technology has potential not only in agriculture but in insect monitoring in general.
  • Gu, Chunhao (2021)
    Along with the rapid scale-up of biological knowledge bases, mechanistic models, especially metabolic network models, are becoming more accurate. On the other hand, machine learning has been widely applied in biomedical researches as a large amount of omics data becomes available in recent years. Thus, it is worth to conduct a study on integration of metabolic network models and machine learning, and the method may result in some biological discoveries. In 2019, MIT researchers proposed an approach called 'White-Box Machine Learning' when they used fluxomics data derived from in silico simulation of a genome-scale metabolic (GEM) model and experimental antibiotic lethality measurements (IC50 values) of E. coli under hundreds of screening conditions to train a linear regression-based machine learning model, and they extracted coefficients of the model to discover some metabolic mechanism involving in antibiotic lethality. In this thesis, we propose a new approach based on the framework of the 'White-Box Machine Learning'. We replace the GEM model with another state-of-the-art metabolic network model -- the expression and thermodynamics flux (ETFL) formulation. We also replace the linear regression-based machine learning model with a novel nonlinear regression model – multi-task elastic net multilayer perceptron (MTENMLP). We apply the approach on the same experimental antibiotic lethality measurements (IC50 values) of E. coli from the 'White-Box Machine Learning' study. Finally, we validate their conclusions and make some new discoveries. Specially, our results show the ppGpp metabolism is active under antibiotic stress, which is supported by some literature. This implies that our approach has potential to make a biological discovery even if we don't know a possible conclusion.
  • Balaz, Melanie (2023)
    Gene editing holds tremendous potential for treating a variety of diseases, but concerns about safety, particularly the risk of edited cells becoming cancerous, must be addressed. This thesis explores a safety mechanism to prevent unwanted cell proliferation and tumor formation in induced pluripotent stem cells that have been edited for use in gene therapy. The mechanism bases on the genetic disruption (knockout) of the thymidylate synthase gene (TYMS), the only enzyme in charge of synthesizing deoxythymidine monophosphate (dTMP), an essential building block of DNA. Without dTMP, cells cannot successfully proliferate, while RNA synthesis remains unaffected. Through RNA sequencing analysis, we investigate the early response of TYMS knockout cells to dTMP withdrawal and find evidence of the activation of apoptosis and stress pathways, as well as differentiation and changes in the cell cycle. In addition, we demonstrate the effectiveness of the TYMS knockout mechanism in preventing proliferation of cancerous cells in a laboratory setting.
  • Riikonen, Juha (2023)
    Population structure refers to the patterns of genetic variation within and between populations, which arises from various evolutionary processes such as genetic drift, natural selection and migration. Understanding this structure in human populations provides insights about our own evolutionary history and past migration patterns. Controlling for underlying population structure is also an essential step in genetic association analyses to ensure that the associations between genetic variants and traits of interest are not confounded by differences in ancestry. Results from such analyses are essential for the research and development of personalised medicine. Principal component analysis (PCA) is a method that has been widely used to study the patterns of genetic variability within populations. In this study, PCA is applied to a genotype data set of 38,113 samples born in Finland using data from Finnish study cohorts FINRISK, GeneRISK, FinHealth 2017 and Health 2000. The first ten principal components are extracted using PLINK 2.0 software. Novel discoveries of association between genetic variants and a disease often motivates further studies on the geographical distribution of such risk variants. Here, the genetic population structure is proposed as an alternative, higher dimensional space for studying the distribution of genetic variants within a population. This study presents a framework for quantifying and visualising the allele frequency variability across the genetic structure defined by principal components. Using an empirical Bayes model, the posterior minor allele frequency is estimated in discrete areas of the principal component space. The variability of these estimates is visualised as heatmaps, using a colouring scheme that provides statistical guarantees for frequency differences between different colours. The framework is demonstrated on five biallelic variants known to be associated with a disease or a disorder. The results show that visualising the pairwise components complemented with data on sample birth location reveals the major patterns of genetic variability within the Finnish population. The framework is able to distinguish areas in the genetic structure with differing levels of allele frequency, and visualise this variability as heatmaps that enable meaningful visual interpretation. The levels of allele frequency differences found in the principal component space are comparable to the differences found geographically, which suggests that studying individual variants within the genetic structure on top of geographical frequency maps can provide additional information on their distribution in a population.
  • Zogjani, Yllza (2023)
    The increasing demand for comprehensive datasets to address complex diseases has resulted in a widespread popularity of biobank-based research. However, the collection of biobank-level data may be susceptible to biases when fundamental aspects of study design, such as sampling approach, are overlooked. FinnGen is a large-scale cohort study aiming to improve diagnoses and prevent diseases through genetic research by combining biobank data with registry data.However, FinnGen’s hospital-based recruitment strategy makes FinnGen suffer from selection bias and thus epidemiologically less representative of its sampling population. In this study, we examine the profound impact of selection bias in FinnGen. We use well-established epidemiological methods and leverage representative data on the Finnish population to try and correct for the bias. By comparing key demographic characteristics and association statistics of interest between FinnGen and a comprehensive registry-based study, FinRegistry, we highlight the extent to which selection bias within FinnGen results in distorted association estimates and a dataset that is highly non - representative of its underlying population. In response to these findings, we estimate Iterative Proportional Fitting (IPF) weights to estimate association statistics that are representative of the true sampling population of FinnGen and unaffected by selection bias. By comparing weighted associations estimated in the FinnGen with associations estimated using FinRegistry data, we infer that the use of our IPF weights mitigates volunteer bias in FinnGen.
  • Nebelung, Hanna (2023)
    ScRNA-seq captures a static picture of a cell's transcriptome including abundances of unspliced and spliced RNA. RNA velocity methods offer the opportunity to infer future RNA abundances and thus future states of a cell based on the temporal change of these unspliced and spliced RNA. Early RNA velocity methods have shed light on transcriptional dynamics in many biological processes. However, due to strict assumptions in the underlying model, these models are not reliable when analysing and inferring velocity for genes with complex expression dynamics such as genes with transcriptional boosts. These genes can for example be observed in erythropoietic and hematopoietic data. Several new RNA velocity methods have been proposed recently. Among these, veloVI and Pyro-Velocity both employ Bayesian methods to estimate the reaction rate and latent parameters. Thus the problem of estimating RNA velocity is turned into a posterior probability inference, that allows for more flexible inference of model parameters and the quantification of uncertainty. The objectives of this thesis were to investigate newly published RNA velocity methods, veloVI and Pyro-Velocity, in comparison to the established tool scVelo. To achieve this, we applied the methods to data obtained from scRNA-seq of healthy and ERCC6L2 disease bone marrow cells. ERCC6L2 disease can cause bone marrow failure with a risk of progression to acute myeloid leukemia with erythroid predominance. Specifically, we evaluated whether RNA velocity results reflect hematopoietic differentiation, if genes with transcriptional boosts affect the velocity results, and if RNA velocity analysis can indicate why erythropoiesis in ERCC6L2 disease is affected. We find that new RNA velocity methods can not produce velocity estimations that are fully in line with what is known of hematopoiesis in our data. Further, the results suggest that velocity estimations by veloVI are affected by genes with transcriptional boosts. Moreover, RNA velocity methods examined in this thesis are not robust and cannot reliably predict cell transitions based on the estimated velocity. Subsequently, velocity estimations for disease data such as ERCC6L2 disease must be evaluated carefully before drawing any conclusion about the differentiation process. In conclusion, this thesis highlights the need for models that can model complex transcription kinetics. Still, as this field is rapidly growing and promising new methods are being developed, improvement of RNA velocity analysis, in general, is possible.
  • Lehtonen, Leevi (2021)
    Sex differences can be found in most human phenotypes, and they play an important role in human health and disease. Females and males have different sex chromosomes, which are known to cause sex differences, as are differences in the concentration of sex hormones such as testosterone, estradiol and progesterone. However, the role of the autosomes has remained more debated. The primary aim of this thesis is to assess the magnitude and relevance of human sex-specific genetic architecture in the autosomes. This is done by calculating sex-specific heritability estimates and genetic correlation estimates between females and males, as well as comparing these to sex differences on the phenotype level. Additionally, the heritability and genetic correlation estimates are compared between two populations, in order to assess the magnitude of sex differences compared to differences between populations. The analyses in this thesis are based on sex-stratified genome-wide association study (GWAS) data from 48 phenotypes in the UK Biobank (UKB), which contains genotype data from approximately 500 000 individuals as well as thousands of phenotype measurements. A replication of the analyses using three phenotypes was also made on data from the FinnGen project, with a dataset from approximately 175 000 individuals. The 48 phenotypes used in this study range from biomarkers such as serum testosterone and albumin levels to general traits such as height and blood pressure. The heritability and genetic correlation estimates were calculated using linkage disequilibrium score regression (LDSC). LDSC fits a linear regression model between test statistic values of GWAS variants and linkage disequilibrium (LD) scores calculated from a reference population. For most phenotypes, the heritability and genetic correlation results show little evidence of sex differences. Serum testosterone level and waist-to-hip ratio are exceptions to this, showing strong evidence of sex differences both on the genetic and the phenotype level. However, the overall correlation between phenotype level sex differences and sex differences in heritability or genetic correlation estimates is low. The replication in the FinnGen dataset for height, weight and body mass index (BMI), showed that for these traits the differences in heritability estimates and genetic correlations between the Finnish and UK populations are comparable or larger than the differences found between males and females.
  • Viitikko, Tanja (2023)
    Pathogens are everywhere in nature, so organisms have developed various defense mechanisms in order to defend themselves against the pathogens. Two of the defense mechanisms are known as resistance and tolerance. Resistance describes the host's ability to avoid being infected by the pathogen, while tolerance describes the host's ability to reduce the fitness loss caused by the infection. We assume that investing into resistance reduces the transmission rate of the pathogens and investing into tolerance reduces the host's virulence. Developing the defense mechanisms is costly to the host. In this thesis, we assume that the resources invested into resistance and tolerance are taken away from the host's fecundity. The independent but simultaneous evolution of resistance and tolerance is modeled with an SIS model. The model is studied with the methods of adaptive dynamics. We concentrate on finding continuously stable strategies, which serve as the evolutionary end points for the population. We study the varying ecological parameters to determine which strategies are optimal for the host in different environments. We find that for low values of transmission rate, the hosts favor resistance over tolerance. When the transmission rate increases, resistance is traded for tolerance and the host benefits more from high tolerance. Low values of virulence result in tolerance being favored over resistance. Increasing virulence leads to a change in the defense mechanism as for high values of virulence investing into resistance is more beneficial to the host. The same holds for recovery rate, as tolerance is favored for low values of recovery rate and changed for resistance when the recovery rate increases. Patterns and associations between resistance and tolerance are also studied. Positive correlation between resistance and tolerance is found with low values of transmission rate, low and high values of virulence and high values of recovery rate. Resistance and tolerance correlate negatively with high values of transmission rate, intermediate values of virulence and low values of recovery rate.
  • Purmonen, Noora (2022)
    Tämän tutkielman tarkoituksena on esittää ja havainnoida tapoja, joilla tilastollista epävarmuutta voidaan selittää ja visualisoida. Erityisesti kohdeyleisönä tilastollisen epävarmuuden viestinnällä ovat lukijat, joilla ei ole juurikaan aiempaa kokemusta tilastollisista käsitteistä tai menetelmistä. Sovelluskohteena näiden visuaalisten viestinnän menetelmien esittämisessä on hyödynnetty COVID19-aineistoja. COVID19-tartuntataudin viestinnässä kohdeyleisöjä on ollut hyvin erilaisia, mutta esimerkiksi koko Suomen väestöä koskevassa viestinnässä epidemian etenemisestä olennaista on ollut nimenomaan viestintä kohdeyleisölle, joka ei koostu alan asiantuntijoista. Tutkielma pohjautuu vuoden 2020 COVID19-aineistoihin ja tartuntatautitilanteeseen, jolloin väestön keskuudessa ei vielä juurikaan ollut kehittynyt immuniteettia taudille. Tutkielman alussa esitellään SEIR-tartuntatautimalli, jossa kuvataan epidemian kehittymistä väestössä neljän eri tartuntatautivaiheen kautta. SEIR-mallia on hyödynnetty myös COVID19-mallinnuksessa epidemian alkuvaiheessa, sillä COVID19 ajateltiin käyttäytyvän epidemiana samoin näiden neljän vaiheen osalta. Mallin esittelyn lisäksi on hieman pohdittu, kuinka mallissa käytössä olevat parametrit, kuten perustarttuvuusluku, vaikuttavat epidemiatilanteen kehittymiseen. Terveyden ja hyvinvoinnin laitoksen COVID19-mallinnusta on myös esitelty SEIR-mallin ja tartuntamäärien kehittymisen näkökulmasta vuoden 2020 alkupuolella. Tässä on tuotu esille myös vuonna 2020 käytössä olleiden yksilöiden välisten kontaktien määrää alentavien rajoitusten vaikutusta epidemiatilanteeseen tarttuvuusluvun kautta. Tilastollisen epävarmuuden osalta tässä tutkielmassa on keskitytty tilastollisen epävarmuuden syihin, sillä epävarmuus voi olla peräisin hyödynnettävien tietojen puutteesta tai niiden sattumanvaraisuudesta. Taustalla vaikuttavien syiden ymmärtäminen on olennaista kokonaiskuvan ja sen osien selittämisessä ja havainnoimisessa. Tutkielmassa pohditaan erityisesti COVID19-mallinnuksessa ja sen tartuntojen testaamisessa esiintyvää epävarmuutta. Lisäksi tutkielmassa paneudutaan tilastollisen epävarmuuden esittämiskeinoihin, kuten otantaan liittyvään keskihajontaan tai -virheeseen sekä luottamusväleihin, sekä myöhemmin muun muassa näiden käsitteiden visualisointiin ja viestintään. Tilastollisen epävarmuuden viestintää esitetään erityisesti erilaisten visuaalisten kuvaajien, kuten laatikko-janakuvaajien ja sirontakaavioiden, kautta pohtien samalla eri kuvaajien hyötyjä tai haasteita. Tutkielman loppupuolella perehdytään vielä viestinnän näkökulmasta kuvaajien tulkintaan vaikuttaviin seikkoihin sekä epävarmuuden viestinnän päämääriin esimerkiksi viestinnästä syntyvän luottamuksen tai tunteiden kautta. Lopuksi kootaan vielä tilastollisen epävarmuuden visuaalisen esittämisen mahdollisia haasteita, jotka voivat johtua esimerkiksi kohdeyleisön tekemistä tulkinnoista tai epäolennaisten kuvaajien hyödyntämisestä.
  • Ba, Yue (2021)
    Ringed seals (Pusa hispida) and grey seals (Halichoerus grypus) are known to have hybridized in captivity despite belonging to different taxonomic genera. Earlier genetic analyses have indicated hybridization in the wild and the resulting introgression of genetic material cross species boundaries could potentially explain the intermediate phenotypes observed e.g. in their dentition. Introgression can be detected using genome data, but existing inference methods typically require phased genotype data or cannot separate heterozygous and homozygous introgression tracts. In my thesis, I will present a method based on Hidden Markov Models (HMM) to identify genomic regions with a high density of single nucleotide variants (SNVs) of foreign ancestry. Unlike other methods, my method can use unphased genotype data and can separate heterozygous and homozygous introgression tracts. I will apply this method to study introgression in Baltic ringed seals and grey seals. I will compare our method to an alternative method and assess our method with simulated data in terms of precision and recall. Then, I will apply it to seal data to search for introgression. Finally, I will discuss what future directions to improve our method.
  • Niinikoski, Eerik (2020)
    The aim of this thesis is to predict total career racing performance of Finnish trotter horses by using trotters early career racing performance and other early career variables. This thesis presents a brief introductory of harness racing and horses used in Finnish trotting sport. The data is presented and modified for predictions, with descriptive statistics of tables and visuals. The machine learning method of Random forests for regression is introduced and used in the predictions. After training the model, this thesis presents the prediction accuracy and variables of importance of the predictions of total career racing performance for both Finnhorse trotters and Finnish Standardbred trotter population. Finally, the writer discusses on the shortages and possible improvements for future research. The data for this thesis was provided by The Finnish trotting and breeding association (Suomen Hippos ry), which included all information of harness races from 1984 to the end of 2019, raced in Finland. From almost three million rows, the data was summarised to a data table of 46704 rows of trotters, that have started their career at earliest allowed three age groups. A total of 37 independent variables were used to predict three outcomes of total career earnings, total number of career starts and total number of career first placings, as separate models. The predictors are derived from other studies that estimate the environmental and genetic factors of racing performance of a trotter. The three models performed poor to moderate, with total earnings having the highest prediction accuracy. The model predicted quite well larger amounts of earnings, but was avid to predict some earnings when there in fact were none. Prediction accuracy of total number of starts was poor, especially when the true amount of starts was low. Model that predicted total number of career first placings performed the worst. This can partially be explained by the fact that winning is a rare event for a trotter in general. The models fit better for Finnish Standardbred trotters than for Finnhorse trotters. This thesis works as a good basis for future similar research, where massive amounts of data and machine learning is used to predict trotter’s career, racing performance or other factors. The results show that predicting total career racing performance as a classification problem could be a better fit than regression. These adequate classes, as well as possible better predictors and suitable imputes for missing values, should be consulted with an audience of superior knowledge in harness racing.