Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by master's degree program "Master's Programme in Life Science Informatics"

Sort by: Order: Results:

  • Ersalman, Murat (2024)
    Integrated population models (IPMs) are a promising approach to assess and manage wildlife populations in dynamic and uncertain conditions. By combining multiple data sources into a single, unified model, they enable the parametrization of versatile, mechanistic population models that can predict population dynamics in novel circumstances. This is in contrast to traditional approaches where independent empirical estimates for demographic parameters are typically incorporated into a population projection matrix such as a Leslie matrix. A major limitation of conventional methods is their inability to fully utilize all available information, as the synergies between different data sources are not exploited. The Baltic ringed seal (Pusa hispida botnica) presents an example illustrating the limitation of conventional monitoring approaches. Despite the availability of long-term monitoring data, population assessment is hindered by dynamic environmental conditions, varying reproductive rates, and the recently re-introduced hunting, thus limiting the quality of information available to managers regarding, for example, hunting quotas. In particular, population counts of ringed seals from aerial surveys have exhibited unexpected trends and large fluctuations during the last decade, making it impossible to obtain reliable estimates of population growth from survey data alone. This thesis presents a Bayesian IPM for the ringed seal population inhabiting the Bothnian Bay in the Baltic Sea. The central aim of this work is to outline an approach that can overcome some of the challenges that have crippled Baltic ringed seal monitoring efforts during the last decade, and support science-based management decisions. The thesis broadly consists of three parts. First, a state-space model is presented for the Bothnian Bay ringed seal population. Demographic processes are described through a stochastic age and sex structured population model that includes both hunting mortality and the hypothesized effects of environmental variables such as pollution and sea ice cover on demographic parameters and seal behaviour. Next, the model is fit to census and various demographic and reproductive data, as well as hunting statistics, from 1988 to 2023 under a Bayesian framework where posterior samples of model parameters are obtained using Markov Chain Monte Carlo methods. Finally, posterior estimates of model parameters are used to construct a Leslie matrix, and model behavior is analyzed using methods developed for matrix projection models. Future population dynamics are also simulated under alternative management scenarios to inform ringed seal management decisions. In general, this thesis demonstrates the value of mechanistic IPMs for monitoring and managing natural populations under changing environments, and supporting science-based management decisions.
  • Szabo, Angela (2024)
    The advancement of high-throughput imaging technologies has revolutionized the study of the tumor microenvironment (TME), including high-grade serous ovarian carcinoma (HGSOC), a cancer type characterized by genetic instability and high intra-tumor heterogeneity. HGSOC is often diagnosed at advanced stages and has a high relapse rate following initial treatment, presenting significant clinical challenges. Understanding the dynamic and complex tumor microenvironment in HGSOC is crucial for developing effective therapeutic strategies, as it includes various interacting cells and structures. Currently most methods are focusing on deciphering the TME on a single cell level, but the volume of the data poses a challenge in large scale studies. This thesis focuses on developing a comprehensive pipeline for accurate detection and phenotyping of immune cells within the TME using tissue cyclic immunofluorescence imaging. The proposed pipeline integrates Napari, an advanced visualization tool, and several existing computational methods to handle large-scale imaging datasets efficiently. The primary aim is to create Napari plugins for fast browsing and detailed visualization of these datasets, enabling precise cell phenotyping and quality control. Handling large images was resolved through the implementation of Zarr and Dask methodologies, enabling efficient data management. Key image processing methodologies include the use of the StarDist algorithm for cell segmentation, preprocessing steps for fluorescence intensity normaliza tion, and the Tribus tool for semi-automated cell type classification. In total, we annotated 976,082 single cells on three HGSOC samples originating from pre- or post-neoadjuvant chemotherapy tumor sections. The accurate annotation of immune sub-populations was enhanced by visual evaluation steps, addressing the limitations of the discussed methods. Accurately annotating dense tissue areas is crucial for describing the cellular composition of samples, particularly tumor-infiltrating immune populations. The results indicate that the proposed pipeline not only enhances the understanding of the TME in HGSOC but also provides a robust framework for future studies involving large-scale imaging data.
  • Rantala, Frans (2023)
    Cancer consists of heterogeneous cell populations that repeatedly undergo natural selection. These cell populations contest with each other for space and nutrients and try to generate phenotypes that maximize their ecological fitness. For achieving this, they evolve evolutionarily stable strategies. When an oncologist starts to treat cancer, another game emerges. While affected by the cellular evolution processes, modeling of this game owes to the results of the classical game theory. This thesis investigates the theoretical foundations of adaptive cancer treatment. It draws from two game theoretical approaches, evolutionary game theory and Stackelberg leader-follower game. The underlying hypothesis of adaptive regimen is that the patient's cancer burden can be administered by leveraging the resource competition between treatment-sensitive and treatment-resistant cells. The intercellular competition is mathematically modelled as an evolutionary game using the G function approach. The properties of the evolutionary stability, such as ESS, the ESS maximum principle, and convergence stability, that are relevant to tumorigenesis and intra-tumoral dynamics, are elaborated. To mitigate the patient's cancer burden, it is necessary to find an optimal modulation and frequency of treatment doses. The Stackelberg leader-follower game, adopted from the economic studies of duopoly, provides a promising framework to model the interplay between a rationally playing oncologist as a leader and the evolutionary evolving tumor as a follower. The two game types applied simultaneously to cancer therapy strategisizing can nourish each other and improve the planning of adaptive regimen. Hence, the characteristics of the Stackelberg game are mathematically studied and a preliminary dose-optimization function is presented. The applicability of the combination of the two games in the planning of cancer therapy strategies is tested with a theoretical case. The results are critically discussed from three perspectives: the biological veracity of the eco-evolutionary model, the applicability of the Stackelberg game, and the clinical relevance of the combination. The current limitations of the model are considered to invite further research on the subject.
  • Pohjonen, Joona (2020)
    Prediction of the pathological T-stage (pT) in men undergoing radical prostatectomy (RP) is crucial for disease management as curative treatment is most likely when prostate cancer (PCa) is organ-confined (OC). Although multiparametric magnetic resonance imaging (MRI) has been shown to predict pT findings and the risk of biochemical recurrence (BCR), none of the currently used nomograms allow the inclusion of MRI variables. This study aims to assess the possible added benefit of MRI when compared to the Memorial Sloan Kettering, Partin table and CAPRA nomograms and a model built from available preoperative clinical variables. Logistic regression is used to assess the added benefit of MRI in the prediction of non-OC disease and Kaplan-Meier survival curves and Cox proportional hazards in the prediction of BCR. For the prediction of non-OC disease, all models with the MRI variables had significantly higher discrimination and net benefit than the models without the MRI variables. For the prediction of BCR, MRI prediction of non-OC disease separated the high-risk group of all nomograms into two groups with significantly different survival curves but in the Cox proportional hazards models the variable was not significantly associated with BCR. Based on the results, it can be concluded that MRI does offer added value to predicting non-OC disease and BCR, although the results for BCR are not as clear as for non-OC disease.
  • Backlund, Sofia Maria (2022)
    Coral reefs form important marine ecosystems and simultaneously are at risk of deterioration due to rapidly changing environments as a consequence of human actions. Understanding their dynamics is thus important in order to be able to protect them from being destroyed. In this thesis we construct a lattice model for two life-history strategies of corals, brooders and spawners. These two strategies differ mainly in their modes of sexual reproduction, but also differences in growth and death rates as well as competitive ability are considered. We use pair approximation to help analyse the model while keeping its spatial structure. Numerical analysis is used to find the equilibria of the system as well as their stabilities, first for a single strategy and then for the two-strategy system. We find that the two strategies are able to coexist if the spawners have a higher growth rate and higher death rate and are competitively superior to brooders. This requires some reproduction over distance and a trade-off between growth and death rates. Thus we find that brooders are focusing a bigger part of their energy on long-distance reproduction, while spawners are dominating over short distances and having a higher turnover. We also find that both mutual invasibility and coexistence in the broader sense are only possible for low rates of sexual reproduction for both strategies. For higher rates of sexual reproduction we find that whichever strategy invades the lattice first will stay and the other cannot invade. Lastly we look at the effect of a change in environmental conditions, namely the acidification and temperature increase of oceans, on the two strategies and find that it affects the two strategies differently. The spawners are quickly driven to extinction by the change in environmental conditions, while brooders initially benefit from the changing conditions and only start to suffer themselves after the spawners have gone extinct.
  • Suppula, Joni Johan Mikael (2023)
    Progressive Multifocal Leukoencephalopathy (PML) is a rare but often fatal central nervous system demyelination disease caused by the reactivation of persistent JC polyomavirus (JCPyV) in immunosuppressed individuals. JCPyV infects oligodendrocytes in the brain, causing lysis of the glial cells, which leads to progressive demyelination and destruction of neurons seen as lesions in the white matter. The cause of JCPyV reactivation and how it reaches the brain are not well understood. MicroRNAs (miRNAs) are short non-coding RNAs which negatively regulate gene expression by marking mRNAs for destruction or by preventing translation. A Single miRNA can have multiple mRNA targets and multiple miRNAs can target the same mRNA, making the miRNA induced gene regulation a complex process affecting multiple different signaling pathways and cellular processes. The focus of the thesis is to study miRNA differential expression of PML patients compared to healthy individuals to find miRNAs and their target genes affected by JCPyV, while showing expertise in the data handling and data analysis of a miRNA sequencing experiment. The study was conducted by collecting miRNA samples from 8 PML patients and two controls and using Next-gen sequencing and the QuickMIRSeq analysis tool to collect miRNA counts for differential expression analysis. The analysis identified twelve miRNAs upregulated in the PML brain and multiple target genes interacting with two or more of the found miRNAs. The miRNAs were found to have connections to JCPyV replication, PML and important cellular processes such as neuroinflammation and BBB integrity.
  • Hämäläinen, Kreetta (2021)
    Personalized medicine tailors therapies for the patient based on predicted risk factors. Some tools used for making predictions on the safety and efficacy of drugs are genetics and metabolomics. This thesis focuses on identifying biomarkers for the activity level of the drug transporter organic anion transporting polypep-tide 1B1 (OATP1B1) from data acquired from untargeted metabolite profiling. OATP1B1 transports various drugs, such as statins, from portal blood into the hepatocytes. OATP1B1 is a genetically polymorphic influx transporter, which is expressed in human hepatocytes. Statins are low-density lipoprotein cholesterol-lowering drugs, and decreased or poor OATP1B1 function has been shown to be associated with statin-induced myopathy. Based on genetic variability, individuals can be classified to those with normal, decreased or poor OATP1B1 function. These activity classes were employed to identify metabolomic biomarkers for OATP1B1. To find the most efficient way to predict the activity level and find the biomarkers that associate with the activity level, 5 different machine learning models were tested with a dataset that consisted of 356 fasting blood samples with 9152 metabolite features. The models included both a Random Forest regressor and a classifier, Gradient Boosted Decision Tree regressor and classifier, and a Deep Neural Network regressor. Hindrances specific for this type of data was the collinearity between the features and the large amount of features compared to the number of samples, which lead to issues in determining the important features of the neural network model. To adjust to this, the data was clustered according to their Spearman’s rank-order correlation ranks. Feature importances were calculated using two methods. In the case of neural network, the feature importances were calculated with permutation feature importance using mean squared error, and random forest and gradient boosted decision trees used gini impurity. The performance of each model was measured, and all classifiers had a poor ability to predict decreasead and poor function classes. All regressors performed very similarly to each other. Gradient boosted decision tree regressor performed the best by a slight margin, but random forest regressor and neural network regressor performed nearly as well. The best features from all three models were cross-referenced with the features found from y-aware PCA analysis. The y-aware PCA analysis indicated that 14 best features cover 95% of the explained variance, so 14 features were picked from each model and cross-referenced with each other. Cross-referencing highest scoring features reported by the best models found multiple features that showed up as important in many models.Taken together, machine learning methods provide powerful tools to identify potential biomarkers from untargeted metabolomics data.
  • Rögnvaldsson, Sölvi (2023)
    Seasonal variation has affected human societies throughout history, shaping various aspects of life including agriculture, migration patterns and culture. This influence is observed, among others, in the occurrences of diseases such as viral and bacterial infections, cardiovascular disease and mental disorders. While there are a multitude of factors influencing the timing of disease diagnoses, environmental and behavioral, the genetic role has not been explored to the best of our knowledge. The aim of this thesis was to relate genetic variation to seasonal disease risk. To achieve this, the seasonality of 1,759 disease endpoints was assessed in the Finnish population. A subset of 14 diseases were selected and used as input into a statistical modeling framework that was developed to search for genetic variants associated with seasonal disease risk in the FinnGen study population. A total of 9 genome-wide significant loci affecting seasonality were identified, including a top-sQTL, rs41273830[T], in ITGB8 for major depression and a stop-gain variant, rs601338[A], in FUT2 for intestinal infections, the latter also being protective against disease risk. This introduces a new aspect to genetic research, which can both contribute to better understanding how known disease variants affect disease but also finding new disease variants whose effects are currently obscured by seasonal variation.
  • Kinnula, Ville (2021)
    In inductive inference phenomena from the past are modeled in order to make predictions of the future. The mathematical concept of exchangeability for random sequences provides a mathematical justification for the assumption that observations are independently and identically distributed given some underlying parameters estimable from the empirical distribution of the observations. The theory of exchangeability contains basic elements for inductive inference, such as the de Finetti representation theorem for the probability of a general exchangeable sequence, prior probability distributions for the parameters in the representation theorem, as well as the predictive probabilities, or rule of succession, for new observations from the random sequence under consideration. However, entirely unanticipated observations pose a problem for inductive inference. How can one assign a probability for an event that has never been seen before? This is called the sampling of species problem. Under exchangeability, the number of possible different events t has to be known before-hand to be able to assign an equal prior probability 1/t for each event. In the sampling of species problem an assumption of infinite possible events has to be made, leading to the prior probability 1/∞ for each event, which is impossible. Exchangeability is thus inadequate to handle probability distributions for infinite possible events. It turns out that a solution to the sampling of species problem arises from partition exchangeability. Exchangeable random sequences have the same probability of occurring, if the observations in the sequence have identical frequencies. Under partition exchangeability, the sequences have the same probability of occurring when they share identical frequencies of frequencies. In this thesis, partition exchangeability is introduced as a framework of inductive inference by juxtaposing it with the more familiar type of exchangeability for random sequences. Partition exchangeability has parallel elements to exchangeability, in the Kingman representation theorem, the Poisson-Dirichlet distribution for the prior probability distribution, and a corresponding rule of succession. The rules of succession are required in the problem of supervised classification to provide product predictive probabilities to be maximized by assigning the test data into pre-defined classes based on training data. A Bayesian construction of supervised classification is discussed in this thesis. In theory, the best classification performance is gained when assigning the class labels to the test data simultaneously, but because of computational complexity, an assumption is often made where the test data points are i.i.d. with regards to each other. In the case of a known set of possible events these simultaneous and marginal classifiers converge in their test data predictive probabilities as the amount of training data tends to infinity, justifying the use of the simpler marginal classifier with enough training data. These two classifiers are implemented in this thesis under partition exchangeability, and it is shown in theory and in practice with a simulation study that the same asymptotic convergence between the simultaneous and marginal classifiers applies with partition exchangeable data as well. Finally, a small application in single cell RNA expression is explored.
  • Koski, Jessica (2021)
    Acute lymphoblastic leukemia (ALL) is a hematological malignancy that is characterized by uncontrolled proliferation and blocked maturation of lymphoid progenitor cells. It is divided into B- and T-cell types both of which have multiple subtypes defined by different somatic genetic changes. Also, germline predisposition has been found to play an important role in multiple hematological malignancies and several germline variants that contribute to the ALL risk have already been identified in pediatric and familial settings. There are only few studies including adult ALL patients but thanks to the findings in acute myeloid leukemia, where they found the germline predisposition to consider also adult patients, there is now more interest in studying adult patients. The prognosis of adult ALL patients is much worse compared to pediatric patients and many are still lacking clear genetic markers for diagnosis. Thus, identifying genetic lesions affecting ALL development is important in order to improve treatments and prognosis. Germline studies can provide additional insight on the predisposition and development of ALL when there are no clear somatic biomarkers. Single nucleotide variants are usually of interest when identifying biomarkers from the genome, but also structural variants can be studied. Their coverage on the genome is higher than that of single nucleotide variants which makes them suitable candidates to explore association with prognosis. Copy number changes can be detected from next generation sequencing data although the detection specificity and sensitivity vary a lot between different software. Current approach is to identify the most likely regions with copy number change by using multiple tools and to later validate the findings experimentally. In this thesis the copy number changes in germline samples of 41 adult ALL patients were analyzed using ExomeDepth, CODEX2 and CNVkit.
  • Suhonen, Sannimari (2023)
    Polygenic risk scores (PRSs) estimate the genetic risk of an individual for a certain polygenic disease trait by summing up the effects of multiple variants across the genome affecting the disease risk. Currently, polygenic risk scores (PRSs) are calculated from imputed array genotyping data which is inexpensive to produce use and has standard procedures and pipelines available. However, genotyping arrays are prone to ascertainment bias, which can also lead to biased PRS results in some populations. If PRSs are utilized in healthcare for screening rare diseases, usage of whole-genome sequencing (WGS) instead of array genotyping is desirable, because also individual samples can be analyzed easily. While high-coverage WGS is still significantly more expensive than array genotyping, low-coverage whole genome sequencing (lcWGS) with imputation has been proposed as an alternative for genotyping arrays. In this project, the utility of imputed low-coverage whole-genome sequencing (lcWGS) data in PRS estimation compared to genotyping array data and the impact of the choice of imputation tool for lcWGS data was studied. Down-sampled WGS data with six different low coverages (0.1x-2x) was used to represent lcWGS data. Two different pipelines were used in genotype imputation and haplotype phasing: in the first one, pre-phasing and imputation were performed directly for the genotype likelihoods (GLs) calculated from the down-sampled data, whereas in the second one, the GLs were converted to genotype calls before imputation and phasing. In both pipelines, PRS for 27 disease phenotypes were calculated from the imputed and phased lcWGS data. Imputation and PRS calculation accuracy of the two pipelines were calculated in relation to both genotyping array and high-coverage whole-genome sequencing (hcWGS) data. In both pipelines, imputation and PRS calculation accuracy increased when the down-sampled coverage increased. The second imputation and phasing pipeline lead to better results in both imputation and PRS calculation accuracy. Some differences in PRS accuracy between different phenotypes were also detected. The results show similar patterns to what is seen in other similar publications. However, not quite as high imputation and PRS accuracy as seen in earlier studies could be attained, but possible limitations leading to lower accuracy could be identified. The results also emphasize the importance of choosing suitable imputation and phasing methods for lcWGS data and suggest that methods and pipelines designed particularly for lcWGS should be developed and published.
  • Pirttikoski, Anna (2022)
    Ovarian cancer is the most lethal gynecological cancer and high-grade serous ovarian cancer (HGSOC) is the most common type of it. HGSOC is often diagnosed in advanced stages and most patients will relapse after optimal first-line treatment. One reason for the lack of successful treatment in HGSOC is high tumor heterogeneity including differences across the tumors in distinct patients, and even within each tumor. This heterogeneity is the result of genetic and non genetic factors. Phenotypical variabilty exists also within cancer cells that have the same genetic background. This is due to the fact that a cell can exist in more than one stable state where its genome is in a specific configuration and it expresses certain genes. Diverse cell states and transitions between them initially offer a path for tumor development, and later enable essential tumor behavior, such as metastasis and survival in variable environmental pressures, such as those posed by anti-cancer therapies. Generally, phenotypic heterogeneity is acquired from the cell of origin for a tumor. This thesis studies cell states in HGSOC cancer cells and their normal counterparts, fallopian tube epithelial cells. Exploration of cell states is based on gene expression data of individual cells. Gene expression data was analyzed with state-of-the-art tools and computational methods. Gene modules representing cell states were constructed using genes found in differential gene expression analysis of cancer cells, normal cells and tumor microenvironment. Differentially expressed gene (DEG) groups of cancer, normal FTE and shared epithelial genes were grouped separately into gene modules based on gene-gene associations and community detection. Potential dynamical relationships between cell states were addressed by pseudo-temporal ordering using RNA velocity modeling approach. We are able to capture biologically meaningful cell states which are relevant in the development of HGSOC with chosen research strategy. Found cell states represent processes such as epithelial-mesenchymal transition, inflammation and stress response which are known to have a role in cancer development. The transition patterns showed consistent tendencies across the samples, and the trajectories for normal samples presented more directionality than those of cancer specimens. The results indicate existence of shared epithelial states which stay in fixed positions in the developmental trajectory of normal and cancer cells. For example, both epithelial stem cells and stem-like cancer cells seem to utilize oxidative phosphorylation (OXPHOS) for their metabolic needs. On the other hand, cell states that are more terminal showed higher activities of tumor necrosis factor alpha and Wnt/beta-catenin pathways that were both mutually exclusive with OXPHOS. Overall, this thesis presents a novel approach to study cell states the characterization of which is essential in understanding tumorigenesis and cancer cell plasticity.
  • Malmsten, Kim (2021)
    Genomic structural variants are large events that change the structure of the genome. These can cause changes in the functions of cells by breaking genes and genomic regulatory regions. Multiple factors are known to affect the formation of structural variants and previous studies have shown that often the sequence content in a genomic region plays a role in their formation. This study aims to characterize the sequence content around structural variant breakpoints from structural variants which have been detected from human tissue samples which have been whole genome sequenced with nanopore sequencing. The characterization was done by looking at the genomic repetitive elements found around the breakpoints, by analyzing the GC-content around the breakpoints, and by studying what kind of enriched DNA motifs were found in the sequences around the breakpoints and how these were located in these sequences. Multiple different repetitive elements were seen to occur near the breakpoint regions, and it was also observed that there were differences in what kind of repetitive elements were seen around different types of structural variants. Around the sequences of different kinds of structural variants there was also distinct differences in what kind of GC-content profiles the sequences had. In addition, various different enriched motifs were also found from the sequences and many of these showed distinct variation on how they were located around the breakpoints. These results support the previous findings showing that also here the sequence content does play a role in the formation of structural variants, but still all of the results here could not be directly explained by previous studies. In these results, it was seen that the GC-content was higher in sequences that have been affected by an event that causes structural variant formation. Also, many of the found DNA motifs were distinctly skewed around the breakpoint sequences, possibly hinting that the sequences containing these motifs would be prone to the formation of structural variants.
  • Lindgren, Himmi (2024)
    Unsupervised learning techniques can detect clinically relevant structure in population cohort data of human gut microbiota. While the gut microbiota composition is influenced by individual factors such as diet, medication, and development of the immune system during early childhood, it is proposed that individuals maintain a relatively stable microbiota ecosystem throughout adulthood. This stability allows to distinguish individuals into subgroups based on their gut microbiota characteristics, which define the key features of microbiota community types within the population. For this, I compared three probabilistic unsupervised learning techniques, optimization-based Non-negative Matrix Factorization, and Bayesian modelling techniques, Dirichlet Multinomial Mixtures and Latent Dirichlet Allocation, with a naive benchmark clustering based on dominant taxa. I used an all-cause mortality association strength as a quantitative metrics to distinguish biologically relevant structure in a large Finnish population cohort with almost 18 years follow-up. The techniques defined microbiota assemblages as either discrete enterotypes, which assigned each sample to a single community type, or continuous enterosignatures, which identified patterns of co-occurrence of microbiota community types within each sample. I found five rather robust community types, characterized by Bacteroides, Alistipes, Agathobacter, Escherichia, and Prevotella bacterial genera. Latent Dirichlet Allocation detected the strongest early mortality signal using Cox regression, outperforming all other techniques. The replicability of Latent Dirichlet Allocation was assessed using cross validation. The predicted community types uncovered similar ecological landscape on the data with the community types obtained using the entire data, confirming the clinical relevance, robustness, and scalability of the technique.
  • Ottensmann, Linda (2020)
    It is challenging to identify causal genes and pathways explaining the associations with diseases and traits found by genome-wide association studies (GWASs). To solve this problem, a variety of methods that prioritize genes based on the variants identified by GWASs have been developed. In this thesis, the methods Data-driven Expression Prioritized Integration for Complex Traits (DEPICT) and Multi-marker Analysis of GenoMic Annotation (MAGMA) are used to prioritize causal genes based on the most recently published publicly available schizophrenia GWAS summary statistics. The two methods are compared using the Benchmarker framework, which allows an unbiased comparison of gene prioritization methods. The study has four aims. Firstly, to explain what are the differences between the gene prioritization methods DEPICT and MAGMA and how the two methods work. Secondly, to explain how the Benchmarker framework can be used to compare gene prioritization methods in an unbiased way. Thirdly, to compare the performance of DEPICT and MAGMA in prioritizing genes based on the latest schizophrenia summary statistics from 2018 using the Benchmarker framework. Lastly, to compare the performance of DEPICT and MAGMA on a schizophrenia GWAS with a smaller sample size by using Benchmarker. Firstly, the published results of the Benchmarker analyses using schizophrenia GWAS from 2014 were replicated to make sure that the framework is run correctly. The results were very similar and both the original and the replicated results show that DEPICT and MAGMA do not perform significantly differently. Furthermore, they show that the intersection of genes prioritized by DEPICT and MAGMA outperforms the outersection, which is defined as genes prioritized by only one of these methods. Secondly, Benchmarker was used to compare the performance of DEPICT and MAGMA on prioritizing genes using the schizophrenia GWAS from 2018. The results of the Benchmarker analyses suggest that DEPICT and MAGMA perform similarly with the GWAS from 2018 compared to the GWAS from 2014. Furthermore, an earlier schizophrenia GWAS from 2011 was used to check if the performance of DEPICT and MAGMA differs when a GWAS with lower statistical power is used. The results of the Benchmarker analyses make clear that MAGMA performs better than DEPICT in prioritizing genes using this smaller data set. Furthermore, for the schizophrenia GWAS from 2011 the outersection of genes prioritized by DEPICT and MAGMA outperforms the intersection. To conclude, the Benchmarker framework is a useful tool for comparing gene prioritization methods in an unbiased way. For the most recently published schizophrenia GWAS from 2018 there is no significant difference between the performance of DEPICT and MAGMA in prioritizing genes according to Benchmarker. For the smaller schizophrenia GWAS from 2011, however, MAGMA outperformed DEPICT.
  • Kortelainen, Milla (2023)
    Sequence alignment is widely studied problem in the field of bioinformatics. The exact solution takes quadratic time to compute, and thus is not practical for long sequences. A number of heuristic approaches have been developed to conquer the quadratic time-complexity. This thesis reviews the average-case time analysis of two such heuristics, banded alignment by Ganesh and Sy in ''Near-Linear Time Edit Distance for Indel Channels'' WABI 2020, and seed-chain-extend by Shaw and Yu in ''Proving sequence aligners can guarantee accuracy in almost O(m log n) time through an average-case analysis of the seed-chain-extend heuristic'' Genome Research 2023. These heuristics reduce the quadratic average-case time complexity of the sequence alignment to log-linear. The approach of the thesis reviews is to outline the proofs of the original analysis, and provide supporting materials to aid the reader in studying the analysis. The experiments of this thesis compare four different approaches to compute the exact match anchors of the seed-chain-extend sequence alignment heuristic. A Bi-Directional Burrows-Wheeler Transformation (BDBWT), suffix tree based Mummer and Minimap2 based exact match anchors are computed. The anchors are then given to a chaining algorithm, to compare the performance of each anchoring technique. The qualities of the chains are compared using a Jaccard index applied to the sequences. The highest Jaccard index is obtained for the maximal exact match and the unique maximal exact match anchors of Mummer and BDBWT approaches. An increasing minimum length of the exact matches seem to increase the Jaccard index and reduce the running time of the chaining algorithm.
  • Lintula, Johannes (2023)
    This work examines how neural networks can be used to qualitatively analyze systems of differential equations depicting population dynamics. We present a novel numerical method derived from physics informed learning, capable of extracting equilibria and bifurcations from population dynamics models. The potential of the framework is showcased three different example problems, a logistic model with outside inference, the Rosenzweig-MacArthur model and one model from a recent population dynamics paper. The key idea behind the method is having a neural network learn the dynamics of a free parameter ODE system, and then using the derivatives of the neural network to find equilibria and bifurcations. We, a bit clunkily, refer to these networks as physics informed neural networks with free parameters and variable initial conditions. In addition to these examples, we also survey how and where these neural networks could be further utilized in the context of population dynamics. To answer the how, we document our experiences choosing good hyperparameters for these networks, even venturing into previously unexplored territory. For the where, we suggest potentially useful neural network frameworks to answer questions from an external survey concerning contemporary open questions in population dynamics. The research of the work is preceded by a short dive on qualitative population dynamics, where we ponder what are the problems we want to solve and what are the tools we have available for that. Special attention is paid to parameter sensitivity analysis of ordinary differential equation systems through bifurcation theory. We also provide a beginner friendly introduction to deep learning, so that the research can be understood even by someone not previously familiar with the field. The work was written, and all included contents were selected, with the goal of establishing a basis for future research.
  • Maljanen, Katri (2021)
    Cancer is a leading cause of death worldwide. Unlike its name would suggest, cancer is not a single disease. It is a group of diseases that arises from the expansion of a somatic cell clone. This expansion is thought to be a result of mutations that confer a selective advantage to the cell clone. These mutations that are advantageous to cells that result in their proliferation and escape of normal cell constraints are called driver mutations. The genes that contain driver mutations are known as driver genes. Studying these mutations and genes is important for understanding how cancer forms and evolves. Various methods have been developed that can discover these mutations and genes. This thesis focuses on a method called Deep Mutation Modelling, a deep learning based approach to predicting the probability of mutations. Deep Mutation Modelling’s output probabilities offer the possibility of creating sample and cancer type specific probability scores for mutations that reflect the pathogenicity of the mutations. Most methods in the past have made scores that are the same for all cancer types. Deep Mutation Modelling offers the opportunity to make a more personalised score. The main objectives of this thesis were to examine the Deep Mutation Modelling output as it was unknown what kind of features it has, see how the output compares against other scoring methods and how the probabilities work in mutation hotspots. Lastly, could the probabilities be used in a common driver gene discovery method. Overall, the goal was to see if Deep Mutation Modelling works and if it is competitive with other known methods. The findings indicate that Deep Mutation Modelling works in predicting driver mutations, but that it does not have sufficient power to do this reliably and requires further improvements.
  • Pfeil, Rebecca Katharina (2024)
    Single-cell RNA sequencing (scRNA-seq) allows the analysis of differences in the RNA expression between individual cells. While this is usually performed by short read sequencing, long read sequencing like Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) is also applied by researchers. As long read technologies allow to capture entire RNA molecules, combined with single-cell sequencing, this enables the exploration of cell-specific isoform expression patterns. In single-cell sequencing each cell is tagged by a different oligonucleotide, called barcode, during sequencing, to enable the identification of the origin of each read. With short reads, these are straighforward to identify and correct. However, with the higher error rate of long reads, the identification of the barcodes becomes more challenging. Tools exist for the identification and correction of barcodes in short reads and for combinations of long and short reads, but only few tools work with long reads exclusively. Additionally, most tools are focused on one specific scRNA-seq protocol. While most protocols work in a similar way, the location, length or other characteristics of the barcodes might differ, meaning not all tools work for all protocols. This thesis introduces a novel barcode calling algorithm for long reads called BArcoDe callinG via Edit distance gRaph, or Badger, which can accomodate for different scRNAseq protocols. The algorithm uses a novel data structure called edit distance graph, which is based on the Hamming distance graph. Within the graph, every distinct barcode is represented by a node. Edges are added between nodes where the represented barcodes have an edit distance below a certain threshold between them. As calculating the edit distance is computationally expensive, a filter is used to find similar barcodes, and only between those the edit distance is calculated. Additionally, the algorithm is implemented and its performance evaluated, both on its own and in comparison to the existing method scTagger, where Badger outperforms scTagger in both precision and recall.
  • Detrois, Kira Elaine (2023)
    Background/Objectives: Various studies have shown the advantage when incorporating polygenic risk scores (PRSs) in models with classic risk factors. However, systematic comparisons of PRSs with non-genetic factors are lacking. In particular, many studies on PRSs do not even report the predictive performance of the confounders, such as age and sex, included in the model, which are already very predictive for most diseases. We looked at the ability of PRSs to predict the onset of 18 diseases in FinnGen R8 (N=342,499) and compared PRSs with the known non-genetic risk factors, age, sex, Education, and Charlson Comorbidity Index (CCI). Methods: We set up individual studies for the 18 diseases. A single study consisted of an exposure (1999-2009), a washout (2009-2011), and an observation period (2011-2019). Eligible individuals could not have the selected disease of interest inside the disease-free period, which ranged from birth until the beginning of the observation period. We then defined the case and control status based on the diagnoses in the observation period and calculated the phenotypic scores during the exposure period. The PRSs were calculated using MegaPRS and the latest publicly available genome-wide association study summary statistics. We then fitted separate Cox proportional hazards models for each disease to predict disease onset during the observation period. Results: In FinnGen, the model’s predictive ability (c-index) with all predictors ranged from 0.565 (95%CI: 0.552-0.576) for Acute Appendicitis to 0.838 (95% CI: 0.834-0.841) for Atrial Fibrillation. The PRSs outperformed the phenotypic predictors, CCI, and Education, for 6/18 diseases and still significantly enhance onset prediction for 13/18 diseases when added to a model with only non-genetic predictors. Conclusion: Overall, we showed that for many diseases PRSs add predictive power over commonly used predictors - such as age, sex, CCI, and Education. However, many important challenges must be addressed before implementing PRSs in clinical practice. Notably, we will need disease-specific cost- benefit analyses and studies to assess the direct impact of including PRSs in clinical use. Nonetheless, as more research is being conducted, PRSs could play an increasingly valuable role in identifying individuals at higher risk for certain diseases and enabling targeted interventions to improve health outcomes.