Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by master's degree program "Magisterprogrammet i informatik inom livsvetenskaperna"

Sort by: Order: Results:

  • Rantala, Frans (2023)
    Cancer consists of heterogeneous cell populations that repeatedly undergo natural selection. These cell populations contest with each other for space and nutrients and try to generate phenotypes that maximize their ecological fitness. For achieving this, they evolve evolutionarily stable strategies. When an oncologist starts to treat cancer, another game emerges. While affected by the cellular evolution processes, modeling of this game owes to the results of the classical game theory. This thesis investigates the theoretical foundations of adaptive cancer treatment. It draws from two game theoretical approaches, evolutionary game theory and Stackelberg leader-follower game. The underlying hypothesis of adaptive regimen is that the patient's cancer burden can be administered by leveraging the resource competition between treatment-sensitive and treatment-resistant cells. The intercellular competition is mathematically modelled as an evolutionary game using the G function approach. The properties of the evolutionary stability, such as ESS, the ESS maximum principle, and convergence stability, that are relevant to tumorigenesis and intra-tumoral dynamics, are elaborated. To mitigate the patient's cancer burden, it is necessary to find an optimal modulation and frequency of treatment doses. The Stackelberg leader-follower game, adopted from the economic studies of duopoly, provides a promising framework to model the interplay between a rationally playing oncologist as a leader and the evolutionary evolving tumor as a follower. The two game types applied simultaneously to cancer therapy strategisizing can nourish each other and improve the planning of adaptive regimen. Hence, the characteristics of the Stackelberg game are mathematically studied and a preliminary dose-optimization function is presented. The applicability of the combination of the two games in the planning of cancer therapy strategies is tested with a theoretical case. The results are critically discussed from three perspectives: the biological veracity of the eco-evolutionary model, the applicability of the Stackelberg game, and the clinical relevance of the combination. The current limitations of the model are considered to invite further research on the subject.
  • Backlund, Sofia Maria (2022)
    Coral reefs form important marine ecosystems and simultaneously are at risk of deterioration due to rapidly changing environments as a consequence of human actions. Understanding their dynamics is thus important in order to be able to protect them from being destroyed. In this thesis we construct a lattice model for two life-history strategies of corals, brooders and spawners. These two strategies differ mainly in their modes of sexual reproduction, but also differences in growth and death rates as well as competitive ability are considered. We use pair approximation to help analyse the model while keeping its spatial structure. Numerical analysis is used to find the equilibria of the system as well as their stabilities, first for a single strategy and then for the two-strategy system. We find that the two strategies are able to coexist if the spawners have a higher growth rate and higher death rate and are competitively superior to brooders. This requires some reproduction over distance and a trade-off between growth and death rates. Thus we find that brooders are focusing a bigger part of their energy on long-distance reproduction, while spawners are dominating over short distances and having a higher turnover. We also find that both mutual invasibility and coexistence in the broader sense are only possible for low rates of sexual reproduction for both strategies. For higher rates of sexual reproduction we find that whichever strategy invades the lattice first will stay and the other cannot invade. Lastly we look at the effect of a change in environmental conditions, namely the acidification and temperature increase of oceans, on the two strategies and find that it affects the two strategies differently. The spawners are quickly driven to extinction by the change in environmental conditions, while brooders initially benefit from the changing conditions and only start to suffer themselves after the spawners have gone extinct.
  • Suppula, Joni Johan Mikael (2023)
    Progressive Multifocal Leukoencephalopathy (PML) is a rare but often fatal central nervous system demyelination disease caused by the reactivation of persistent JC polyomavirus (JCPyV) in immunosuppressed individuals. JCPyV infects oligodendrocytes in the brain, causing lysis of the glial cells, which leads to progressive demyelination and destruction of neurons seen as lesions in the white matter. The cause of JCPyV reactivation and how it reaches the brain are not well understood. MicroRNAs (miRNAs) are short non-coding RNAs which negatively regulate gene expression by marking mRNAs for destruction or by preventing translation. A Single miRNA can have multiple mRNA targets and multiple miRNAs can target the same mRNA, making the miRNA induced gene regulation a complex process affecting multiple different signaling pathways and cellular processes. The focus of the thesis is to study miRNA differential expression of PML patients compared to healthy individuals to find miRNAs and their target genes affected by JCPyV, while showing expertise in the data handling and data analysis of a miRNA sequencing experiment. The study was conducted by collecting miRNA samples from 8 PML patients and two controls and using Next-gen sequencing and the QuickMIRSeq analysis tool to collect miRNA counts for differential expression analysis. The analysis identified twelve miRNAs upregulated in the PML brain and multiple target genes interacting with two or more of the found miRNAs. The miRNAs were found to have connections to JCPyV replication, PML and important cellular processes such as neuroinflammation and BBB integrity.
  • Rögnvaldsson, Sölvi (2023)
    Seasonal variation has affected human societies throughout history, shaping various aspects of life including agriculture, migration patterns and culture. This influence is observed, among others, in the occurrences of diseases such as viral and bacterial infections, cardiovascular disease and mental disorders. While there are a multitude of factors influencing the timing of disease diagnoses, environmental and behavioral, the genetic role has not been explored to the best of our knowledge. The aim of this thesis was to relate genetic variation to seasonal disease risk. To achieve this, the seasonality of 1,759 disease endpoints was assessed in the Finnish population. A subset of 14 diseases were selected and used as input into a statistical modeling framework that was developed to search for genetic variants associated with seasonal disease risk in the FinnGen study population. A total of 9 genome-wide significant loci affecting seasonality were identified, including a top-sQTL, rs41273830[T], in ITGB8 for major depression and a stop-gain variant, rs601338[A], in FUT2 for intestinal infections, the latter also being protective against disease risk. This introduces a new aspect to genetic research, which can both contribute to better understanding how known disease variants affect disease but also finding new disease variants whose effects are currently obscured by seasonal variation.
  • Kinnula, Ville (2021)
    In inductive inference phenomena from the past are modeled in order to make predictions of the future. The mathematical concept of exchangeability for random sequences provides a mathematical justification for the assumption that observations are independently and identically distributed given some underlying parameters estimable from the empirical distribution of the observations. The theory of exchangeability contains basic elements for inductive inference, such as the de Finetti representation theorem for the probability of a general exchangeable sequence, prior probability distributions for the parameters in the representation theorem, as well as the predictive probabilities, or rule of succession, for new observations from the random sequence under consideration. However, entirely unanticipated observations pose a problem for inductive inference. How can one assign a probability for an event that has never been seen before? This is called the sampling of species problem. Under exchangeability, the number of possible different events t has to be known before-hand to be able to assign an equal prior probability 1/t for each event. In the sampling of species problem an assumption of infinite possible events has to be made, leading to the prior probability 1/∞ for each event, which is impossible. Exchangeability is thus inadequate to handle probability distributions for infinite possible events. It turns out that a solution to the sampling of species problem arises from partition exchangeability. Exchangeable random sequences have the same probability of occurring, if the observations in the sequence have identical frequencies. Under partition exchangeability, the sequences have the same probability of occurring when they share identical frequencies of frequencies. In this thesis, partition exchangeability is introduced as a framework of inductive inference by juxtaposing it with the more familiar type of exchangeability for random sequences. Partition exchangeability has parallel elements to exchangeability, in the Kingman representation theorem, the Poisson-Dirichlet distribution for the prior probability distribution, and a corresponding rule of succession. The rules of succession are required in the problem of supervised classification to provide product predictive probabilities to be maximized by assigning the test data into pre-defined classes based on training data. A Bayesian construction of supervised classification is discussed in this thesis. In theory, the best classification performance is gained when assigning the class labels to the test data simultaneously, but because of computational complexity, an assumption is often made where the test data points are i.i.d. with regards to each other. In the case of a known set of possible events these simultaneous and marginal classifiers converge in their test data predictive probabilities as the amount of training data tends to infinity, justifying the use of the simpler marginal classifier with enough training data. These two classifiers are implemented in this thesis under partition exchangeability, and it is shown in theory and in practice with a simulation study that the same asymptotic convergence between the simultaneous and marginal classifiers applies with partition exchangeable data as well. Finally, a small application in single cell RNA expression is explored.
  • Suhonen, Sannimari (2023)
    Polygenic risk scores (PRSs) estimate the genetic risk of an individual for a certain polygenic disease trait by summing up the effects of multiple variants across the genome affecting the disease risk. Currently, polygenic risk scores (PRSs) are calculated from imputed array genotyping data which is inexpensive to produce use and has standard procedures and pipelines available. However, genotyping arrays are prone to ascertainment bias, which can also lead to biased PRS results in some populations. If PRSs are utilized in healthcare for screening rare diseases, usage of whole-genome sequencing (WGS) instead of array genotyping is desirable, because also individual samples can be analyzed easily. While high-coverage WGS is still significantly more expensive than array genotyping, low-coverage whole genome sequencing (lcWGS) with imputation has been proposed as an alternative for genotyping arrays. In this project, the utility of imputed low-coverage whole-genome sequencing (lcWGS) data in PRS estimation compared to genotyping array data and the impact of the choice of imputation tool for lcWGS data was studied. Down-sampled WGS data with six different low coverages (0.1x-2x) was used to represent lcWGS data. Two different pipelines were used in genotype imputation and haplotype phasing: in the first one, pre-phasing and imputation were performed directly for the genotype likelihoods (GLs) calculated from the down-sampled data, whereas in the second one, the GLs were converted to genotype calls before imputation and phasing. In both pipelines, PRS for 27 disease phenotypes were calculated from the imputed and phased lcWGS data. Imputation and PRS calculation accuracy of the two pipelines were calculated in relation to both genotyping array and high-coverage whole-genome sequencing (hcWGS) data. In both pipelines, imputation and PRS calculation accuracy increased when the down-sampled coverage increased. The second imputation and phasing pipeline lead to better results in both imputation and PRS calculation accuracy. Some differences in PRS accuracy between different phenotypes were also detected. The results show similar patterns to what is seen in other similar publications. However, not quite as high imputation and PRS accuracy as seen in earlier studies could be attained, but possible limitations leading to lower accuracy could be identified. The results also emphasize the importance of choosing suitable imputation and phasing methods for lcWGS data and suggest that methods and pipelines designed particularly for lcWGS should be developed and published.
  • Pirttikoski, Anna (2022)
    Ovarian cancer is the most lethal gynecological cancer and high-grade serous ovarian cancer (HGSOC) is the most common type of it. HGSOC is often diagnosed in advanced stages and most patients will relapse after optimal first-line treatment. One reason for the lack of successful treatment in HGSOC is high tumor heterogeneity including differences across the tumors in distinct patients, and even within each tumor. This heterogeneity is the result of genetic and non genetic factors. Phenotypical variabilty exists also within cancer cells that have the same genetic background. This is due to the fact that a cell can exist in more than one stable state where its genome is in a specific configuration and it expresses certain genes. Diverse cell states and transitions between them initially offer a path for tumor development, and later enable essential tumor behavior, such as metastasis and survival in variable environmental pressures, such as those posed by anti-cancer therapies. Generally, phenotypic heterogeneity is acquired from the cell of origin for a tumor. This thesis studies cell states in HGSOC cancer cells and their normal counterparts, fallopian tube epithelial cells. Exploration of cell states is based on gene expression data of individual cells. Gene expression data was analyzed with state-of-the-art tools and computational methods. Gene modules representing cell states were constructed using genes found in differential gene expression analysis of cancer cells, normal cells and tumor microenvironment. Differentially expressed gene (DEG) groups of cancer, normal FTE and shared epithelial genes were grouped separately into gene modules based on gene-gene associations and community detection. Potential dynamical relationships between cell states were addressed by pseudo-temporal ordering using RNA velocity modeling approach. We are able to capture biologically meaningful cell states which are relevant in the development of HGSOC with chosen research strategy. Found cell states represent processes such as epithelial-mesenchymal transition, inflammation and stress response which are known to have a role in cancer development. The transition patterns showed consistent tendencies across the samples, and the trajectories for normal samples presented more directionality than those of cancer specimens. The results indicate existence of shared epithelial states which stay in fixed positions in the developmental trajectory of normal and cancer cells. For example, both epithelial stem cells and stem-like cancer cells seem to utilize oxidative phosphorylation (OXPHOS) for their metabolic needs. On the other hand, cell states that are more terminal showed higher activities of tumor necrosis factor alpha and Wnt/beta-catenin pathways that were both mutually exclusive with OXPHOS. Overall, this thesis presents a novel approach to study cell states the characterization of which is essential in understanding tumorigenesis and cancer cell plasticity.
  • Malmsten, Kim (2021)
    Genomic structural variants are large events that change the structure of the genome. These can cause changes in the functions of cells by breaking genes and genomic regulatory regions. Multiple factors are known to affect the formation of structural variants and previous studies have shown that often the sequence content in a genomic region plays a role in their formation. This study aims to characterize the sequence content around structural variant breakpoints from structural variants which have been detected from human tissue samples which have been whole genome sequenced with nanopore sequencing. The characterization was done by looking at the genomic repetitive elements found around the breakpoints, by analyzing the GC-content around the breakpoints, and by studying what kind of enriched DNA motifs were found in the sequences around the breakpoints and how these were located in these sequences. Multiple different repetitive elements were seen to occur near the breakpoint regions, and it was also observed that there were differences in what kind of repetitive elements were seen around different types of structural variants. Around the sequences of different kinds of structural variants there was also distinct differences in what kind of GC-content profiles the sequences had. In addition, various different enriched motifs were also found from the sequences and many of these showed distinct variation on how they were located around the breakpoints. These results support the previous findings showing that also here the sequence content does play a role in the formation of structural variants, but still all of the results here could not be directly explained by previous studies. In these results, it was seen that the GC-content was higher in sequences that have been affected by an event that causes structural variant formation. Also, many of the found DNA motifs were distinctly skewed around the breakpoint sequences, possibly hinting that the sequences containing these motifs would be prone to the formation of structural variants.
  • Kortelainen, Milla (2023)
    Sequence alignment is widely studied problem in the field of bioinformatics. The exact solution takes quadratic time to compute, and thus is not practical for long sequences. A number of heuristic approaches have been developed to conquer the quadratic time-complexity. This thesis reviews the average-case time analysis of two such heuristics, banded alignment by Ganesh and Sy in ''Near-Linear Time Edit Distance for Indel Channels'' WABI 2020, and seed-chain-extend by Shaw and Yu in ''Proving sequence aligners can guarantee accuracy in almost O(m log n) time through an average-case analysis of the seed-chain-extend heuristic'' Genome Research 2023. These heuristics reduce the quadratic average-case time complexity of the sequence alignment to log-linear. The approach of the thesis reviews is to outline the proofs of the original analysis, and provide supporting materials to aid the reader in studying the analysis. The experiments of this thesis compare four different approaches to compute the exact match anchors of the seed-chain-extend sequence alignment heuristic. A Bi-Directional Burrows-Wheeler Transformation (BDBWT), suffix tree based Mummer and Minimap2 based exact match anchors are computed. The anchors are then given to a chaining algorithm, to compare the performance of each anchoring technique. The qualities of the chains are compared using a Jaccard index applied to the sequences. The highest Jaccard index is obtained for the maximal exact match and the unique maximal exact match anchors of Mummer and BDBWT approaches. An increasing minimum length of the exact matches seem to increase the Jaccard index and reduce the running time of the chaining algorithm.
  • Lintula, Johannes (2023)
    This work examines how neural networks can be used to qualitatively analyze systems of differential equations depicting population dynamics. We present a novel numerical method derived from physics informed learning, capable of extracting equilibria and bifurcations from population dynamics models. The potential of the framework is showcased three different example problems, a logistic model with outside inference, the Rosenzweig-MacArthur model and one model from a recent population dynamics paper. The key idea behind the method is having a neural network learn the dynamics of a free parameter ODE system, and then using the derivatives of the neural network to find equilibria and bifurcations. We, a bit clunkily, refer to these networks as physics informed neural networks with free parameters and variable initial conditions. In addition to these examples, we also survey how and where these neural networks could be further utilized in the context of population dynamics. To answer the how, we document our experiences choosing good hyperparameters for these networks, even venturing into previously unexplored territory. For the where, we suggest potentially useful neural network frameworks to answer questions from an external survey concerning contemporary open questions in population dynamics. The research of the work is preceded by a short dive on qualitative population dynamics, where we ponder what are the problems we want to solve and what are the tools we have available for that. Special attention is paid to parameter sensitivity analysis of ordinary differential equation systems through bifurcation theory. We also provide a beginner friendly introduction to deep learning, so that the research can be understood even by someone not previously familiar with the field. The work was written, and all included contents were selected, with the goal of establishing a basis for future research.
  • Detrois, Kira Elaine (2023)
    Background/Objectives: Various studies have shown the advantage when incorporating polygenic risk scores (PRSs) in models with classic risk factors. However, systematic comparisons of PRSs with non-genetic factors are lacking. In particular, many studies on PRSs do not even report the predictive performance of the confounders, such as age and sex, included in the model, which are already very predictive for most diseases. We looked at the ability of PRSs to predict the onset of 18 diseases in FinnGen R8 (N=342,499) and compared PRSs with the known non-genetic risk factors, age, sex, Education, and Charlson Comorbidity Index (CCI). Methods: We set up individual studies for the 18 diseases. A single study consisted of an exposure (1999-2009), a washout (2009-2011), and an observation period (2011-2019). Eligible individuals could not have the selected disease of interest inside the disease-free period, which ranged from birth until the beginning of the observation period. We then defined the case and control status based on the diagnoses in the observation period and calculated the phenotypic scores during the exposure period. The PRSs were calculated using MegaPRS and the latest publicly available genome-wide association study summary statistics. We then fitted separate Cox proportional hazards models for each disease to predict disease onset during the observation period. Results: In FinnGen, the model’s predictive ability (c-index) with all predictors ranged from 0.565 (95%CI: 0.552-0.576) for Acute Appendicitis to 0.838 (95% CI: 0.834-0.841) for Atrial Fibrillation. The PRSs outperformed the phenotypic predictors, CCI, and Education, for 6/18 diseases and still significantly enhance onset prediction for 13/18 diseases when added to a model with only non-genetic predictors. Conclusion: Overall, we showed that for many diseases PRSs add predictive power over commonly used predictors - such as age, sex, CCI, and Education. However, many important challenges must be addressed before implementing PRSs in clinical practice. Notably, we will need disease-specific cost- benefit analyses and studies to assess the direct impact of including PRSs in clinical use. Nonetheless, as more research is being conducted, PRSs could play an increasingly valuable role in identifying individuals at higher risk for certain diseases and enabling targeted interventions to improve health outcomes.
  • Dias, Diogo (2022)
    One of the biggest hurdles in cancer patient care is the lack of response to treatment. With the support of high-throughput drug screening, it is nowadays feasible to conduct vast amounts of drug sensitivity assays, aiding in the identification of sensitive and resistant samples to chemical perturbations. In an oncology setting, drug screening is the process by which patient cells are examined experimentally for response and activity to distinct drugs and analysed via dose-response curve fitting. However, the ability to reproduce and replicate with high confidence drug screening outcomes proved to be a challenge that needs to be addressed. Inefficient experimental designs, lack of standard protocols to control both biological and technical factors in such cell-based assays are at the core of a steep influx of experimental biases. Hence, additional endeavour has to be carried out to provide less biased estimations of drug effects. This thesis work focuses on reducing erroneous inferences (i.e., bias) from dose-response data in the curve fitting step, thereby improving the reproducibility of drug sensitivity screening through efficient dose selection. A novel two-step experimental design is introduced which significantly improves the estimation of dose-response curves while keeping the amount of cellular and chemical materials feasible.
  • Varvarà, Giulia (2022)
    Species factories are defined as times and places in the fossil record where and when an exceptionally large number of new species occurs. While several tailored solutions for the mammalian record have been proposed, how to identify species factories computationally in a standardized way is still an open question. To quantify what is exceptional, we first need to quantify what is regular. One of the main challenges in this identification process is to account for sampling unevenness, which depends on several methodological decisions, including the scale of the analysis (aggrega- tion radius). In this thesis we used Capture-Mark-Recapture methods (CMR) with spatial aggregation guided by network modelling, to estimate the sampling probabilities for the species in the NOW database of mammalian fossil occurrences. Since the mammalian record is sparse and most localities include only a few species, we coupled CMR with tailored spatial aggregation approaches to estimate the sampling prob- abilities. We then used these sampling probabilities to quantify background speciation rates and assess what rates are abnormal. We represented aggregated fossil data as a bipartite network and used community detection to evaluate how the choice of an aggre- gation radius impacts the modular structure. After aggregating the data according to the radius chosen using networks analysis, we es- timated sampling probabilities using CMR. These probabilities allow the adjustment for sampling unevenness so that the difference in findings can be compared across locations and cannot be due to differences in sampling. We identified as species factories the locations with origination rate in the highest 5% after adjustment per time unit. Once the species factories had been identified, we looked for paleoecological patterns in these places that may be lacking elsewhere, finding that species factories present a lower number of findings and of different species among findings, but a higher ratio between the amount of different species and of total findings than the rest of the locations. This would indicate that, even if species factories might accommodate fewer species, they present a higher diversity. To make sure these results were not only due to chance, we performed the same analysis on 100 randomized experiments obtained using a modified version of the Curveball Algo- rithm and compared the values obtained from the original dataset and the ones obtained from the randomized ones. This comparison showed us that species factories tend to have more extreme values than the ones obtained through randomization, which would indicate that species factories present specific paleoecological patterns that are not present in other locations.
  • Dovydas, Kičiatovas (2021)
    Cancer cells accumulate somatic mutations in their DNA throughout their lifetime. The advances in cancer prevention and treatment methods call for a deeper understanding of carcinogenesis on the genetic sequence level. Mutational signatures present a novel and promising way to capture somatic mutation patterns and define their causes, allowing to summarize the mutational landscape of cancer as a combination of distinct mutagenic processes acting with different levels of strength. While the majority of previous studies assume an additive relationship between the mutational processes, this Master’s thesis provides tentative evidence that contemporary methods with additivity constraints, e.g. non-negative matrix factorization (NMF), are not sufficient to comprehensively explain the observed mutations in cancer genomes and the observed deviations are not random. To quantify these residues, two metrics are defined – additive and multiplicative residues – and hierarchical clustering algorithms are used to identify cancer subsets with similar residual profiles. It is shown that in certain cancer sample subsets there is a systematic mutational burden overestimation that can only be solved by a multiplicatively acting process, as well as non-random underestimation, requiring additional mutational signatures. Here an extension to the additive mutational signature model is proposed – a probabilistic model that incorporates a selectively active modulatory mutational process that is able to act in a multiplicative manner together with the known mutational signatures, reducing systematic variability.
  • Leppiniemi, Samuel Albert (2023)
    High-grade serous carcinoma (HGSC) is a highly lethal cancer type characterised by high genomic instability and frequent copy number alterations. This study examines the relationships between genetic variants in tumour germline and gene expression levels to obtain a better understanding of gene regulation in HGSC. This would then improve knowledge of the cancer mechanisms in order to find, for example, potential new treatment targets and biomarkers. The aim is to find significantly associated variant-gene pairs in HGSC. Expression quantitative trait loci (eQTL) analysis is a well-suited method to explore these associations. eQTL analysis is a suitable approach to analysing also those variants that are located in the non-coding genomic regions, as indicated by previous genome-wide association studies to contain many disease-linked germline variants. The current eQTL analysis methods are, however, not applicable for association testing between genes and variants in the context of HGSC because of the special genomic features of the cancer. Therefore, a new eQTL analysis approach, SegmentQTL, was developed for this study to accommodate the copy-number-driven nature of the disease. Careful input processing is of particular importance in eQTL as it has a notable effect on the number of significantly associated variant-gene pairs. It is also relevant to maintain adequate statistical power, which affects the reliability of the findings. In all, this study uses eQTL analysis to uncover variant-gene associations. This helps to improve knowledge of gene regulation mechanisms in HGSC in order to find new treatments. To apply the analysis to the HGSC data, a novel eQTL analysis method was developed. Additionally, appropriate input processing is important prior to running the analysis to ensure reliable results.
  • Soukainen, Arttu (2023)
    Insect pests substantially impact global agriculture, and pest control is essential for global food production. However, some pest control measures, such as intensive insecticide use, can have adverse ecological and economic effects. Consequently, there is a growing need for advanced pest management tools that can be integrated into intelligent farming strategies and precision agriculture. This study explores the potential of a machine learning tool to automatically identify and quantify fruit fly pests from images in the context of Ghanaian mango orchards in West Africa. Fruit flies provide a special challenge for computer vision-based deep learning due to their small size and taxonomic diversity. Insects were captured using sticky traps together with attractant pheromones. The traps were then photographed in the field using regular smartphone cameras. The image data contained 1434 examples of the targeted pests, and it was used to train a convolutional neural network model (CNN) for counting and classifying the fruit flies into two different genera: Bactrocera and Ceratits. High-resolution images were used to train the YOLOv7 object detection algorithm. The training involved manual hyper-parameter optimization emphasizing pre-selected hyper parameters. The focus was on employing appropriate evaluation metrics during model training. The final model had a mean average precision (mAP) of 0.746 and was able to identify 82% of the Ceratitis and 70% of the Bactrocera examples in the validation data. Results promote the advantages of a computer vision-based solution for automated multi-class insect identification and counting. Low-effort data collection using smartphones is sufficient to train a modern CNN model efficiently, even with a limited number of field images. Further research is needed to effectively integrate this technology into decision-making systems for pre cision agriculture in tropical Africa. Nevertheless, this work serves as a proof of concept, show casing the serious potential of computer vision-based models in automated or semi-automated pest monitoring. Such models can enable new strategies for monitoring pest populations and targeting pest control methods. The same technology has potential not only in agriculture but in insect monitoring in general.
  • Gu, Chunhao (2021)
    Along with the rapid scale-up of biological knowledge bases, mechanistic models, especially metabolic network models, are becoming more accurate. On the other hand, machine learning has been widely applied in biomedical researches as a large amount of omics data becomes available in recent years. Thus, it is worth to conduct a study on integration of metabolic network models and machine learning, and the method may result in some biological discoveries. In 2019, MIT researchers proposed an approach called 'White-Box Machine Learning' when they used fluxomics data derived from in silico simulation of a genome-scale metabolic (GEM) model and experimental antibiotic lethality measurements (IC50 values) of E. coli under hundreds of screening conditions to train a linear regression-based machine learning model, and they extracted coefficients of the model to discover some metabolic mechanism involving in antibiotic lethality. In this thesis, we propose a new approach based on the framework of the 'White-Box Machine Learning'. We replace the GEM model with another state-of-the-art metabolic network model -- the expression and thermodynamics flux (ETFL) formulation. We also replace the linear regression-based machine learning model with a novel nonlinear regression model – multi-task elastic net multilayer perceptron (MTENMLP). We apply the approach on the same experimental antibiotic lethality measurements (IC50 values) of E. coli from the 'White-Box Machine Learning' study. Finally, we validate their conclusions and make some new discoveries. Specially, our results show the ppGpp metabolism is active under antibiotic stress, which is supported by some literature. This implies that our approach has potential to make a biological discovery even if we don't know a possible conclusion.
  • Balaz, Melanie (2023)
    Gene editing holds tremendous potential for treating a variety of diseases, but concerns about safety, particularly the risk of edited cells becoming cancerous, must be addressed. This thesis explores a safety mechanism to prevent unwanted cell proliferation and tumor formation in induced pluripotent stem cells that have been edited for use in gene therapy. The mechanism bases on the genetic disruption (knockout) of the thymidylate synthase gene (TYMS), the only enzyme in charge of synthesizing deoxythymidine monophosphate (dTMP), an essential building block of DNA. Without dTMP, cells cannot successfully proliferate, while RNA synthesis remains unaffected. Through RNA sequencing analysis, we investigate the early response of TYMS knockout cells to dTMP withdrawal and find evidence of the activation of apoptosis and stress pathways, as well as differentiation and changes in the cell cycle. In addition, we demonstrate the effectiveness of the TYMS knockout mechanism in preventing proliferation of cancerous cells in a laboratory setting.
  • Riikonen, Juha (2023)
    Population structure refers to the patterns of genetic variation within and between populations, which arises from various evolutionary processes such as genetic drift, natural selection and migration. Understanding this structure in human populations provides insights about our own evolutionary history and past migration patterns. Controlling for underlying population structure is also an essential step in genetic association analyses to ensure that the associations between genetic variants and traits of interest are not confounded by differences in ancestry. Results from such analyses are essential for the research and development of personalised medicine. Principal component analysis (PCA) is a method that has been widely used to study the patterns of genetic variability within populations. In this study, PCA is applied to a genotype data set of 38,113 samples born in Finland using data from Finnish study cohorts FINRISK, GeneRISK, FinHealth 2017 and Health 2000. The first ten principal components are extracted using PLINK 2.0 software. Novel discoveries of association between genetic variants and a disease often motivates further studies on the geographical distribution of such risk variants. Here, the genetic population structure is proposed as an alternative, higher dimensional space for studying the distribution of genetic variants within a population. This study presents a framework for quantifying and visualising the allele frequency variability across the genetic structure defined by principal components. Using an empirical Bayes model, the posterior minor allele frequency is estimated in discrete areas of the principal component space. The variability of these estimates is visualised as heatmaps, using a colouring scheme that provides statistical guarantees for frequency differences between different colours. The framework is demonstrated on five biallelic variants known to be associated with a disease or a disorder. The results show that visualising the pairwise components complemented with data on sample birth location reveals the major patterns of genetic variability within the Finnish population. The framework is able to distinguish areas in the genetic structure with differing levels of allele frequency, and visualise this variability as heatmaps that enable meaningful visual interpretation. The levels of allele frequency differences found in the principal component space are comparable to the differences found geographically, which suggests that studying individual variants within the genetic structure on top of geographical frequency maps can provide additional information on their distribution in a population.
  • Zogjani, Yllza (2023)
    The increasing demand for comprehensive datasets to address complex diseases has resulted in a widespread popularity of biobank-based research. However, the collection of biobank-level data may be susceptible to biases when fundamental aspects of study design, such as sampling approach, are overlooked. FinnGen is a large-scale cohort study aiming to improve diagnoses and prevent diseases through genetic research by combining biobank data with registry data.However, FinnGen’s hospital-based recruitment strategy makes FinnGen suffer from selection bias and thus epidemiologically less representative of its sampling population. In this study, we examine the profound impact of selection bias in FinnGen. We use well-established epidemiological methods and leverage representative data on the Finnish population to try and correct for the bias. By comparing key demographic characteristics and association statistics of interest between FinnGen and a comprehensive registry-based study, FinRegistry, we highlight the extent to which selection bias within FinnGen results in distorted association estimates and a dataset that is highly non - representative of its underlying population. In response to these findings, we estimate Iterative Proportional Fitting (IPF) weights to estimate association statistics that are representative of the true sampling population of FinnGen and unaffected by selection bias. By comparing weighted associations estimated in the FinnGen with associations estimated using FinRegistry data, we infer that the use of our IPF weights mitigates volunteer bias in FinnGen.