Browsing by master's degree program "Magisterprogrammet i informatik inom livsvetenskaperna"
Now showing items 21-34 of 34
-
(2021)Cancer cells accumulate somatic mutations in their DNA throughout their lifetime. The advances in cancer prevention and treatment methods call for a deeper understanding of carcinogenesis on the genetic sequence level. Mutational signatures present a novel and promising way to capture somatic mutation patterns and define their causes, allowing to summarize the mutational landscape of cancer as a combination of distinct mutagenic processes acting with different levels of strength. While the majority of previous studies assume an additive relationship between the mutational processes, this Master’s thesis provides tentative evidence that contemporary methods with additivity constraints, e.g. non-negative matrix factorization (NMF), are not sufficient to comprehensively explain the observed mutations in cancer genomes and the observed deviations are not random. To quantify these residues, two metrics are defined – additive and multiplicative residues – and hierarchical clustering algorithms are used to identify cancer subsets with similar residual profiles. It is shown that in certain cancer sample subsets there is a systematic mutational burden overestimation that can only be solved by a multiplicatively acting process, as well as non-random underestimation, requiring additional mutational signatures. Here an extension to the additive mutational signature model is proposed – a probabilistic model that incorporates a selectively active modulatory mutational process that is able to act in a multiplicative manner together with the known mutational signatures, reducing systematic variability.
-
(2023)High-grade serous carcinoma (HGSC) is a highly lethal cancer type characterised by high genomic instability and frequent copy number alterations. This study examines the relationships between genetic variants in tumour germline and gene expression levels to obtain a better understanding of gene regulation in HGSC. This would then improve knowledge of the cancer mechanisms in order to find, for example, potential new treatment targets and biomarkers. The aim is to find significantly associated variant-gene pairs in HGSC. Expression quantitative trait loci (eQTL) analysis is a well-suited method to explore these associations. eQTL analysis is a suitable approach to analysing also those variants that are located in the non-coding genomic regions, as indicated by previous genome-wide association studies to contain many disease-linked germline variants. The current eQTL analysis methods are, however, not applicable for association testing between genes and variants in the context of HGSC because of the special genomic features of the cancer. Therefore, a new eQTL analysis approach, SegmentQTL, was developed for this study to accommodate the copy-number-driven nature of the disease. Careful input processing is of particular importance in eQTL as it has a notable effect on the number of significantly associated variant-gene pairs. It is also relevant to maintain adequate statistical power, which affects the reliability of the findings. In all, this study uses eQTL analysis to uncover variant-gene associations. This helps to improve knowledge of gene regulation mechanisms in HGSC in order to find new treatments. To apply the analysis to the HGSC data, a novel eQTL analysis method was developed. Additionally, appropriate input processing is important prior to running the analysis to ensure reliable results.
-
(2023)Insect pests substantially impact global agriculture, and pest control is essential for global food production. However, some pest control measures, such as intensive insecticide use, can have adverse ecological and economic effects. Consequently, there is a growing need for advanced pest management tools that can be integrated into intelligent farming strategies and precision agriculture. This study explores the potential of a machine learning tool to automatically identify and quantify fruit fly pests from images in the context of Ghanaian mango orchards in West Africa. Fruit flies provide a special challenge for computer vision-based deep learning due to their small size and taxonomic diversity. Insects were captured using sticky traps together with attractant pheromones. The traps were then photographed in the field using regular smartphone cameras. The image data contained 1434 examples of the targeted pests, and it was used to train a convolutional neural network model (CNN) for counting and classifying the fruit flies into two different genera: Bactrocera and Ceratits. High-resolution images were used to train the YOLOv7 object detection algorithm. The training involved manual hyper-parameter optimization emphasizing pre-selected hyper parameters. The focus was on employing appropriate evaluation metrics during model training. The final model had a mean average precision (mAP) of 0.746 and was able to identify 82% of the Ceratitis and 70% of the Bactrocera examples in the validation data. Results promote the advantages of a computer vision-based solution for automated multi-class insect identification and counting. Low-effort data collection using smartphones is sufficient to train a modern CNN model efficiently, even with a limited number of field images. Further research is needed to effectively integrate this technology into decision-making systems for pre cision agriculture in tropical Africa. Nevertheless, this work serves as a proof of concept, show casing the serious potential of computer vision-based models in automated or semi-automated pest monitoring. Such models can enable new strategies for monitoring pest populations and targeting pest control methods. The same technology has potential not only in agriculture but in insect monitoring in general.
-
(2021)Along with the rapid scale-up of biological knowledge bases, mechanistic models, especially metabolic network models, are becoming more accurate. On the other hand, machine learning has been widely applied in biomedical researches as a large amount of omics data becomes available in recent years. Thus, it is worth to conduct a study on integration of metabolic network models and machine learning, and the method may result in some biological discoveries. In 2019, MIT researchers proposed an approach called 'White-Box Machine Learning' when they used fluxomics data derived from in silico simulation of a genome-scale metabolic (GEM) model and experimental antibiotic lethality measurements (IC50 values) of E. coli under hundreds of screening conditions to train a linear regression-based machine learning model, and they extracted coefficients of the model to discover some metabolic mechanism involving in antibiotic lethality. In this thesis, we propose a new approach based on the framework of the 'White-Box Machine Learning'. We replace the GEM model with another state-of-the-art metabolic network model -- the expression and thermodynamics flux (ETFL) formulation. We also replace the linear regression-based machine learning model with a novel nonlinear regression model – multi-task elastic net multilayer perceptron (MTENMLP). We apply the approach on the same experimental antibiotic lethality measurements (IC50 values) of E. coli from the 'White-Box Machine Learning' study. Finally, we validate their conclusions and make some new discoveries. Specially, our results show the ppGpp metabolism is active under antibiotic stress, which is supported by some literature. This implies that our approach has potential to make a biological discovery even if we don't know a possible conclusion.
-
(2023)Gene editing holds tremendous potential for treating a variety of diseases, but concerns about safety, particularly the risk of edited cells becoming cancerous, must be addressed. This thesis explores a safety mechanism to prevent unwanted cell proliferation and tumor formation in induced pluripotent stem cells that have been edited for use in gene therapy. The mechanism bases on the genetic disruption (knockout) of the thymidylate synthase gene (TYMS), the only enzyme in charge of synthesizing deoxythymidine monophosphate (dTMP), an essential building block of DNA. Without dTMP, cells cannot successfully proliferate, while RNA synthesis remains unaffected. Through RNA sequencing analysis, we investigate the early response of TYMS knockout cells to dTMP withdrawal and find evidence of the activation of apoptosis and stress pathways, as well as differentiation and changes in the cell cycle. In addition, we demonstrate the effectiveness of the TYMS knockout mechanism in preventing proliferation of cancerous cells in a laboratory setting.
-
(2023)Population structure refers to the patterns of genetic variation within and between populations, which arises from various evolutionary processes such as genetic drift, natural selection and migration. Understanding this structure in human populations provides insights about our own evolutionary history and past migration patterns. Controlling for underlying population structure is also an essential step in genetic association analyses to ensure that the associations between genetic variants and traits of interest are not confounded by differences in ancestry. Results from such analyses are essential for the research and development of personalised medicine. Principal component analysis (PCA) is a method that has been widely used to study the patterns of genetic variability within populations. In this study, PCA is applied to a genotype data set of 38,113 samples born in Finland using data from Finnish study cohorts FINRISK, GeneRISK, FinHealth 2017 and Health 2000. The first ten principal components are extracted using PLINK 2.0 software. Novel discoveries of association between genetic variants and a disease often motivates further studies on the geographical distribution of such risk variants. Here, the genetic population structure is proposed as an alternative, higher dimensional space for studying the distribution of genetic variants within a population. This study presents a framework for quantifying and visualising the allele frequency variability across the genetic structure defined by principal components. Using an empirical Bayes model, the posterior minor allele frequency is estimated in discrete areas of the principal component space. The variability of these estimates is visualised as heatmaps, using a colouring scheme that provides statistical guarantees for frequency differences between different colours. The framework is demonstrated on five biallelic variants known to be associated with a disease or a disorder. The results show that visualising the pairwise components complemented with data on sample birth location reveals the major patterns of genetic variability within the Finnish population. The framework is able to distinguish areas in the genetic structure with differing levels of allele frequency, and visualise this variability as heatmaps that enable meaningful visual interpretation. The levels of allele frequency differences found in the principal component space are comparable to the differences found geographically, which suggests that studying individual variants within the genetic structure on top of geographical frequency maps can provide additional information on their distribution in a population.
-
(2024)Acute myeloid leukemia (AML) is a disease in which blood cell production is severely disrupted. Cell count and morphological analysis from bone marrow (BM) samples are key in the diag- nosis of AML. Recent advances in computer vision have led to algorithms developed at the Hematoscope Lab that can automatically classify cells from these BM samples and calculate various cell-level statistics. This thesis investigated the use of cytomorphological data along with standard clinical data to predict progression-free survival (PFS). A benchmark study using penalized Cox regression, random survival forests, and survival support vector machines was conducted to study the utility of cytomorphology data. As features greatly outnumber samples, the methods are further compared over three feature filtering methods based on Spearman’s correlation coefficient, conditional Cox screening, and mutual information. In a dataset from the national VenEx trial, the penalized Cox regression method with ElasticNet penalization supplemented with Cox conditional screening was found to perform best in the nested CV benchmarking. A post-hoc dissection of two best-performing Cox models revealed potentially predictive cytomorphological features, while disease etiology and patient age were likewise important.
-
(2023)The increasing demand for comprehensive datasets to address complex diseases has resulted in a widespread popularity of biobank-based research. However, the collection of biobank-level data may be susceptible to biases when fundamental aspects of study design, such as sampling approach, are overlooked. FinnGen is a large-scale cohort study aiming to improve diagnoses and prevent diseases through genetic research by combining biobank data with registry data.However, FinnGen’s hospital-based recruitment strategy makes FinnGen suffer from selection bias and thus epidemiologically less representative of its sampling population. In this study, we examine the profound impact of selection bias in FinnGen. We use well-established epidemiological methods and leverage representative data on the Finnish population to try and correct for the bias. By comparing key demographic characteristics and association statistics of interest between FinnGen and a comprehensive registry-based study, FinRegistry, we highlight the extent to which selection bias within FinnGen results in distorted association estimates and a dataset that is highly non - representative of its underlying population. In response to these findings, we estimate Iterative Proportional Fitting (IPF) weights to estimate association statistics that are representative of the true sampling population of FinnGen and unaffected by selection bias. By comparing weighted associations estimated in the FinnGen with associations estimated using FinRegistry data, we infer that the use of our IPF weights mitigates volunteer bias in FinnGen.
-
(2023)ScRNA-seq captures a static picture of a cell's transcriptome including abundances of unspliced and spliced RNA. RNA velocity methods offer the opportunity to infer future RNA abundances and thus future states of a cell based on the temporal change of these unspliced and spliced RNA. Early RNA velocity methods have shed light on transcriptional dynamics in many biological processes. However, due to strict assumptions in the underlying model, these models are not reliable when analysing and inferring velocity for genes with complex expression dynamics such as genes with transcriptional boosts. These genes can for example be observed in erythropoietic and hematopoietic data. Several new RNA velocity methods have been proposed recently. Among these, veloVI and Pyro-Velocity both employ Bayesian methods to estimate the reaction rate and latent parameters. Thus the problem of estimating RNA velocity is turned into a posterior probability inference, that allows for more flexible inference of model parameters and the quantification of uncertainty. The objectives of this thesis were to investigate newly published RNA velocity methods, veloVI and Pyro-Velocity, in comparison to the established tool scVelo. To achieve this, we applied the methods to data obtained from scRNA-seq of healthy and ERCC6L2 disease bone marrow cells. ERCC6L2 disease can cause bone marrow failure with a risk of progression to acute myeloid leukemia with erythroid predominance. Specifically, we evaluated whether RNA velocity results reflect hematopoietic differentiation, if genes with transcriptional boosts affect the velocity results, and if RNA velocity analysis can indicate why erythropoiesis in ERCC6L2 disease is affected. We find that new RNA velocity methods can not produce velocity estimations that are fully in line with what is known of hematopoiesis in our data. Further, the results suggest that velocity estimations by veloVI are affected by genes with transcriptional boosts. Moreover, RNA velocity methods examined in this thesis are not robust and cannot reliably predict cell transitions based on the estimated velocity. Subsequently, velocity estimations for disease data such as ERCC6L2 disease must be evaluated carefully before drawing any conclusion about the differentiation process. In conclusion, this thesis highlights the need for models that can model complex transcription kinetics. Still, as this field is rapidly growing and promising new methods are being developed, improvement of RNA velocity analysis, in general, is possible.
-
(2024)Modern developments in biology and bioinformatics has enabled unprecedented computational capabilities in developmental biology. Novel biological single cell-resolution assays, as well as advanced data processing algorithms and workflows, define directions in which genomics research transitions in the 21st century. This thesis uses Single Cell Assay for Transposase Accessible Chromatin using Sequencing (scATAC-seq) high-throughput data to study gene regulatory principles responsible for the development of a distinguished cellular lineage in the embryonic brainstem in mouse. In particular, a ventral area of the rhombomere 1 region (rV2) in the embryonic brainstem generates two, antagonistic cellular lineages. These lineages–an inhibitory GABAergic lineage and an exhibitory glutamatergic lineage–have been well characterized in recent literature. Functionally, the cells originating from these lineages have been shown to regulate fundamental, evolutionary behavioral traits, present in cognitively more complex organisms such as humans, as well. Studying the cellular development of the two lineages is thus crucial for understanding the development of congenital disorders related to the higher-level behavioral traits. Coupled with auxiliary single cell mRNA sequencing (scRNA-seq) data analysis, the scATAC-seq data analysis conducted in this thesis first successfully rebuilds the lineage structure of the rV2 area in both, scRNA-seq data and scATAC-seq data. Moreover, by the use of open-source data analysis libraries, the work presented in this thesis integrates the structure of the rV2 area across the two modalities, creating a multiomics data set in which both, the transcriptomics and the chromatin accessibility landscapes of the rV2 lineages can be studied jointly. Previous research has described the molecular and genetic signature of the rV2 area. In particular, the induction of the GABAergic rV2 lineage is known to depend on presence of a combination of a few key transcription factors. The analysis of chromatin accessibility data obtained through the scATAC-seq assay, enables the determination of regulatory elements, putatively responsible for the activation the Tal1 gene – a crucial fate-determining gene coding one of the key transcription factors. Indeed, the work presented in this thesis shows that at least three proximal regulatory elements exhibit potential accessibility trends associated with elevated Tal1 gene expression. Through transcription factor footprinting analysis algorithms, this thesis finally predicts how a limited number of known transcription factors binds to the proximal Tal1 regulatory elements and orchestrates lineage defining Tal1 activation in the rV2 area of mouse brainstem. The thesis ends with a critical assessment of the analysis pipelines and computational tools used in the thesis, and suggests directions for research efforts, which can computationally and biologically validate the observations and results of the thesis work.
-
(2024)Tobacco smoking has a huge impact on health, increasing the risk of cardiovascular diseases, respiratory diseases, and various types of cancer. Therefore, assessing a patient’s smoking history is crucial for identifying potential risk factors. Smoking also induces alteration in DNAm. The large effect of smoking makes it a crucial confounding factor in EWAS. However, smoking status information is not always available in the data. Even so, it is not always reliable due to depending on self-reporting, which can cause bias in the analysis. DNAm can be used as an excellent biomarker for smoking since it can be measured in a cost-effective, non-invasive way through methylation arrays. Already, multiple DNAm-based smoking predictors are available; some return a smoking score associated with smoking, and others return a smoking status, whether the individual is a current, never, or former smoker. These predictors are based on the Infinium 450k array from Illumina, and there is no available predictor for the Infinium Methylation EPIC array, which contains almost twice as many CpG sites as the previous one. We developed two machine learning models (Model1, Model2) that can classify individuals into three smoking statuses: never-smoker, current-smoker, and former-smoker. Both models were LASSO logistic regressors trained on EPIC array DNAm data of the Young Finns Study cohort. Model1 was trained on the beta matrix pre-processed with the standard minfi pipeline, while Model2 was trained on a beta matrix derived from QN normalized intensity values. Model1 and Model2 were both evaluated on an independent test dataset, the Finnish Twin Cohort, resulting in overall accuracies of 57.4% and 64.29%, respectively. The models can separate the classes from each other with a micro-average OvA AUC of 0.79 and 0.81. They can distinguish never- and current-smoker categories with an average OvO AUC of 0.94 and 0.93. Misclassifications aligned with the individuals’ smoking intensities and the methylation levels of the well-known smoking-associated CpG site, cg05575921.
-
(2023)Pathogens are everywhere in nature, so organisms have developed various defense mechanisms in order to defend themselves against the pathogens. Two of the defense mechanisms are known as resistance and tolerance. Resistance describes the host's ability to avoid being infected by the pathogen, while tolerance describes the host's ability to reduce the fitness loss caused by the infection. We assume that investing into resistance reduces the transmission rate of the pathogens and investing into tolerance reduces the host's virulence. Developing the defense mechanisms is costly to the host. In this thesis, we assume that the resources invested into resistance and tolerance are taken away from the host's fecundity. The independent but simultaneous evolution of resistance and tolerance is modeled with an SIS model. The model is studied with the methods of adaptive dynamics. We concentrate on finding continuously stable strategies, which serve as the evolutionary end points for the population. We study the varying ecological parameters to determine which strategies are optimal for the host in different environments. We find that for low values of transmission rate, the hosts favor resistance over tolerance. When the transmission rate increases, resistance is traded for tolerance and the host benefits more from high tolerance. Low values of virulence result in tolerance being favored over resistance. Increasing virulence leads to a change in the defense mechanism as for high values of virulence investing into resistance is more beneficial to the host. The same holds for recovery rate, as tolerance is favored for low values of recovery rate and changed for resistance when the recovery rate increases. Patterns and associations between resistance and tolerance are also studied. Positive correlation between resistance and tolerance is found with low values of transmission rate, low and high values of virulence and high values of recovery rate. Resistance and tolerance correlate negatively with high values of transmission rate, intermediate values of virulence and low values of recovery rate.
-
(2022)Tämän tutkielman tarkoituksena on esittää ja havainnoida tapoja, joilla tilastollista epävarmuutta voidaan selittää ja visualisoida. Erityisesti kohdeyleisönä tilastollisen epävarmuuden viestinnällä ovat lukijat, joilla ei ole juurikaan aiempaa kokemusta tilastollisista käsitteistä tai menetelmistä. Sovelluskohteena näiden visuaalisten viestinnän menetelmien esittämisessä on hyödynnetty COVID19-aineistoja. COVID19-tartuntataudin viestinnässä kohdeyleisöjä on ollut hyvin erilaisia, mutta esimerkiksi koko Suomen väestöä koskevassa viestinnässä epidemian etenemisestä olennaista on ollut nimenomaan viestintä kohdeyleisölle, joka ei koostu alan asiantuntijoista. Tutkielma pohjautuu vuoden 2020 COVID19-aineistoihin ja tartuntatautitilanteeseen, jolloin väestön keskuudessa ei vielä juurikaan ollut kehittynyt immuniteettia taudille. Tutkielman alussa esitellään SEIR-tartuntatautimalli, jossa kuvataan epidemian kehittymistä väestössä neljän eri tartuntatautivaiheen kautta. SEIR-mallia on hyödynnetty myös COVID19-mallinnuksessa epidemian alkuvaiheessa, sillä COVID19 ajateltiin käyttäytyvän epidemiana samoin näiden neljän vaiheen osalta. Mallin esittelyn lisäksi on hieman pohdittu, kuinka mallissa käytössä olevat parametrit, kuten perustarttuvuusluku, vaikuttavat epidemiatilanteen kehittymiseen. Terveyden ja hyvinvoinnin laitoksen COVID19-mallinnusta on myös esitelty SEIR-mallin ja tartuntamäärien kehittymisen näkökulmasta vuoden 2020 alkupuolella. Tässä on tuotu esille myös vuonna 2020 käytössä olleiden yksilöiden välisten kontaktien määrää alentavien rajoitusten vaikutusta epidemiatilanteeseen tarttuvuusluvun kautta. Tilastollisen epävarmuuden osalta tässä tutkielmassa on keskitytty tilastollisen epävarmuuden syihin, sillä epävarmuus voi olla peräisin hyödynnettävien tietojen puutteesta tai niiden sattumanvaraisuudesta. Taustalla vaikuttavien syiden ymmärtäminen on olennaista kokonaiskuvan ja sen osien selittämisessä ja havainnoimisessa. Tutkielmassa pohditaan erityisesti COVID19-mallinnuksessa ja sen tartuntojen testaamisessa esiintyvää epävarmuutta. Lisäksi tutkielmassa paneudutaan tilastollisen epävarmuuden esittämiskeinoihin, kuten otantaan liittyvään keskihajontaan tai -virheeseen sekä luottamusväleihin, sekä myöhemmin muun muassa näiden käsitteiden visualisointiin ja viestintään. Tilastollisen epävarmuuden viestintää esitetään erityisesti erilaisten visuaalisten kuvaajien, kuten laatikko-janakuvaajien ja sirontakaavioiden, kautta pohtien samalla eri kuvaajien hyötyjä tai haasteita. Tutkielman loppupuolella perehdytään vielä viestinnän näkökulmasta kuvaajien tulkintaan vaikuttaviin seikkoihin sekä epävarmuuden viestinnän päämääriin esimerkiksi viestinnästä syntyvän luottamuksen tai tunteiden kautta. Lopuksi kootaan vielä tilastollisen epävarmuuden visuaalisen esittämisen mahdollisia haasteita, jotka voivat johtua esimerkiksi kohdeyleisön tekemistä tulkinnoista tai epäolennaisten kuvaajien hyödyntämisestä.
-
(2024)Renewable energy is the key for a sustainable future in a world currently run by coal and oil, and one of these sources could be bioelectrochemical systems [McCormick et. al., Energy Environ. Sci, 2015]. This is very different from traditional renewable energy sources, in that traditionally the process for generating the solar cells requires exotic material, or has a relatively extensive manufacturing process [Ren et. al., Solar Energy, 2020]. One type of these bioelectrochemical systems are biophotovoltaic systems, which utilize solar energy and water to produce electrons or other reducing agents outside of the organism, which can then be harvested for external usage [McCormick et. al., Energy Environ. Sci, 2015]. This type of system has many different focuses to improve efficiency, including substrate design, reactor design, and electrode properties [Anam et. al., Sustainable Energy Fuels, 2021]. While these are important, there is another avenue to be explored, namely the exoelectrogenesis pathway itself [Okedi et. al., bioRxiv, 2021]. This pathway analysis has been explored briefly with Hilbert-Huang transforms to figure out their oscillatory components, which has been partially mapped to photosystem II core expression [Okedi et. al., bioRxiv, 2021]. In my analysis, I will be using generated data from cyanobacteria which exhibit enhanced photosystem II and see if the exact mechanisms for this phenomenon can be captured. The data provided by the sequencing vendor comes in a FASTA Extension format, so the process and tools to translate this data into usable variant calling format files will be described. I will then iterate the additional analysis in the way of variant comparisons through strain concordance with gene comparisons, as well as phylogenetic trees. The first analysis is to compare a wild type to a mutated strain, with subsequent analysis being to compare multiple wild type strains to each other. Further analysis on phenotype expression compared to the variant calling will also be explored.
Now showing items 21-34 of 34