Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by study line "Bioinformatics and Systems Medicine"

Sort by: Order: Results:

  • Suhonen, Sannimari (2023)
    Polygenic risk scores (PRSs) estimate the genetic risk of an individual for a certain polygenic disease trait by summing up the effects of multiple variants across the genome affecting the disease risk. Currently, polygenic risk scores (PRSs) are calculated from imputed array genotyping data which is inexpensive to produce use and has standard procedures and pipelines available. However, genotyping arrays are prone to ascertainment bias, which can also lead to biased PRS results in some populations. If PRSs are utilized in healthcare for screening rare diseases, usage of whole-genome sequencing (WGS) instead of array genotyping is desirable, because also individual samples can be analyzed easily. While high-coverage WGS is still significantly more expensive than array genotyping, low-coverage whole genome sequencing (lcWGS) with imputation has been proposed as an alternative for genotyping arrays. In this project, the utility of imputed low-coverage whole-genome sequencing (lcWGS) data in PRS estimation compared to genotyping array data and the impact of the choice of imputation tool for lcWGS data was studied. Down-sampled WGS data with six different low coverages (0.1x-2x) was used to represent lcWGS data. Two different pipelines were used in genotype imputation and haplotype phasing: in the first one, pre-phasing and imputation were performed directly for the genotype likelihoods (GLs) calculated from the down-sampled data, whereas in the second one, the GLs were converted to genotype calls before imputation and phasing. In both pipelines, PRS for 27 disease phenotypes were calculated from the imputed and phased lcWGS data. Imputation and PRS calculation accuracy of the two pipelines were calculated in relation to both genotyping array and high-coverage whole-genome sequencing (hcWGS) data. In both pipelines, imputation and PRS calculation accuracy increased when the down-sampled coverage increased. The second imputation and phasing pipeline lead to better results in both imputation and PRS calculation accuracy. Some differences in PRS accuracy between different phenotypes were also detected. The results show similar patterns to what is seen in other similar publications. However, not quite as high imputation and PRS accuracy as seen in earlier studies could be attained, but possible limitations leading to lower accuracy could be identified. The results also emphasize the importance of choosing suitable imputation and phasing methods for lcWGS data and suggest that methods and pipelines designed particularly for lcWGS should be developed and published.
  • Kortelainen, Milla (2023)
    Sequence alignment is widely studied problem in the field of bioinformatics. The exact solution takes quadratic time to compute, and thus is not practical for long sequences. A number of heuristic approaches have been developed to conquer the quadratic time-complexity. This thesis reviews the average-case time analysis of two such heuristics, banded alignment by Ganesh and Sy in ''Near-Linear Time Edit Distance for Indel Channels'' WABI 2020, and seed-chain-extend by Shaw and Yu in ''Proving sequence aligners can guarantee accuracy in almost O(m log n) time through an average-case analysis of the seed-chain-extend heuristic'' Genome Research 2023. These heuristics reduce the quadratic average-case time complexity of the sequence alignment to log-linear. The approach of the thesis reviews is to outline the proofs of the original analysis, and provide supporting materials to aid the reader in studying the analysis. The experiments of this thesis compare four different approaches to compute the exact match anchors of the seed-chain-extend sequence alignment heuristic. A Bi-Directional Burrows-Wheeler Transformation (BDBWT), suffix tree based Mummer and Minimap2 based exact match anchors are computed. The anchors are then given to a chaining algorithm, to compare the performance of each anchoring technique. The qualities of the chains are compared using a Jaccard index applied to the sequences. The highest Jaccard index is obtained for the maximal exact match and the unique maximal exact match anchors of Mummer and BDBWT approaches. An increasing minimum length of the exact matches seem to increase the Jaccard index and reduce the running time of the chaining algorithm.
  • Dovydas, Kičiatovas (2021)
    Cancer cells accumulate somatic mutations in their DNA throughout their lifetime. The advances in cancer prevention and treatment methods call for a deeper understanding of carcinogenesis on the genetic sequence level. Mutational signatures present a novel and promising way to capture somatic mutation patterns and define their causes, allowing to summarize the mutational landscape of cancer as a combination of distinct mutagenic processes acting with different levels of strength. While the majority of previous studies assume an additive relationship between the mutational processes, this Master’s thesis provides tentative evidence that contemporary methods with additivity constraints, e.g. non-negative matrix factorization (NMF), are not sufficient to comprehensively explain the observed mutations in cancer genomes and the observed deviations are not random. To quantify these residues, two metrics are defined – additive and multiplicative residues – and hierarchical clustering algorithms are used to identify cancer subsets with similar residual profiles. It is shown that in certain cancer sample subsets there is a systematic mutational burden overestimation that can only be solved by a multiplicatively acting process, as well as non-random underestimation, requiring additional mutational signatures. Here an extension to the additive mutational signature model is proposed – a probabilistic model that incorporates a selectively active modulatory mutational process that is able to act in a multiplicative manner together with the known mutational signatures, reducing systematic variability.
  • Leppiniemi, Samuel Albert (2023)
    High-grade serous carcinoma (HGSC) is a highly lethal cancer type characterised by high genomic instability and frequent copy number alterations. This study examines the relationships between genetic variants in tumour germline and gene expression levels to obtain a better understanding of gene regulation in HGSC. This would then improve knowledge of the cancer mechanisms in order to find, for example, potential new treatment targets and biomarkers. The aim is to find significantly associated variant-gene pairs in HGSC. Expression quantitative trait loci (eQTL) analysis is a well-suited method to explore these associations. eQTL analysis is a suitable approach to analysing also those variants that are located in the non-coding genomic regions, as indicated by previous genome-wide association studies to contain many disease-linked germline variants. The current eQTL analysis methods are, however, not applicable for association testing between genes and variants in the context of HGSC because of the special genomic features of the cancer. Therefore, a new eQTL analysis approach, SegmentQTL, was developed for this study to accommodate the copy-number-driven nature of the disease. Careful input processing is of particular importance in eQTL as it has a notable effect on the number of significantly associated variant-gene pairs. It is also relevant to maintain adequate statistical power, which affects the reliability of the findings. In all, this study uses eQTL analysis to uncover variant-gene associations. This helps to improve knowledge of gene regulation mechanisms in HGSC in order to find new treatments. To apply the analysis to the HGSC data, a novel eQTL analysis method was developed. Additionally, appropriate input processing is important prior to running the analysis to ensure reliable results.
  • Balaz, Melanie (2023)
    Gene editing holds tremendous potential for treating a variety of diseases, but concerns about safety, particularly the risk of edited cells becoming cancerous, must be addressed. This thesis explores a safety mechanism to prevent unwanted cell proliferation and tumor formation in induced pluripotent stem cells that have been edited for use in gene therapy. The mechanism bases on the genetic disruption (knockout) of the thymidylate synthase gene (TYMS), the only enzyme in charge of synthesizing deoxythymidine monophosphate (dTMP), an essential building block of DNA. Without dTMP, cells cannot successfully proliferate, while RNA synthesis remains unaffected. Through RNA sequencing analysis, we investigate the early response of TYMS knockout cells to dTMP withdrawal and find evidence of the activation of apoptosis and stress pathways, as well as differentiation and changes in the cell cycle. In addition, we demonstrate the effectiveness of the TYMS knockout mechanism in preventing proliferation of cancerous cells in a laboratory setting.
  • Zogjani, Yllza (2023)
    The increasing demand for comprehensive datasets to address complex diseases has resulted in a widespread popularity of biobank-based research. However, the collection of biobank-level data may be susceptible to biases when fundamental aspects of study design, such as sampling approach, are overlooked. FinnGen is a large-scale cohort study aiming to improve diagnoses and prevent diseases through genetic research by combining biobank data with registry data.However, FinnGen’s hospital-based recruitment strategy makes FinnGen suffer from selection bias and thus epidemiologically less representative of its sampling population. In this study, we examine the profound impact of selection bias in FinnGen. We use well-established epidemiological methods and leverage representative data on the Finnish population to try and correct for the bias. By comparing key demographic characteristics and association statistics of interest between FinnGen and a comprehensive registry-based study, FinRegistry, we highlight the extent to which selection bias within FinnGen results in distorted association estimates and a dataset that is highly non - representative of its underlying population. In response to these findings, we estimate Iterative Proportional Fitting (IPF) weights to estimate association statistics that are representative of the true sampling population of FinnGen and unaffected by selection bias. By comparing weighted associations estimated in the FinnGen with associations estimated using FinRegistry data, we infer that the use of our IPF weights mitigates volunteer bias in FinnGen.
  • Nebelung, Hanna (2023)
    ScRNA-seq captures a static picture of a cell's transcriptome including abundances of unspliced and spliced RNA. RNA velocity methods offer the opportunity to infer future RNA abundances and thus future states of a cell based on the temporal change of these unspliced and spliced RNA. Early RNA velocity methods have shed light on transcriptional dynamics in many biological processes. However, due to strict assumptions in the underlying model, these models are not reliable when analysing and inferring velocity for genes with complex expression dynamics such as genes with transcriptional boosts. These genes can for example be observed in erythropoietic and hematopoietic data. Several new RNA velocity methods have been proposed recently. Among these, veloVI and Pyro-Velocity both employ Bayesian methods to estimate the reaction rate and latent parameters. Thus the problem of estimating RNA velocity is turned into a posterior probability inference, that allows for more flexible inference of model parameters and the quantification of uncertainty. The objectives of this thesis were to investigate newly published RNA velocity methods, veloVI and Pyro-Velocity, in comparison to the established tool scVelo. To achieve this, we applied the methods to data obtained from scRNA-seq of healthy and ERCC6L2 disease bone marrow cells. ERCC6L2 disease can cause bone marrow failure with a risk of progression to acute myeloid leukemia with erythroid predominance. Specifically, we evaluated whether RNA velocity results reflect hematopoietic differentiation, if genes with transcriptional boosts affect the velocity results, and if RNA velocity analysis can indicate why erythropoiesis in ERCC6L2 disease is affected. We find that new RNA velocity methods can not produce velocity estimations that are fully in line with what is known of hematopoiesis in our data. Further, the results suggest that velocity estimations by veloVI are affected by genes with transcriptional boosts. Moreover, RNA velocity methods examined in this thesis are not robust and cannot reliably predict cell transitions based on the estimated velocity. Subsequently, velocity estimations for disease data such as ERCC6L2 disease must be evaluated carefully before drawing any conclusion about the differentiation process. In conclusion, this thesis highlights the need for models that can model complex transcription kinetics. Still, as this field is rapidly growing and promising new methods are being developed, improvement of RNA velocity analysis, in general, is possible.