Browsing by study line "Bioinformatics and Systems Medicine"
Now showing items 1-14 of 14
-
(2024)The advancement of high-throughput imaging technologies has revolutionized the study of the tumor microenvironment (TME), including high-grade serous ovarian carcinoma (HGSOC), a cancer type characterized by genetic instability and high intra-tumor heterogeneity. HGSOC is often diagnosed at advanced stages and has a high relapse rate following initial treatment, presenting significant clinical challenges. Understanding the dynamic and complex tumor microenvironment in HGSOC is crucial for developing effective therapeutic strategies, as it includes various interacting cells and structures. Currently most methods are focusing on deciphering the TME on a single cell level, but the volume of the data poses a challenge in large scale studies. This thesis focuses on developing a comprehensive pipeline for accurate detection and phenotyping of immune cells within the TME using tissue cyclic immunofluorescence imaging. The proposed pipeline integrates Napari, an advanced visualization tool, and several existing computational methods to handle large-scale imaging datasets efficiently. The primary aim is to create Napari plugins for fast browsing and detailed visualization of these datasets, enabling precise cell phenotyping and quality control. Handling large images was resolved through the implementation of Zarr and Dask methodologies, enabling efficient data management. Key image processing methodologies include the use of the StarDist algorithm for cell segmentation, preprocessing steps for fluorescence intensity normaliza tion, and the Tribus tool for semi-automated cell type classification. In total, we annotated 976,082 single cells on three HGSOC samples originating from pre- or post-neoadjuvant chemotherapy tumor sections. The accurate annotation of immune sub-populations was enhanced by visual evaluation steps, addressing the limitations of the discussed methods. Accurately annotating dense tissue areas is crucial for describing the cellular composition of samples, particularly tumor-infiltrating immune populations. The results indicate that the proposed pipeline not only enhances the understanding of the TME in HGSOC but also provides a robust framework for future studies involving large-scale imaging data.
-
(2023)Polygenic risk scores (PRSs) estimate the genetic risk of an individual for a certain polygenic disease trait by summing up the effects of multiple variants across the genome affecting the disease risk. Currently, polygenic risk scores (PRSs) are calculated from imputed array genotyping data which is inexpensive to produce use and has standard procedures and pipelines available. However, genotyping arrays are prone to ascertainment bias, which can also lead to biased PRS results in some populations. If PRSs are utilized in healthcare for screening rare diseases, usage of whole-genome sequencing (WGS) instead of array genotyping is desirable, because also individual samples can be analyzed easily. While high-coverage WGS is still significantly more expensive than array genotyping, low-coverage whole genome sequencing (lcWGS) with imputation has been proposed as an alternative for genotyping arrays. In this project, the utility of imputed low-coverage whole-genome sequencing (lcWGS) data in PRS estimation compared to genotyping array data and the impact of the choice of imputation tool for lcWGS data was studied. Down-sampled WGS data with six different low coverages (0.1x-2x) was used to represent lcWGS data. Two different pipelines were used in genotype imputation and haplotype phasing: in the first one, pre-phasing and imputation were performed directly for the genotype likelihoods (GLs) calculated from the down-sampled data, whereas in the second one, the GLs were converted to genotype calls before imputation and phasing. In both pipelines, PRS for 27 disease phenotypes were calculated from the imputed and phased lcWGS data. Imputation and PRS calculation accuracy of the two pipelines were calculated in relation to both genotyping array and high-coverage whole-genome sequencing (hcWGS) data. In both pipelines, imputation and PRS calculation accuracy increased when the down-sampled coverage increased. The second imputation and phasing pipeline lead to better results in both imputation and PRS calculation accuracy. Some differences in PRS accuracy between different phenotypes were also detected. The results show similar patterns to what is seen in other similar publications. However, not quite as high imputation and PRS accuracy as seen in earlier studies could be attained, but possible limitations leading to lower accuracy could be identified. The results also emphasize the importance of choosing suitable imputation and phasing methods for lcWGS data and suggest that methods and pipelines designed particularly for lcWGS should be developed and published.
-
(2023)Sequence alignment is widely studied problem in the field of bioinformatics. The exact solution takes quadratic time to compute, and thus is not practical for long sequences. A number of heuristic approaches have been developed to conquer the quadratic time-complexity. This thesis reviews the average-case time analysis of two such heuristics, banded alignment by Ganesh and Sy in ''Near-Linear Time Edit Distance for Indel Channels'' WABI 2020, and seed-chain-extend by Shaw and Yu in ''Proving sequence aligners can guarantee accuracy in almost O(m log n) time through an average-case analysis of the seed-chain-extend heuristic'' Genome Research 2023. These heuristics reduce the quadratic average-case time complexity of the sequence alignment to log-linear. The approach of the thesis reviews is to outline the proofs of the original analysis, and provide supporting materials to aid the reader in studying the analysis. The experiments of this thesis compare four different approaches to compute the exact match anchors of the seed-chain-extend sequence alignment heuristic. A Bi-Directional Burrows-Wheeler Transformation (BDBWT), suffix tree based Mummer and Minimap2 based exact match anchors are computed. The anchors are then given to a chaining algorithm, to compare the performance of each anchoring technique. The qualities of the chains are compared using a Jaccard index applied to the sequences. The highest Jaccard index is obtained for the maximal exact match and the unique maximal exact match anchors of Mummer and BDBWT approaches. An increasing minimum length of the exact matches seem to increase the Jaccard index and reduce the running time of the chaining algorithm.
-
(2024)Single-cell RNA sequencing (scRNA-seq) allows the analysis of differences in the RNA expression between individual cells. While this is usually performed by short read sequencing, long read sequencing like Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) is also applied by researchers. As long read technologies allow to capture entire RNA molecules, combined with single-cell sequencing, this enables the exploration of cell-specific isoform expression patterns. In single-cell sequencing each cell is tagged by a different oligonucleotide, called barcode, during sequencing, to enable the identification of the origin of each read. With short reads, these are straighforward to identify and correct. However, with the higher error rate of long reads, the identification of the barcodes becomes more challenging. Tools exist for the identification and correction of barcodes in short reads and for combinations of long and short reads, but only few tools work with long reads exclusively. Additionally, most tools are focused on one specific scRNA-seq protocol. While most protocols work in a similar way, the location, length or other characteristics of the barcodes might differ, meaning not all tools work for all protocols. This thesis introduces a novel barcode calling algorithm for long reads called BArcoDe callinG via Edit distance gRaph, or Badger, which can accomodate for different scRNAseq protocols. The algorithm uses a novel data structure called edit distance graph, which is based on the Hamming distance graph. Within the graph, every distinct barcode is represented by a node. Edges are added between nodes where the represented barcodes have an edit distance below a certain threshold between them. As calculating the edit distance is computationally expensive, a filter is used to find similar barcodes, and only between those the edit distance is calculated. Additionally, the algorithm is implemented and its performance evaluated, both on its own and in comparison to the existing method scTagger, where Badger outperforms scTagger in both precision and recall.
-
(2024)Essential thrombocythemia (ET) is a clonal hematopoietic disease characterized by an abnormal increase of platelets in the circulation, with increased risk of thrombosis and hemorrhage. Despite megakaryocytes having a central role in the disease, few studies have investigated their gene expression in ET. The aim of this study is to characterize the gene expression profiles of megakaryocytes from ET patients harboring different driver mutations, and increase the knowledge of the molecular mechanisms underlying the pathophysiology of the disease. In this study, samples were obtained from healthy donors and ET patients with JAK2 V617F, CALR Type I, CALR Type II driver mutations and triple-negative patients. Following megakaryocyte culture from peripheral blood and RNA sequencing, the data was pre-processed and analyzed using differential gene expression analysis. The downstream analysis was conducted using pathway enrichment analysis tools. The analysis revealed that all mutants shared common deregulated genes related to processes involving platelets and coagulation. However, it was shown that CALR and JAK2 V617F mutants also have distinct patterns of gene expression. CALR Type I mutants had a unique gene expression signature consisting of genes related to immune response, as well as metabolic, regulatory, proliferative, and inflammatory pathways, while CALR Type II mutants had unique genes related to ribosomes. The CALR mutants also shared a common anti-inflammatory response signature which set them apart from JAK2 V617F mutants. In conclusion, this study shows that the gene expression profiles of ET mutants are heterogeneous. Moreover, the results provide new insights into the gene expression profiles of CALR mutants that distinguish them from the other mutants. Further experiments using single-cell RNA sequencing methods could build upon these findings and uncover the observed gene expression discrepancies between CALR and JAK2 mutants with increased accuracy.
-
(2024)Proteins are the building blocks of life, and they play a crucial role in biological functions and activities as the expression products of genes within organisms. Annotating the functions of proteins is a critical challenge in bioinformatics. Historically, the annotation of protein functions has relied heavily on experimental approaches, which are time-consuming and cannot keep pace with the rapid generation of genomic data. Thus, automating functional annotation of proteins is becoming increasingly important. As machine learning methods have matured, researchers have used them to automate the annotation of experimentally uncharacterized proteins. To achieve effective results with machine learning models, preprocessing data and selecting appropriate hyperparameters are crucial but sometimes overlooked steps. This article explores using Bayesian search to find suitable hyperparameters from a vast search space. Besides, a drastic class imbalance problem is often encountered when dealing with protein function prediction. To address this, we utilized some functions to compute class weights, with the choice of function integrated as a hyperparameter, allowing for its optimization through Bayesian search. Additionally, we employ grid search to optimize the selection of preprocessing methods. These tested preprocessing methods are simple functions that alter the distribution of input variables. These methodological enhancements have improved machine learning performance in predicting protein functions, thereby supporting researchers in automated protein function prediction. Furthermore, these methodologies can be easily transferred to other tasks requiring machine learning.
-
(2021)Cancer cells accumulate somatic mutations in their DNA throughout their lifetime. The advances in cancer prevention and treatment methods call for a deeper understanding of carcinogenesis on the genetic sequence level. Mutational signatures present a novel and promising way to capture somatic mutation patterns and define their causes, allowing to summarize the mutational landscape of cancer as a combination of distinct mutagenic processes acting with different levels of strength. While the majority of previous studies assume an additive relationship between the mutational processes, this Master’s thesis provides tentative evidence that contemporary methods with additivity constraints, e.g. non-negative matrix factorization (NMF), are not sufficient to comprehensively explain the observed mutations in cancer genomes and the observed deviations are not random. To quantify these residues, two metrics are defined – additive and multiplicative residues – and hierarchical clustering algorithms are used to identify cancer subsets with similar residual profiles. It is shown that in certain cancer sample subsets there is a systematic mutational burden overestimation that can only be solved by a multiplicatively acting process, as well as non-random underestimation, requiring additional mutational signatures. Here an extension to the additive mutational signature model is proposed – a probabilistic model that incorporates a selectively active modulatory mutational process that is able to act in a multiplicative manner together with the known mutational signatures, reducing systematic variability.
-
(2023)High-grade serous carcinoma (HGSC) is a highly lethal cancer type characterised by high genomic instability and frequent copy number alterations. This study examines the relationships between genetic variants in tumour germline and gene expression levels to obtain a better understanding of gene regulation in HGSC. This would then improve knowledge of the cancer mechanisms in order to find, for example, potential new treatment targets and biomarkers. The aim is to find significantly associated variant-gene pairs in HGSC. Expression quantitative trait loci (eQTL) analysis is a well-suited method to explore these associations. eQTL analysis is a suitable approach to analysing also those variants that are located in the non-coding genomic regions, as indicated by previous genome-wide association studies to contain many disease-linked germline variants. The current eQTL analysis methods are, however, not applicable for association testing between genes and variants in the context of HGSC because of the special genomic features of the cancer. Therefore, a new eQTL analysis approach, SegmentQTL, was developed for this study to accommodate the copy-number-driven nature of the disease. Careful input processing is of particular importance in eQTL as it has a notable effect on the number of significantly associated variant-gene pairs. It is also relevant to maintain adequate statistical power, which affects the reliability of the findings. In all, this study uses eQTL analysis to uncover variant-gene associations. This helps to improve knowledge of gene regulation mechanisms in HGSC in order to find new treatments. To apply the analysis to the HGSC data, a novel eQTL analysis method was developed. Additionally, appropriate input processing is important prior to running the analysis to ensure reliable results.
-
(2023)Gene editing holds tremendous potential for treating a variety of diseases, but concerns about safety, particularly the risk of edited cells becoming cancerous, must be addressed. This thesis explores a safety mechanism to prevent unwanted cell proliferation and tumor formation in induced pluripotent stem cells that have been edited for use in gene therapy. The mechanism bases on the genetic disruption (knockout) of the thymidylate synthase gene (TYMS), the only enzyme in charge of synthesizing deoxythymidine monophosphate (dTMP), an essential building block of DNA. Without dTMP, cells cannot successfully proliferate, while RNA synthesis remains unaffected. Through RNA sequencing analysis, we investigate the early response of TYMS knockout cells to dTMP withdrawal and find evidence of the activation of apoptosis and stress pathways, as well as differentiation and changes in the cell cycle. In addition, we demonstrate the effectiveness of the TYMS knockout mechanism in preventing proliferation of cancerous cells in a laboratory setting.
-
(2024)Acute myeloid leukemia (AML) is a disease in which blood cell production is severely disrupted. Cell count and morphological analysis from bone marrow (BM) samples are key in the diag- nosis of AML. Recent advances in computer vision have led to algorithms developed at the Hematoscope Lab that can automatically classify cells from these BM samples and calculate various cell-level statistics. This thesis investigated the use of cytomorphological data along with standard clinical data to predict progression-free survival (PFS). A benchmark study using penalized Cox regression, random survival forests, and survival support vector machines was conducted to study the utility of cytomorphology data. As features greatly outnumber samples, the methods are further compared over three feature filtering methods based on Spearman’s correlation coefficient, conditional Cox screening, and mutual information. In a dataset from the national VenEx trial, the penalized Cox regression method with ElasticNet penalization supplemented with Cox conditional screening was found to perform best in the nested CV benchmarking. A post-hoc dissection of two best-performing Cox models revealed potentially predictive cytomorphological features, while disease etiology and patient age were likewise important.
-
(2023)The increasing demand for comprehensive datasets to address complex diseases has resulted in a widespread popularity of biobank-based research. However, the collection of biobank-level data may be susceptible to biases when fundamental aspects of study design, such as sampling approach, are overlooked. FinnGen is a large-scale cohort study aiming to improve diagnoses and prevent diseases through genetic research by combining biobank data with registry data.However, FinnGen’s hospital-based recruitment strategy makes FinnGen suffer from selection bias and thus epidemiologically less representative of its sampling population. In this study, we examine the profound impact of selection bias in FinnGen. We use well-established epidemiological methods and leverage representative data on the Finnish population to try and correct for the bias. By comparing key demographic characteristics and association statistics of interest between FinnGen and a comprehensive registry-based study, FinRegistry, we highlight the extent to which selection bias within FinnGen results in distorted association estimates and a dataset that is highly non - representative of its underlying population. In response to these findings, we estimate Iterative Proportional Fitting (IPF) weights to estimate association statistics that are representative of the true sampling population of FinnGen and unaffected by selection bias. By comparing weighted associations estimated in the FinnGen with associations estimated using FinRegistry data, we infer that the use of our IPF weights mitigates volunteer bias in FinnGen.
-
(2023)ScRNA-seq captures a static picture of a cell's transcriptome including abundances of unspliced and spliced RNA. RNA velocity methods offer the opportunity to infer future RNA abundances and thus future states of a cell based on the temporal change of these unspliced and spliced RNA. Early RNA velocity methods have shed light on transcriptional dynamics in many biological processes. However, due to strict assumptions in the underlying model, these models are not reliable when analysing and inferring velocity for genes with complex expression dynamics such as genes with transcriptional boosts. These genes can for example be observed in erythropoietic and hematopoietic data. Several new RNA velocity methods have been proposed recently. Among these, veloVI and Pyro-Velocity both employ Bayesian methods to estimate the reaction rate and latent parameters. Thus the problem of estimating RNA velocity is turned into a posterior probability inference, that allows for more flexible inference of model parameters and the quantification of uncertainty. The objectives of this thesis were to investigate newly published RNA velocity methods, veloVI and Pyro-Velocity, in comparison to the established tool scVelo. To achieve this, we applied the methods to data obtained from scRNA-seq of healthy and ERCC6L2 disease bone marrow cells. ERCC6L2 disease can cause bone marrow failure with a risk of progression to acute myeloid leukemia with erythroid predominance. Specifically, we evaluated whether RNA velocity results reflect hematopoietic differentiation, if genes with transcriptional boosts affect the velocity results, and if RNA velocity analysis can indicate why erythropoiesis in ERCC6L2 disease is affected. We find that new RNA velocity methods can not produce velocity estimations that are fully in line with what is known of hematopoiesis in our data. Further, the results suggest that velocity estimations by veloVI are affected by genes with transcriptional boosts. Moreover, RNA velocity methods examined in this thesis are not robust and cannot reliably predict cell transitions based on the estimated velocity. Subsequently, velocity estimations for disease data such as ERCC6L2 disease must be evaluated carefully before drawing any conclusion about the differentiation process. In conclusion, this thesis highlights the need for models that can model complex transcription kinetics. Still, as this field is rapidly growing and promising new methods are being developed, improvement of RNA velocity analysis, in general, is possible.
-
(2024)Modern developments in biology and bioinformatics has enabled unprecedented computational capabilities in developmental biology. Novel biological single cell-resolution assays, as well as advanced data processing algorithms and workflows, define directions in which genomics research transitions in the 21st century. This thesis uses Single Cell Assay for Transposase Accessible Chromatin using Sequencing (scATAC-seq) high-throughput data to study gene regulatory principles responsible for the development of a distinguished cellular lineage in the embryonic brainstem in mouse. In particular, a ventral area of the rhombomere 1 region (rV2) in the embryonic brainstem generates two, antagonistic cellular lineages. These lineages–an inhibitory GABAergic lineage and an exhibitory glutamatergic lineage–have been well characterized in recent literature. Functionally, the cells originating from these lineages have been shown to regulate fundamental, evolutionary behavioral traits, present in cognitively more complex organisms such as humans, as well. Studying the cellular development of the two lineages is thus crucial for understanding the development of congenital disorders related to the higher-level behavioral traits. Coupled with auxiliary single cell mRNA sequencing (scRNA-seq) data analysis, the scATAC-seq data analysis conducted in this thesis first successfully rebuilds the lineage structure of the rV2 area in both, scRNA-seq data and scATAC-seq data. Moreover, by the use of open-source data analysis libraries, the work presented in this thesis integrates the structure of the rV2 area across the two modalities, creating a multiomics data set in which both, the transcriptomics and the chromatin accessibility landscapes of the rV2 lineages can be studied jointly. Previous research has described the molecular and genetic signature of the rV2 area. In particular, the induction of the GABAergic rV2 lineage is known to depend on presence of a combination of a few key transcription factors. The analysis of chromatin accessibility data obtained through the scATAC-seq assay, enables the determination of regulatory elements, putatively responsible for the activation the Tal1 gene – a crucial fate-determining gene coding one of the key transcription factors. Indeed, the work presented in this thesis shows that at least three proximal regulatory elements exhibit potential accessibility trends associated with elevated Tal1 gene expression. Through transcription factor footprinting analysis algorithms, this thesis finally predicts how a limited number of known transcription factors binds to the proximal Tal1 regulatory elements and orchestrates lineage defining Tal1 activation in the rV2 area of mouse brainstem. The thesis ends with a critical assessment of the analysis pipelines and computational tools used in the thesis, and suggests directions for research efforts, which can computationally and biologically validate the observations and results of the thesis work.
-
(2024)Tobacco smoking has a huge impact on health, increasing the risk of cardiovascular diseases, respiratory diseases, and various types of cancer. Therefore, assessing a patient’s smoking history is crucial for identifying potential risk factors. Smoking also induces alteration in DNAm. The large effect of smoking makes it a crucial confounding factor in EWAS. However, smoking status information is not always available in the data. Even so, it is not always reliable due to depending on self-reporting, which can cause bias in the analysis. DNAm can be used as an excellent biomarker for smoking since it can be measured in a cost-effective, non-invasive way through methylation arrays. Already, multiple DNAm-based smoking predictors are available; some return a smoking score associated with smoking, and others return a smoking status, whether the individual is a current, never, or former smoker. These predictors are based on the Infinium 450k array from Illumina, and there is no available predictor for the Infinium Methylation EPIC array, which contains almost twice as many CpG sites as the previous one. We developed two machine learning models (Model1, Model2) that can classify individuals into three smoking statuses: never-smoker, current-smoker, and former-smoker. Both models were LASSO logistic regressors trained on EPIC array DNAm data of the Young Finns Study cohort. Model1 was trained on the beta matrix pre-processed with the standard minfi pipeline, while Model2 was trained on a beta matrix derived from QN normalized intensity values. Model1 and Model2 were both evaluated on an independent test dataset, the Finnish Twin Cohort, resulting in overall accuracies of 57.4% and 64.29%, respectively. The models can separate the classes from each other with a micro-average OvA AUC of 0.79 and 0.81. They can distinguish never- and current-smoker categories with an average OvO AUC of 0.94 and 0.93. Misclassifications aligned with the individuals’ smoking intensities and the methylation levels of the well-known smoking-associated CpG site, cg05575921.
Now showing items 1-14 of 14