Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "bioinformatics"

Sort by: Order: Results:

  • Szabo, Angela (2024)
    The advancement of high-throughput imaging technologies has revolutionized the study of the tumor microenvironment (TME), including high-grade serous ovarian carcinoma (HGSOC), a cancer type characterized by genetic instability and high intra-tumor heterogeneity. HGSOC is often diagnosed at advanced stages and has a high relapse rate following initial treatment, presenting significant clinical challenges. Understanding the dynamic and complex tumor microenvironment in HGSOC is crucial for developing effective therapeutic strategies, as it includes various interacting cells and structures. Currently most methods are focusing on deciphering the TME on a single cell level, but the volume of the data poses a challenge in large scale studies. This thesis focuses on developing a comprehensive pipeline for accurate detection and phenotyping of immune cells within the TME using tissue cyclic immunofluorescence imaging. The proposed pipeline integrates Napari, an advanced visualization tool, and several existing computational methods to handle large-scale imaging datasets efficiently. The primary aim is to create Napari plugins for fast browsing and detailed visualization of these datasets, enabling precise cell phenotyping and quality control. Handling large images was resolved through the implementation of Zarr and Dask methodologies, enabling efficient data management. Key image processing methodologies include the use of the StarDist algorithm for cell segmentation, preprocessing steps for fluorescence intensity normaliza tion, and the Tribus tool for semi-automated cell type classification. In total, we annotated 976,082 single cells on three HGSOC samples originating from pre- or post-neoadjuvant chemotherapy tumor sections. The accurate annotation of immune sub-populations was enhanced by visual evaluation steps, addressing the limitations of the discussed methods. Accurately annotating dense tissue areas is crucial for describing the cellular composition of samples, particularly tumor-infiltrating immune populations. The results indicate that the proposed pipeline not only enhances the understanding of the TME in HGSOC but also provides a robust framework for future studies involving large-scale imaging data.
  • Koski, Jessica (2021)
    Acute lymphoblastic leukemia (ALL) is a hematological malignancy that is characterized by uncontrolled proliferation and blocked maturation of lymphoid progenitor cells. It is divided into B- and T-cell types both of which have multiple subtypes defined by different somatic genetic changes. Also, germline predisposition has been found to play an important role in multiple hematological malignancies and several germline variants that contribute to the ALL risk have already been identified in pediatric and familial settings. There are only few studies including adult ALL patients but thanks to the findings in acute myeloid leukemia, where they found the germline predisposition to consider also adult patients, there is now more interest in studying adult patients. The prognosis of adult ALL patients is much worse compared to pediatric patients and many are still lacking clear genetic markers for diagnosis. Thus, identifying genetic lesions affecting ALL development is important in order to improve treatments and prognosis. Germline studies can provide additional insight on the predisposition and development of ALL when there are no clear somatic biomarkers. Single nucleotide variants are usually of interest when identifying biomarkers from the genome, but also structural variants can be studied. Their coverage on the genome is higher than that of single nucleotide variants which makes them suitable candidates to explore association with prognosis. Copy number changes can be detected from next generation sequencing data although the detection specificity and sensitivity vary a lot between different software. Current approach is to identify the most likely regions with copy number change by using multiple tools and to later validate the findings experimentally. In this thesis the copy number changes in germline samples of 41 adult ALL patients were analyzed using ExomeDepth, CODEX2 and CNVkit.
  • Hellsten, Kirsi (2023)
    Triglycerides are a type of lipid that enters our body with fatty food. High triglyceride levels are often caused by an unhealthy diet, poor lifestyle, poorly treated diseases such as diabetes and too little exercise. Other risk factors found in various studies are HIV, menopause, inherited lipid metabolism disorder and South Asian ancestry. Complications of high triglycerides include pancreatitis, carotid artery disease, coronary artery disease, metabolic syndrome, peripheral artery disease, and strokes. Migration has made Singapore diverse, and it contains several subpopulations. One third of the population has genetic ancestry in China. The second largest group has genetic ancestry in Malaysia, and the third largest has genetic ancestry in India. Even though Singapore has one of the highest life expectancies in the world, unhealthy lifestyles such as poor diet, lack of exercise and smoking are still visible in everyday life. The purpose of this thesis was to introduce GWAS-analysis for quantitative traits and apply it to real data, and also to see if there are associations between some variants and triglycerides in three main subpopulations in Singapore and compare the results to previous studies. The research questions that this thesis answered are: what is GWAS analysis and what is it used for, how can GWAS be applied to data containing quantitative traits, and is there associations between some SNPs and triglycerides in three main populations in Singapore. GWAS stands for genome-wide association studies designed to identify statistical association between genetic variants and phenotypes or traits. One reason for developing GWAS was to learn to identify different genetic factors which have an impact on significant phenotypes, for instance susceptibility to certain diseases Such information can eventually be used to predict the phenotypes of individuals. GWAS have been globally used in, for example, anthropology, biomedicine, biotechnology, and forensics. The studies enhance the understanding of human evolution and natural selection and helps forward many areas of biology. The study used several quality control methods, linear models, and Bayesian inference to study associations. The research results were examined, among other things, with the help of various visual methods. The dataset used in this thesis was an open data used by Saw, W., Tantoso, E., Begum, H. et al. in their previous study. This study showed that there are associations between 6 different variants and triglycerides in the three main subpopulations in Singapore. The study results were compared with the results of two previous studies, which differed from the results of this study, suggesting that the results are significant. In addition, the thesis reviewed the ethics of GWAS and the limitations and benefits of GWAS. Most of the studies like this have been done in Europe, so more research is needed in different parts of the world. This research can also be continued with different methods and variables.
  • Kähkönen, Harri (2023)
    The volume of data generated by high-throughput DNA sequencing has grown to a magnitude that leads to substantial computational challenges in storing and searching the data. To tackle this problem, various computational methodologies have been developed in recent years to space-efficiently index collections of data sets and enable efficient searches. One of the most recent indexing methods, Spectral Burrows-Wheeler Transform (SBWT), presents all distinct k-mers of a DNA sequence using only 4 bits and a small additional space for the rank data structures per k-mer. In addition to being space-efficient, it also enables k-mer membership queries in linear time relative to k, and constant time relative to the number of distinct k-mers in the sequence. The queries rely on rank queries over bit vectors. Experiments run on a single CPU thread have shown that in one second, hundreds of thousands of k-mer membership queries can be performed over SBWT. By parallelizing the queries on a CPU, it is possible to execute millions of queries per second. However, Graphic Processing Units (GPUs) have much more parallelization potential. The main contribution of the thesis is an implementation of the k-mer membership queries over SBWT with GPU computing. Optimizing the queries to be performed on a GPU made it possible to perform over a billion queries per second. Furthermore, the thesis presents a new enhancement for the queries over SBWT called presearching, which doubles the speed of the original SBWT search query. The rank query needed for the membership queries is implemented using space-efficient poppy rank data structures, and its derivative cumulative-poppy data structure which is one of the contributions of the thesis.
  • Nebelung, Hanna (2023)
    ScRNA-seq captures a static picture of a cell's transcriptome including abundances of unspliced and spliced RNA. RNA velocity methods offer the opportunity to infer future RNA abundances and thus future states of a cell based on the temporal change of these unspliced and spliced RNA. Early RNA velocity methods have shed light on transcriptional dynamics in many biological processes. However, due to strict assumptions in the underlying model, these models are not reliable when analysing and inferring velocity for genes with complex expression dynamics such as genes with transcriptional boosts. These genes can for example be observed in erythropoietic and hematopoietic data. Several new RNA velocity methods have been proposed recently. Among these, veloVI and Pyro-Velocity both employ Bayesian methods to estimate the reaction rate and latent parameters. Thus the problem of estimating RNA velocity is turned into a posterior probability inference, that allows for more flexible inference of model parameters and the quantification of uncertainty. The objectives of this thesis were to investigate newly published RNA velocity methods, veloVI and Pyro-Velocity, in comparison to the established tool scVelo. To achieve this, we applied the methods to data obtained from scRNA-seq of healthy and ERCC6L2 disease bone marrow cells. ERCC6L2 disease can cause bone marrow failure with a risk of progression to acute myeloid leukemia with erythroid predominance. Specifically, we evaluated whether RNA velocity results reflect hematopoietic differentiation, if genes with transcriptional boosts affect the velocity results, and if RNA velocity analysis can indicate why erythropoiesis in ERCC6L2 disease is affected. We find that new RNA velocity methods can not produce velocity estimations that are fully in line with what is known of hematopoiesis in our data. Further, the results suggest that velocity estimations by veloVI are affected by genes with transcriptional boosts. Moreover, RNA velocity methods examined in this thesis are not robust and cannot reliably predict cell transitions based on the estimated velocity. Subsequently, velocity estimations for disease data such as ERCC6L2 disease must be evaluated carefully before drawing any conclusion about the differentiation process. In conclusion, this thesis highlights the need for models that can model complex transcription kinetics. Still, as this field is rapidly growing and promising new methods are being developed, improvement of RNA velocity analysis, in general, is possible.
  • Ingervo, Eliel (2024)
    The problem of safety in a general graph is the problem of finding walks in the graph that are subwalks of a walk in any possible solution within a given model (Tomescu and Medvedev, 2017). The problem of safety has proven to be a valuable problem for bioinformatics in the context of genome assembly and other assembly problems. When working with perfect data, safe walks in a graph correspond to correct sequences of the genome. When working with real (erroneous) data, safe walks correspond to close to correct sequences, where the errors correspond only to errors already in the data. Safety has previously been considered in two different models: in a model where the graph holds only the information of graph topology, and in a model where the graph holds the information of graph topology and network flow. Subpath constraints are paths in a graph that restrict the solution set. Restricting a solution set increases the lengths of safe walks, so using a model that holds the information of subpath constraints is advantageous. However, subpath constraints have never been considered with the problem of safety. In this work, we introduce subwalk constraints, which are subpath constraints that can have cycles and are not limited to directed acyclic graphs (DAGs). Then, we present a new model, where the graph holds the information of graph topology, network flow and subwalk constraints. In the new model, we present three methodologies to compute safe walks that become possible because of subwalk constraints. Then, we present an algorithm that computes safe walks in polynomial time in the new model utilizing the three new methodologies. Due to the new model, all safe walks produced by other algorithms are subwalks of the safe walks generated by our new algorithm