Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "bioinformatics"

Sort by: Order: Results:

  • Peltola, Sanni (2019)
    In recent decades, ancient DNA recovered from old and degraded samples, such as bones and fossils, has presented novel prospects in the fields of genetics, archaeology and anthropology. In Finland, ancient DNA research is constrained by the poor preservation of bones: they are quickly degraded by acidic soils, limiting the age of DNA that can be recovered from physical remains. However, some soil components can bind DNA and thus protect the molecules from degradation. Ancient DNA from soils and sediments has previously been used to reconstruct paleoenvironments, to study ancient parasites and diet and to demonstrate the presence of a species at a given site, even when there are no visible fossils present. In this pilot study, I explored the potential of archaeological sediments as an alternative source of ancient human DNA. I collected sediment samples from five Finnish Neolithic Stone Age (6,000–4,000 years ago) settlement sites, located in woodland. In addition, I analysed a lakebed sample from a submerged Mesolithic (10,000–7,000 years ago) settlement site, and a soil sample from an Iron Age burial with bones present to compare DNA yields between the two materials. Soil samples were converted into Illumina sequencing libraries and enriched for human mtDNA. I analysed the sequencing data with a customised metagenomics-based bioinformatic analysis workflow. I also tested program performance with simulated data. The results suggested that human DNA preservation in Finnish archaeological sediments may be poor or very localised. I detected small amounts of human mtDNA in three Stone Age woodland settlement sites and a submerged Mesolithic settlement site. One Stone Age sample exhibited terminal damage patterns suggestive of DNA decay, but the time of deposition is difficult to estimate. Interestingly, no human DNA was recovered from the Iron Age burial soil, suggesting that body decomposition may not serve as a significant source of sedimentary ancient DNA. Additional complications may arise from the high inhibitor content of the soil and the abundance of microbial and other non-human DNA present in environmental samples. In the future, a more refined sampling approach, such as targeting microscopic bone fragments, could be a strategy worth trialling.
  • Koski, Jessica (2021)
    Acute lymphoblastic leukemia (ALL) is a hematological malignancy that is characterized by uncontrolled proliferation and blocked maturation of lymphoid progenitor cells. It is divided into B- and T-cell types both of which have multiple subtypes defined by different somatic genetic changes. Also, germline predisposition has been found to play an important role in multiple hematological malignancies and several germline variants that contribute to the ALL risk have already been identified in pediatric and familial settings. There are only few studies including adult ALL patients but thanks to the findings in acute myeloid leukemia, where they found the germline predisposition to consider also adult patients, there is now more interest in studying adult patients. The prognosis of adult ALL patients is much worse compared to pediatric patients and many are still lacking clear genetic markers for diagnosis. Thus, identifying genetic lesions affecting ALL development is important in order to improve treatments and prognosis. Germline studies can provide additional insight on the predisposition and development of ALL when there are no clear somatic biomarkers. Single nucleotide variants are usually of interest when identifying biomarkers from the genome, but also structural variants can be studied. Their coverage on the genome is higher than that of single nucleotide variants which makes them suitable candidates to explore association with prognosis. Copy number changes can be detected from next generation sequencing data although the detection specificity and sensitivity vary a lot between different software. Current approach is to identify the most likely regions with copy number change by using multiple tools and to later validate the findings experimentally. In this thesis the copy number changes in germline samples of 41 adult ALL patients were analyzed using ExomeDepth, CODEX2 and CNVkit.
  • Reinikka, Siiri (2020)
    Endometrial polyps are one of the most common benign uterine lesions, affecting approximately 10% of all adult women. While endometrial polyps have a high prevalence, their molecular pathogenesis and genetic background are largely undefined. Accordingly, the aim of this thesis was to characterize the somatic mutational landscape of endometrial polyps – to identify mutations in cancer-associated genes, and to identify mutational signatures contributing towards the somatic mutational spectrum. The present study was conducted using whole exome sequencing of 23 endometrial polyps and 18 matching normal blood samples. Mutational signature analysis was conducted using MutationalPatterns and SigProfiler. Endometrial polyps were found to carry varying number of somatic mutations in their exomes, most of them present at a low allelic fraction. Moreover, 43% (10/23) of the polyps were identified to carry one to four cancer-associated mutations, including mutations in genes such as PIK3CA 17% (4/23), KRAS 13% (3/23) and ERBB1 9% (2/23), which are well-established cancer driver genes. Cancer-associated mutational signatures do not have a notable contribution towards the somatic mutational spectrum of endometrial polyps. However, a novel signature, ‘signature B’, characterized by T>G mutations, was found to affect a subset of polyp samples. To conclude, the whole exome sequencing of endometrial polyps revealed several mutations in cancer-associated genes and a novel mutational signature, which may contribute to the development of these benign tumours. However, further research is required to confirm and validate the novel signature, and to define the genetic alterations leading to the polyp pathogenesis.
  • Jokinen, Vilja (2021)
    Uterine leiomyomas are benign smooth muscle tumors arising in myometrium. They are very common, and the incidence in women is up to 70% by the age of 50. Usually, leiomyomas are asymptomatic, but some patients suffer from various symptoms, including abnormal uterine bleeding, pelvic pain, urinary frequency, and constipation. Uterine leiomyomas may also cause subfertility. Genetic alterations in the known driver genes MED12, HMGA2, FH, and COL4A5-6 account for about 90 % of all leiomyomas. These initiator mutations result in distinct molecular subtypes of leiomyomas. The majority of whole-genome sequencing (WGS) studies analyzing chromosomal rearrangements have been performed using fresh frozen tissues. One aim of this study was to examine the feasibility of detecting chromosomal rearrangements from WGS data of formalin-fixed paraffin embedded (FFPE) tissue samples. Previous results from 3’RNA-sequencing data revealed a subset of uterine leiomyoma samples that displayed similar gene expression patterns with HMGA2-positive leiomyomas but were previously classified as HMGA2-negative by immunohistochemistry. According to 3’RNA-sequencing, all these tumors overexpressed PLAG1, and some of them overexpressed HMGA2 or HMGA1. Thus, the second aim of this study was to identify driver mutations in these leiomyoma samples using WGS. In this study, WGS was performed for 16 leiomyoma and 4 normal myometrium FFPE samples. The following bioinformatic tools were used to detect somatic alterations at multiple levels: Delly for chromosomal rearrangements, CNVkit for copy-number alterations, and Mutect for point mutations and small insertions and deletions. Sanger sequencing was used to validate findings. The quality of WGS data obtained from FFPE samples was sufficient for detecting chromosomal rearrangements, although the number of calls were quite high. We identified recurrent chromosomal rearrangements affecting HMGA2, HMGA1, and PLAG1, mutually exclusively. One sample did not harbor any of these rearrangements, but a deletion in COL4A5-6 was found. Biallelic loss of DEPDC5 was seen in one sample with an HMGA2 rearrangement and in another sample with an HMGA1 rearrangement. HMGA2 and HMGA1 encode architectural chromatin proteins regulating several transcription factors. It is well-known that HMGA2 upregulates PLAG1 expression. The structure and functionality of HMGA2 and HMGA1 are very similar and conserved, so it might be that HMGA1 may also regulate PLAG1 expression. The results of this study suggest that HMGA2 and HMGA1 drive tumorigenesis by regulating PLAG1, and thus, PLAG1 rearrangements resulting in PLAG1 overexpression can also drive tumorigenesis. A few samples, previously classified as HMGA2-negative by immunohistochemistry, revealed to harbor HMGA2 rearrangements, suggesting that the proportion of HMGA2-positive leiomyomas might be underestimated in previous studies using immunohistochemistry. Only one study has previously reported biallelic inactivation of DEPDC5 in leiomyomas, and the results of this study support the idea that biallelic loss of DEPDC5 is a secondary driver event in uterine leiomyomas.
  • Koivunen, Sampo (2019)
    The Oxford Nanopore MinION is a third generation sequencer utilizing nanopore sequencing technology. The nanopore sequencing method allows sequencing of either DNA or RNA strands as they pass through the membrane-embedded nanopores. By measuring the corresponding fluctuations in the ion flow passing through the nanopore the passing strands can be sequenced directly without additional second-hand reactions or measurements. The MinION sequencing has very distinctly different characteristics compared to the market leaders of the sequencing field. The small form factor of the device further helps it to separate itself from the other alternatives. However, the technology has only been on the market for a very short time and thus very little golden standards regarding its capabilities or usage have been established. This thesis describes our experiences testing the capabilities of the MinION sequencer both before its commercial release as a part of a special early access program, as well as our continued experiments with the device following its commercial launch. The main results of this study include successfully sequencing and aligning E.coli and human gDNA samples to their respective reference genomes. Using our sequencing and analysis pipeline specifically tuned to the MinION we were able to sequence the entire E.coli genome on a single MinION flow cell with an average depth of around 180. Over the course of the thesis project the MinION sequencing protocol was evaluated and optimized in order to determine whether it has the potential to achieve our ultimate goal of reliably sequencing the previously inaccessible genomic regions of the human genome. The possibility of augmenting the sequencing protocol by adding the pre-sequencing target enrichment was also explored. Ultimately we were able to confirm that the MinION sequencer can be used to sequence long DNA fragments from a multitude of sample types. The majority of the produced reads could successfully be aligned against a reference genome. However, the limited yield and sequencing quality of a single experiment does limit the applicability of the method for more complicated genomic studies. These issues can be addressed with various techniques, chiefly target enrichment, but adapting such methods into the sequencing pipeline has its own challenges.
  • Arsin, Sila (2019)
    Mycosporines and mycosporine-like amino acids (MAAs) are small-molecules that provide UV protection in a broad range of organisms. Cyanobacteria produce a diverse set of MAA chemical variants, many of which are glycosylated. Even though the biosynthetic pathway for the production of a common cyanobacterial MAA, shinorine, is known, the biosynthetic origins of the glycosylated variants remains unclear. In this work, bioinformatics analyses were performed to catalogue the genetic diversity encoded in the MAA gene clusters in cyanobacterial genomes and identify a set of enzymes that might be involved in MAA biosynthesis. A total of 211 cyanobacterial genomes were found to contain the MAA gene cluster, with six containing glycosyltransferase genes within the gene cluster. Afterwards, 38 strains from the University of Helsinki Culture Collection were tested for the production of MAAs using QTOF-LC/MS analyses. This resulted in the identification of several novel glycosylated MAA chemical variants from Nostoc sp. UHCC 0302, which contained a 7.4 kb MAA biosynthetic gene cluster consisting of 7 genes, including two for glycosyltransferases and one for dioxygenase. Heterologous expression of this gene cluster in Escherichia coli TOP10 resulted in the production of a glycosylated porphyra-334 variant of 509 m/z by the transformant cells, showing that colanic acid biosynthesis glycosyltransferases can catalyse the addition of hexose to MAAs. These results suggested a biosynthetic route for the production of glycosylated MAAs in cyanobacteria and allowed to propose a putative role for dioxygenases in MAA biosynthesis. Further characterization of additional glycosyltransferases is necessary to improve our understanding of glycosylated MAA biosynthesis and functionality, which could be applied to large scale processes and be used in industrial applications.
  • Hellsten, Kirsi (2023)
    Triglycerides are a type of lipid that enters our body with fatty food. High triglyceride levels are often caused by an unhealthy diet, poor lifestyle, poorly treated diseases such as diabetes and too little exercise. Other risk factors found in various studies are HIV, menopause, inherited lipid metabolism disorder and South Asian ancestry. Complications of high triglycerides include pancreatitis, carotid artery disease, coronary artery disease, metabolic syndrome, peripheral artery disease, and strokes. Migration has made Singapore diverse, and it contains several subpopulations. One third of the population has genetic ancestry in China. The second largest group has genetic ancestry in Malaysia, and the third largest has genetic ancestry in India. Even though Singapore has one of the highest life expectancies in the world, unhealthy lifestyles such as poor diet, lack of exercise and smoking are still visible in everyday life. The purpose of this thesis was to introduce GWAS-analysis for quantitative traits and apply it to real data, and also to see if there are associations between some variants and triglycerides in three main subpopulations in Singapore and compare the results to previous studies. The research questions that this thesis answered are: what is GWAS analysis and what is it used for, how can GWAS be applied to data containing quantitative traits, and is there associations between some SNPs and triglycerides in three main populations in Singapore. GWAS stands for genome-wide association studies designed to identify statistical association between genetic variants and phenotypes or traits. One reason for developing GWAS was to learn to identify different genetic factors which have an impact on significant phenotypes, for instance susceptibility to certain diseases Such information can eventually be used to predict the phenotypes of individuals. GWAS have been globally used in, for example, anthropology, biomedicine, biotechnology, and forensics. The studies enhance the understanding of human evolution and natural selection and helps forward many areas of biology. The study used several quality control methods, linear models, and Bayesian inference to study associations. The research results were examined, among other things, with the help of various visual methods. The dataset used in this thesis was an open data used by Saw, W., Tantoso, E., Begum, H. et al. in their previous study. This study showed that there are associations between 6 different variants and triglycerides in the three main subpopulations in Singapore. The study results were compared with the results of two previous studies, which differed from the results of this study, suggesting that the results are significant. In addition, the thesis reviewed the ethics of GWAS and the limitations and benefits of GWAS. Most of the studies like this have been done in Europe, so more research is needed in different parts of the world. This research can also be continued with different methods and variables.
  • Kähkönen, Harri (2023)
    The volume of data generated by high-throughput DNA sequencing has grown to a magnitude that leads to substantial computational challenges in storing and searching the data. To tackle this problem, various computational methodologies have been developed in recent years to space-efficiently index collections of data sets and enable efficient searches. One of the most recent indexing methods, Spectral Burrows-Wheeler Transform (SBWT), presents all distinct k-mers of a DNA sequence using only 4 bits and a small additional space for the rank data structures per k-mer. In addition to being space-efficient, it also enables k-mer membership queries in linear time relative to k, and constant time relative to the number of distinct k-mers in the sequence. The queries rely on rank queries over bit vectors. Experiments run on a single CPU thread have shown that in one second, hundreds of thousands of k-mer membership queries can be performed over SBWT. By parallelizing the queries on a CPU, it is possible to execute millions of queries per second. However, Graphic Processing Units (GPUs) have much more parallelization potential. The main contribution of the thesis is an implementation of the k-mer membership queries over SBWT with GPU computing. Optimizing the queries to be performed on a GPU made it possible to perform over a billion queries per second. Furthermore, the thesis presents a new enhancement for the queries over SBWT called presearching, which doubles the speed of the original SBWT search query. The rank query needed for the membership queries is implemented using space-efficient poppy rank data structures, and its derivative cumulative-poppy data structure which is one of the contributions of the thesis.
  • Scheinin, Ilari (2011)
    Ewing sarcoma is an aggressive and poorly differentiated malignancy of bone and soft tissue. It primarily affects children, adolescents, and young adults, with a slight male predominance. It is characterized by a translocation between chromosomes 11 and 22 resulting in the EWSR1-FLI1fusion transcription factor. The aim of this study is to identify putative Ewing sarcoma target genes through an integrative analysis of three microarray data sets. Array comparative genomic hybridization is used to measure changes in DNA copy number, and analyzed to detect common chromosomal aberrations. mRNA and miRNA microarrays are used to measure expression of protein-coding and miRNA genes, and these results integrated with the copy number data. Chromosomal aberrations typically contain also bystanders in addition to the driving tumor suppressor and oncogenes, and integration with expression helps to identify the true targets. Correlation between expression of miRNAs and their predicted target mRNAs is also evaluated to assess the results of post-transcriptional miRNA regulation on mRNA levels. The highest frequencies of copy number gains were identified in chromosome 8, 1q, and X. Losses were most frequent in 9p21.3, which also showed an enrichment of copy number breakpoints relative to the rest of the genome. Copy number losses in 9p21.3 were found have a statistically significant effect on the expression of MTAP, but not on CDKN2A, which is a known tumor-suppressor in the same locus. MTAP was also down-regulated in the Ewing sarcoma cell lines compared to mesenchymal stem cells. Genes exhibiting elevated expression in association with copy number gains and up-regulation compared to the reference samples included DCAF7, ENO2, MTCP1, andSTK40. Differentially expressed miRNAs were detected by comparing Ewing sarcoma cell lines against mesenchymal stem cells. 21 up-regulated and 32 down-regulated miRNAs were identified, includingmiR-145, which has been previously linked to Ewing sarcoma. The EWSR1-FLI1 fusion gene represses miR-145, which in turn targets FLI1 forming a mutually repressive feedback loop. In addition higher expression linked to copy number gains and compared to mesenchymal stem cells, STK40 was also found to be a target of four different miRNAs that were all down-regulated in Ewing sarcoma cell lines compared to the reference samples. SLCO5A1 was identified as the only up-regulated gene within a frequently gained region in chromosome 8. This region was gained in over 90 % of the cell lines, and also with a higher frequency than the neighboring regions. In addition, SLCO5A1 was found to be a target of three miRNAs that were down-regulated compared to the mesenchymal stem cells.
  • Nebelung, Hanna (2023)
    ScRNA-seq captures a static picture of a cell's transcriptome including abundances of unspliced and spliced RNA. RNA velocity methods offer the opportunity to infer future RNA abundances and thus future states of a cell based on the temporal change of these unspliced and spliced RNA. Early RNA velocity methods have shed light on transcriptional dynamics in many biological processes. However, due to strict assumptions in the underlying model, these models are not reliable when analysing and inferring velocity for genes with complex expression dynamics such as genes with transcriptional boosts. These genes can for example be observed in erythropoietic and hematopoietic data. Several new RNA velocity methods have been proposed recently. Among these, veloVI and Pyro-Velocity both employ Bayesian methods to estimate the reaction rate and latent parameters. Thus the problem of estimating RNA velocity is turned into a posterior probability inference, that allows for more flexible inference of model parameters and the quantification of uncertainty. The objectives of this thesis were to investigate newly published RNA velocity methods, veloVI and Pyro-Velocity, in comparison to the established tool scVelo. To achieve this, we applied the methods to data obtained from scRNA-seq of healthy and ERCC6L2 disease bone marrow cells. ERCC6L2 disease can cause bone marrow failure with a risk of progression to acute myeloid leukemia with erythroid predominance. Specifically, we evaluated whether RNA velocity results reflect hematopoietic differentiation, if genes with transcriptional boosts affect the velocity results, and if RNA velocity analysis can indicate why erythropoiesis in ERCC6L2 disease is affected. We find that new RNA velocity methods can not produce velocity estimations that are fully in line with what is known of hematopoiesis in our data. Further, the results suggest that velocity estimations by veloVI are affected by genes with transcriptional boosts. Moreover, RNA velocity methods examined in this thesis are not robust and cannot reliably predict cell transitions based on the estimated velocity. Subsequently, velocity estimations for disease data such as ERCC6L2 disease must be evaluated carefully before drawing any conclusion about the differentiation process. In conclusion, this thesis highlights the need for models that can model complex transcription kinetics. Still, as this field is rapidly growing and promising new methods are being developed, improvement of RNA velocity analysis, in general, is possible.