Browsing by study line "Biostatistics and Bioinformatics"
Now showing items 112 of 12

(2023)Seasonal variation has affected human societies throughout history, shaping various aspects of life including agriculture, migration patterns and culture. This influence is observed, among others, in the occurrences of diseases such as viral and bacterial infections, cardiovascular disease and mental disorders. While there are a multitude of factors influencing the timing of disease diagnoses, environmental and behavioral, the genetic role has not been explored to the best of our knowledge. The aim of this thesis was to relate genetic variation to seasonal disease risk. To achieve this, the seasonality of 1,759 disease endpoints was assessed in the Finnish population. A subset of 14 diseases were selected and used as input into a statistical modeling framework that was developed to search for genetic variants associated with seasonal disease risk in the FinnGen study population. A total of 9 genomewide significant loci affecting seasonality were identified, including a topsQTL, rs41273830[T], in ITGB8 for major depression and a stopgain variant, rs601338[A], in FUT2 for intestinal infections, the latter also being protective against disease risk. This introduces a new aspect to genetic research, which can both contribute to better understanding how known disease variants affect disease but also finding new disease variants whose effects are currently obscured by seasonal variation.

(2021)In inductive inference phenomena from the past are modeled in order to make predictions of the future. The mathematical concept of exchangeability for random sequences provides a mathematical justification for the assumption that observations are independently and identically distributed given some underlying parameters estimable from the empirical distribution of the observations. The theory of exchangeability contains basic elements for inductive inference, such as the de Finetti representation theorem for the probability of a general exchangeable sequence, prior probability distributions for the parameters in the representation theorem, as well as the predictive probabilities, or rule of succession, for new observations from the random sequence under consideration. However, entirely unanticipated observations pose a problem for inductive inference. How can one assign a probability for an event that has never been seen before? This is called the sampling of species problem. Under exchangeability, the number of possible different events t has to be known beforehand to be able to assign an equal prior probability 1/t for each event. In the sampling of species problem an assumption of infinite possible events has to be made, leading to the prior probability 1/∞ for each event, which is impossible. Exchangeability is thus inadequate to handle probability distributions for infinite possible events. It turns out that a solution to the sampling of species problem arises from partition exchangeability. Exchangeable random sequences have the same probability of occurring, if the observations in the sequence have identical frequencies. Under partition exchangeability, the sequences have the same probability of occurring when they share identical frequencies of frequencies. In this thesis, partition exchangeability is introduced as a framework of inductive inference by juxtaposing it with the more familiar type of exchangeability for random sequences. Partition exchangeability has parallel elements to exchangeability, in the Kingman representation theorem, the PoissonDirichlet distribution for the prior probability distribution, and a corresponding rule of succession. The rules of succession are required in the problem of supervised classification to provide product predictive probabilities to be maximized by assigning the test data into predefined classes based on training data. A Bayesian construction of supervised classification is discussed in this thesis. In theory, the best classification performance is gained when assigning the class labels to the test data simultaneously, but because of computational complexity, an assumption is often made where the test data points are i.i.d. with regards to each other. In the case of a known set of possible events these simultaneous and marginal classifiers converge in their test data predictive probabilities as the amount of training data tends to infinity, justifying the use of the simpler marginal classifier with enough training data. These two classifiers are implemented in this thesis under partition exchangeability, and it is shown in theory and in practice with a simulation study that the same asymptotic convergence between the simultaneous and marginal classifiers applies with partition exchangeable data as well. Finally, a small application in single cell RNA expression is explored.

(2024)Unsupervised learning techniques can detect clinically relevant structure in population cohort data of human gut microbiota. While the gut microbiota composition is influenced by individual factors such as diet, medication, and development of the immune system during early childhood, it is proposed that individuals maintain a relatively stable microbiota ecosystem throughout adulthood. This stability allows to distinguish individuals into subgroups based on their gut microbiota characteristics, which define the key features of microbiota community types within the population. For this, I compared three probabilistic unsupervised learning techniques, optimizationbased Nonnegative Matrix Factorization, and Bayesian modelling techniques, Dirichlet Multinomial Mixtures and Latent Dirichlet Allocation, with a naive benchmark clustering based on dominant taxa. I used an allcause mortality association strength as a quantitative metrics to distinguish biologically relevant structure in a large Finnish population cohort with almost 18 years followup. The techniques defined microbiota assemblages as either discrete enterotypes, which assigned each sample to a single community type, or continuous enterosignatures, which identified patterns of cooccurrence of microbiota community types within each sample. I found five rather robust community types, characterized by Bacteroides, Alistipes, Agathobacter, Escherichia, and Prevotella bacterial genera. Latent Dirichlet Allocation detected the strongest early mortality signal using Cox regression, outperforming all other techniques. The replicability of Latent Dirichlet Allocation was assessed using cross validation. The predicted community types uncovered similar ecological landscape on the data with the community types obtained using the entire data, confirming the clinical relevance, robustness, and scalability of the technique.

(2020)It is challenging to identify causal genes and pathways explaining the associations with diseases and traits found by genomewide association studies (GWASs). To solve this problem, a variety of methods that prioritize genes based on the variants identified by GWASs have been developed. In this thesis, the methods Datadriven Expression Prioritized Integration for Complex Traits (DEPICT) and Multimarker Analysis of GenoMic Annotation (MAGMA) are used to prioritize causal genes based on the most recently published publicly available schizophrenia GWAS summary statistics. The two methods are compared using the Benchmarker framework, which allows an unbiased comparison of gene prioritization methods. The study has four aims. Firstly, to explain what are the differences between the gene prioritization methods DEPICT and MAGMA and how the two methods work. Secondly, to explain how the Benchmarker framework can be used to compare gene prioritization methods in an unbiased way. Thirdly, to compare the performance of DEPICT and MAGMA in prioritizing genes based on the latest schizophrenia summary statistics from 2018 using the Benchmarker framework. Lastly, to compare the performance of DEPICT and MAGMA on a schizophrenia GWAS with a smaller sample size by using Benchmarker. Firstly, the published results of the Benchmarker analyses using schizophrenia GWAS from 2014 were replicated to make sure that the framework is run correctly. The results were very similar and both the original and the replicated results show that DEPICT and MAGMA do not perform significantly differently. Furthermore, they show that the intersection of genes prioritized by DEPICT and MAGMA outperforms the outersection, which is defined as genes prioritized by only one of these methods. Secondly, Benchmarker was used to compare the performance of DEPICT and MAGMA on prioritizing genes using the schizophrenia GWAS from 2018. The results of the Benchmarker analyses suggest that DEPICT and MAGMA perform similarly with the GWAS from 2018 compared to the GWAS from 2014. Furthermore, an earlier schizophrenia GWAS from 2011 was used to check if the performance of DEPICT and MAGMA differs when a GWAS with lower statistical power is used. The results of the Benchmarker analyses make clear that MAGMA performs better than DEPICT in prioritizing genes using this smaller data set. Furthermore, for the schizophrenia GWAS from 2011 the outersection of genes prioritized by DEPICT and MAGMA outperforms the intersection. To conclude, the Benchmarker framework is a useful tool for comparing gene prioritization methods in an unbiased way. For the most recently published schizophrenia GWAS from 2018 there is no significant difference between the performance of DEPICT and MAGMA in prioritizing genes according to Benchmarker. For the smaller schizophrenia GWAS from 2011, however, MAGMA outperformed DEPICT.

(2023)Background/Objectives: Various studies have shown the advantage when incorporating polygenic risk scores (PRSs) in models with classic risk factors. However, systematic comparisons of PRSs with nongenetic factors are lacking. In particular, many studies on PRSs do not even report the predictive performance of the confounders, such as age and sex, included in the model, which are already very predictive for most diseases. We looked at the ability of PRSs to predict the onset of 18 diseases in FinnGen R8 (N=342,499) and compared PRSs with the known nongenetic risk factors, age, sex, Education, and Charlson Comorbidity Index (CCI). Methods: We set up individual studies for the 18 diseases. A single study consisted of an exposure (19992009), a washout (20092011), and an observation period (20112019). Eligible individuals could not have the selected disease of interest inside the diseasefree period, which ranged from birth until the beginning of the observation period. We then defined the case and control status based on the diagnoses in the observation period and calculated the phenotypic scores during the exposure period. The PRSs were calculated using MegaPRS and the latest publicly available genomewide association study summary statistics. We then fitted separate Cox proportional hazards models for each disease to predict disease onset during the observation period. Results: In FinnGen, the model’s predictive ability (cindex) with all predictors ranged from 0.565 (95%CI: 0.5520.576) for Acute Appendicitis to 0.838 (95% CI: 0.8340.841) for Atrial Fibrillation. The PRSs outperformed the phenotypic predictors, CCI, and Education, for 6/18 diseases and still significantly enhance onset prediction for 13/18 diseases when added to a model with only nongenetic predictors. Conclusion: Overall, we showed that for many diseases PRSs add predictive power over commonly used predictors  such as age, sex, CCI, and Education. However, many important challenges must be addressed before implementing PRSs in clinical practice. Notably, we will need diseasespecific cost benefit analyses and studies to assess the direct impact of including PRSs in clinical use. Nonetheless, as more research is being conducted, PRSs could play an increasingly valuable role in identifying individuals at higher risk for certain diseases and enabling targeted interventions to improve health outcomes.

(2021)Traditional parametric statistical inference methods, such as maximum likelihood and Bayesian inference, cannot be used to learn parameter estimates if the likelihood is intractable, for example due to the complexity of the studied phenomenon. This can be overcome by using likelihoodfree inference that is used with simulatorbased models to learn parameter estimates. Also, traditional methods used in the estimation of uncertainties related to the parameter estimates typically require a likelihood function, and that is why these methods cannot be applied in likelihoodfree inference. In this thesis, we present a novel way to compute confidence sets for parameter estimates obtained from likelihoodfree inference using Jensen—Shannon divergence. We consider two test statistics that are based on mean Jensen—Shannon divergence and propose hypothesised asymptotic distributions for them. We test whether these hypothesised distributions can be used in the computation of confidence sets for parameter estimates obtained from likelihoodfree inference, and we evaluate the produced confidence sets by studying their frequentist behaviour that is summarised with coverage probabilities. We compare this frequentist behaviour between Jensen —Shannon divergence estimates and confidence sets obtained from grid evaluation of Monte Carlo estimates and from Bayesian optimisation for likelihoodfree inference (BOLFI) to the ones obtained from maximum likelihood inference with Wald’s and log likelihoodratio confidence sets using three different models. We also use a simulator based model with intractable likelihood to study the proposed confidence sets with BOLFI. In order to study the influence of observations on the parameter estimates and their confidence sets, we conducted these experiments with varying the number of observations. We show that Jensen—Shannon divergence based confidence sets meet the expected frequentist behaviour.

(2022)Species factories are defined as times and places in the fossil record where and when an exceptionally large number of new species occurs. While several tailored solutions for the mammalian record have been proposed, how to identify species factories computationally in a standardized way is still an open question. To quantify what is exceptional, we first need to quantify what is regular. One of the main challenges in this identification process is to account for sampling unevenness, which depends on several methodological decisions, including the scale of the analysis (aggrega tion radius). In this thesis we used CaptureMarkRecapture methods (CMR) with spatial aggregation guided by network modelling, to estimate the sampling probabilities for the species in the NOW database of mammalian fossil occurrences. Since the mammalian record is sparse and most localities include only a few species, we coupled CMR with tailored spatial aggregation approaches to estimate the sampling prob abilities. We then used these sampling probabilities to quantify background speciation rates and assess what rates are abnormal. We represented aggregated fossil data as a bipartite network and used community detection to evaluate how the choice of an aggre gation radius impacts the modular structure. After aggregating the data according to the radius chosen using networks analysis, we es timated sampling probabilities using CMR. These probabilities allow the adjustment for sampling unevenness so that the difference in findings can be compared across locations and cannot be due to differences in sampling. We identified as species factories the locations with origination rate in the highest 5% after adjustment per time unit. Once the species factories had been identified, we looked for paleoecological patterns in these places that may be lacking elsewhere, finding that species factories present a lower number of findings and of different species among findings, but a higher ratio between the amount of different species and of total findings than the rest of the locations. This would indicate that, even if species factories might accommodate fewer species, they present a higher diversity. To make sure these results were not only due to chance, we performed the same analysis on 100 randomized experiments obtained using a modified version of the Curveball Algo rithm and compared the values obtained from the original dataset and the ones obtained from the randomized ones. This comparison showed us that species factories tend to have more extreme values than the ones obtained through randomization, which would indicate that species factories present specific paleoecological patterns that are not present in other locations.

(2023)Population structure refers to the patterns of genetic variation within and between populations, which arises from various evolutionary processes such as genetic drift, natural selection and migration. Understanding this structure in human populations provides insights about our own evolutionary history and past migration patterns. Controlling for underlying population structure is also an essential step in genetic association analyses to ensure that the associations between genetic variants and traits of interest are not confounded by differences in ancestry. Results from such analyses are essential for the research and development of personalised medicine. Principal component analysis (PCA) is a method that has been widely used to study the patterns of genetic variability within populations. In this study, PCA is applied to a genotype data set of 38,113 samples born in Finland using data from Finnish study cohorts FINRISK, GeneRISK, FinHealth 2017 and Health 2000. The first ten principal components are extracted using PLINK 2.0 software. Novel discoveries of association between genetic variants and a disease often motivates further studies on the geographical distribution of such risk variants. Here, the genetic population structure is proposed as an alternative, higher dimensional space for studying the distribution of genetic variants within a population. This study presents a framework for quantifying and visualising the allele frequency variability across the genetic structure defined by principal components. Using an empirical Bayes model, the posterior minor allele frequency is estimated in discrete areas of the principal component space. The variability of these estimates is visualised as heatmaps, using a colouring scheme that provides statistical guarantees for frequency differences between different colours. The framework is demonstrated on five biallelic variants known to be associated with a disease or a disorder. The results show that visualising the pairwise components complemented with data on sample birth location reveals the major patterns of genetic variability within the Finnish population. The framework is able to distinguish areas in the genetic structure with differing levels of allele frequency, and visualise this variability as heatmaps that enable meaningful visual interpretation. The levels of allele frequency differences found in the principal component space are comparable to the differences found geographically, which suggests that studying individual variants within the genetic structure on top of geographical frequency maps can provide additional information on their distribution in a population.

(2021)Sex differences can be found in most human phenotypes, and they play an important role in human health and disease. Females and males have different sex chromosomes, which are known to cause sex differences, as are differences in the concentration of sex hormones such as testosterone, estradiol and progesterone. However, the role of the autosomes has remained more debated. The primary aim of this thesis is to assess the magnitude and relevance of human sexspecific genetic architecture in the autosomes. This is done by calculating sexspecific heritability estimates and genetic correlation estimates between females and males, as well as comparing these to sex differences on the phenotype level. Additionally, the heritability and genetic correlation estimates are compared between two populations, in order to assess the magnitude of sex differences compared to differences between populations. The analyses in this thesis are based on sexstratified genomewide association study (GWAS) data from 48 phenotypes in the UK Biobank (UKB), which contains genotype data from approximately 500 000 individuals as well as thousands of phenotype measurements. A replication of the analyses using three phenotypes was also made on data from the FinnGen project, with a dataset from approximately 175 000 individuals. The 48 phenotypes used in this study range from biomarkers such as serum testosterone and albumin levels to general traits such as height and blood pressure. The heritability and genetic correlation estimates were calculated using linkage disequilibrium score regression (LDSC). LDSC fits a linear regression model between test statistic values of GWAS variants and linkage disequilibrium (LD) scores calculated from a reference population. For most phenotypes, the heritability and genetic correlation results show little evidence of sex differences. Serum testosterone level and waisttohip ratio are exceptions to this, showing strong evidence of sex differences both on the genetic and the phenotype level. However, the overall correlation between phenotype level sex differences and sex differences in heritability or genetic correlation estimates is low. The replication in the FinnGen dataset for height, weight and body mass index (BMI), showed that for these traits the differences in heritability estimates and genetic correlations between the Finnish and UK populations are comparable or larger than the differences found between males and females.

(2022)Tämän tutkielman tarkoituksena on esittää ja havainnoida tapoja, joilla tilastollista epävarmuutta voidaan selittää ja visualisoida. Erityisesti kohdeyleisönä tilastollisen epävarmuuden viestinnällä ovat lukijat, joilla ei ole juurikaan aiempaa kokemusta tilastollisista käsitteistä tai menetelmistä. Sovelluskohteena näiden visuaalisten viestinnän menetelmien esittämisessä on hyödynnetty COVID19aineistoja. COVID19tartuntataudin viestinnässä kohdeyleisöjä on ollut hyvin erilaisia, mutta esimerkiksi koko Suomen väestöä koskevassa viestinnässä epidemian etenemisestä olennaista on ollut nimenomaan viestintä kohdeyleisölle, joka ei koostu alan asiantuntijoista. Tutkielma pohjautuu vuoden 2020 COVID19aineistoihin ja tartuntatautitilanteeseen, jolloin väestön keskuudessa ei vielä juurikaan ollut kehittynyt immuniteettia taudille. Tutkielman alussa esitellään SEIRtartuntatautimalli, jossa kuvataan epidemian kehittymistä väestössä neljän eri tartuntatautivaiheen kautta. SEIRmallia on hyödynnetty myös COVID19mallinnuksessa epidemian alkuvaiheessa, sillä COVID19 ajateltiin käyttäytyvän epidemiana samoin näiden neljän vaiheen osalta. Mallin esittelyn lisäksi on hieman pohdittu, kuinka mallissa käytössä olevat parametrit, kuten perustarttuvuusluku, vaikuttavat epidemiatilanteen kehittymiseen. Terveyden ja hyvinvoinnin laitoksen COVID19mallinnusta on myös esitelty SEIRmallin ja tartuntamäärien kehittymisen näkökulmasta vuoden 2020 alkupuolella. Tässä on tuotu esille myös vuonna 2020 käytössä olleiden yksilöiden välisten kontaktien määrää alentavien rajoitusten vaikutusta epidemiatilanteeseen tarttuvuusluvun kautta. Tilastollisen epävarmuuden osalta tässä tutkielmassa on keskitytty tilastollisen epävarmuuden syihin, sillä epävarmuus voi olla peräisin hyödynnettävien tietojen puutteesta tai niiden sattumanvaraisuudesta. Taustalla vaikuttavien syiden ymmärtäminen on olennaista kokonaiskuvan ja sen osien selittämisessä ja havainnoimisessa. Tutkielmassa pohditaan erityisesti COVID19mallinnuksessa ja sen tartuntojen testaamisessa esiintyvää epävarmuutta. Lisäksi tutkielmassa paneudutaan tilastollisen epävarmuuden esittämiskeinoihin, kuten otantaan liittyvään keskihajontaan tai virheeseen sekä luottamusväleihin, sekä myöhemmin muun muassa näiden käsitteiden visualisointiin ja viestintään. Tilastollisen epävarmuuden viestintää esitetään erityisesti erilaisten visuaalisten kuvaajien, kuten laatikkojanakuvaajien ja sirontakaavioiden, kautta pohtien samalla eri kuvaajien hyötyjä tai haasteita. Tutkielman loppupuolella perehdytään vielä viestinnän näkökulmasta kuvaajien tulkintaan vaikuttaviin seikkoihin sekä epävarmuuden viestinnän päämääriin esimerkiksi viestinnästä syntyvän luottamuksen tai tunteiden kautta. Lopuksi kootaan vielä tilastollisen epävarmuuden visuaalisen esittämisen mahdollisia haasteita, jotka voivat johtua esimerkiksi kohdeyleisön tekemistä tulkinnoista tai epäolennaisten kuvaajien hyödyntämisestä.

(2020)The aim of this thesis is to predict total career racing performance of Finnish trotter horses by using trotters early career racing performance and other early career variables. This thesis presents a brief introductory of harness racing and horses used in Finnish trotting sport. The data is presented and modified for predictions, with descriptive statistics of tables and visuals. The machine learning method of Random forests for regression is introduced and used in the predictions. After training the model, this thesis presents the prediction accuracy and variables of importance of the predictions of total career racing performance for both Finnhorse trotters and Finnish Standardbred trotter population. Finally, the writer discusses on the shortages and possible improvements for future research. The data for this thesis was provided by The Finnish trotting and breeding association (Suomen Hippos ry), which included all information of harness races from 1984 to the end of 2019, raced in Finland. From almost three million rows, the data was summarised to a data table of 46704 rows of trotters, that have started their career at earliest allowed three age groups. A total of 37 independent variables were used to predict three outcomes of total career earnings, total number of career starts and total number of career first placings, as separate models. The predictors are derived from other studies that estimate the environmental and genetic factors of racing performance of a trotter. The three models performed poor to moderate, with total earnings having the highest prediction accuracy. The model predicted quite well larger amounts of earnings, but was avid to predict some earnings when there in fact were none. Prediction accuracy of total number of starts was poor, especially when the true amount of starts was low. Model that predicted total number of career first placings performed the worst. This can partially be explained by the fact that winning is a rare event for a trotter in general. The models fit better for Finnish Standardbred trotters than for Finnhorse trotters. This thesis works as a good basis for future similar research, where massive amounts of data and machine learning is used to predict trotter’s career, racing performance or other factors. The results show that predicting total career racing performance as a classification problem could be a better fit than regression. These adequate classes, as well as possible better predictors and suitable imputes for missing values, should be consulted with an audience of superior knowledge in harness racing.

(2024)Renewable energy is the key for a sustainable future in a world currently run by coal and oil, and one of these sources could be bioelectrochemical systems [McCormick et. al., Energy Environ. Sci, 2015]. This is very different from traditional renewable energy sources, in that traditionally the process for generating the solar cells requires exotic material, or has a relatively extensive manufacturing process [Ren et. al., Solar Energy, 2020]. One type of these bioelectrochemical systems are biophotovoltaic systems, which utilize solar energy and water to produce electrons or other reducing agents outside of the organism, which can then be harvested for external usage [McCormick et. al., Energy Environ. Sci, 2015]. This type of system has many different focuses to improve efficiency, including substrate design, reactor design, and electrode properties [Anam et. al., Sustainable Energy Fuels, 2021]. While these are important, there is another avenue to be explored, namely the exoelectrogenesis pathway itself [Okedi et. al., bioRxiv, 2021]. This pathway analysis has been explored briefly with HilbertHuang transforms to figure out their oscillatory components, which has been partially mapped to photosystem II core expression [Okedi et. al., bioRxiv, 2021]. In my analysis, I will be using generated data from cyanobacteria which exhibit enhanced photosystem II and see if the exact mechanisms for this phenomenon can be captured. The data provided by the sequencing vendor comes in a FASTA Extension format, so the process and tools to translate this data into usable variant calling format files will be described. I will then iterate the additional analysis in the way of variant comparisons through strain concordance with gene comparisons, as well as phylogenetic trees. The first analysis is to compare a wild type to a mutated strain, with subsequent analysis being to compare multiple wild type strains to each other. Further analysis on phenotype expression compared to the variant calling will also be explored.
Now showing items 112 of 12