Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "GTM"

Sort by: Order: Results:

  • Malmberg, Anni (2020)
    A population is said to be genetically structured when it can be divided into subpopulations based on genetic differences between the individuals. As in case of Finland for example, the population has been shown to consist of genetic subpopulations that correspond strongly to geographical subgroups. Such information may be interesting when seeking answers to questions related to the settlement and migration history of some population. Information about genetic population structure is also required for example in studies looking for associations between genetic variants and some inheritable disease to ensure that the groups with and without diagnosis of the disease resemble each other genetically except for the genetic variants causing the disease. In my thesis, I have compared how two different mathematical models, principal component analysis (PCA) and generative topographic mapping (GTM), visualize ancestry and identify genetic structure in Finnish population. PCA was introduced already in 1901, and nowadays it is a standard tool in identifying genetic structure and visualizing ancestry. GTM instead was published relatively recently, in 1998, and has not yet been applied in population structure studies as widely than PCA. Both PCA and GTM transform high-dimensional data to a low-dimensional, interpretable representation where relationships between observations of the data are summarized. In case of data containing genetic heterogeneity between individuals, this representation gives a visual approximation of the genetic structure of the population. However, Hèlèna A. Gaspar and Gerome Breen found in 2018 that GTM is able to classify ancestry of populations from around the world more accurately than PCA: the differences recognized by PCA were mainly between geographically most distant populations, while GTM detected also more their subpopulations. My aims in the thesis were to examine whether applying the methods for Finnish data would give similar results, and to give thorough presentations of the mathematical background for both the methods. I also discuss how the results fit into what is currently known about the genetic population structure in Finland. The study results are based on data from the FINRISK Study Survey collected by the National Institute for Health and Welfare (THL) in 1992-2012 and include 35 499 samples. After performing quality control on the data, I analysed the data with SmartPCA program and ugtm Python package implementing PCA and GTM, respectively. The final results have been presented for such 2010 individuals that participated the FINRISK Study Survey in 1997 and whose both parents were born close to each other. I have assigned the individuals into distinct geographical subgroups according to the birthplaces of their mothers to find out whether PCA and GTM identify individuals having a similar geographical origin to be genetically close to each other. Based on the results, the genetic structure in Finland is clearly geographically clustered, which fits into what is known from earlier studies. The results were also similar to those observed by Gaspar and Breen: Both the methods identified the genetic substructure but GTM was able to recognize more subtle differences in ancestry between the geographically defined subgroups than PCA. For example, GTM discovered the group corresponding to the region of Northern Ostrobothnia to consist of four smaller separate subgroups, while PCA interpreted the individuals with a Northern Ostrobothnian origin to be genetically rather homogeneous. Locating these individuals on the map of Finland according to the birthplaces of their mothers reveals that they also make four geographical clusters corresponding to the genetic subpopulations detected by GTM. As a final conclusion I state that GTM is a noteworthy alternative to PCA for studying genetic population structure, especially when it comes to identifying substructures from a population that PCA may interpret to be genetically homogeneous. I also note that the reason why GTM generally seems to be capable of more fine-grained clustering than PCA, is probably that PCA as a linear model may cause more bias to the results than GTM which accounts for also non-linear relationships when transforming the data into a more interpretable form.