Skip to main content
Login | Suomeksi | På svenska | In English

Calculating polygenic risk scores using down-sampled low-coverage whole-genome sequencing data in a Finnish cohort

Show full item record

Title: Calculating polygenic risk scores using down-sampled low-coverage whole-genome sequencing data in a Finnish cohort
Author(s): Suhonen, Sannimari
Contributor: University of Helsinki, Faculty of Science
Degree program: Master's Programme in Life Science Informatics
Specialisation: Bioinformatics and Systems Medicine
Language: English
Acceptance year: 2023
Polygenic risk scores (PRSs) estimate the genetic risk of an individual for a certain polygenic disease trait by summing up the effects of multiple variants across the genome affecting the disease risk. Currently, polygenic risk scores (PRSs) are calculated from imputed array genotyping data which is inexpensive to produce use and has standard procedures and pipelines available. However, genotyping arrays are prone to ascertainment bias, which can also lead to biased PRS results in some populations. If PRSs are utilized in healthcare for screening rare diseases, usage of whole-genome sequencing (WGS) instead of array genotyping is desirable, because also individual samples can be analyzed easily. While high-coverage WGS is still significantly more expensive than array genotyping, low-coverage whole genome sequencing (lcWGS) with imputation has been proposed as an alternative for genotyping arrays. In this project, the utility of imputed low-coverage whole-genome sequencing (lcWGS) data in PRS estimation compared to genotyping array data and the impact of the choice of imputation tool for lcWGS data was studied. Down-sampled WGS data with six different low coverages (0.1x-2x) was used to represent lcWGS data. Two different pipelines were used in genotype imputation and haplotype phasing: in the first one, pre-phasing and imputation were performed directly for the genotype likelihoods (GLs) calculated from the down-sampled data, whereas in the second one, the GLs were converted to genotype calls before imputation and phasing. In both pipelines, PRS for 27 disease phenotypes were calculated from the imputed and phased lcWGS data. Imputation and PRS calculation accuracy of the two pipelines were calculated in relation to both genotyping array and high-coverage whole-genome sequencing (hcWGS) data. In both pipelines, imputation and PRS calculation accuracy increased when the down-sampled coverage increased. The second imputation and phasing pipeline lead to better results in both imputation and PRS calculation accuracy. Some differences in PRS accuracy between different phenotypes were also detected. The results show similar patterns to what is seen in other similar publications. However, not quite as high imputation and PRS accuracy as seen in earlier studies could be attained, but possible limitations leading to lower accuracy could be identified. The results also emphasize the importance of choosing suitable imputation and phasing methods for lcWGS data and suggest that methods and pipelines designed particularly for lcWGS should be developed and published.
Keyword(s): Polygenic risk scores Low-coverage whole-genome sequencing Genotype imputation

Files in this item

Files Size Format View
Suhonen_Sannimari_thesis_2023.pdf 1.056Mb PDF

This item appears in the following Collection(s)

Show full item record