Browsing by Subject "statistics"

  • Laiho, Aleksi (2022)
    In statistics, data can often be high-dimensional with a very large number of variables, often larger than the number of samples themselves. In such cases, selection of a relevant configuration of significant variables is often needed. One such case is in genetics, especially genome-wide association studies (GWAS). To select the relevant variables from high-dimensional data, there exists various statistical methods, with many of them relating to Bayesian statistics. This thesis aims to review and compare two such methods, FINEMAP and Sum of Single Effects (SuSiE). The methods are reviewed according to their accuracy of identifying the relevant configurations of variables and their computational efficiency, especially in the case where there exists high inter-variable correlations within the dataset. The methods were also compared to more conventional variable selection methods, such as LASSO. The results show that both FINEMAP and SuSiE outperform LASSO in terms of selection accuracy and efficiency, with FINEMAP producing sligthly more accurate results with the expense of computation time compared to SuSiE. These results can be used as guidelines in selecting an appropriate variable selection method based on the study and data.
  • Lintuvuori, Meri (2010)
    The number of Finnish pupils attending special education has increased for more than a decade (Tilastokeskus 1999, 2000, 2001, 2003, 2004, 2005a, 2006b, 2007b, 2008b, 2008e, 2009b; Virtanen ja Ratilainen 1996). In the year 2007 nearly third of Finnish comprehensive school pupils took part in special needs education. According to the latest statistics, in the autumn of 2008 approximately 47 000 pupils have been admitted or transferred to special education and approximately 126 000 pupils received part-time special education during the 2007-2008 academic year. (Tilastokeskus 2008b, 2009b.) The Finnish special education system is currently under review. The Reform, both in legislation and in practice, began nationwide in the year 2008 (e.g. Special education strategy document, November 2007 and the development project Kelpo). The aim of the study was the statistical description of the Finnish special education system and on the other hand to gain a deeper understanding about the Finnish special education system and its quantitative increase, by analysis based on the nationwide statistical information. Earlier studies have shown that the growth in special education is affected by multiple independent variables and cannot be solely explained by the pupil characteristics. The statistical overview and analysis have been carried out in two parts. In the first part, the description and analysis were based on statistical time series from the academic year 1979-1980 until 2008. While, in the second, more detailed description and analysis, based on comparable time series from 1995 to 2008 and from 2001-2002 to 2007-2008, is presented. Historical perspective was one part of this study. There was an attempt to find reasons explaining the observed growth in the special needs education from late 1960s to 2008. The majority of the research was based on the nationwide statistics information. In addition to this, materials including educational legislation literature, different kind of records of special education and preceding studies were also used to support the research. The main results of the study, are two statistical descriptions and time series analysis of the quantitative increase of the special needs education. Further, a summary of the plausible factors behind the special education system change and its quantitative increase, is presented. The conclusions coming from the study can be summarised as follows: the comparable statistical time series analysis suggests that the growth in special education after the year 1999 could be a consequence of the changes in the structure of special education and that new group of pupils have been directed to special needs education.
  • Hellsten, Kirsi (2023)
    Triglycerides are a type of lipid that enters our body with fatty food. High triglyceride levels are often caused by an unhealthy diet, poor lifestyle, poorly treated diseases such as diabetes and too little exercise. Other risk factors found in various studies are HIV, menopause, inherited lipid metabolism disorder and South Asian ancestry. Complications of high triglycerides include pancreatitis, carotid artery disease, coronary artery disease, metabolic syndrome, peripheral artery disease, and strokes. Migration has made Singapore diverse, and it contains several subpopulations. One third of the population has genetic ancestry in China. The second largest group has genetic ancestry in Malaysia, and the third largest has genetic ancestry in India. Even though Singapore has one of the highest life expectancies in the world, unhealthy lifestyles such as poor diet, lack of exercise and smoking are still visible in everyday life. The purpose of this thesis was to introduce GWAS-analysis for quantitative traits and apply it to real data, and also to see if there are associations between some variants and triglycerides in three main subpopulations in Singapore and compare the results to previous studies. The research questions that this thesis answered are: what is GWAS analysis and what is it used for, how can GWAS be applied to data containing quantitative traits, and is there associations between some SNPs and triglycerides in three main populations in Singapore. GWAS stands for genome-wide association studies designed to identify statistical association between genetic variants and phenotypes or traits. One reason for developing GWAS was to learn to identify different genetic factors which have an impact on significant phenotypes, for instance susceptibility to certain diseases Such information can eventually be used to predict the phenotypes of individuals. GWAS have been globally used in, for example, anthropology, biomedicine, biotechnology, and forensics. The studies enhance the understanding of human evolution and natural selection and helps forward many areas of biology. The study used several quality control methods, linear models, and Bayesian inference to study associations. The research results were examined, among other things, with the help of various visual methods. The dataset used in this thesis was an open data used by Saw, W., Tantoso, E., Begum, H. et al. in their previous study. This study showed that there are associations between 6 different variants and triglycerides in the three main subpopulations in Singapore. The study results were compared with the results of two previous studies, which differed from the results of this study, suggesting that the results are significant. In addition, the thesis reviewed the ethics of GWAS and the limitations and benefits of GWAS. Most of the studies like this have been done in Europe, so more research is needed in different parts of the world. This research can also be continued with different methods and variables.
  • Kämäräinen, Emma (2018)
    Tässä työssä aiheena oleva mobiilipuhelimien käyttöiän mallintaminen ja ennustaminen on osa teleoperaattori DNA Oyj:n laitemallia. Laitemalliin kuuluu asiakkaan seuraavan puhelinlaitteen ostoajanhetken, hinnan ja valmistajan ennustaminen. Ostoajanhetken arviointi on olennainen tieto yrityksille, jotka myyvät mobiililaitteita, sillä sen avulla voidaan ajoittaa laitesuositteluja sekä tehdä asiakkaalle ajankohtaisia toimenpiteitä. Käyttöiän mallintamista varten haettiin aineisto DNA Oyj:n tietokannasta, jota jatkojalostettiin mallinnukseen sopivaksi. Aineistoa kertyy koko ajan lisää, jonka takia mallinnuksessa käytetty aineisto muuttuu jopa päivittäin. Laitemallia ajetaan DNA Oyj:n tuotantoympäristössä ja sen tulokset ovat operatiivisessa käytössä. Tutkielmani alussa esittelen mallinnuksessa käytettävän satunnainen metsä-algoritmin, joka on päätöspuiden kokoelmaan perustuva menetelmä. Ensin kerron hieman algoritmin historiasta ja sen teoreettisesta taustasta. Algoritmin toiminnan ymmärtämiseksi esittelen myös muita koneoppimisen menetelmiä, jotka ovat oleellinen osa algoritmia. Satunnainen metsä- menetelmässä on monia hyviä ominaisuuksia, joita täsmennän teoriaosuuden yhteydessä. Menetelmän suorituksen yhteydessä voidaan esimerkiksi laskea selittäville muuttujille niiden tärkeys mallinnuksessa. Algoritmin teorian esittelyn jälkeen määrittelen vielä muutamia metriikoita, joita käytän mallinnusvaiheessa tulosten analysoinnissa ja validoinnissa. Seuraavaksi kuvailen työssä käytetyn aineiston. Aineiston hakuja tehtiin kaksi, joista toinen on mallin koulutusaineistoa varten ja toinen on aineisto, jolle lopulliset ennusteet muodostetaan. Aineistoissa on paljon muuttujia, joten esittelen ne kahdessa osassa. Ensin kerron laitteeseen liittyvät ominaisuudet ja sen jälkeen asiakkaaseen liittyvät tiedot. Laitteiden ostopäivätiedoista saatiin selville mallinnuksen selitettävä muuttuja, puhelimen käyttöaika, joka luokiteltiin kolmen kuukauden tarkkuudella. Ostopäivän lisäksi puhelinlaitteesta on tiedossa monenlaisia teknisiä ominaisuuksia, muun muassa laitteen käyttöjärjestelmä sekä 4G- kyvykkyys. Asiakkaan tiedoista mallinnuksessa käytettiin demografisia tietoja, kuten sukupuolta ja ikää. Lisäksi hyödynnettiin asiakkaan ilmoittaman osoitetiedon perustella määriteltyä laajakaistasaatavuutta ja mobiilidatan käyttöön liittyviä muuttujia. Aineiston esittelyn jälkeen kerron varsinaisesta mallinnuksesta. Mallinnuksen yhteydessä tutkin eri parametrien vaikutusta ennustetuloksiin. Optimaalisten parametrien avulla luotiin luokkaennusteet mobiililaitteiden käyttöiälle. Eräs satunnainen metsä- algoritmin ominaisuus liittyy siihen, että menetelmän suorituksen yhteydessä pystytään arvioimaan sen tuottamia tuloksia aineistolle, jota menetelmä ei ole käyttänyt kyseisellä suorituskerralla mallin rakentamiseen. Arviointiin käytettiin luokittelumenetelmiin sopivia metriikoita, joiden perusteella algoritmi ennustaa onnistuneesti suuren osan aineistosta. Parametrien määrittämisen ja mallin kouluttamisen jälkeen muodostettiin luokat ennusteaineistolle. Lopullisten ennusteiden paikkansapitävyyttä ei voida arvioida, ennen kuin asiakas ostaa uuden puhelimen. Joissakin tapauksissa vaihtoon voi mennä useampi vuosi. Päätän opinnäytetyöni arvioimalla menetelmän toimivuutta ja pohtimalla laitevaihdon taustalla olevia muuttujia. Vaikka työssä oli käytössä rikas aineisto, puhelinvaihdon luultavasti yleisintä syytä eli laitteen vikatilannetta ei ollut saatavilla työn tekohetkellä. Laitevaihdon syihin perustuvan aineiston lisääminen parantaisi mallinnuksen tuloksia entisestään. Lopussa pohdin myös tuotannossa ajettavan, päivittäin muuttuvan mallinnuksen haasteita. Eräs mallinnuksen tuloksiin vaikuttava tekijä on muuttumattomat parametrit, jotka aineiston muuttuessa eivät välttämättä tuota enää parhaita ennustetuloksia. Laitemallia aiotaan kehittää entistä paremmaksi DNA Oyj:llä.
  • Forsman, Cecilia (University of HelsinkiHelsingin yliopistoHelsingfors universitet, 2014)
    Clinical signs associated with equine gastric ulceration are commonly reported in the literature, but are vague and often unsubstantiated. Clinical signs of gastric ulceration in yearlings and mature horses are less well recognized than in foals, but may be more important economically. There are no studies in the literature that have investigated the statistical association between clinical signs and gastric ulceration. The aim of this study was to determine whether there is a statistical association between commonly reported clinical signs of gastric ulceration and gastric ulcer severity as determined by endoscopic examination of the stomach. The hypothesis of this study was that there is no association between the severity of gastric ulceration and the owners perception of clinical signs of gastric ulceration. To achieve statistical significance, the study included 100 horses. A gastroscopic examination was performed on all the horses and documented on video. Owners were then asked to fill in a questionnaire documenting the clinical signs exhibited by their horses in the 3 months prior to the examination. The ulcers where graded into four categories1) presence or absence of gastric ulcers; 2) presence or absence of clinical significant gastric ulcers (i.e. needing treatment or not); 3) presence or absence of glandular ulcers; and 4) presence or absence of non-glandular ulcers. The four categories where compared to the clinical signs using a Pearson Chi-Square or Mann- Whitney U-test. Significance was set at p<0.05. A statistical association was found between clinical significant ulcers and losing weight (p=0,01) and between ulcer or no ulcer and losing weight (p=0,051). The results suggest that an owners perception of their horse losing weight could be associated with the presence of gastric ulcers and an increased severity of gastric ulcers, and can be used as an indication to perform gastroscopy on these individuals. There was no association between gastric ulcer severity and the owners perception of colic, crib-biting, flank-biting, fussy eating, changes in behaviour, chronic diarrhoea, bruxism, poor body condition, poor coat condition and poor performance, and requests from owners to have gastroscopy performed on their horses based upon these clinical signs should be approached with caution.