Browsing by study line "Tilastotiede"
Now showing items 1-20 of 41
-
(2022)Sosiaalietuudet ovat kokeneet monenlaisia muutoksia vuosien aikana, ja niihin liittyviä lakeja pyritään kehittämään jatkuvasti. Myös aivan viimesijaiseen valtion tarjoamaan taloudellisen tuen muotoon, toimeentulotukeen, on kohdistettu merkittäviä toimenpiteitä, mikä on vaikuttanut useiden suomalaisten elämään. Näistä toimenpiteistä erityisesti perustoimeentulotuen siirtäminen Kansaneläkelaitoksen vastuulle on vaatinut paljon sopeutumiskykyä tukea käsitteleviltä ja hakevilta tahoilta. Tämä on voinut herättää voimakkaitakin mielipiteitä, joiden ilmaisuun keskustelufoorumit ovat otollinen alusta. Suomen suurin keskustelufoorumi Suomi24 sisältää paljon yhteiskuntaan ja politiikkaan liittyviä keskusteluketjuja, joiden sisällön kartoittaminen kiinnostaviin aiheisiin liittyen voi tuottaa oikeanlaisilla menetelmillä mielenkiintoista ja hyödyllistä tietoa. Tässä tutkielmassa pyritään luonnollisen kielen prosessoinnin menetelmiä, tarkemmin aihemallinnusta, hyödyntämällä selvittämään, onko vuonna 2017 voimaan tulleen toimeentulotukilain muutos mahdollisesti näkynyt jollakin tavalla Suomi24-foorumin toimeentulotukea käsittelevissä keskusteluissa. Tutkimus toteutetaan havainnollistamalla valittua aineistoa erilaisilla visualisoinneilla sekä soveltamalla LDA algoritmia, ja näiden avulla yritetään havaita keskusteluiden keskeisimmät aiheet ja niihin liittyvät käsitteet. Jos toimeentulotukilain muutos on herättänyt keskustelua, se voisi ilmetä aiheista sekä niiden sisältämien sanojen käytön jakautumisesta ajalle ennen muutosta ja sen jälkeen. Myös aineiston rajaus ja poiminta tietokannasta, sekä aineiston esikäsittely aihemallinnusta varten kattaa merkittävän osan tutkimuksesta. Aineistoa testataan yhteensä kaksi kertaa, sillä ensimmäisellä kerralla havaitaan puutteita esikäsittelyvaiheessa sekä mallin sovittamisessa. Iterointi ei ole epätavanomaista tällaisissa tutkimuksissa, sillä vasta tuloksia tulkitessa saattaa nousta esille asioita, jotka olisi pitänyt ottaa huomioon jo edeltävissä vaiheissa. Toisella testauskerralla aiheiden sisällöistä nousi esille joitain mielenkiintoisia havaintoja, mutta niiden perusteella on vaikea tehdä päätelmiä siitä, näkyykö toimeentulotukilain muutos keskustelualustan viesteistä.
-
(2023)Otantamenetelmät ovat perustavanlaatuinen osa tiedettä ja tutkimusta, kun halutaan havainnoida ja mitata mitä tahansa mitattavaa ilmiötä. Hyvin tehdyn otannan tavoitteena on saada kerättyä otos, joka edustaa tavoiteperusjoukkoa mahdollisimman hyvin. Jos otantaa ei ole tehty huolellisesti ja tiettyjä perusperiaatteita noudattaen, tutkimustulokset eivät ole luotettavia eikä tilastollista päättelyä voida tehdä. Tässä tutkielmassa tarkastellaan otantaa todennäköisyys- ja ei-todennäköisyysmenetelmillä. Todennäköisyysotannassa kaiken perustana on se, että jokaisella yksiköllä on nollaa suurempi todennäköisyys tulla valituksi otokseen. Ei-todennäköisyysotannalla tarkoitetaan sellaisia otantamenetelmiä, jotka eivät noudata todennäköisyysotannan perusperiaatteita. Vastaava tilanne voisi tapahtua esimerkiksi tutkimuksissa, joissa yksikön todennäköisyyttä tulla valituksi otokseen ei tiedetä. Kyseessä on myös ei-todennäköisyysotos, jos yhden tai useamman perusjoukon yksikön valikoitumistodennäköisyys on nolla. Ei-todennäköisyysotoksissa ilmenee usein harhaa.Tilanteessa, jossa otantaa ei ole valikoitu satunnaisesti ja menetelmien mukaisesti, otoksessa esiintyvää harhaa voidaan korjata erilaisin menetelmin. Tässä tutkielmassa esittelen näitä menetelmiä. Tutkielman aineistona Kelan työterveyshuoltotutkimuksen aineistoa (OHC), joka on kerätty työterveyshuoltolain täytäntöönpanon ja toteuttamisen arvioimiseksi vuonna 1985. Tarkasteluissa käytettiin OHC-aineistoa perusjoukkona. Tarkasteluissa perusjoukosta poimitaan eri kokoisia otoksia eri otantamenetelmillä, ja vertaillaan niitä. Lopuksi kokeillaan myös painokertoimilla kalibroimista, eli harhan korjaamista. Tuloksista huomataan, että perinteiset menetelmät antavat odotetusti luotettavia tuloksia, mutta oikein käytettynä ei-todennäköisyysotannalla voidaan kerätä tärkeää tietoa.
-
(2022)Colorectal cancer (CRC) accounts for one in 10 new cancer cases worldwide. CRC risk is determined by a complex interplay of constitutional, behavioral, and environmental factors. Patients with ulcerative colitis (UC) are at increased risk of CRC, but effect estimates are heterogeneous, and many studies are limited by small numbers of events. Furthermore, it has been challenging to distinguish the effects of age at UC diagnosis and duration of UC. Multistate models provide a useful statistical framework for analyses of cancers and premalignant conditions. This thesis has three aims: to review the mathematical and statistical background of multistate models; to study maximum likelihood estimation in the illness-death model with piecewise constant hazards; and to apply the illness-death model to UC and CRC in a population-based cohort study in Finland in 2000–2017, considering UC as a premalignant state that may precede CRC. A likelihood function is derived for multistate models under noninformative censoring. The multistate process is considered as a multivariate counting process, and product integration is reviewed. The likelihood is constructed by partitioning the study time into subintervals and finding the limit as the number of subintervals tends to infinity. Two special cases of the illness-death model with piecewise constant hazards are studied: a simple Markov model and a non-Markov model with multiple time scales. In the latter case, the likelihood is factorized into terms proportional to Poisson likelihoods, which permits estimation with standard software for generalized linear models. The illness-death model was applied to study the relationship between UC and CRC in a population-based sample of 2.5 million individuals in Finland in 2000–2017. Dates of UC and CRC diagnoses were obtained from the Finnish Care Register for Health Care and the Finnish Cancer Registry, respectively. Individuals with prevalent CRC were excluded from the study cohort. Individuals in the study cohort were followed from January 1, 2000, to the date of first CRC diagnosis, death from other cause, emigration, or December 31, 2017, whichever came first. A total of 23,533 incident CRCs were diagnosed during 41 million person-years of follow-up. In addition to 8,630 patients with prevalent UC, there were 19,435 cases of incident UC. Of the 23,533 incident CRCs, 298 (1.3%) were diagnosed in patients with pre-existing UC. In the first year after UC diagnosis, the HR for incident CRC was 4.67 (95% CI: 3.07, 7.09) in females and 7.62 (95% CI: 5.65, 10.3) in males. In patients with UC diagnosed 1–3 or 4–9 years earlier, CRC incidence did not differ from persons without UC. When 10–19 years had passed from UC diagnosis, the HR for incident CRC was 1.63 (95% CI: 1.19, 2.24) in females and 1.29 (95% CI: 0.96, 1.75) in males, and after 20 years, the HR was 1.61 (95% CI: 1.13, 2.31) in females and 1.74 (95% CI: 1.31, 2.31) in males. Early-onset UC (age <40 years) was associated with a markedly increased long-term risk of CRC. The HR for CRC in early-onset UC was 4.13 (95% CI: 2.28, 7.47) between 4–9 years from UC diagnosis, 4.88 (95% CI: 3.46, 6.88) between 10–19 years, and 2.63 (95% CI: 2.01, 3.43) after 20 years. In this large population-based cohort study, we estimated CRC risk in persons with and without UC in Finland in 2000–2017, considering both the duration of UC and age at UC diagnosis. Patients with early-onset UC are at increased risk of CRC, but the risk is likely to depend on disease duration, extent of disease, attained age, and other risk factors. Increased CRC risk in the first year after UC diagnosis may be in part due to detection bias, whereas chronic inflammation may underlie the long-term excess risk of CRC in patients with UC.
-
(2024)Buildings consume approximately 40% of global energy, hence, understanding and analyzing energy consumption patterns of buildings is essential in bringing desirable insights to building management stakeholders for better decision-making and energy efficiency. Based on a specific use case of a Finnish building management company, this thesis presents the challenge of optimizing energy consumption forecasting and building management by addressing the shortcomings of current individual building-level forecasting approaches and the dynamic nature of building energy use. The research investigates the plausibility of a system of building clusters by studying the representative cluster profiles and dynamic cluster changes. We focus on a dataset comprising hourly energy consumption time series from a variety of Finnish university buildings, employing these as subjects to implement a novel stream clustering approach called ClipStream. ClipStream is an attibute-based stream clustering algorithm to perform continuous online clustering of time series data batches that involves iterative data abstraction, clustering, and change detection phases. This thesis shows that it was plausible to build clusters of buildings based on energy consumption time series. 23 buildings were successfully clustered into 3-5 clusters during each two-week window of the period of investigation. The study’s findings revealed distinct and evolving energy consumption clusters of buildings and characterized 7 predominant cluster profiles, which reflected significant seasonal variations and operational changes over time. Qualitative analyses of the clusters primarily confirmed the noticeable shifts in energy consumption patterns from 2019 to 2022, underscoring the potential of our approach to enhance forecasting efficiency and management effectiveness. These findings could be further extended to establish energy policy, building management practices, and broader sustainability efforts. This suggests that improved energy efficiency can be achieved through the application of machine learning techniques such as cluster analysis.
-
(2022)Tämän tutkielman tarkoitus on tarkastella robustien estimaattorien, erityisesti BMM- estimaattorin, soveltuvuutta ARMA(p, q)-prosessin parametrien estimointiin. Robustit estimaattorit ovat estimaattoreita, joilla pyritään hallitsemaan poikkeavien havaintojen eli outlierien vaikutusta estimaatteihin. Robusti estimaattori sietääkin outliereita siten, että outlierien läsnäololla havainnoissa ei ole merkittävää vaikutusta estimaatteihin. Outliereita vastaan saatu suoja kuitenkin yleensä näkyy menetettynä tehokkuutena suhteessa suurimman uskottavuuden menetelmään. BMM-estimaattori on Mulerin, Peñan ja Yohain Robust estimation for ARMA models-artikkelissa (2009) esittelemä MM-estimaattorin laajennus. BMM-estimaattori pohjautuu ARMA-mallin apumalliksi kehitettyyn BIP-ARMA-malliin, jossa innovaatiotermin vaikutusta rajoitetaan suodattimella. Ajatuksena on näin kontrolloida ARMA-mallin innovaatioissa esiintyvien outlierien vaikutusta. Tutkielmassa BMM- ja MM- estimaattoria verrataan klassisista menetelmistä suurimman uskottavuuden (SU) ja pienimmän neliösumman (PNS) menetelmiin. Tutkielman alussa esitetään tarvittava todennäköisyysteorian, aikasarja-analyysin sekä robustien menetelmien käsitteistö. Lukija tutustutetaan robusteihin estimaattoreihin ja motivaatioon robustien menetelmien taustalla. Outliereita sisältäviä aikasarjoja käsitellään tutkielmassa asymptoottisesti saastuneen ARMA-prosessin realisaatioina ja keskeisimmille kirjallisuudessa tunnetuille outlier-prosesseille annetaan määritelmät. Lisäksi kuvataan käsiteltyjen BMM-, MM-, SU- ja PNS-estimaattorien laskenta. Estimaattorien yhteydessä käsitellään lisäksi alkuarvomenetelmiä, joilla estimaattorien minimointialgoritmien käyttämät alkuarvot valitaan. Tutkielman teoriaosuudessa esitetään lauseet ja todistukset MM-estimaattorin tarkentuvuudesta ja asymptoottisesta normaaliudesta. Kirjallisuudessa ei kuitenkaan tunneta todistusta BMM-estimaattorin vastaaville ominaisuuksille, vaan samojen ominaisuuksien otaksutaan pätevän myös BMM-estimaattorille. Tulososuudessa esitetään simulaatiot, jotka toistavat Muler et al. artikkelissa esitetyt simulaatiot monimutkaisemmille ARMA-malleille. Simulaatioissa BMM- ja MM-estimaattoria verrataan keskineliövirheen suhteen SU- ja PNS-estimaattoreihin, verraten samalla eri alkuarvomenetelmiä samalla. Lisäksi estimaattorien asymptoottisia robustiusominaisuuksia käsitellään. Estimaattorien laskenta on toteutettu R- ohjelmistolla, missä BMM- ja MM-estimaattorien laskenta on toteutettu pääosin C++-kielellä. Liite käsittää BMM- ja MM- estimaattorien laskentaan tarvittavan lähdekoodin.
-
(2022)Can a day be classified to the correct season on the basis of its hourly weather observations using a neural network model, and how accurately can this be done? This is the question this thesis aims to answer. The weather observation data was retrieved from Finnish Meteorological Institute’s website, and it includes the hourly weather observations from Kumpula observation station from years 2010-2020. The weather observations used for the classification were cloud amount, air pressure, precipitation amount, relative humidity, snow depth, air temperature, dew-point temperature, horizontal visibility, wind direction, gust speed and wind speed. There are four distinct seasons that can be experienced in Finland. In this thesis the seasons were defined as three-month periods, with winter consisting of December, January and February, spring consisting of March, April and May, summer consisting of June, July and August, and autumn consisting of September, October and November. The days in the weather data were classified into these seasons with a convolutional neural network model. The model included a convolutional layer followed by a fully connected layer, with the width of both layers being 16 nodes. The accuracy of the classification with this model was 0.80. The model performed better than a multinomial logistic regression model, which had accuracy of 0.75. It can be concluded that the classification task was satisfactorily successful. An interesting finding was that neither models ever confused summer and winter with each other.
-
(2022)In the thesis we assess the ability of two different models to predict cash flows in private credit investment funds. Models are a stochastic type and a deterministic type which makes them quite different. The data that has been obtained for the analysis is divided in three subsamples. These subsamples are mature funds, liquidated funds and all funds. The data consists of 62 funds, subsample of mature funds 36 and subsample of liquidated funds 17 funds. Both of our models will be fitted for all subsamples. Parameters of the models are estimated with different techniques. The parameters of the Stochastic model are estimated with the conditional least squares method. The parameters of the Yale model are estimated with the numerical methods. After the estimation of the parameters, the values are explained in detail and their effect on the cash flows are investigated. This helps to understand what properties of the cash flows the models are able to capture. In addition, we assess to both models' ability to predict cash flows in the future. This is done by using the coefficient of determination, QQ-plots and comparison of predicted and observed cumulated cash flows. By using the coefficient of determination we try to explain how well the models explain the variation around the residuals of the observed and predicted values. With QQ-plots we try to determine if the values produced of the process follow the normal distribution. Finally, with the cumulated cash flows of contributions and distributions we try to determine if models are able to predict the cumulated committed capital and returns of the fund in a form of distributions. The results show that the Stochastic model performs better in its prediction of contributions and distributions. However, this is not the case for all the subsamples. The Yale model seems to do better in cumulated contributions of the subsample of the mature funds. Although, the flexibility of the Stochastic model is more suitable for different types of cash flows and subsamples. Therefore, it is suggested that the Stochastic model should be the model to be used in prediction and modelling of the private credit funds. It is harder to implement than the Yale model but it does provide more accurate results in its prediction.
-
(2022)In statistics, data can often be high-dimensional with a very large number of variables, often larger than the number of samples themselves. In such cases, selection of a relevant configuration of significant variables is often needed. One such case is in genetics, especially genome-wide association studies (GWAS). To select the relevant variables from high-dimensional data, there exists various statistical methods, with many of them relating to Bayesian statistics. This thesis aims to review and compare two such methods, FINEMAP and Sum of Single Effects (SuSiE). The methods are reviewed according to their accuracy of identifying the relevant configurations of variables and their computational efficiency, especially in the case where there exists high inter-variable correlations within the dataset. The methods were also compared to more conventional variable selection methods, such as LASSO. The results show that both FINEMAP and SuSiE outperform LASSO in terms of selection accuracy and efficiency, with FINEMAP producing sligthly more accurate results with the expense of computation time compared to SuSiE. These results can be used as guidelines in selecting an appropriate variable selection method based on the study and data.
-
(2024)This thesis is an empirical comparison of various methods of statistical matching applied to Finnish income and consumption data. The comparison is performed in order to map out some possible matching strategies for Statistics Finland to use in this imputation task and compare the applicability of the strategies within specific datasets. For Statistics Finland, the main point of performing these imputations is in assessing consumption behaviour in years when consumption-related data is not explicitly collected. Within this thesis I compared the imputation of consumption data by imputing 12 consumption variables as well as their sum using the following matching methods: draws from the conditional distribution distance hot deck, predictive mean matching, local residual draws and a gradient boosting approach. The used donor dataset is a sample of households collected for the 2016 Finnish Household Budget Survey (HBS). The recipient dataset is a sample of households collected for the 2019 Finnish Survey of Income and Living Conditions (EU-SILC). In order to assess the quality of the imputations, I used numerical and visual assessments concerning the similarity of the weighted distributions of the consumption variables. The applied numerical assessments were the Kolmogorov-Smirnov (KS) test statistic as well as the Hellinger Distance (HD), the latter of which was calculated for a categorical transformation of the consumption variables. Additionally, the similarities of the correlation matrices were assessed using correlation matrix distance. Generally, distance hot deck and predictive mean matching fared relatively well in the imputation tasks. For example, in the imputation of transport-related expenditure, both produced KS test statistics of approximately 0.01-0.02 and HD of approximately 0.05, whereas the next best-performing method received scores of 0.04 and 0.09, thus representing slightly larger discrepancies. Comparing the two methods, particularly in the imputation of semicontinuous consumption variables, distance hot deck fared notably better than the predictive mean matching approach. As an example, in the consumption expenditure of alcoholic beverages and tobacco, distance hot deck produced values of the KS test statistic and HD of approximately 0.01 and 0.02 respectively whereas the corresponding scores for predictive mean matching were 0.21 and 0.16. Eventually, I would recommend for further application a consideration of both predictive mean matching and distance hot deck depending on the imputation task. This is because predictive mean matching can be applied more easily in different contexts but in certain kinds of imputation tasks distance hot deck clearly outperforms predictive mean matching. Further assessment for this data should be done, in particular the results should be validated with additional data.
-
(2021)Children’s height and weight development remains a subject of interest especially due to increasing prevalence of overweight and obesity in the children. With statistical modeling, height and weight development can be examined as separate or connected outcomes, aiding with understanding of the phenomenon of growth. As biological connection between height and weight development can be assumed, their joint modeling is expected to be beneficial. One more advantage of joint modeling is its convenience of the Body Mass Index (BMI) prediction. In the thesis, we modeled longitudinal data of children’s heights and weights of the dataset obtained from Finlapset register of the Institute of Health and Welfare (THL). The research aims were to predict the modeled quantities together with the BMI, interpret the obtained parameters with relation to the phenomenon of growth, as well as to investigate the impact of municipalities on to the growth of children. The dataset’s irregular, register-based nature together with positively skewed, heteroschedastic weight distributions and within- and between-subject variability suggested Hierarchical Linear Models (HLMs) as the modeling method of choice. We used HLMs in Bayesian setting with the benefits of incorporating existing knowledge, and obtaining full posterior predictive distribution for the outcome variables. HLMs were compared with the less suitable classical linear regression model, and bivariate and univariate HLMs with or without area as a covariate were compared in terms of their posterior predictive precision and accuracy. One of the main research questions was the model’s ability to predict the BMI of the child, which we assessed with various posterior predictive checks (PPC). The most suitable model was used to estimate growth parameters of 2-6 year old males and females in Vihti, Kirkkonummi and Tuusula. With the parameter estimates, we could compare growth of males and females, assess the differences of within-subject and between-subject variability on growth and examine correlation between height and weight development. Based on the work, we could conclude that the bivariate HLM constructed provided the most accurate and precise predictions especially for the BMI. The area covariates did not provide additional advantage to the models. Overall, Bayesian HLMs are a suitable tool for the register-based dataset of the work, and together with log-transformation of height and weight they can be used to model skewed and heteroschedastic longitudinal data. However, the modeling would ideally require more observations per individual than we had, and proper out-of-sample predictive evaluation would ensure that current models are not over-fitted with regards to the data. Nevertheless, the built models can already provide insight into contemporary Finnish childhood growth and to simulate and create predictions for the future BMI population distributions.
-
(2022)The Finnish Customs collects and maintains the statistics of the Finnish intra-EU trade with the Intrastat system. Companies with significant intra-EU trade are obligated to give monthly Intrastat declarations, and the statistics of the Finnish intra-EU trade are compiled based on the information collected with the declarations. In case of a company not giving the declaration in time, there needs to exist an estimation method for the missing values. In this thesis we propose an automatic multivariate time series forecasting process for the estimation of the missing Intrastat import and export values. The forecasting is done separately for each company with missing values. For forecasting we use two dimensional time series models, where the other component is the import or export value of the company to be forecasted, and the other component is the import or export value of the industrial group of the company. To complement the time series forecasting we use forecast combining. Combined forecasts, for example the averages of the obtained forecasts, have been found to perform well in terms of forecast accuracy compared to the forecasts created by individual methods. In the forecasting process we use two multivariate time series models, the Vector Autoregressive (VAR) model, and a specific VAR model called the Vector Error Correction (VEC) model. The choice of the model is based on the stationary properties of the time series to be modelled. An alternative option for the VEC model is the so-called augmented VAR model, which is an over-fitted VAR model. We use the VEC model and the augmented VAR model together by using the average of the forecasts created with them as the forecast for the missing value. When the usual VAR model is used, only the forecast created by the single model is used. The forecasting process is created as automatic and as fast as possible, therefore the estimation of a time series model for a single company is made as simple as possible. Thus, only statistical tests which can be applied automatically are used in the model building. We compare the forecast accuracy of the forecasts created with the automatic forecasting process to the forecast accuracy of forecasts created with two simple forecasting methods. In the non-stationary-deemed time series the Naïve forecast performs well in terms of forecast accuracy compared to the time series model based forecasts. On the other hand, in the stationary-deemed time series the average over the past 12 months performs well as a forecast in terms of forecast accuracy compared to the time series model based forecasts. We also consider forecast combinations where the forecast combinations are created by calculating the average of the time series model based forecasts and the simple forecasts. In line with the literature, the forecast combinations perform overall better in terms of the forecast accuracy than the forecasts based on the individual models.
-
(2022)Often in spatial statistics the modelled domain contains physical barriers that can have impact on how the modelled phenomena behaves. This barrier can be, for example, land in case of modelling a fish population, or road for different animal populations. Common model that is used in spatial statistics is a stationary Gaussian model, because of its computational requirements, relatively easy interpretation of results. The physical barrier does not have an effect on this type of models unless the barrier is transformed into variable, but this can cause issues in the polygon selection. In this thesis I discuss how the non-stationary Gaussian model can be deployed in cases where spatial domain contains physical barriers. This non-stationary model reduces spatial correlation continuously towards zero in areas that are considered as a physical barrier. When the correlation is selected to reduce smoothly to zero, the model is more likely to results similar output with slightly different polygons. The advantage of the barrier model is that it is as fast to train as the stationary model because both models can be trained using finite equation method (FEM). With FEM we can solve stochastic partial differential equations (SPDE). This method interprets continuous random field as a discrete mesh, and the computational requirements increases as the number of nodes in mesh increases. In order to create stationary and non-stationary models, I have described the required methods such as Bayesian statistics, stochastic process, and covariance function in the second chapter. I use these methods to define spatial random effect model, and one commonly used spatial model is the Gaussian latent variable model. At the end of second chapter, I describe how the barrier model is created, and what types of requirements this model has. The barrier model is based on a Matern model that is a Gaussian random field, and it can be represented by using Matern covariance function. The second chapter ends with description of how to create a mesh mentioned above, and how the FEM is used to solve SPDE. The performance of stationary and non-stationary Gaussian models are first tested by training both models with simulated data. This simulated data is a random sample from polygon of Helsinki where the coastline is interpreted as a physical barrier. The results show that the barrier model estimates the true parameters better than the stationary model. The last chapter contains data analysis of the rat populations in Helsinki. The data contains number of rat observations in each zip code, and a set of covariates. Both models, stationary and non-stationary, are trained with and without covariates, and the best model out of these four models was the stationary model with covariates.
-
(2023)Triglycerides are a type of lipid that enters our body with fatty food. High triglyceride levels are often caused by an unhealthy diet, poor lifestyle, poorly treated diseases such as diabetes and too little exercise. Other risk factors found in various studies are HIV, menopause, inherited lipid metabolism disorder and South Asian ancestry. Complications of high triglycerides include pancreatitis, carotid artery disease, coronary artery disease, metabolic syndrome, peripheral artery disease, and strokes. Migration has made Singapore diverse, and it contains several subpopulations. One third of the population has genetic ancestry in China. The second largest group has genetic ancestry in Malaysia, and the third largest has genetic ancestry in India. Even though Singapore has one of the highest life expectancies in the world, unhealthy lifestyles such as poor diet, lack of exercise and smoking are still visible in everyday life. The purpose of this thesis was to introduce GWAS-analysis for quantitative traits and apply it to real data, and also to see if there are associations between some variants and triglycerides in three main subpopulations in Singapore and compare the results to previous studies. The research questions that this thesis answered are: what is GWAS analysis and what is it used for, how can GWAS be applied to data containing quantitative traits, and is there associations between some SNPs and triglycerides in three main populations in Singapore. GWAS stands for genome-wide association studies designed to identify statistical association between genetic variants and phenotypes or traits. One reason for developing GWAS was to learn to identify different genetic factors which have an impact on significant phenotypes, for instance susceptibility to certain diseases Such information can eventually be used to predict the phenotypes of individuals. GWAS have been globally used in, for example, anthropology, biomedicine, biotechnology, and forensics. The studies enhance the understanding of human evolution and natural selection and helps forward many areas of biology. The study used several quality control methods, linear models, and Bayesian inference to study associations. The research results were examined, among other things, with the help of various visual methods. The dataset used in this thesis was an open data used by Saw, W., Tantoso, E., Begum, H. et al. in their previous study. This study showed that there are associations between 6 different variants and triglycerides in the three main subpopulations in Singapore. The study results were compared with the results of two previous studies, which differed from the results of this study, suggesting that the results are significant. In addition, the thesis reviewed the ethics of GWAS and the limitations and benefits of GWAS. Most of the studies like this have been done in Europe, so more research is needed in different parts of the world. This research can also be continued with different methods and variables.
-
(2023)Hawkes processes are a special class of inhomogenous Poisson processes used to model events exhibiting interdependencies. Initially introduced in Hawkes [1971], Hawkes processes have since found applications in various fields such as seismology, finance, and criminology. The defining feature of Hawkes processes lies in their ability to capture self-exciting behaviour, where the occurrence of an event increases the risk of experiencing subsequent events. This effect is quantified in their conditional intensity function which takes into account the history of the process in the kernel. This thesis focuses on the modeling of event histories using Hawkes processes. We define both the univariate and multivariate forms of Hawkes processes and discuss the selection of kernels, which determine whether the process is a jump or a non-jump process. In a jump Hawkes process, the conditional intensity spikes at an occurrence of an event and the risk of experiencing new events is the highest immediately after an event. For non-jump processes, the risk increases more gradually and can be more flexible. Additionally, we explore the choice of baseline intensity and the inclusion of covariates in the conditional intensity of the process. For parameter estimation, we derive the log-likelihood functions and discuss goodness of fit methods. We show that by employing the time-rescaling theorem to transform event times, assessing the fit of a Hawkes process reduces to that of an unit rate Poisson process. Finally, we illustrate the application of Hawkes processes by exploring whether an exponential Hawkes process can be used to model occurrences of diabetes-related comorbidities using data from the Diabetes Register of the Finnish Institute for Health and Welfare (THL). Based on our analysis, the process did not adequately describe our data, however, exploring alternative kernel functions and incorporating time-varying baseline intensities hold potential for improving the compatibility.
-
(2023)Survey-tutkimuksilla kerätään tietoa ympäröivästä maailmasta. Ne perustuvat usein otantaan eli siihen, että tietoa poimitaan valituilta tutkimuskohteilta. Vastaajia ja muita tietolähteitä poimitaan satunnaisesti tai jollain muulla tavalla valitsemalla. Otanta säästää tutkimuksen tekijältä aikaa ja kustannuksia, mutta tuottaa tutkimukselle epävarmuustekijöitä. Epävarmuus johtuu siitä, että poimittu otos ei vastaa tutkittavaa perusjoukkoa. Osa tutkimuskohteista jää tavoittamatta, tutkimuskohteet eivät vastaa tai vastaukset jäävät joiltain osin puutteellisiksi. Näin ollen otantatutkimus tuottaa aineistoja, joiden edustavuus ei ole täydellistä. Edustavuutta pyritään korjaamaan tilastollisilla menetelmillä, kuten painottamalla saatuja vastauksia sekä imputoimalla (paikkaamalla) tyhjiksi jääneitä vastauksia. Tässä opinnäytteessä keskitytään otannan korjaamiseen painottamalla. Opinnäytteessä luodaan katsaus katokorjausmenetelmään nimeltään jälkiositus. Menetelmä esitellään ja siihen liittyvä konkreettinen esimerkki käydään lävitse. Konkreettinen esimerkki liittyy Risto Lehtosen ja Erkki Pahkisen oppikirjaan [Lehtonen ja Pahkinen, 2004]. Siinä ennustetaan työttömien lukumääriä ja työttömien suhteellista osuutta entisen Keski-Suomen läänin alueella. Oppikirjan esimerkki toistettiin pääpiirteissään käyttäen eri otosta kuin oppikirjassa oli käytetty. Otokseen valikoituneiden kuntien takia tulokset poikkesivat jonkin verran oppikirjassa esitetyistä tuloksista. Jälkiosittaminen on loogista aineiston painottamista, joka tapahtuu aineiston keräämisen jälkeen. Siinä käytetään apuna lisätietoa esimerkiksi rekistereistä tai aiemmista tutkimuksista. Jälkiosittamisen historia ulottuu saman nimisenä 1970-luvulle ja eri nimisenä vähintäänkin 1940-luvulle. Jälkiosittaminen perustuu aineiston jakamiseen homogeenisiin soluihin sen jälkeen, kun aineisto on kerätty. Homogeenisten solujen käsittely vähentää epävarmuutta, jota tutkimuksen tekemiseen sisältyy. Epävarmuuden vähentyminen näkyy estimaattorien varianssien pienentymisenä.
-
(2023)Tilastollisen lukutaidon osaamisen tarve on jo pitkään tunnustettu yhdeksi yhteiskunnassa toimivan aikuisen kansalaistaidoksi. Tilastollisen lukutaidon määritelmästä ei kuitenkaan ole asiantuntijoiden välillä täyttä yhteisymmärrystä. Tutkielmassa tarkastellaan ensin tilastotieteen oppikirjojen ja populaaritieteen kirjojen tarjoamia määritelmiä tilastotieteelle, sekä kirjojen sisällysluetteloiden avulla sitä, mitä tilastoaiheita kirjoissa käsitellään. Tämän jälkeen pyritään löytämään sekä määritelmiä tilastolliselle lukutaidolle, että kartoittamaan niitä elementtejä, josta tilastollinen lukutaito koostuu. PISA-tutkimuksen matemaattisen lukutaidon mallia tarkastellaan yhtenä mahdollisuutena tilastollisen lukutaidon osa-alueiden ja prosessien määrittämiseen. Tilastollisen lukutaidon mallien avulla on mahdollista tarkentaa tilastotieteen perusteiden opetussisältöjä. Niitä on myös mahdollista käyttää opetuksen kehityksessä ja opiskelijoiden osaamisen arvioimisessa. Vaikka PISA-tutkimus tarjoaakin mielenkiintoisen näkökulman tilastollisen lukutaidon tarkasteluun, tutkimuksen viitekehystä tulisi tarkastella ja kehittää tilastotieteen opetuksen tutkijoiden tilastollisen lukutaidon valossa. Tilastotieteen määritelmät korostavat havaintojen keräämisen merkitystä todellisen maailman kuvaamiseksi ja analysoimiseksi, epävarmuuden hallitsemiseksi ja parempien päätösten tekemiseksi. Perinteiset tilastollisen lukutaidon määritelmät painottavat näiden lisäksi myös numeeristen havaintojen ymmärtämistä, kontekstia, kommunikaatiota, kriittistä ajattelua, asenteita ja motivaatiota. Tilastotieteen oppikirjojen lisäksi opetuksen suunnittelussa olisi hyvä hyödyntää populaaritieteen kirjojen käytännöllistä lähestymistapaa. Opetuksen painopisteen tulisi siirtyä numeeristen vastausten tuottamisesta numeroiden tulkitsemiseen kontekstissa. Opetuksessa tulisi kiinnittää erityistä huomiota kriittisten taitojen harjoitteluun. Dispositiotekijät, kuten asenteet ja motivaatio, ovat osa tutkijoiden ehdottamia tilastollisen lukutaidon malleja, eikä niitä tulisi sivuttaa myöskään opetuksessa.
-
(2023)In this thesis, we model the graduation of Mathematics and Statistics students at the University of Helsinki. The interest is in the graduation and drop-out times of bachelor’s and master’s degree program students. Our aim is to understand how studies lead up to graduation or drop-out, and which students are at a higher risk of dropping out. As the modeled quantity is time-to-event, the modeling is performed with survival analysis methods. Chapter 1 gives an introduction to the subject, while in Chapter 2 we explain our objectives for the research. In Chapter 3, we present the available information and the possible variables for modeling. The dataset covers a 12-year period from 2010/11 to 2021/22 and includes information for 2268 students in total. There were many limitations, and the depth of the data allowed the analysis to focus only on the post-2017/18 bachelor’s program. In Chapter 4, we summarize the data with visual presentation and some basic statistics of the follow-up population and different cohorts. The statistical methods are presented in Chapter 5. After introducing the characteristic concepts of time-to-event analysis, the main focus is on two alternative model choices; the Cox regression and the accelerated failure time models. The modeling itself was conducted with programming language R, and the results are given in Chapter 6. In Chapter 7, we introduce the main findings of the study and discuss how the research could be continued in the future. We found that most drop-outs happen early, during the first and second study year, with the grades from early courses such as Raja-arvot providing some early indication of future success in studies. Most graduations in the post-2017/18 program occur between the end of the third study year and the end of the fourth study year, with the median graduation time being 3,2 years after enrollment. Including the known graduation times from the pre-2017/18 data, the median graduation time from the whole follow-up period was 3,8 years. Other relevant variables in modeling the graduation times were gender and whether or not a student was studying in the Econometrics study track. Female students graduated faster than male students, and students in the Econometrics study track graduated slower than students in other study tracks. In future continuation projects, the presence of more specific period-wise data is crucial, as it would allow the implementation of more complex models and a reliable validation for the results presented in this thesis. Additionally, more accuracy could be attained for the estimated drop-out times.
-
(2024)Fish spawning activity is closely related to the habitat conditions. This research focuses on the number of pike-perch larvae caught during the warmer 3rd and 4th quarters along the Finnish coastline from 2007 to 2014. The whole study was based on a Bayesian species distribution model with a Poisson distribution to simulate the reproduction of pikeperch and predict the distribution of pikeperch larvae. In addition to utilizing environmental covariates, this study also incorporated phenological information simulate the periodic event of larval density in pikeperch and used spatially varying coecients to see how temperature can influence it. I validated the explanatory power and predictive ability of this model from which simulated and predicted the larval density of pikeperch to determine that shallow coastal waters with higher cumulative spring temperatures are suitable for pikeperch to reproduce and spawn , and the most likely time for spawning is from early to mid-June. By modeling the seabed environment, we can better understand the marine ecosystem and assess fisher behavior.
-
(2023)This thesis focuses on statistical topics that proved important during a research project involving quality control in chemical forensics. This includes general observations about the goals and challenges a statistician may face when working together with a researcher. The research project involved analyzing a dataset with high dimensionality compared to the sample size in order to figure out if parts of the dataset can be considered distinct from the rest. Principal component analysis and Hotelling's T^2 statistic were used to answer this research question. Because of this the thesis introduces the ideas behind both procedures as well as the general idea behind multivariate analysis of variance. Principal component analysis is a procedure that is used to reduce the dimension of a sample. On the other hand, the Hotelling's T^2 statistic is a method for conducting multivariate hypothesis testing for a dataset consisting of one or two samples. One way of detecting outliers in a sample transformed with principal component analysis involves the use of the Hotelling's T^2 statistic. However, using both procedures together breaks the theory behind the Hotelling's T^2 statistic. Due to this the resulting information is considered more of a guideline than a hard rule for the purposes of outlier detection. To figure out how the different attributes of the transformed sample influence the number of outliers detected according to the Hotelling's T^2 statistic, the thesis includes a simulation experiment. The simulation experiment involves generating a large number of datasets. Each observation in a dataset contains the number of outliers according to the Hotelling's T^2 statistic in a sample that is generated from a specific multivariate normal distribution and transformed with principal component analysis. The attributes that are used to create the transformed samples vary between the datasets, and in some datasets the samples are instead generated from two different multivariate normal distributions. The datasets are observed and compared against each other to find out how the specific attributes affect the frequencies of different numbers of outliers in a dataset, and to see how much the datasets differ when a part of the sample is generated from a different multivariate normal distribution. The results of the experiment indicate that the only attributes that directly influence the number of outliers are the sample size and the number of principal components used in the principal component analysis. The mean number of outliers divided by the sample size is smaller than the significance level used for the outlier detection and approaches the significance level when the sample size increases, implying that the procedure is consistent and conservative. In addition, when some part of the sample is generated from a different multivariate normal distribution than the rest, the frequency of outliers can potentially increase significantly. This indicates that the number of outliers according to Hotelling's T^2 statistic in a sample transformed with principal component analysis can potentially be used to confirm that some part of the sample is distinct from the rest.
-
(2024)The ever-changing world of e-commerce prompted the case company to develop a new improved online store for its business functions, which prompted the need to also understand relevant metrics. The aim of the research is to find the customer behaviour metrics that have explanatory power for the response variable, which is the count of transactions. Examining these key metrics provide an opportunity to create a sustainable foundation for future analytics. Based on the results the case company can develop analytics, as well as understand the weaknesses and strengths of the online store. The data is from Google Analytics service and each variable receives a daily value, but the data is not treated as time series. The response variable is not normally distributed, so a linear model was not suitable. Instead, the natural choice was generalized linear models as they can also accommodate non-normally distributed response variables. Two different models were fitted, Poisson distributed, and Gamma distributed. The models were compared in many ways, but no clear difference between the models performance was found, so the results were combined from both models. The results provided by the models were quite similar, but there were differences. For this reason, the explanatory variables were divided into three categories: key variables, variables with differing results, and non-significant variables. Key variables have explanatory power for the response variable, and the results of the models were consistent. For variables with differing results, the results of the models were different, and for non-significant variables, there was no explanatory power for the response variable. This categorization facilitates understanding of the results. In total 6 explanatory variables were categorized as key variables, one as mixed result variable and two as non-significant. In conclusion it matters which variables are tracked if the efficiency of the web store is developed based on the efficiency of transactions.
Now showing items 1-20 of 41