Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by study line "Tilastotiede"

Sort by: Order: Results:

  • Nenonen, Veera (2022)
    Sosiaalietuudet ovat kokeneet monenlaisia muutoksia vuosien aikana, ja niihin liittyviä lakeja pyritään kehittämään jatkuvasti. Myös aivan viimesijaiseen valtion tarjoamaan taloudellisen tuen muotoon, toimeentulotukeen, on kohdistettu merkittäviä toimenpiteitä, mikä on vaikuttanut useiden suomalaisten elämään. Näistä toimenpiteistä erityisesti perustoimeentulotuen siirtäminen Kansaneläkelaitoksen vastuulle on vaatinut paljon sopeutumiskykyä tukea käsitteleviltä ja hakevilta tahoilta. Tämä on voinut herättää voimakkaitakin mielipiteitä, joiden ilmaisuun keskustelufoorumit ovat otollinen alusta. Suomen suurin keskustelufoorumi Suomi24 sisältää paljon yhteiskuntaan ja politiikkaan liittyviä keskusteluketjuja, joiden sisällön kartoittaminen kiinnostaviin aiheisiin liittyen voi tuottaa oikeanlaisilla menetelmillä mielenkiintoista ja hyödyllistä tietoa. Tässä tutkielmassa pyritään luonnollisen kielen prosessoinnin menetelmiä, tarkemmin aihemallinnusta, hyödyntämällä selvittämään, onko vuonna 2017 voimaan tulleen toimeentulotukilain muutos mahdollisesti näkynyt jollakin tavalla Suomi24-foorumin toimeentulotukea käsittelevissä keskusteluissa. Tutkimus toteutetaan havainnollistamalla valittua aineistoa erilaisilla visualisoinneilla sekä soveltamalla LDA algoritmia, ja näiden avulla yritetään havaita keskusteluiden keskeisimmät aiheet ja niihin liittyvät käsitteet. Jos toimeentulotukilain muutos on herättänyt keskustelua, se voisi ilmetä aiheista sekä niiden sisältämien sanojen käytön jakautumisesta ajalle ennen muutosta ja sen jälkeen. Myös aineiston rajaus ja poiminta tietokannasta, sekä aineiston esikäsittely aihemallinnusta varten kattaa merkittävän osan tutkimuksesta. Aineistoa testataan yhteensä kaksi kertaa, sillä ensimmäisellä kerralla havaitaan puutteita esikäsittelyvaiheessa sekä mallin sovittamisessa. Iterointi ei ole epätavanomaista tällaisissa tutkimuksissa, sillä vasta tuloksia tulkitessa saattaa nousta esille asioita, jotka olisi pitänyt ottaa huomioon jo edeltävissä vaiheissa. Toisella testauskerralla aiheiden sisällöistä nousi esille joitain mielenkiintoisia havaintoja, mutta niiden perusteella on vaikea tehdä päätelmiä siitä, näkyykö toimeentulotukilain muutos keskustelualustan viesteistä.
  • Aholainen, Kusti (2022)
    Tämän tutkielman tarkoitus on tarkastella robustien estimaattorien, erityisesti BMM- estimaattorin, soveltuvuutta ARMA(p, q)-prosessin parametrien estimointiin. Robustit estimaattorit ovat estimaattoreita, joilla pyritään hallitsemaan poikkeavien havaintojen eli outlierien vaikutusta estimaatteihin. Robusti estimaattori sietääkin outliereita siten, että outlierien läsnäololla havainnoissa ei ole merkittävää vaikutusta estimaatteihin. Outliereita vastaan saatu suoja kuitenkin yleensä näkyy menetettynä tehokkuutena suhteessa suurimman uskottavuuden menetelmään. BMM-estimaattori on Mulerin, Peñan ja Yohain Robust estimation for ARMA models-artikkelissa (2009) esittelemä MM-estimaattorin laajennus. BMM-estimaattori pohjautuu ARMA-mallin apumalliksi kehitettyyn BIP-ARMA-malliin, jossa innovaatiotermin vaikutusta rajoitetaan suodattimella. Ajatuksena on näin kontrolloida ARMA-mallin innovaatioissa esiintyvien outlierien vaikutusta. Tutkielmassa BMM- ja MM- estimaattoria verrataan klassisista menetelmistä suurimman uskottavuuden (SU) ja pienimmän neliösumman (PNS) menetelmiin. Tutkielman alussa esitetään tarvittava todennäköisyysteorian, aikasarja-analyysin sekä robustien menetelmien käsitteistö. Lukija tutustutetaan robusteihin estimaattoreihin ja motivaatioon robustien menetelmien taustalla. Outliereita sisältäviä aikasarjoja käsitellään tutkielmassa asymptoottisesti saastuneen ARMA-prosessin realisaatioina ja keskeisimmille kirjallisuudessa tunnetuille outlier-prosesseille annetaan määritelmät. Lisäksi kuvataan käsiteltyjen BMM-, MM-, SU- ja PNS-estimaattorien laskenta. Estimaattorien yhteydessä käsitellään lisäksi alkuarvomenetelmiä, joilla estimaattorien minimointialgoritmien käyttämät alkuarvot valitaan. Tutkielman teoriaosuudessa esitetään lauseet ja todistukset MM-estimaattorin tarkentuvuudesta ja asymptoottisesta normaaliudesta. Kirjallisuudessa ei kuitenkaan tunneta todistusta BMM-estimaattorin vastaaville ominaisuuksille, vaan samojen ominaisuuksien otaksutaan pätevän myös BMM-estimaattorille. Tulososuudessa esitetään simulaatiot, jotka toistavat Muler et al. artikkelissa esitetyt simulaatiot monimutkaisemmille ARMA-malleille. Simulaatioissa BMM- ja MM-estimaattoria verrataan keskineliövirheen suhteen SU- ja PNS-estimaattoreihin, verraten samalla eri alkuarvomenetelmiä samalla. Lisäksi estimaattorien asymptoottisia robustiusominaisuuksia käsitellään. Estimaattorien laskenta on toteutettu R- ohjelmistolla, missä BMM- ja MM-estimaattorien laskenta on toteutettu pääosin C++-kielellä. Liite käsittää BMM- ja MM- estimaattorien laskentaan tarvittavan lähdekoodin.
  • Kukkola, Johanna (2022)
    Can a day be classified to the correct season on the basis of its hourly weather observations using a neural network model, and how accurately can this be done? This is the question this thesis aims to answer. The weather observation data was retrieved from Finnish Meteorological Institute’s website, and it includes the hourly weather observations from Kumpula observation station from years 2010-2020. The weather observations used for the classification were cloud amount, air pressure, precipitation amount, relative humidity, snow depth, air temperature, dew-point temperature, horizontal visibility, wind direction, gust speed and wind speed. There are four distinct seasons that can be experienced in Finland. In this thesis the seasons were defined as three-month periods, with winter consisting of December, January and February, spring consisting of March, April and May, summer consisting of June, July and August, and autumn consisting of September, October and November. The days in the weather data were classified into these seasons with a convolutional neural network model. The model included a convolutional layer followed by a fully connected layer, with the width of both layers being 16 nodes. The accuracy of the classification with this model was 0.80. The model performed better than a multinomial logistic regression model, which had accuracy of 0.75. It can be concluded that the classification task was satisfactorily successful. An interesting finding was that neither models ever confused summer and winter with each other.
  • Virtanen, Jussi (2022)
    In the thesis we assess the ability of two different models to predict cash flows in private credit investment funds. Models are a stochastic type and a deterministic type which makes them quite different. The data that has been obtained for the analysis is divided in three subsamples. These subsamples are mature funds, liquidated funds and all funds. The data consists of 62 funds, subsample of mature funds 36 and subsample of liquidated funds 17 funds. Both of our models will be fitted for all subsamples. Parameters of the models are estimated with different techniques. The parameters of the Stochastic model are estimated with the conditional least squares method. The parameters of the Yale model are estimated with the numerical methods. After the estimation of the parameters, the values are explained in detail and their effect on the cash flows are investigated. This helps to understand what properties of the cash flows the models are able to capture. In addition, we assess to both models' ability to predict cash flows in the future. This is done by using the coefficient of determination, QQ-plots and comparison of predicted and observed cumulated cash flows. By using the coefficient of determination we try to explain how well the models explain the variation around the residuals of the observed and predicted values. With QQ-plots we try to determine if the values produced of the process follow the normal distribution. Finally, with the cumulated cash flows of contributions and distributions we try to determine if models are able to predict the cumulated committed capital and returns of the fund in a form of distributions. The results show that the Stochastic model performs better in its prediction of contributions and distributions. However, this is not the case for all the subsamples. The Yale model seems to do better in cumulated contributions of the subsample of the mature funds. Although, the flexibility of the Stochastic model is more suitable for different types of cash flows and subsamples. Therefore, it is suggested that the Stochastic model should be the model to be used in prediction and modelling of the private credit funds. It is harder to implement than the Yale model but it does provide more accurate results in its prediction.
  • Laiho, Aleksi (2022)
    In statistics, data can often be high-dimensional with a very large number of variables, often larger than the number of samples themselves. In such cases, selection of a relevant configuration of significant variables is often needed. One such case is in genetics, especially genome-wide association studies (GWAS). To select the relevant variables from high-dimensional data, there exists various statistical methods, with many of them relating to Bayesian statistics. This thesis aims to review and compare two such methods, FINEMAP and Sum of Single Effects (SuSiE). The methods are reviewed according to their accuracy of identifying the relevant configurations of variables and their computational efficiency, especially in the case where there exists high inter-variable correlations within the dataset. The methods were also compared to more conventional variable selection methods, such as LASSO. The results show that both FINEMAP and SuSiE outperform LASSO in terms of selection accuracy and efficiency, with FINEMAP producing sligthly more accurate results with the expense of computation time compared to SuSiE. These results can be used as guidelines in selecting an appropriate variable selection method based on the study and data.
  • Kauppala, Tuuli (2021)
    Children’s height and weight development remains a subject of interest especially due to increasing prevalence of overweight and obesity in the children. With statistical modeling, height and weight development can be examined as separate or connected outcomes, aiding with understanding of the phenomenon of growth. As biological connection between height and weight development can be assumed, their joint modeling is expected to be beneficial. One more advantage of joint modeling is its convenience of the Body Mass Index (BMI) prediction. In the thesis, we modeled longitudinal data of children’s heights and weights of the dataset obtained from Finlapset register of the Institute of Health and Welfare (THL). The research aims were to predict the modeled quantities together with the BMI, interpret the obtained parameters with relation to the phenomenon of growth, as well as to investigate the impact of municipalities on to the growth of children. The dataset’s irregular, register-based nature together with positively skewed, heteroschedastic weight distributions and within- and between-subject variability suggested Hierarchical Linear Models (HLMs) as the modeling method of choice. We used HLMs in Bayesian setting with the benefits of incorporating existing knowledge, and obtaining full posterior predictive distribution for the outcome variables. HLMs were compared with the less suitable classical linear regression model, and bivariate and univariate HLMs with or without area as a covariate were compared in terms of their posterior predictive precision and accuracy. One of the main research questions was the model’s ability to predict the BMI of the child, which we assessed with various posterior predictive checks (PPC). The most suitable model was used to estimate growth parameters of 2-6 year old males and females in Vihti, Kirkkonummi and Tuusula. With the parameter estimates, we could compare growth of males and females, assess the differences of within-subject and between-subject variability on growth and examine correlation between height and weight development. Based on the work, we could conclude that the bivariate HLM constructed provided the most accurate and precise predictions especially for the BMI. The area covariates did not provide additional advantage to the models. Overall, Bayesian HLMs are a suitable tool for the register-based dataset of the work, and together with log-transformation of height and weight they can be used to model skewed and heteroschedastic longitudinal data. However, the modeling would ideally require more observations per individual than we had, and proper out-of-sample predictive evaluation would ensure that current models are not over-fitted with regards to the data. Nevertheless, the built models can already provide insight into contemporary Finnish childhood growth and to simulate and create predictions for the future BMI population distributions.
  • Lahdensuo, Sofia (2022)
    The Finnish Customs collects and maintains the statistics of the Finnish intra-EU trade with the Intrastat system. Companies with significant intra-EU trade are obligated to give monthly Intrastat declarations, and the statistics of the Finnish intra-EU trade are compiled based on the information collected with the declarations. In case of a company not giving the declaration in time, there needs to exist an estimation method for the missing values. In this thesis we propose an automatic multivariate time series forecasting process for the estimation of the missing Intrastat import and export values. The forecasting is done separately for each company with missing values. For forecasting we use two dimensional time series models, where the other component is the import or export value of the company to be forecasted, and the other component is the import or export value of the industrial group of the company. To complement the time series forecasting we use forecast combining. Combined forecasts, for example the averages of the obtained forecasts, have been found to perform well in terms of forecast accuracy compared to the forecasts created by individual methods. In the forecasting process we use two multivariate time series models, the Vector Autoregressive (VAR) model, and a specific VAR model called the Vector Error Correction (VEC) model. The choice of the model is based on the stationary properties of the time series to be modelled. An alternative option for the VEC model is the so-called augmented VAR model, which is an over-fitted VAR model. We use the VEC model and the augmented VAR model together by using the average of the forecasts created with them as the forecast for the missing value. When the usual VAR model is used, only the forecast created by the single model is used. The forecasting process is created as automatic and as fast as possible, therefore the estimation of a time series model for a single company is made as simple as possible. Thus, only statistical tests which can be applied automatically are used in the model building. We compare the forecast accuracy of the forecasts created with the automatic forecasting process to the forecast accuracy of forecasts created with two simple forecasting methods. In the non-stationary-deemed time series the Naïve forecast performs well in terms of forecast accuracy compared to the time series model based forecasts. On the other hand, in the stationary-deemed time series the average over the past 12 months performs well as a forecast in terms of forecast accuracy compared to the time series model based forecasts. We also consider forecast combinations where the forecast combinations are created by calculating the average of the time series model based forecasts and the simple forecasts. In line with the literature, the forecast combinations perform overall better in terms of the forecast accuracy than the forecasts based on the individual models.
  • Nikkanen, Leo (2022)
    Often in spatial statistics the modelled domain contains physical barriers that can have impact on how the modelled phenomena behaves. This barrier can be, for example, land in case of modelling a fish population, or road for different animal populations. Common model that is used in spatial statistics is a stationary Gaussian model, because of its computational requirements, relatively easy interpretation of results. The physical barrier does not have an effect on this type of models unless the barrier is transformed into variable, but this can cause issues in the polygon selection. In this thesis I discuss how the non-stationary Gaussian model can be deployed in cases where spatial domain contains physical barriers. This non-stationary model reduces spatial correlation continuously towards zero in areas that are considered as a physical barrier. When the correlation is selected to reduce smoothly to zero, the model is more likely to results similar output with slightly different polygons. The advantage of the barrier model is that it is as fast to train as the stationary model because both models can be trained using finite equation method (FEM). With FEM we can solve stochastic partial differential equations (SPDE). This method interprets continuous random field as a discrete mesh, and the computational requirements increases as the number of nodes in mesh increases. In order to create stationary and non-stationary models, I have described the required methods such as Bayesian statistics, stochastic process, and covariance function in the second chapter. I use these methods to define spatial random effect model, and one commonly used spatial model is the Gaussian latent variable model. At the end of second chapter, I describe how the barrier model is created, and what types of requirements this model has. The barrier model is based on a Matern model that is a Gaussian random field, and it can be represented by using Matern covariance function. The second chapter ends with description of how to create a mesh mentioned above, and how the FEM is used to solve SPDE. The performance of stationary and non-stationary Gaussian models are first tested by training both models with simulated data. This simulated data is a random sample from polygon of Helsinki where the coastline is interpreted as a physical barrier. The results show that the barrier model estimates the true parameters better than the stationary model. The last chapter contains data analysis of the rat populations in Helsinki. The data contains number of rat observations in each zip code, and a set of covariates. Both models, stationary and non-stationary, are trained with and without covariates, and the best model out of these four models was the stationary model with covariates.
  • Halonen, Pyry (2022)
    Prostate cancer is the second most common cancer among men and the risk evaluation of the cancer prior the treatment can be critical. Risk evaluation of the prostate cancer is based on multiple factors such as clinical assessment. Biomarkers are studied as they would also be beneficial in the risk evaluation. In this thesis we assess the predictive abilities of biomarkers regarding the prostate cancer relapse. The statistical method we utilize is logistic regression model. It is used to model the probability of a dichotomous outcome variable. In this case the outcome variable indicates if the cancer of the observed patient has relapsed. The four biomarkers AR, ERG, PTEN and Ki67 form the explanatory variables. They are the most studied biomarkers in prostate cancer tissue. The biomarkers are usually detected by visual assessment of the expression status or abundance of staining. Artificial intelligence image analysis is not yet in common clinical use, but it is studied as a potential diagnostic assistance. The data contains for each biomarker a visually obtained variable and a variable obtained by artificial intelligence. In the analysis we compare the predictive power of these two differently obtained sets of variables. Due to the larger number of explanatory variables, we seek the best fitting model. When we are seeking the best fitting model, we use an algorithm glmulti for the selection of the explanatory variables. The predictive power of the models is measured by the receiver operating characteristic curve and the area under the curve. The data contains two classifications of the prostate cancer whereas the cancer was visible in the magnetic resonance imaging (MRI). The classification is not exclusive since a patient could have had both, a magnetic resonance imaging visible and an invisible cancer. The data was split into three datasets: MRI visible cancers, MRI invisible cancers and the two datasets combined. By splitting the data we could further analyze if the MRI visible cancers have differences in the relapse prediction compared to the MRI invisible cancers. In the analysis we find that none of the variables from MRI invisible cancers are significant in the prostate cancer relapse prediction. In addition, all the variables regarding the biomarker AR have no predictive power. The best biomarker for predicting prostate cancer relapse is Ki67 where high staining percentage indicates greater probabilities for the prostate cancer relapse. The variables of the biomarker Ki67 were significant in multiple models whereas biomarkers ERG and PTEN had significant variables only in a few models. Artificial intelligence variables show more accurate predictions compared to the visually obtained variables, but we could not conclude that the artificial intelligence variables are purely better. We learn instead that the visual and the artificial intelligence variables complement each other in predicting the cancer relapse.
  • Vuoristo, Varpu (2021)
    Puolueiden kannatusmittaukset vaalien välillä tehdään kyselytutkimusten avulla. Näitä mielipidetiedusteluita kutsutaan kansankielellä termillä gallup. Tässä työssä perehdytään poliittisten mielipidetutkimusten historiaan sekä tehdään lyhyt katsaus galluppien nykytilanteeseen Suomessa. Tässä maisterintutkielmassa on ollut käytössä kyselytutkimuksella kerätyt aineistot. Aineistoissa on kysytty vastaajien äänestyskäyttäytymistä seuraavissa vaaleissa: kuntavaalit 2012, eduskuntavaalit 2015 sekä kuntavaalit 2017. Tutkielmassa esitellään kyselytutkimuksien kysymyksen asettelu, aineistojen puhdistamisen työvaiheita sekä perusteet mitkä tiedot tarvitaan tilastollisen mallin sovittamista varten. Teoriaosuudessa esitellään yleistettyjä lineaarisia malleja. Menetelmänä sovitetaan yleistetty lineaarinen malli valittuihin ja puhdistettuihin aluperäisten aineistojen osa-aineistoihin. Näissä osa-aneistoissa on tiedot vastaajien äänestyskäyttäytymisestä kahdeksan eri eduskuntapuolueen kesken. Lisäksi tilastollisen mallin sovittamista varten osa-aineistossa on tiedot vastaajien sukupuolesta sekä asuinpaikasta NUTS 2 -aluejaon mukaisesti. Sukupuoli ja viisi eri aluetta toimivat mallissa selittävinä muuttujina, kun taas puoluekannatus selitettävänä muuttujana. Aineiston käsittely on toteutettu R-laskentaohjelmalla. Tuloksissa on esitetään taulukointina selittävien muuttujien vaikutusta tarkasteltavan puolueen äänestämiseen, niin itsenäisinä selittäjinä kuin niiden yhteisvaikuksina. Jokaista kahdeksaa puoluetta tarkastellaan kaikkien kolmen vaaliaineiston osalta erikseen. Analysoinnin työkaluina toimivat suurimman uskottavuuden estimaattit sekä niiden luottamusvälit.
  • Halme, Topi (2021)
    In a quickest detection problem, the objective is to detect abrupt changes in a stochastic sequence as quickly as possible, while limiting rate of false alarms. The development of algorithms that after each observation decide to either stop and declare a change as having happened, or to continue the monitoring process has been an active line of research in mathematical statistics. The algorithms seek to optimally balance the inherent trade-off between the average detection delay in declaring a change and the likelihood of declaring a change prematurely. Change-point detection methods have applications in numerous domains, including monitoring the environment or the radio spectrum, target detection, financial markets, and others. Classical quickest detection theory focuses settings where only a single data stream is observed. In modern day applications facilitated by development of sensing technology, one may be tasked with monitoring multiple streams of data for changes simultaneously. Wireless sensor networks or mobile phones are examples of technology where devices can sense their local environment and transmit data in a sequential manner to some common fusion center (FC) or cloud for inference. When performing quickest detection tasks on multiple data streams in parallel, classical tools of quickest detection theory focusing on false alarm probability control may become insufficient. Instead, controlling the false discovery rate (FDR) has recently been proposed as a more useful and scalable error criterion. The FDR is the expected proportion of false discoveries (false alarms) among all discoveries. In this thesis, novel methods and theory related to quickest detection in multiple parallel data streams are presented. The methods aim to minimize detection delay while controlling the FDR. In addition, scenarios where not all of the devices communicating with the FC can remain operational and transmitting to the FC at all times are considered. The FC must choose which subset of data streams it wants to receive observations from at a given time instant. Intelligently choosing which devices to turn on and off may extend the devices’ battery life, which can be important in real-life applications, while affecting the detection performance only slightly. The performance of the proposed methods is demonstrated in numerical simulations to be superior to existing approaches. Additionally, the topic of multiple hypothesis testing in spatial domains is briefly addressed. In a multiple hypothesis testing problem, one tests multiple null hypotheses at once while trying to control a suitable error criterion, such as the FDR. In a spatial multiple hypothesis problem each tested hypothesis corresponds to e.g. a geographical location, and the non-null hypotheses may appear in spatially localized clusters. It is demonstrated that implementing a Bayesian approach that accounts for the spatial dependency between the hypotheses can greatly improve testing accuracy.
  • Laine, Riku (2021)
    People with a drug use disorder have a high risk of death following release from criminal sanctions due to increased risk of overdose. Time in prison has been associated with increased mortality from natural causes of death and suicides. In this thesis, the association of criminal sanctions with the mortality and causes of death of Finnish treatment-seeking individuals with substance use disorder was studied. Prior research on the topic is scarce and old. The data was the Register-based follow-up study on criminality, health and taxation of inpatients and outpatients entered into substance abuse treatment (RIPE, n = 10 887). The patients had been clients of A-Clinic Foundation between 1990 and 2009. Mortality was the modelled with logistic regression from 1.1.1992 to 26.8.2015. The time was divided into one-week episodes. For each client it was marked whether they were free, in prison or serving a community service, and whether they had died during the episode. Causes of death were studied using death records from 1992 to 2018. There was a 2,5-fold increase in overall mortality during the first two weeks after sentences. The risk stayed elevated even after the first 12 weeks (odds ratio 1,20; 95% confidence interval 1,08-1,32). The risk of a drug-related death (DRD) was almost 8,5-fold during the first two weeks. Poisonings excl. alcohol poisoning and assaults were more likely causes of death for patients with criminal history. DRD was over three times more likely among patients with criminal records. After validations, 33 individuals who had died during their sentence were identified from the data, of whom 14 (42,4%) had committed suicide. Approximately 10 percent of other deaths were suicides. Thus, it can be concluded that Finland has similar increased risk of death after sentences as has been observed in other countries despite frequent use of buprenorphine. Sentences affect causes of death for 2-5 years after the last sentence. Additionally, first signs of elevated mortality during community sanctions was observed, but further studies are required to confirm the finding.
  • Viholainen, Olga (2020)
    The Poisson regression is a well known generalized linear model that relates the expected value of the count to a linear combination of explanatory variables. Outliers affect severely the classical maximum likelihood estimator of the Poisson regression. Several robust alternatives for the maximum likelihood (ML) estimator have been developed, such as Conditionally unbiased bounded-influence (CU) estimator, Mallows quasi-likelihood (MQ) estimator and M-Estimators based on transformations (MT). The purpose of the thesis is to study robustness of the robust Poisson regression estimators in different conditions. Another goal is to compare their performance to each other. The robustness of the Poisson regression estimators is investigated by performing a simulation study, where the used estimators are the ML, CU, MQ and MT estimators. The robust estimators MQ and MT are studied with two different weight functions C and H and also without a weight function. The simulation is executed in three parts, where the first part handles a situation without any outliers, in the second part the outliers are in the X space and in the third part the outliers are in the Y space. The results of the simulation show that all the robust estimators are less affected by the outliers than the classical ML estimator, but nevertheless the outliers severely weaken the results of the CU estimator and the MQ based estimators. The MT based estimators and especially the MT and H-MT estimators have by far the lowest medians of the mean squared errors, when the data are contaminated with outliers. When there aren’t any outliers in the data, they compare favorably with the other estimators. Therefore the MT and H-MT estimators are an excellent option for fitting the Poisson regression model.
  • Jeskanen, Juuso-Markus (2021)
    Developing reliable, regulatory compliant and customer-oriented credit risk models requires thorough knowledge of credit risk phenomenon. Tight collaboration between stakeholders is necessary and hence models need to be transparent, interpretable and explainable as well as accurate, for experts without statistical background. In the context of credit risk, one can speak of explainable artificial intelligence (XAI). Hence, practice and market standards are also underlined in this study. So far, credit risk research has mainly focused on the estimation of the probability of default parameter. However, as systems and processes have evolved to comply with regulation in the last decade, recovery data has improved, which has raised loss given default (LGD) up to the heart of credit risk. In the context of LGD, most of the studies have emphasized estimation of one-stage models. However, in practice, market standards support a multi-stage approach which follows the institution's simplified recovery processes. Generally, multi-stage models are more transparent and have better predictive power and compliant status with the regulation. This thesis presents a framework to analyze and execute sensitivity analysis for multi-stage LGD model. The main contribution of the study is to increase the knowledge of LGD modelling by giving insights to the sensitivity of discriminatory power between risk drivers, model components and LGD score. The study aims to answer two questions. Firstly, how sensitive the predictive power of multi-stage LGD model is on the correlation of risk drivers and individual components? Secondly, how to identify the most driving risk factors that need to be considered in multi-stage LGD modelling to achieve adequate level LGD score? The experimental part of this thesis is divided into two parts. The first one presents the motivation, study design and experimental setup used in this thesis to execute the study. The second part focuses on the sensitivity analysis of risk drivers, components and LGD score. Sensitivity analysis presented in this study gives important knowledge of behavior of multi-stage LGD and dependencies between independent risk drivers, components and LGD score with regards to the correlations and model performance metrics. Introduced sensitivity framework can be utilised in assessing the need and schedule for model calibrations with related to the changes in application portfolio. In addition, framework and results can be used in recognizing the needs for monthly performed IFRS 9 ECL calculation updates. The study also gives input for model stress testing where different scenarios and impacts are analyzed regarding the changes in macroeconomic conditions. Even though the focus of this study is in credit risk, the methods presented are also applicable in the different fields outside the financial sector.
  • Talvensaari, Mikko (2022)
    Gaussiset prosessit ovat satunnaisprosesseja, jotka soveltuvat erityisen hyvin ajallista tai avaruudellista riippuvuutta ilmentävän datan mallintamiseen. Gaussisten prosessien helppo sovellettavuus on seurausta siitä, että prosessin äärelliset osajoukot noudattavat moniulotteista normaalijakaumaa, jonka määrittävät täydellisesti prosessin odotusarvofunktio ja kovarianssifunktio. Multinormaalijakaumaan perustuvan uskottavuusfunktion ongelma on heikko skaalautuvuus, sillä uskottavuusfunktion evaluoinnissa välttämätön kovarianssimatriisin kääntäminen on aikavaativuudeltaan aineiston koon kuutiollinen funktio. Tässä tutkielmassa kuvataan temporaalisille gaussisille prosesseille esitysmuoto, joka perustuu stokastisten differentiaaliyhtälöryhmien määrittämiin vektoriarvoisiin Markov-prosesseihin. Menetelmän aikatehokkuushyöty perustuu vektoriprosessin Markov-ominaisuuteen, eli siihen, että prosessin tulevaisuus riippuu vain matalaulotteisen vektorin nykyarvosta. Stokastisen differentiaaliyhtälöryhmän määrittämästä vektoriprosessista johdetaan edelleen diskreettiaikainen lineaaris-gaussinen tila-avaruusmalli, jonka uskottavuusfunktio voidaan evaluoida lineaarisessa ajassa. Tutkielman teoriaosuudessa osoitetaan stationaaristen gaussisten prosessien spektraaliesitystä käyttäen, että stokastisiin differentiaaliyhtälöjärjestelmiin ja kovarianssifunktihin perustuvat määritelmät ovat yhtäpitäviä tietyille stationaarisille gaussisille prosesseille. Tarkat tila-avaruusmuodot esitetään Matérn-tyypin kovarianssifunktioille sekä kausittaiselle kovarianssifunktiolle. Lisäksi teoriaosuudessa esitellään tila-avaruusmallien soveltamisen perusoperaatiot Kalman-suodatuksesta silotukseen ja ennustamiseen, sekä tehokkaat algoritmit operaatioiden suorittamiseen. Tutkielman soveltavassa osassa tila-avaruusmuotoisia gaussisia prosesseja käytettiin mallintamaan ja ennustamaan käyttäjädatan läpisyöttöä 3g-solukkoverkon tukiasemissa. Bayesiläistä käytäntöä noudattaen epävarmuus malliparametreistä ilmaistiin asettamalla parametreille priorijakaumat. Aineiston 15 aikasarjaa sovitettiin sekä yksittäisille aikasarjoille määriteltyyn malliin että moniaikasarjamalliin, jossa aikasarjojen väliselle kovarianssille johdettiin posteriorijakauma. Moniaikasarjamallin viiden viikon ennusteet olivat 15 aikasarjan aineistossa keskimäärin niukasti parempia kuin yksisarjamallin. Kummankin mallin ennusteet olivat keskimäärin parempia kuin laajalti käytettyjen ARIMA-mallien ennusteet.
  • Rautavirta, Juhana (2022)
    Comparison of amphetamine profiles is a task in forensic chemistry and its goal is to make decisions on whether two samples of amphetamine originate from the same source or not. These decisions help identifying and prosecuting the suppliers of amphetamine, which is an illicit drug in Finland. The traditional approach of comparing amphetamine samples involves computation of the Pearson correlation coefficient between two real-valued sample vectors obtained by gas chromatography-mass spectrometry analysis. A two-sample problem, such as the problem of comparing drug samples, can also be tackled with methods such as a t-test or Bayes factors. Recently, a newer method called predictive agreement (PA) has been applied in the comparison of amphetamine profiles, comparing the posterior predictive distributions induced by two samples. In this thesis, we did a statistical validation of the use of this newer method in amphetamine profile comparison. In this thesis, we compared the performance of the predictive agreement method to the traditional method involving computation of the Pearson correlation coefficient. Techniques such as simulation and cross-validation were used in the validation. In the simulation part, we simulated enough data to compute 10 000 PA and correlation values between sample pairs. Cross-validation was used in a case-study, where a repeated 5-fold group cross-validation was used to study the effect of changes in the data used in training of the model. In the cross-validation, performance of the models was measured with area under curve (AUC) values of receiver operating characteristics (ROC) and precision-recall (PR) curves. For the validation, two separate datasets collected by the National Bureau of Investigation of Finland (NBI), were available. One of the datasets was a larger collection of amphetamine samples, whereas the other dataset was a more curated group of samples, of which we also know which samples are somehow linked to each other. On top of these datasets, we simulated data representing amphetamine samples that were either from different or same source. The results showed that with the simulated data, predictive agreement outperformed the traditional method in terms of distinguishing sample pairs consisting of samples from different sources, from sample pairs consisting of samples from the same source. The case-study showed that changes in the training data have quite a marginal effect on the performance of the predictive agreement method, and also that with real world data, the PA method outperformed the traditional method in terms of AUC-ROC and AUC-PR values. Additionally, we concluded that the PA method has the benefit of interpretation, where the PA value between two samples can be interpreted as the probability of these samples originating from the same source.
  • Tan, Shu Zhen (2021)
    In practice, outlying observations are not uncommon in many study domains. Without knowing the underlying factors to the outliers, it is appealing to eliminate the outliers from the datasets. However, unless there are scientific justification, outlier elimination amounts to alteration of the datasets. Otherwise, heavy-tailed distributions should be adopted to model the larger-than-expected variabiltiy in an overdispersed dataset. The Poisson distribution is the standard model to model the variation in count data. However, the empirical variability in observed datsets is often larger than the amount expected by the Poisson. This leads to unreliable inferences when estimating the true effect sizes of covariates in regression modelling. It follows that the Negative Binomial distribution is often adopted as an alternative to deal with the overdispersed datasets. Nevertheless, it has been proven that both Poisson and Negative Binomial observation distributions are not robust against the outliers, in a sense that the outliers have non-negligible influence on the estimation of the covariate effect size. On the other hand, the scale mixture of quasi-Poisson distributions (called the robust quasi-Poisson model), which is constructed similarly to the construction of the Student's t-distribution, is a heavy-tailed alternative to the Poisson. It is proven to be robust against outliers. The thesis shows the theoretical evidence on the robustness of the 3 aforementioned models in a Bayesian framework. Lastly, the thesis considers 2 simulation experiments with different kinds of the outlier source -- process error and covariate measurement error, to compare the robustness between the Poisson, Negative Binomial and robust quasi-Poisson regression models in the Bayesian framework. The model robustness was assessed, in terms of the model ability to infer correctly the covariate effect size, in different combination of error probability and error variability. It was proven that the robust quasi-Poisson regression model was more robust than its counterparts because its breakdown point was relatively higher than the others, in both experiments.
  • Kari, Daniel (2020)
    Estimating the effect of random chance (’luck’) has long been a question of particular interest in various team sports. In this thesis, we aim to determine the role of luck in a single icehockey game by building a model to predict the outcome based on the course of events in a game. The obtained prediction accuracy should also to some extent reveal the effect of random chance. Using the course of events from over 10,000 games, we train feedforward and convolutional neural networks to predict the outcome and final goal differential, which has been proposed as a more informative proxy for outcome. Interestingly, we are not able to obtain distinctively higher accuracy than previous studies, which have focused on predicting the outcome with infomation available before the game. The results suggest that there might exist an upper bound for prediction accuracy even if we knew ’everything’ that went on in a game. This further implies that random chance could affect the outcome of a game, although assessing this is difficult, as we do not have a good quantitative metric for luck in the case of single ice hockey game prediction.