Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by study line "Statistik"

Sort by: Order: Results:

  • Vuoristo, Varpu (2021)
    Puolueiden kannatusmittaukset vaalien välillä tehdään kyselytutkimusten avulla. Näitä mielipidetiedusteluita kutsutaan kansankielellä termillä gallup. Tässä työssä perehdytään poliittisten mielipidetutkimusten historiaan sekä tehdään lyhyt katsaus galluppien nykytilanteeseen Suomessa. Tässä maisterintutkielmassa on ollut käytössä kyselytutkimuksella kerätyt aineistot. Aineistoissa on kysytty vastaajien äänestyskäyttäytymistä seuraavissa vaaleissa: kuntavaalit 2012, eduskuntavaalit 2015 sekä kuntavaalit 2017. Tutkielmassa esitellään kyselytutkimuksien kysymyksen asettelu, aineistojen puhdistamisen työvaiheita sekä perusteet mitkä tiedot tarvitaan tilastollisen mallin sovittamista varten. Teoriaosuudessa esitellään yleistettyjä lineaarisia malleja. Menetelmänä sovitetaan yleistetty lineaarinen malli valittuihin ja puhdistettuihin aluperäisten aineistojen osa-aineistoihin. Näissä osa-aneistoissa on tiedot vastaajien äänestyskäyttäytymisestä kahdeksan eri eduskuntapuolueen kesken. Lisäksi tilastollisen mallin sovittamista varten osa-aineistossa on tiedot vastaajien sukupuolesta sekä asuinpaikasta NUTS 2 -aluejaon mukaisesti. Sukupuoli ja viisi eri aluetta toimivat mallissa selittävinä muuttujina, kun taas puoluekannatus selitettävänä muuttujana. Aineiston käsittely on toteutettu R-laskentaohjelmalla. Tuloksissa on esitetään taulukointina selittävien muuttujien vaikutusta tarkasteltavan puolueen äänestämiseen, niin itsenäisinä selittäjinä kuin niiden yhteisvaikuksina. Jokaista kahdeksaa puoluetta tarkastellaan kaikkien kolmen vaaliaineiston osalta erikseen. Analysoinnin työkaluina toimivat suurimman uskottavuuden estimaattit sekä niiden luottamusvälit.
  • Mattila, Mari (2023)
    Tilastokeskuksessa haluttiin kehittää omakotitalojen lämmitysöljyn käyttöä ja lämmitystapaa ku- vaavia tilastoja. Asumisen rahoitus- ja kehittämiskeskuksen ylläpitämä energiatodistusrekisteri näh- tiin yhdeksi mahdolliseksi aineistoksi, jota kehittämisessä voisi käyttää. Kun energiatodistusrekis- terin aineistoa alettiin tarkastella, huomattiin, että energiatodistusrekisteriin valikoituu keskimää- räistä suurempia ja uudempia rakennuksia. Valikoituneisuuden oletettiin muodostuvan keskeiseksi ongelmaksi energiatodistusrekisterin hyödyntämisessä. Tutkimuskysymykseksi muotoutui, ovatko rakennuskannan öljylämmitteisten omakotitalojen tiedot lämmitystavasta ja pinta-alasta energia- todistusrekisterissä yleistettävissä koko rakennuskantaan. Koska valikoituneisuus on yksi puuttuneisuuden ilmenemismuoto, tutkimuksen teoriaosuudessa pää- tettiin keskittyä puuttuneisuuteen. Puuttuneisuuden mekanismi on tutkimuksen teoriaosan kes- keisimpiä käsitteitä. Puuttuneisuuden mekanismi voi olla täysin satunnainen, satunnainen tai ei- satunnainen. Puuttuneisuuden mekanismi vaikuttaa siihen, mitä tilastollisia menetelmiä aineiston mallintamiseen soveltuu. Tässä tutkimuksessa puuttuneisuuden oletettiin olevan ei-satunnaista. Kun puuttuneisuuden mekanismi on ei-satunnainen, puuttuneisuutta käsitellään yleensä satunnai- silmiönä. Tutkimuksen aineistolle ja puuttuneisuutta kuvaavalle puuttuneisuusindikaattorille muo- dostetaan tilastollinen malli, johon voidaan soveltaa uskottavuuspäättelyä. Tutkimuksessa malliksi valittiin Heckmanin valintamalli. Malli on tarkoitettu käytettäväksi silloin, kun aineisto on valikoitunut tutkittavan ilmiön perusteella. Esimerkiksi öljyn kulutus voidaan esti- moida vain öljylämmittäjistä muodostetun aineiston perusteella. Hackmanin mallilla voidaan ottaa huomioon se, että ölyn kulutus puuttuu niiltä taloilta, jotka eivät lämmitä öljyllä. Kun Heckmanin malli oli estimoitu, sen hyvyyttä arvioitiin ristiinvalidoimalla. Ristiinvalidoinnissa ennustettiin öljylämmityksessä pysymistä. Malli ennusti vain noin 58 % tapauksista oikein. Tätä onnistumisprosenttia pidettiin liian pienenä, jotta mallia kannattaisi käyttää Tilastokeskuksessa energiankulutustietojen korjaamiseen. Syitä mallintamisen epäonnistumiselle voi olla esimerkiksi se, että öljylämmityksen vaihtaminen tapahtuu pitkän aikaikkunan sisällä. Mallin selittäjien vaikutus vasteeseen voi vaihdella eri ajan- kohtina. Malli ei ottanut aikaa huomioon, vaan kaikki asuntokunta kuvaavat muuttujat oli keskiar- voistettu. Malliyhtälö saattoi olla väärä myös siitä näkökulmasta, että siitä saattoi puuttua tärkeitä kotitalouskohtaisia selittäjiä, joita ei vain ollut rekisteriaineistosta saatavilla.
  • Halme, Topi (2021)
    In a quickest detection problem, the objective is to detect abrupt changes in a stochastic sequence as quickly as possible, while limiting rate of false alarms. The development of algorithms that after each observation decide to either stop and declare a change as having happened, or to continue the monitoring process has been an active line of research in mathematical statistics. The algorithms seek to optimally balance the inherent trade-off between the average detection delay in declaring a change and the likelihood of declaring a change prematurely. Change-point detection methods have applications in numerous domains, including monitoring the environment or the radio spectrum, target detection, financial markets, and others. Classical quickest detection theory focuses settings where only a single data stream is observed. In modern day applications facilitated by development of sensing technology, one may be tasked with monitoring multiple streams of data for changes simultaneously. Wireless sensor networks or mobile phones are examples of technology where devices can sense their local environment and transmit data in a sequential manner to some common fusion center (FC) or cloud for inference. When performing quickest detection tasks on multiple data streams in parallel, classical tools of quickest detection theory focusing on false alarm probability control may become insufficient. Instead, controlling the false discovery rate (FDR) has recently been proposed as a more useful and scalable error criterion. The FDR is the expected proportion of false discoveries (false alarms) among all discoveries. In this thesis, novel methods and theory related to quickest detection in multiple parallel data streams are presented. The methods aim to minimize detection delay while controlling the FDR. In addition, scenarios where not all of the devices communicating with the FC can remain operational and transmitting to the FC at all times are considered. The FC must choose which subset of data streams it wants to receive observations from at a given time instant. Intelligently choosing which devices to turn on and off may extend the devices’ battery life, which can be important in real-life applications, while affecting the detection performance only slightly. The performance of the proposed methods is demonstrated in numerical simulations to be superior to existing approaches. Additionally, the topic of multiple hypothesis testing in spatial domains is briefly addressed. In a multiple hypothesis testing problem, one tests multiple null hypotheses at once while trying to control a suitable error criterion, such as the FDR. In a spatial multiple hypothesis problem each tested hypothesis corresponds to e.g. a geographical location, and the non-null hypotheses may appear in spatially localized clusters. It is demonstrated that implementing a Bayesian approach that accounts for the spatial dependency between the hypotheses can greatly improve testing accuracy.
  • Viholainen, Olga (2020)
    The Poisson regression is a well known generalized linear model that relates the expected value of the count to a linear combination of explanatory variables. Outliers affect severely the classical maximum likelihood estimator of the Poisson regression. Several robust alternatives for the maximum likelihood (ML) estimator have been developed, such as Conditionally unbiased bounded-influence (CU) estimator, Mallows quasi-likelihood (MQ) estimator and M-Estimators based on transformations (MT). The purpose of the thesis is to study robustness of the robust Poisson regression estimators in different conditions. Another goal is to compare their performance to each other. The robustness of the Poisson regression estimators is investigated by performing a simulation study, where the used estimators are the ML, CU, MQ and MT estimators. The robust estimators MQ and MT are studied with two different weight functions C and H and also without a weight function. The simulation is executed in three parts, where the first part handles a situation without any outliers, in the second part the outliers are in the X space and in the third part the outliers are in the Y space. The results of the simulation show that all the robust estimators are less affected by the outliers than the classical ML estimator, but nevertheless the outliers severely weaken the results of the CU estimator and the MQ based estimators. The MT based estimators and especially the MT and H-MT estimators have by far the lowest medians of the mean squared errors, when the data are contaminated with outliers. When there aren’t any outliers in the data, they compare favorably with the other estimators. Therefore the MT and H-MT estimators are an excellent option for fitting the Poisson regression model.
  • Smith, Dianna (2024)
    Statistician C. R. Rao made many contributions to multivariate analysis over the span of his career. Some of his earliest contributions continue to be used and built upon almost eighty years later, while his more recent contributions spur new avenues of research. This thesis discusses these contributions, how they helped shape multivariate analysis as we see it today, and what we may learn from reviewing his works. Topics include his extension of linear discriminant analysis, Rao’s perimeter test, Rao’s U statistic, his asymptotic expansion of Wilks’ Λ statistic, canonical factor analysis, functional principal component analysis, redundancy analysis, canonical coordinates, and correspondence analysis. The examination of his works shows that interdisciplinary collaboration and the utilization of real datasets were crucial in almost all of Rao’s impactful contributions.
  • Jeskanen, Juuso-Markus (2021)
    Developing reliable, regulatory compliant and customer-oriented credit risk models requires thorough knowledge of credit risk phenomenon. Tight collaboration between stakeholders is necessary and hence models need to be transparent, interpretable and explainable as well as accurate, for experts without statistical background. In the context of credit risk, one can speak of explainable artificial intelligence (XAI). Hence, practice and market standards are also underlined in this study. So far, credit risk research has mainly focused on the estimation of the probability of default parameter. However, as systems and processes have evolved to comply with regulation in the last decade, recovery data has improved, which has raised loss given default (LGD) up to the heart of credit risk. In the context of LGD, most of the studies have emphasized estimation of one-stage models. However, in practice, market standards support a multi-stage approach which follows the institution's simplified recovery processes. Generally, multi-stage models are more transparent and have better predictive power and compliant status with the regulation. This thesis presents a framework to analyze and execute sensitivity analysis for multi-stage LGD model. The main contribution of the study is to increase the knowledge of LGD modelling by giving insights to the sensitivity of discriminatory power between risk drivers, model components and LGD score. The study aims to answer two questions. Firstly, how sensitive the predictive power of multi-stage LGD model is on the correlation of risk drivers and individual components? Secondly, how to identify the most driving risk factors that need to be considered in multi-stage LGD modelling to achieve adequate level LGD score? The experimental part of this thesis is divided into two parts. The first one presents the motivation, study design and experimental setup used in this thesis to execute the study. The second part focuses on the sensitivity analysis of risk drivers, components and LGD score. Sensitivity analysis presented in this study gives important knowledge of behavior of multi-stage LGD and dependencies between independent risk drivers, components and LGD score with regards to the correlations and model performance metrics. Introduced sensitivity framework can be utilised in assessing the need and schedule for model calibrations with related to the changes in application portfolio. In addition, framework and results can be used in recognizing the needs for monthly performed IFRS 9 ECL calculation updates. The study also gives input for model stress testing where different scenarios and impacts are analyzed regarding the changes in macroeconomic conditions. Even though the focus of this study is in credit risk, the methods presented are also applicable in the different fields outside the financial sector.
  • Sebag, Etienne (2024)
    Road crashes pose a serious safety risk, particularly under adverse weather conditions. Having a deeper understanding of crash patterns and underpinning their connection to different meteorological factors is useful for targeted safety interventions. Many types of statistical and machine learning models seek to quantify the relationship between different meteorological parameters and accident risk. This thesis presents a spatiotemporal generalized additive model to explain which weather conditions increase the risk of a crash in the Finnish regions of Uusimaa and Varsinais-Suomi. The work also explores the spatial and temporal trends which are associated with a heightened probability of a car accident. The emphasis throughout this work was on carefully engineering the model by selecting an appropriate temporal and spatial granularity at which to perform the analysis. Incorporating a thoughtful study design and data aggregation procedure was paramount. Ultimately, the model assigns fitted probabilities for a combination of a smaller spatial unit at a specific hourly time, ranging from March 2017 to December 2021. The model employs MetCoOp Ensemble Prediction System (MEPS) data which was obtained from the Finnish Meteorological Institute. The constructed model explained 33.8% of the deviance and had good fit as per diagnostic plots of the randomized quantile residuals. The model indicates that snow and sleet increase the log-odds of a crash. Other factors such as rush hour and the fact that a crash happened nearby in the last two hours also added explanatory power to the model. The highest probability of a car crash happens around the Helsinki and Turku regions.
  • Ren, Ruotian (2024)
    This research explores how bird abundance changes, in response to both human influence and meteorological factors. We utilized spatiotemporal Poisson regression model for analysis and employed the integrated nested Laplace approximation (INLA) method to address the model. Our study concentrated on twenty representative bird species with five of them chosen to highlight the discoveries. By introducing the Human Influence Index (HII) and Human Footprint Index (HFI) to quantify human activities, we discovered that human influence and habitat conditions significantly impact bird populations with varying sensitivities among species. In terms of meteorological factors, temperature plays a crucial role in species distribution with lower temperatures in a specific range generally favoring higher bird densities. The spatiotemporal Poisson regression model and INLA approach offer a view of the issue, uncovering the intricate interplay between human activities and natural elements. This research sheds light on how birds adapt to changing environment and could offer insights for biodiversity preservation efforts. The results underscore the importance of considering both human activities and climate influences, in conservation strategies to ensure the long-term viability of bird populations and their ecosystems.
  • Ren, Ruotian (2024)
    This research explores how bird abundance changes, in response to both human influence and meteorological factors. We utilized spatiotemporal Poisson regression model for analysis and employed the integrated nested Laplace approximation (INLA) method to address the model. Our study concentrated on twenty representative bird species with five of them chosen to highlight the discoveries. By introducing the Human Influence Index (HII) and Human Footprint Index (HFI) to quantify human activities, we discovered that human influence and habitat conditions significantly impact bird populations with varying sensitivities among species. In terms of meteorological factors, temperature plays a crucial role in species distribution with lower temperatures in a specific range generally favoring higher bird densities. The spatiotemporal Poisson regression model and INLA approach offer a view of the issue, uncovering the intricate interplay between human activities and natural elements. This research sheds light on how birds adapt to changing environment and could offer insights for biodiversity preservation efforts. The results underscore the importance of considering both human activities and climate influences, in conservation strategies to ensure the long-term viability of bird populations and their ecosystems.
  • Talvensaari, Mikko (2022)
    Gaussiset prosessit ovat satunnaisprosesseja, jotka soveltuvat erityisen hyvin ajallista tai avaruudellista riippuvuutta ilmentävän datan mallintamiseen. Gaussisten prosessien helppo sovellettavuus on seurausta siitä, että prosessin äärelliset osajoukot noudattavat moniulotteista normaalijakaumaa, jonka määrittävät täydellisesti prosessin odotusarvofunktio ja kovarianssifunktio. Multinormaalijakaumaan perustuvan uskottavuusfunktion ongelma on heikko skaalautuvuus, sillä uskottavuusfunktion evaluoinnissa välttämätön kovarianssimatriisin kääntäminen on aikavaativuudeltaan aineiston koon kuutiollinen funktio. Tässä tutkielmassa kuvataan temporaalisille gaussisille prosesseille esitysmuoto, joka perustuu stokastisten differentiaaliyhtälöryhmien määrittämiin vektoriarvoisiin Markov-prosesseihin. Menetelmän aikatehokkuushyöty perustuu vektoriprosessin Markov-ominaisuuteen, eli siihen, että prosessin tulevaisuus riippuu vain matalaulotteisen vektorin nykyarvosta. Stokastisen differentiaaliyhtälöryhmän määrittämästä vektoriprosessista johdetaan edelleen diskreettiaikainen lineaaris-gaussinen tila-avaruusmalli, jonka uskottavuusfunktio voidaan evaluoida lineaarisessa ajassa. Tutkielman teoriaosuudessa osoitetaan stationaaristen gaussisten prosessien spektraaliesitystä käyttäen, että stokastisiin differentiaaliyhtälöjärjestelmiin ja kovarianssifunktihin perustuvat määritelmät ovat yhtäpitäviä tietyille stationaarisille gaussisille prosesseille. Tarkat tila-avaruusmuodot esitetään Matérn-tyypin kovarianssifunktioille sekä kausittaiselle kovarianssifunktiolle. Lisäksi teoriaosuudessa esitellään tila-avaruusmallien soveltamisen perusoperaatiot Kalman-suodatuksesta silotukseen ja ennustamiseen, sekä tehokkaat algoritmit operaatioiden suorittamiseen. Tutkielman soveltavassa osassa tila-avaruusmuotoisia gaussisia prosesseja käytettiin mallintamaan ja ennustamaan käyttäjädatan läpisyöttöä 3g-solukkoverkon tukiasemissa. Bayesiläistä käytäntöä noudattaen epävarmuus malliparametreistä ilmaistiin asettamalla parametreille priorijakaumat. Aineiston 15 aikasarjaa sovitettiin sekä yksittäisille aikasarjoille määriteltyyn malliin että moniaikasarjamalliin, jossa aikasarjojen väliselle kovarianssille johdettiin posteriorijakauma. Moniaikasarjamallin viiden viikon ennusteet olivat 15 aikasarjan aineistossa keskimäärin niukasti parempia kuin yksisarjamallin. Kummankin mallin ennusteet olivat keskimäärin parempia kuin laajalti käytettyjen ARIMA-mallien ennusteet.
  • Rautavirta, Juhana (2022)
    Comparison of amphetamine profiles is a task in forensic chemistry and its goal is to make decisions on whether two samples of amphetamine originate from the same source or not. These decisions help identifying and prosecuting the suppliers of amphetamine, which is an illicit drug in Finland. The traditional approach of comparing amphetamine samples involves computation of the Pearson correlation coefficient between two real-valued sample vectors obtained by gas chromatography-mass spectrometry analysis. A two-sample problem, such as the problem of comparing drug samples, can also be tackled with methods such as a t-test or Bayes factors. Recently, a newer method called predictive agreement (PA) has been applied in the comparison of amphetamine profiles, comparing the posterior predictive distributions induced by two samples. In this thesis, we did a statistical validation of the use of this newer method in amphetamine profile comparison. In this thesis, we compared the performance of the predictive agreement method to the traditional method involving computation of the Pearson correlation coefficient. Techniques such as simulation and cross-validation were used in the validation. In the simulation part, we simulated enough data to compute 10 000 PA and correlation values between sample pairs. Cross-validation was used in a case-study, where a repeated 5-fold group cross-validation was used to study the effect of changes in the data used in training of the model. In the cross-validation, performance of the models was measured with area under curve (AUC) values of receiver operating characteristics (ROC) and precision-recall (PR) curves. For the validation, two separate datasets collected by the National Bureau of Investigation of Finland (NBI), were available. One of the datasets was a larger collection of amphetamine samples, whereas the other dataset was a more curated group of samples, of which we also know which samples are somehow linked to each other. On top of these datasets, we simulated data representing amphetamine samples that were either from different or same source. The results showed that with the simulated data, predictive agreement outperformed the traditional method in terms of distinguishing sample pairs consisting of samples from different sources, from sample pairs consisting of samples from the same source. The case-study showed that changes in the training data have quite a marginal effect on the performance of the predictive agreement method, and also that with real world data, the PA method outperformed the traditional method in terms of AUC-ROC and AUC-PR values. Additionally, we concluded that the PA method has the benefit of interpretation, where the PA value between two samples can be interpreted as the probability of these samples originating from the same source.
  • Halme, Eetu (2024)
    Solving partial differential equations using methods of probability theory has a long history. In this thesis we show that the solutions of the conductivity equation in Lipschitz domain D with Neumann boundary conditions and uniformly elliptic, measurable conductivity parameter $\kappa$, can be represented using a Feynman-Kac formula for reflecting diffusion processes X on the domain D. We begin with history and connection to statistical experiments in Chapter 1. Chapter 2 starts by introducing Banach and Hilbert spaces with spectral theory of bounded operators, together with Hölder and Sobolev spaces. Sobolev spaces provide the right properties for the boundary data and solutions. In Chapter 3, we introduce the basics of stochastic processes, martingale and continuous semimartingales. We also need the theory of Markov processes, which Hunt and Feller processes are based on. Hunt processes are because of their correspondence with Dirichlet forms to define the reflecting diffusion process X. We also introduce the concept of local time of a process. In Chapter 4 we introduce Dirichlet forms, their correspondence with self-adjoint operators and Revuz measure. In Chapter 5, we introduce the conductivity equation and the Dirichlet-to-Neumann map $\Lambda_\kappa$. The goal of the Calderón's problem is to reconstruct the conductivity parameter $\kappa$ from the map $\Lambda_\kappa$, which is a difficult, non-linear and ill-posed inverse problem. The Chapter 6 constitutes the main body of the thesis, and here we prove the Feynman-Kac formula for solutions of the conductivity equation. We use the correspondence between the Dirichlet forms and selfadjoint operators, to define a semigroup (T_t) of solutions to a abstract Cauchy equation and from the semigroup (T_t) which we can associate by the Dunford-Pettis theorem a transition density function p for the reflecting diffusion process X. We show that using De-Giorgi-Nash-Moser estimates that p is Hölder continuous and defined everywhere. We also prove p converges exponentially to stationary distribution. We generalise the concept of boundary local times using the Revuz measure, and prove the occupation formula. These results together with the Skorohod decomposition for Lipschitz conductivities is used in the four part proof of Feynman-Kac formula. In Chapter 7, we introduce the boundary trace process, which is a pure jump process corresponding to the hitting times on the boundary of the reflecting diffusion process. We state that the trace process is the infinitesimal generator of the Dirichlet-to-Neumann map -\Lambda_\kappa and thus provides a probabilistic interpetation for Calderón's problem. We end with discussion on applications of the theory and potential directions for new research. The main references of the thesis are the articles of Piiroinen and Simon ''From Feynman–Kac Formulae to Numerical Stochastic Homogenization in Electrical Impedance Tomography'' and ''Probabilistic interpretation of the Calderón problem''.
  • Ovaskainen, Osma (2024)
    Abstract Objective The objective of this thesis is to create methods to transform the most accessible digitalized version of an apartment, the floor plan, into a format that can be analyzed by statistical modeling and use the created data to find if there are any spatial or temporal effects in the geometry of apartments floor plans. Methods The first part of the thesis was created using a mix of computer vision image manipulation methods combined with text recognition. The second portion was performed using a oneway ANOVA model. Results With the computer vision portion, we were able to successfully classify a portion of the data, however, there is a lot of room for improvement due to the recognition had a lot of room for improvement. From the created data, we were able to identify some key differences concerning our parameters, location, and year of construction. The analysis however sufferers from a quite limited dataset, where few housing corporations play a large role in the final results, so it would be wise to repeat this experiment with a more comprehensive dataset for more accurate results
  • Tan, Shu Zhen (2021)
    In practice, outlying observations are not uncommon in many study domains. Without knowing the underlying factors to the outliers, it is appealing to eliminate the outliers from the datasets. However, unless there are scientific justification, outlier elimination amounts to alteration of the datasets. Otherwise, heavy-tailed distributions should be adopted to model the larger-than-expected variabiltiy in an overdispersed dataset. The Poisson distribution is the standard model to model the variation in count data. However, the empirical variability in observed datsets is often larger than the amount expected by the Poisson. This leads to unreliable inferences when estimating the true effect sizes of covariates in regression modelling. It follows that the Negative Binomial distribution is often adopted as an alternative to deal with the overdispersed datasets. Nevertheless, it has been proven that both Poisson and Negative Binomial observation distributions are not robust against the outliers, in a sense that the outliers have non-negligible influence on the estimation of the covariate effect size. On the other hand, the scale mixture of quasi-Poisson distributions (called the robust quasi-Poisson model), which is constructed similarly to the construction of the Student's t-distribution, is a heavy-tailed alternative to the Poisson. It is proven to be robust against outliers. The thesis shows the theoretical evidence on the robustness of the 3 aforementioned models in a Bayesian framework. Lastly, the thesis considers 2 simulation experiments with different kinds of the outlier source -- process error and covariate measurement error, to compare the robustness between the Poisson, Negative Binomial and robust quasi-Poisson regression models in the Bayesian framework. The model robustness was assessed, in terms of the model ability to infer correctly the covariate effect size, in different combination of error probability and error variability. It was proven that the robust quasi-Poisson regression model was more robust than its counterparts because its breakdown point was relatively higher than the others, in both experiments.
  • Törmi, Henrik (2024)
    Tässä tutkielmassa tarkastellaan, miten VAR-mallien avulla tehdyt menneistykset soveltuvat Tilastokeskuksen uusimman vuonna 2021 voimaan tulleen työvoimatutkimuksen aikasarjojen taaksepäin korjaukseen vuosille 2020 - 2000. Tarkasteltavat aikasarjat ovat työikäisten miesten ja naisten kuukausittaiset työllisyys- ja työttömyysluvut. Tilastokeskus on luotettavasti taaksepäin korjannut yllä mainitut aikasarjat vuosille 2020 - 2009. Tässä tutkielmassa verrataan estimoitujen VAR-mallien menneistyksiä Tilastokeskuksen virallisiin lukuihin, joita ovat ennen vuotta 2021 voimassa olleen Tilastokeskuksen työvoimatutkimuksen mukaiset luvut sekä Tilastokeskuksen taaksepäin korjatut aikasarjat vuosille 2020 - 2009. Taaksepäin korjauksella yhdenmukaistetaan ennen 2021 olevat aikasarjat uusimman vuoden 2021 voimaan tulleen työvoimatutkimuksen mukaiseksi. Tässä tutkielmassa ei pyritä etsimään parasta mahdollista tapaa taaksepäin korjata Tilastokeskuksen uusimman vuonna 2021 voimaan tulleen työvoimatutkimuksen aikasarjoja, eikä ottamaan kantaa sihen tulisiko VAR-mallien menneistyksillä taaksepäin korjata Tilastokeskuksen uusimman vuoden 2021 voimaan tulleen työvoimatutkimuksen aikasarjoja. Käytettävissäni oli systemaattisella satunnaisotannalla poimittu Tilastokeskuksen työvoimatutkimuksen aineisto, joka sisältää kuukausittaiset tiedot vastaajien työmarkkina asemasta sekä muista keskeisistä muuttujista ajalla 2000 tammikuu - 2023 helmikuu. Lisäksi käytettävissäni oli Työ- ja elinkeinoministeriön aineisto rekisterityöttömien lukumääristä ajalla 2000 tammikuu 2023 helmikuu. Näiden aineistojen pohjalta muodostin VAR-mallit, joiden avulla menneistin aikasarjoja käyttäen ehdollista odotusarvoa, ehdolla käytettävissä oleva aineisto. Mallien estimoinnissa sekä menneistämisessä käytin eksogeenisten muuttujien havaittuja arvoja, joita ovat Työ- ja elinkeinoministeriön rekisterityöttömien lukumäärät sekä osa keskeisistä muuttujista käytettävissä olleesta Tilastokeskuksen työvoimatutkimuksen aineistosta. Mallit estimoitiin pienimmän neliösumman menetelmällä. Varmistin mallien hyvyyden tarkistaen stationaarisuuden, testaamalla homoskedastisuutta, normaalisuutta ja tutkimalla standardoituja residuaaleja ja niiden auto- ja ristikorrelaatioita. Mitä enemmän VAR-mallin estimointiin käytettiin havaintoja, sitä lähempänä aikasarjojen menneistetyt arvot ovat Tilastokeskuksen työvoimatutkimuksen virallisia lukuja, ja niissä on vähemmän radikaaleja muutoksia. VAR-mallien avulla menneistetyissä aikasarjoissa on paljon samankaltaisuutta Tilastokeskuksen työvoimatutkimuksen virallisten lukujen kanssa. Menneistykset noudattavat samankaltaista kausivaihtelua kuin viralliset luvut. Lisäksi monen aikasarjan menneistykset noudattavat samankaltaista trendiä kuin viralliset luvut, joskin tasoerot ovat aika ajoin suuria. Työ- ja elinkeinoministeriön rekisterityöttömien lukumäärät on merkitsevänä selittävänä muuttujana estimoiduissa VAR-malleissa, ja ne selittävät merkittävästi menneistyksien arvoja.
  • Kari, Daniel (2020)
    Estimating the effect of random chance (’luck’) has long been a question of particular interest in various team sports. In this thesis, we aim to determine the role of luck in a single icehockey game by building a model to predict the outcome based on the course of events in a game. The obtained prediction accuracy should also to some extent reveal the effect of random chance. Using the course of events from over 10,000 games, we train feedforward and convolutional neural networks to predict the outcome and final goal differential, which has been proposed as a more informative proxy for outcome. Interestingly, we are not able to obtain distinctively higher accuracy than previous studies, which have focused on predicting the outcome with infomation available before the game. The results suggest that there might exist an upper bound for prediction accuracy even if we knew ’everything’ that went on in a game. This further implies that random chance could affect the outcome of a game, although assessing this is difficult, as we do not have a good quantitative metric for luck in the case of single ice hockey game prediction.
  • Kartau, Joonas (2024)
    A primary goal of human genetics research is the investigation of associations between genetic variants and diseases. Due to the high number of genetic variants, sophisticated statistical methods for high dimensional data are required. A genome-wide association study (GWAS) is the initial analysis used to measure the marginal associations between genetic variants and biological traits, but because it ignores correlation between variants, identification of truly causal variants remains difficult. Fine-mapping refers to the statistical methods that aim to identify causal variants from GWAS results by incorporating information about correlation between variants. One such fine-mapping method is FINEMAP, a widely used Bayesian variable selection model. To make computations efficient, FINEMAP assumes a constant sample size for the measured genetic variants, but in a meta-analysis that combines data from several studies, this assumption may not hold. This results in miscalibration of the FINEMAP model with meta-analyzed data. In this thesis, a novel extension for FINEMAP is developed, named FINEMAP-MISS. With an additional inversion of the variants' correlation matrix and other less demanding computational adjustments, FINEMAP-MISS makes it possible to fine-map meta-analyzed GWAS data. To test the effectiveness of FINEMAP-MISS, genetic data from the UK Biobank is used to generate sets of simulated data, where a single variant has a non-zero effect on the generated trait. For each simulated dataset, a meta-analysis with missing information is emulated, and fine-mapping is performed with FINEMAP and FINEMAP-MISS. The results verify that with missing data FINEMAP-MISS clearly performs better than FINEMAP in identification of causal variants. Additionally, with missing data the posterior probability estimates provided by FINEMAP-MISS are properly calibrated, whereas the estimates by FINEMAP exhibit miscalibration. FINEMAP-MISS enables the use of fine-mapping for meta-analyzed genetic studies, allowing for greater power in the detection of causal genetic variants.