  • Riippa, Väinö (2016)
    This Master's thesis is empirical literature review, which studies open data at the area of healthcare. The study represents what the open data is and how it has become the concept what it stands for today. At the first chapter we take a look at open data at general viewpoint. In the next chapter there will be comparing of the open data processes from the point of publisher and consumer. After the processes we take a look at the open data at the sectors of healthcare and welfare. Study will be done by examining the current practices, the application solutions and the expectations of open data. This study offers for reader an informative review about the process models regarding to open data. After reading the thesis there's possibility to use process model in data openings of the own organization.
  • Veteli, Peitsa (2020)
    Opetus- ja tutkimusmaailmojen välillä koetaan olevan rako, jota voidaan pitää osasyynä yleisesti havaittuun opiskelijoiden matalaan motivaatioon luonnontieteellisiä aloja kohtaan. Samassa yhteydessä esiin nousevat autenttisuuden ja relevanssin käsitteet, joilla voidaan kuvata eri tavoilla tapahtuvan toiminnan ”aitoutta” tai mielekkyyttä. Tässä työssä esitellään Fysiikan tutkimuslaitos HIP:in (Helsinki Institute of Physics) Avoin data opetuksessa -projektin myötä kehitettyjä merkityksellisen ohjelmoinnin työkaluja, joissa hyödynnetään muun muassa CERNissä toimivan CMS-kokeen (Compact Muon Solenoid) avoimia hiukkastutkimuksen aineistoja. Näiden materiaalien siirtymistä opettajakunnan avuksi tuetaan koulutuksilla, joista kerättyä palautetta analysoidaan tässä tutkielmassa laajemman tiedeopetuksen autenttisuuteen ja avoimen datan hyödyntämiseen liittyvän keskustelun yhteydessä. Avoimen datan hyödyntäminen ja opetuksellinen tutkiminen ovat hyvin nuoria aloja, joiden eturintamaan tämäkin työ asettuu. Aineistoa on kerätty sekä suomalaisilta (n = 64) että kansainvälisiltä (n = 12) toisen asteen opettajilta, minkä lisäksi vertailukohtana käytetään opiskelijatyöpajoista nousseita kommentteja (n = 62). Menetelmänä toimii temaattinen analyysi, jonka tulokset ovat vertailukelpoisia muuhun luonnontieteen opetuksen tutkimuskirjallisuuteen. Tutkimuskysymyksenä on: Miten autenttisuus esiintyy opettajien palautteessa hiukkasfysiikan avoimen datan opetuskäytön kursseilta ja kuinka se vertautuu tiedeopetuksen tutkimuskirjallisuuteen? Tuloksista havaitaan opettajien näkemysten asettuvan hyvin saman suuntaisesti kuin verrokkikirjallisuuden pohjalta olisi voinut olettaakin, yleisimpien autenttisuuden yhteyksien painottuessa tutkijoiden toimintaan verrattaviin työskentelytapoihin ja ”oikean maailman” haasteisiin. Palautteen lähes yksimielinen positiivisuus antaa vahvaa indikaatiota projektin tarjoamien mahdollisuuksien hyödyllisyydestä ja tukee alalla kaivattavien jatkotutkimusten kannattavuutta.
  • Suoniemi, Sanni (2014)
    Fysiikka mielletään helposti teoreettiseksi oppiaineeksi, vaikka se onkin lähtökohdiltaan kokeellinen tieteenala. Kokeellisen toiminnan järjestäminen luokkahuoneessa voi kuitenkin olla haastavaa, etenkin modernin fysiikan aihepiirissä. Avoin hiukkasfysiikan tutkimusdata mahdollistaa kokeellisuuden ja aidon tutkimuksen tekemisen hiukkasfysiikan parissa. Hyödyntämällä tutkivan oppimisen pedagogiikkaa voidaan tiedonkäsittelytaidot, yhteistyötaidot ja hiukkasfysiikka yhdistää luokkahuoneessa toteutettavaksi toiminnaksi. Tutkielma toteutettiin kehittämistutkimuksena, joka muodostui kahdesta kehittämissyklistä. Tutkimuksen kehittämisvaiheisiin sisältyivät Masterclass-tapahtuman yhteydessä suoritettu tapaustutkimus sekä fysiikan lukio-opettajille suunnattu kyselytutkimus. Kyselytutkimuksen avulla pyrittiin selvittämään fysiikan lukio-opettajien suhtautumista avoimen hiukkasfysiikan tutkimusdatan opetuskäyttöä kohtaan. Opettajakyselyn tulosten perusteella avoin hiukkasfysiikan tutkimusdata soveltuisi hyvin lukio-opiskeluun. Aihe-alue kiinnosti selkeästi opettajia ja aiheen uskottiin kiinnostavan myös opiskelijoita. Suurin osa (80,3%) olisi valmis hyödyntämään opetuksessaan avointa hiukkasfysiikan tutkimusdataa. Opetusta kehitettäessä tulisi opettajien kokemusten perusteella huomioida ajankäytölliset rajoitteet, tietotekniset rajoitteet, opiskelijoiden erilaiset taito-, tieto- ja motivaatiotasot, opettajan tietotason asettamat haasteet, materiaalin hyvä ohjeistus sekä opetuksen keskittyminen ydinainekseen. Lisäkoulutukselle ja etenkin suomenkieliselle tukimateriaalille olisi tutkimuksen perusteella tarvetta. Avoin hiukkasfysiikan tutkimusdata soveltuisi opettajien mielestä hyödynnettäväksi useallakin eri fysiikan kurssilla ja muutamalla pitkän matematiikan kurssilla. Aikataulullisten resurssien rajallisuus rajoittaisi aiheen parissa käytettävän ajan Aine ja säteily -kurssilla vajaaseen kahteen oppituntiin. Fysiikan koulukohtaisilla kursseilla opettajat olisivat valmiita käyttämään aiheen parissa jopa noin kahdeksan oppituntia. Riittävillä tuki- ja ohjaustoimilla sekä suomenkielisen, luokkatilanteeseen soveltuvan materiaalin kehittämisellä avointa hiukkasfysiikan tutkimusdataa olisi mahdollista hyödyntää laajemminkin lukioiden matemaattis-luonnontieteellisillä kursseilla. Aihe voisi osaltaan edesauttaa fysiikan opiskelijoiden lukumäärän lisäämistä sekä opiskelijoiden sukupuolten välisen jakauman tasoittamista. Tutkimuksen tuloksena syntyi visioita avoimen hiukkasfysiikan tutkimusdatan hyödyntämisestä opetuksessa. Kehittämisen tuloksena muotoiltiin didaktinen rekonstruktio avoimen hiukkasfysiikan tutkimusdatan opetuskäytöllisestä hyödyntämisestä käyttäen esimerkkinä avointa, hiukkastutkimuskeskus Cernin CMS-kokeesta saatavaa tutkimusdataa. Tutkimuksen kautta saatiin myös tietoa fysiikan lukio-opettajien suhtautumisesta hiukkasfysiikkaan, tietoa hiukkasfysiikan opetuksesta lukioissa sekä tietoa opiskelijoiden suhtautumisesta informaaliin hiukkasfysiikan opetukseen Masterclass –tapahtuman yhteydessä.
  • Khalil, Hossam (2024)
    Understanding the baryonic physics on the galaxy group level is a prerequisite for cosmological studies of large-scale structures. While the majority of baryons in galaxy groups are located in their intragroup medium (IGrM), one poorly understood aspect of galaxy groups is their hot intragroup X-ray emission. In this thesis, a new all-sky catalogue of X-ray detected groups (AXES-2MRS) is presented, based on the identification of large X-ray sources discovered in the ROSAT All-Sky Survey (RASS) with the 2MRS Bayesian Group Catalogue. In addition to X-ray luminosity coming from the shallow survey data of RASS, detailed X-ray properties of the groups have been obtained by matching the catalogue to archival X-ray observations conducted by XMM-Newton. The relationship between X-ray and optical properties of AXES-2MRS is explored through scaling relations, namely $\sigma_{v}-L_{X}$, $\sigma_{v}-kT$, $\sigma_{v}-M$, and $kT-L_{X}$ which denote (velocity dispersion vs. X-ray luminosity), (velocity dispersion vs. X-ray temperature), (velocity dispersion vs. hydrostatic mass), (X-ray temperature vs. X-ray luminosity), respectively. The scaling relations reveal similarities between our low-redshift catalogue and high-redshift studies implying that our knowledge about galaxy groups is redshift-invariant. This study enhances the representation of the underexplored low-z, low-luminosity galaxy groups, particularly in low-mass systems ($< 10^{14} M_{\odot}$). This enhances the completeness of galaxy group catalogs, addressing the persistent issue of missing faint, low-mass systems. Moreover, previous catalogues, based on detecting the peak of the X-ray emission preferentially sample the high dark matter (DM) halo-concentration groups, while AXES-2MRS includes many low DM halo-concentration groups.
  • Häggblom, Matilda (2022)
    Modal inclusion logic is modal logic extended with inclusion atoms. It is the modal variant of first-order inclusion logic, which was introduced by Galliani (2012). Inclusion logic is a main variant of dependence logic (Väänänen 2007). Dependence logic and its variants adopt team semantics, introduced by Hodges (1997). Under team semantics, a modal (inclusion) logic formula is evaluated in a set of states, called a team. The inclusion atom is a type of dependency atom, which describes that the possible values a sequence of formulas can obtain are values of another sequence of formulas. In this thesis, we introduce a sound and complete natural deduction system for modal inclusion logic, which is currently missing in the literature. The thesis consists of an introductory part, in which we recall the definitions and basic properties of modal logic and modal inclusion logic, followed by two main parts. The first part concerns the expressive power of modal inclusion logic. We review the result of Hella and Stumpf (2015) that modal inclusion logic is expressively complete: A class of Kripke models with teams is closed under unions, closed under k-bisimulation for some natural number k, and has the empty team property if and only if the class can be defined with a modal inclusion logic formula. Through the expressive completeness proof, we obtain characteristic formulas for classes with these three properties. This also provides a normal form for formulas in MIL. The proof of this result is due to Hella and Stumpf, and we suggest a simplification to the normal form by making it similar to the normal form introduced by Kontinen et al. (2014). In the second part, we introduce a sound and complete natural deduction proof system for modal inclusion logic. Our proof system builds on the proof systems defined for modal dependence logic and propositional inclusion logic by Yang (2017, 2022). We show the completeness theorem using the normal form of modal inclusion logic.
  • Alasuvanto, Toni (Helsingin yliopistoHelsingfors universitetUniversity of Helsinki, 2008)
    Denna pro gradu avhandling är en litteraturstudie av intramolekylära aza-Wittigringslutningar vid syntes av sammansmälta kväveheterocykler. I arbetet behandlas material huvudsakligen från 1980 och framöver. Aza-Wittigreaktionen påminner om den analoga Wittigreaktionen. Aza-Wittigreaktioner har nästan uteslutande gjorts mellan karbonylgrupper och iminofosforaner. Reaktionsmekanismen sker enligt en tvåstegsaddition, som inleds av iminokvävets nukleofila attack till karbonylkolet och avslutas av att en zwitterjonisk betain bildar en azoxafosfetanintermediär. Intermediären sönderfaller spontant till en iminoprodukt och fosfinoxid. Reaktionen görs ofta under milda betingelser d.v.s. vid rumstemperatur och dessutom utan dyr arbetsutrustning Ett vanligt sätt att framställa iminofosforanen är från azid genom Staudingerreaktion och ofta kombineras Staudinger- och aza-Wittigreaktionerna så att iminofosforanen inte isoleras före ringslutningen. Alla andra ringslutningsreaktioner än aza-Wittig, såsom elektrocykliska ringslutningar, har uteslutits ur detta arbete. Materialet i litteraturstudien har indelats enligt vilken typ av karbonylgrupp iminofosforanen reagerar med. På så vis åskådliggörs hurudana typiska produktmolekyler som erhållits med ringslutning till en viss typ av karbonylgrupp. Det har visat sig att det är förmånligt om karbonylkolet har elektronunderskott och iminofosforanens kväve har elektronöverskott. Den ringslutande molekylens entropi och närbelägna substituenters elektroniska och steriska natur samt produktmolekylens termodynamiska fördelaktighet inverkar tillsammans på reaktionens gång. Ifall reaktionen förväntas ske långsamt är det bättre att använda sig av alkyliminofosforaner än aryliminofosforaner. Valet av lösningsmedel har nästan uteslutande lämnats oförklarat i de behandlade publikationerna men i de flesta fall har ortoxylen eller toluen varit goda lösningsmedel. Oönskad tetrazolbildning av aziden kan minimeras genom användning av opolärt lösningsmedel. Likaså kan förmånligt placerade kväveskyddsgrupper hindra intramolekylära vätebindningar. På senare tid har aza-Wittigreaktionen allt mera tillämpats vid framställning av farmakologiska produkter, vilket ökat intresset för att framställa stora mängder närbesläktade produktmolekyler. Dylika molekylbibliotek har med fördel framställts i fastfas varvid reningen av produkten underlättats märkbart. Ett nytt område inom aza-Wittigsyntetiken är asymmetriska reaktioner, vilka säkert kommer att få mera uppmärksamhet i framtiden. I denna litteraturstudie framkom det att många av de utförda synteserna på området kunde upprepas med större variation och systematik gällande reagens och reaktionsbetingelser.
  • Silander, Otto (2019)
    Tässä tutkimuksessa luodaan yleiskatsaus babylonialaiseen matematiikkaan, perehdytään sen saavutuksiin ja erityispiirteisiin ja pohditaan sen suurimpia ansioita. Lisäksi selvitetään miten babylonialainen matematiikka on vaikuttanut matematiikan kehitykseen ja miten babylonialaiset keksinnöt ovat päätyneet erityisesti kreikkalaisten matemaatikoiden käyttöön. Babylonialaisen matematiikan lisäksi tutkitaan myös babylonialaista astronomiaa soveltuvin osin. Tutkimuksessa selvitetään myös onko babylonialaisella matematiikalla yhteyksiä nykyaikaan ja erityisesti tapaan jakaa tunti 60 minuuttiin ja minuutti 60 sekuntiin ja ympyrän kehäkulma 360 asteeseen. Tutkimus toteutettiin kirjallisuuskatsauksena käyttämällä mahdollisimman laajasti sekä babylonialaista matematiikkaa koskevia perusteoksia että uusimpia artikkeleita. Matemaattisten saavutusten siirtymistä lähestyttiin tutkimalla tunnettuja kreikkalaisen matematiikan ja astronomian keskeisiä henkilöitä ja heidän yhteyksiään babylonialaiseen matematiikkaan. Näiden pohjalta muodostettiin yhteneväinen kokonaisuus babylonialaisen matematiikan saavutuksista ja tiedon siirtymisestä. Babylonialainen matematiikka käytti omaperäistä ja edistyksellistä seksagesimaalijärjestelmää, jonka kantaluku oli 60 ja joka oli ensimmäinen tunnettu numeroiden paikkajärjestelmä. Babylonialaisia matemaatikoita voidaan perustellusti sanoa antiikin parhaiksi laskijoiksi. He tunsivat monia tunnettuja lauseita kuten Pythagoraan lauseen ja Thaleen lauseen, osasivat ratkaista toisen asteen yhtälön ja käyttivät erilaisia tehokkaita algoritmeja likiarvojen laskemiseen yli tuhat vuotta ennen kreikkalaisia. Kreikkalaisten ensimmäisinä matemaatikkoina pitämät Thales ja Pythagoras oppivat ilmeisesti tunnetuimmat tuloksensa babylonialaisilta ja heidän merkityksensä on ensisijaisesti tiedon kuljettajana ja matematiikan eri osasten järjestelijöinä. Babylonialainen astronomia oli edistyksellistä ja kreikkalainen Hipparkhos hyödynsi babylonialaisten tekemien havaintojen lisäksi myös babylonialaista laskutapaa tehdessään omia tutkimuksiaan. Näiden ratkaisujen pohjalta ympyrä jaetaan vielä nykyäänkin 360 asteeseen, joista jokainen aste jakautuu 60 osaan. Samalla babylonialaiseen matematiikkaan perustuvalla periaatteella myös tunnit ja minuutit on jaettu 60 osaan.
  • Koivisto, Timo (2016)
    This thesis is a review of bandit algorithms in information retrieval. In information retrieval a result list should include the most relevant documents and the results should also be non-redundant and diverse. To achieve this, some form of feedback is required. This document describes implicit feedback collected from user interactions by using interleaving methods that allow alternative rankings of documents to be presented in result lists. Bandit algorithms can then be used to learn from user interactions in a principled way. The reviewed algorithms include dueling bandits, contextual bandits, and contextual dueling bandits. Additionally coactive learning and preference learning are described. Finally algorithms are summarized by using regret as a performance measure.
  • Räsänen, Jenni (2014)
    Tutkielmassa tarkastellaan kahta tasogeometrian käsitettä: barysentristä koordinaattisysteemiä sekä pisteen konjugaatiota kolmion suhteen. Barysentriset koordinaatit ovat homogeeninen koordinaattisysteemi, jonka avulla pisteen sijainti tasossa ilmoitetaan suhteessa annettuun kolmioon. Pisteen konjugaatio kolmion suhteen on kuvaus, joka kuvaa tason pisteet toisiksi tietyillä, tyypillisesti geometrisesti luonnehdittavilla ehdoilla. Käsitteet liittyvät toisiinsa siten, että eräät mielenkiintoiset konjugaatiokuvaukset voidaan määritellä barysentristen koordinaattien avulla. Barysentriset koordinaatit otettiin käyttöön 1800-luvun alussa useamman henkilön toimesta. Ne ilmoittavat tason pisteen sijainnin suhteessa annettuun kolmioon järjestetyllä lukukolmikolla, toisin kuin yleisemmin käytetyt karteesiset koordinaatit, jotka ilmoittavat pisteen sijainnin suhteessa annettuun origoon (0,0) lukuparin avulla. Barysentriset koordinaatit voidaan ilmoittaa useammalla, keskenään ekvivalentilla tavalla, mutta niiden määrittäminen tapahtuu kuitenkin aina jonkin kolmion suhteen. Määrittely voidaan tehdä joko tutkittavan pisteen ja kolmion kärkien muodostamien kolmion sivujen jakosuhteiden avulla tai käyttäen hyväksi tutkittavan pisteen ja kolmion kärkien muodostamien kolmioiden pinta-alojen suhteita. Tutkielman kolmannessa luvussa esitetään barysentristen koordinaattien järjestelmä sekä annetaan esimerkkejä mielenkiintoisten pisteiden koordinaateista. Barysentristen koordinaattien kaltainen, toinen homogeeninen koordinaattisysteemi, trilineaariset koordinaatit esitellään myös lyhyesti. Neljännessä luvussa johdetaan muunnoskaavat trilineaaristen ja barysentristen koordinaattien sekä barysentristen ja karteesisten koordinaattien välille. Pisteen konjugaatio kolmion suhteen on eräs pistetransformaation erityistapaus. Tutkielman viidennessä luvussa tarkastellaan aluksi pistetransformaation käsitettä yleisesti, jotta pisteen konjugaatiota kolmion suhteen voidaan ymmärtää paremmin. Isotominen ja isogonaalinen konjugaatio ovat mielenkiintoiset, paljon tutkitut ja geometriassa sovelletut erikoistapaukset pisteen konjugaatiosta kolmion suhteen. Ne ovat mielenkiintoisia myös tämän työn kannalta, sillä niiden määrittelyssä käytetään sekä barysentrisiä että trilineaarisia koordinaatteja. Isotominen ja isogonaalinen konjugaatio esitellään tutkielman viimeisessä luvussa.
  • Sotala, Kaj (2015)
    This thesis describes the development of 'Bayes Academy', an educational game which aims to teach an understanding of Bayesian networks. A Bayesian network is a directed acyclic graph describing a joint probability distribution function over n random variables, where each node in the graph represents a random variable. To find a way to turn this subject into an interesting game, this work draws on the theoretical background of meaningful play. Among other requirements, actions in the game need to affect the game experience not only on the immediate moment, but also during later points in the game. This is accomplished by structuring the game as a series of minigames where observing the value of a variable consumes 'energy points', a resource whose use the player needs to optimize as the pool of points is shared across individual minigames. The goal of the game is to maximize the amount of 'experience points' earned by minimizing the uncertainty in the networks that are presented to the player, which in turn requires a basic understanding of Bayesian networks. The game was empirically tested on online volunteers who were asked to fill a survey measuring their understanding of Bayesian networks both before and after playing the game. Players demonstrated an increased understanding of Bayesian networks after playing the game, in a manner that suggested a successful transfer of learning from the game to a more general context. The learning benefits were gained despite the players generally not finding the game particularly fun. ACM Computing Classification System (CCS): - Applied computing - Computer games - Applied computing - Interactive learning environments - Mathematics of computing - Bayesian networks
  • Benner, Christian (2013)
    Background. DNA microarrays measure the expression levels of tens of thousands of genes simultaneously. Some differentially expressed genes may be useful as markers for the diagnosis of diseases. Available statistical tests examine genes individually, which causes challenges due to multiple testing and variance estimation. In this Master's thesis, Bayesian confirmatory factor analysis (CFA) is proposed as a novel approach for the detection of differential gene expression. Methods. The factor scores represent summary measures that combine the expression levels from biological samples under the same condition. Differential gene expression is assessed by utilizing their distributional assumptions. A mean-field variational Bayesian approximation is employed for computationally fast estimation. Results. Its estimation performance is equal to Gibbs sampling. Point estimation errors of model parameters decrease with increasing number of variables. However, mean centering of the data matrix and standardization of factor scores resulted in an inflation of the false positive rate. Conclusion. Avoiding mean centering and revision of the CFA model is required so that location parameters of factor score distributions can be estimated. The utility of CFA for the detection of differential gene expression needs also to be confirmed by a comparison with different statistical procedures to benchmark its false positive rate and statistical power.
  • Chen, Jun (2015)
    The thesis studies three different conditional correlation Multivariate GARCH (MGARCH) models. They are the Constant Conditional Correlation (CCC-) GARCH, Dynamic Conditional Correlation (DCC-) GARCH and Asymmetric Dynamic Conditional Correlation (ADCC-) GARCH, in which the time-varying volatilities are modelled by three univariate GARCH models with the error term assumed to have a Gaussian distribution. In order to compare the performance of these models, we apply them to the volatility analysis of two stocks. Regarding model inference, we adopt a Bayesian approach and implement a Markov Chain Monte Carlo (MCMC) algorithm, Metropolis Within Gibbs (MWG), instead of the regular maximum likelihood (ML) method. Finally, the estimated models are employed to compute Value at Risk (VaR) and their performance is discussed.
  • Mäki, Niklas (2023)
    Most graph neural network architectures take the input graph as granted and do not assign any uncertainty to its structure. In real life, however, data is often noisy and may contain incorrect edges or exclude true edges. Bayesian methods, which consider the input graph as a sample from a distribution, have not been deeply researched, and most existing research only tests the methods on small benchmark datasets such as citation graphs. As often is the case with Bayesian methods, they do not scale well for large datasets. The goal of this thesis is to research different Bayesian graph neural network architectures for semi-supervised node classification and test them on larger datasets, trying to find a method that improves the baseline model and is scalable enough to be used with graphs of tens of thousands of nodes with acceptable latency. All the tests are done twice with different amounts of training data, since Bayesian methods often excel with low amounts of data and in real life labeled data can be scarce. The Bayesian models considered are based on the graph convolutional network, which is also used as the baseline model for comparison. This thesis finds that the impressive performance of the Bayesian graph neural networks does not generalize to all datasets, and that the existing research relies too much on the same small benchmark graphs. Still, the models may be beneficial in some cases, and some of them are quite scalable and could be used even with moderately large graphs.
  • Nevala, Aapeli (2020)
    Thanks to modern medical advances, humans have developed tools for detecting diseases so early, that a patient would be better off had the disease gone undetected. This is called overdiagnosis. Overdiagnosisisaproblemespeciallycommoninacts,wherethetargetpopulationofanintervention consists of mostly healthy people. Colorectal cancer (CRC) is a relatively rare disease. Thus screening for CRC affects mostly cancerfree population. In this thesis I evaluate overdiagnosis in guaiac faecal occult blood test (gFOBT) based CRC screening programme. In gFOBT CRC screening there are two goals: to detect known predecessors of cancers called adenomas and to remove them (cancer prevention), and to detect malign CRCs early enough to be still treatable (early detection). Overdiagnosis can happen when detecting adenomas, but also when detecting cancers. This thesis focuses on overdiagnosis due to detection of adenomas that are non-progressive in their nature. Since there is no clinical means to make distinction between progressive and non-progressive adenomas, statistical methods must be applied. Classical methods to estimate overdiagnosis fail in quantifying this type of overdiagnosis for couple of reasons: incidence data of adenomas is not available, and adenoma removal results in lowering cancer incidence in screened population. While the latter is a desired effect of screening, it makes it impossible to estimate overdiagnosis by just comparing cancer incidences among screened and control populations. In this thesis a Bayesian Hidden Markov model using HMC NUTS algorithm via software Stan is fitted to simulate the natural progression of colorectal cancer. The five states included in the model were healthy (1), progressive adenoma (2), screen-detectable CRC (3), clinically apparent CRC (4) and non-progressive adenoma (5). Possible transitions are from 1 to 2, 1 to 5, 2 to 3 and 3 to 4. The possible observations are screen-negative (1), detected adenoma (2), screen-detected CRC (3), clinically manifested CRC (3). Three relevant estimands for evaluating this type of overdiagnosis with a natural history model are presented. Then the methods are applied to estimate overdiagnosis proportion in guaiac faecal occult blood test (gFOBT) based CRC screening programme conducted in Finland between 2004 and 2016. The resulting mean overdiagnosis probability for all the patients that had an adenoma detected for programme is 0.48 (0.38, 0.56, 95-percent credible interval). Different estimates for overdiagnosis in sex and age-specific stratas of the screened population are also provided. In addition to these findings, the natural history model can be used to gain more insight about natural progression of colorectal cancer.
  • Mäkinen, Ville (2020)
    Objectives: The objective of this thesis is to illustrate the advantages of Bayesian hierarchical models in housing price modeling. Methods: Five Bayesian regression models are estimated for the housing prices. The models use a robust Student’s t-distribution likelihood and are estimated with Hamiltonian Monte Carlo. Four of the models are hierarchical such that the apartments’ neighborhoods are used as a grouping. Model stacking is also used to produce an ensemble model. Model checks are conducted using the posterior predictive distributions. The predictive distributions are also evaluated in terms of calibration and sharpness and using the logarithmic score with leave-one-out cross validation. The logarithmic scores are calculated using Pareto smoothed importance sampling. The R^2-statistics from the point predictions averaged from the predictive distributions are also presented. Results: The results from the models are broadly reasonable as, for the most part, the coefficients of the explanatory variables and the predictive distributions behave as expected. The results are also consistent with the existence of a submarket in central Helsinki where the price mechanism differs markedly from the rest of the Helsinki-Espoo-Vantaa region. However, model checks indicate that none of the models is well-calibrated. Additionally, the models tend to underpredict the prices of expensive apartments.
  • Santana Vega, Carlos (2018)
    The scope of this project is to provide a set of Bayesian methods to be applied to the task of potential energy barriers prediction. Energy barriers define a physical property of atoms that can be used to characterise their molecular dynamics, with applications in quantum-mechanics simulations for the design of new materials. The goal is to replace the currently used artificial neural network (ANN) with a method that apart of providing accurate predictions, can also assess the predictive certainty of the model. We propose several Bayesian methods and evaluate them on this task, demonstrating that sparse Gaussian process (SGP) are capable of providing predictions, and their confidence intervals, with a level of accuracy equivalent to the current ANN, in a bounded computational complexity time.
  • Kokko, Jan (2019)
    In this thesis we present a new likelihood-free inference method for simulator-based models. A simulator-based model is a stochastic mechanism that specifies how data are generated. Simulator-based models can be as complex as needed, but they must allow exact sampling. One common difficulty with simulator-based models is that learning model parameters from observed data is generally challenging, because the likelihood function is typically intractable. Thus, traditional likelihood-based Bayesian inference is not applicable. Several likelihood-free inference methods have been developed to perform inference when a likelihood function is not available. One popular approach is approximate Bayesian computation (ABC), which relies on the fundamental principle of identifying parameter values for which summary statistics of simulated data are close to those of observed data. However, traditional ABC methods tend have high computational cost. The cost is largely due to the need to repeatedly simulate data sets, and the absence of knowledge of how to specify the discrepancy between the simulated and observed data. We consider speeding up the earlier method likelihood-free inference by ratio estimation (LFIRE) by replacing the computationally intensive grid evaluation with Bayesian optimization. The earlier method is an alternative to ABC that relies on transforming the original likelihood-free inference problem into a classification problem that can be solved using machine learning. This method is able to overcome two traditional difficulties with ABC: it avoids using a threshold value that controls the trade-off between computational and statistical efficiency, and combats the curse of dimensionality by offering an automatic selection of relevant summary statistics when using a large number of candidates. Finally, we measure the computational and statistical efficiency of the new method by applying it to three different real-world time series models with intractable likelihood functions. We demonstrate that the proposed method can reduce the computational cost by some orders of magnitude while the statistical efficiency remains comparable to the earlier method.
  • Paulamäki, Henri (2019)
    Tailoring a hybrid surface or any complex material to have functional properties that meet the needs of an advanced device or drug requires knowledge and control of the atomic level structure of the material. The atomistic configuration can often be the decisive factor in whether the device works as intended, because the materials' macroscopic properties - such as electrical and thermal conductivity - stem from the atomic level. However, such systems are difficult to study experimentally and have so far been infeasible to study computationally due to costly simulations. I describe the theory and practical implementation of a 'building block'-based Bayesian Optimization Structure Search (BOSS) method to efficiently address heterogeneous interface optimization problems. This machine learning method is based on accelerating the identification of a material's energy landscape with respect to the number of quantum mechanical (QM) simulations executed. The acceleration is realized by applying likelihood-free Bayesian inference scheme to evolve a Gaussian process (GP) surrogate model of the target landscape. During this active learning, various atomic configurations are iteratively sampled by running static QM simulations. An approximation of using chemical building blocks reduces the search phase space to manageable dimensions. This way the most favored structures can be located with as little computation as possible. Thus it is feasible to do structure search with large simulation cells, while still maintaining high chemical accuracy. The BOSS method was implemented as a python code called aalto-boss between 2016-2019, where I was the main author in co-operation with Milica Todorović and Patrick Rinke. I conducted a dimensional scaling study using analytic functions, which quantified the scaling of BOSS efficiency for fundamentally different functions when dimension increases. The results revealed the target function's derivative's important role to the optimization efficiency. The outcome will help people with choosing the simulation variables so that they are efficient to optimize, as well as help them estimate roughly how many BOSS iterations are potentially needed until convergence. The predictive efficiency and accuracy of BOSS was showcased in the conformer search of the alanine dipeptide molecule. The two most stable conformers and the characteristic 2D potential energy map was found with greatly reduced effort compared to alternative methods. The value of BOSS in novel materials research was showcased in the surface adsorption study of bifenyldicarboxylic acid on CoO thin film using DFT simulations. We found two adsorption configurations which had a lower energy than previous calculations and approximately supported the experimental data on the system. The three applications showed that BOSS can significantly reduce the computational load of atomistic structure search while maintaining predictive accuracy. It allows material scientists to study novel materials more efficiently, and thus help tailor the materials' properties to better suit the needs of modern devices.
  • Sipola, Aleksi (2020)
    Most of the standard statistical inference methods rely on the evaluating so called likelihood functions. But in some cases the phenomenon of interest is too complex or the relevant data inapplicable and as a result the likelihood function cannot be evaluated. Such a situation blocks frequentist methods based on e.g. maximum likelihood estimation and Bayesian inference based on estimating posterior probabilities. Often still, the phenomenon of interest can be modeled with a generative model that describes supposed underlying processes and variables of interest. In such scenarios, likelihood-free inference, such as Approximate Bayesian Computation (ABC), can provide an option for overcoming the roadblock. Creating a simulator that implements such a generative model provides a way to explore the parameter space and approximate the likelihood function based on similarity between real world data and the data simulated with various parameter values. ABC provides well defined and studied framework for carrying out such simulation-based inference with Bayesian approach. ABC has been found useful for example in ecology, finance and astronomy, in situations where likelihood function is not practically computable but models and simulators for generating simulated data are available. One such problem is the estimation of recombination rates of bacterial populations from genetic data, which often is unsuitable for typical statistical methods due to infeasibly massive modeling and computation requirements. Overcoming these hindrances should provide valuable insight into evolution of bacteria and possibly aid in tackling significant challenges such as antimicrobial resistance. Still, ABC inference is not without its limitations either. Often considerable effort in defining distance functions, summary statistics and threshold for similarity is required to make the comparison mechanism successful. High computational costs can also be a hindrance in ABC inference; As increasingly complex phenomena and thus models are studied, the computations that are needed for sufficient exploration of parameter space with the simulation-comparison cycles can get too time- and resource-consuming. Thus efforts have been made to improve the efficiency of ABC inference. One improvement here has been the Bayesian Optimization for Likelihood-Free Inference algorithm (BOLFI), which provides efficient method to optimize the exploration of parameter space, reducing the amount of needed simulation-comparison cycles by up to several magnitudes. This thesis aims to describe some of the theoretical and applied aspects of the complete likelihood-free inference pipelines using both Rejection ABC and BOLFI methods. The thesis presents also use case where the neutral evolution recombination rate in Streptococcus pneumoniae population is inferred from well-studied real world genome data set. This inference task is used to provide context and concrete examples for the theoretical aspects, and demonstrations for numerous applied aspects. The implementations, experiments and acquired results are also discussed in some detail.
  • Mäkelä, Noora (2022)
    Sum-product networks (SPN) are graphical models capable of handling large amount of multi- dimensional data. Unlike many other graphical models, SPNs are tractable if certain structural requirements are fulfilled; a model is called tractable if probabilistic inference can be performed in a polynomial time with respect to the size of the model. The learning of SPNs can be separated into two modes, parameter and structure learning. Many earlier approaches to SPN learning have treated the two modes as separate, but it has been found that by alternating between these two modes, good results can be achieved. One example of this kind of algorithm was presented by Trapp et al. in an article Bayesian Learning of Sum-Product Networks (NeurIPS, 2019). This thesis discusses SPNs and a Bayesian learning algorithm developed based on the earlier men- tioned algorithm, differing in some of the used methods. The algorithm by Trapp et al. uses Gibbs sampling in the parameter learning phase, whereas here Metropolis-Hasting MCMC is used. The algorithm developed for this thesis was used in two experiments, with a small and simple SPN and with a larger and more complex SPN. Also, the effect of the data set size and the complexity of the data was explored. The results were compared to the results got from running the original algorithm developed by Trapp et al. The results show that having more data in the learning phase makes the results more accurate as it is easier for the model to spot patterns from a larger set of data. It was also shown that the model was able to learn the parameters in the experiments if the data were simple enough, in other words, if the dimensions of the data contained only one distribution per dimension. In the case of more complex data, where there were multiple distributions per dimension, the struggle of the computation was seen from the results.