Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "classification"

Sort by: Order: Results:

  • Kokko, Jan (2019)
    In this thesis we present a new likelihood-free inference method for simulator-based models. A simulator-based model is a stochastic mechanism that specifies how data are generated. Simulator-based models can be as complex as needed, but they must allow exact sampling. One common difficulty with simulator-based models is that learning model parameters from observed data is generally challenging, because the likelihood function is typically intractable. Thus, traditional likelihood-based Bayesian inference is not applicable. Several likelihood-free inference methods have been developed to perform inference when a likelihood function is not available. One popular approach is approximate Bayesian computation (ABC), which relies on the fundamental principle of identifying parameter values for which summary statistics of simulated data are close to those of observed data. However, traditional ABC methods tend have high computational cost. The cost is largely due to the need to repeatedly simulate data sets, and the absence of knowledge of how to specify the discrepancy between the simulated and observed data. We consider speeding up the earlier method likelihood-free inference by ratio estimation (LFIRE) by replacing the computationally intensive grid evaluation with Bayesian optimization. The earlier method is an alternative to ABC that relies on transforming the original likelihood-free inference problem into a classification problem that can be solved using machine learning. This method is able to overcome two traditional difficulties with ABC: it avoids using a threshold value that controls the trade-off between computational and statistical efficiency, and combats the curse of dimensionality by offering an automatic selection of relevant summary statistics when using a large number of candidates. Finally, we measure the computational and statistical efficiency of the new method by applying it to three different real-world time series models with intractable likelihood functions. We demonstrate that the proposed method can reduce the computational cost by some orders of magnitude while the statistical efficiency remains comparable to the earlier method.
  • Latva-Käyrä, Petri (2012)
    The intensity and frequency of insect outbreaks have increased in Finland in the last decades and they are expected to increase even further in the future due to global climate change. In 1998-2001 Finland suffered the most severe insect outbreak ever recorded, over 500,000 hectares. The outbreak was caused by the common pine sawfly (Diprion pini L.). The outbreak has continued in the study area, Palokangas, ever since. To find a good method to monitor this type of outbreaks, the purpose of this study was to examine the efficacy of multitemporal ERS-2 and ENVISAT SAR imagery for estimating Scots pine defoliation. The study area, Palokangas, is located in Ilomantsi district, Eastern-Finland and consists mainly even-aged Scots pine forests on relatively dry soils. Most of the forests in the area are young or middle-aged managed forests. The study material was comprised of multi-temporal ERS-2 and ENVISAT synthetic aperture radar (SAR) data. The images had been taken between the years 2001 and 2008. The field data consisted 16 sample plots which had been measured seven times between the years 2002 and 2009. In addition, eight sample plots were added afterwards to places which were known to have had cuttings during the study period. Three methods were tested to estimate Scots pine defoliation: unsupervised k-means clustering, supervised linear discriminant analysis (LDA) and logistic regression. In addition, it was assessed if harvested areas could be differentiated from the defoliated forest using the same methods. Two different speckle filters were used to determine the effect of filtering on the SAR imagery and subsequent results. The logistic regression performed best, producing a classification accuracy of 81.6% (kappa 0.62) with two classes (no defoliation, >20% defoliation). LDA accuracy was with two classes at best 77.7% (kappa 0.54) and k-means 72.8 (0.46). In general, the largest speckle filter, 5 x 5 image window, performed best. When additional classes were added the accuracy was usually degraded on a step-by-step basis. The results were good, but because of the restrictions in the study they should be confirmed with independent data, before full conclusions can be made that results are reliable. The restrictions include the small size field data and, thus, the problems with accuracy assessment (no separate testing data) as well as the lack of meteorological data from the imaging dates.
  • Kyrö, Minna (2011)
    FTIR spectroscopy (Fourier transform infrared spectroscopy) is a fast method of analysis. The use of interferometers in Fourier devices enables the scanning of the whole infrared frequency region in a couple of seconds. There is no need to elaborate sample preparation when the FTIR spectrometer is equipped with an ATR accessory and the method is therefore easy to use. ATR accessory facilitates the analysis of various sample types. It is possible to measure infrared spectra from samples which are not suitable for traditional sample preparation methods. The data from FTIR spectroscopy is frequently combined with statistical multivariate analysis techniques. In cluster analysis the data from spectra can be grouped based on similarity. In hierarchical cluster analysis the similarity between objects is determined by calculating the distance between them. Principal component analysis reduces the dimensionality of the data and establishes new uncorrelated principal components. These principal components should preserve most of the variation of the original data. The possible applications of FTIR spectroscopy combined with multivariate analysis have been studied a lot. For example in food industry its feasibility in quality control has been evaluated. The method has also been used for the identification of chemical compositions of essential oils and for the detection of chemotypes in oil plants. In this study the use of the method was evaluated in the classification of hog's fennel extracts. FTIR spectra of extracts from different plant parts of hog's fennel were compared with the measured FTIR spectra of standard substances. The typical absorption bands in the FTIR spectra of standard substances were identified. The wave number regions of the intensive absorption bands in the spectra of furanocoumarins were selected for multivariate analyses. Multivariate analyses were also performed in the fingerprint region of IR spectra, including the wave number region 1785-725 cm-1. The aim was to classify extracts according to the habitat and coumarin concentration of the plants. Grouping according to habitat was detected, which could mainly be explained by coumarin concentrations as indicated by analyses of the wave number regions of the selected absorption bands. In these analyses extracts mainly grouped and differed by their total coumarin concentrations. In analyses of the wave number region 1785-725 cm-1 grouping according to habitat was also detected but this could not be explained by coumarin concentrations. These groupings may have been caused by similar concentrations of other compounds in the samples. Analyses using other wave number regions were also performed, but the results from these experiments did not differ from previous results. Multivariate analyses of second-order derivative spectra in the fingerprint region did not reveal any noticeable changes either. In future studies the method could perhaps be further developed by investigating narrower carefully selected wave number regions of second-order derivative spectra.
  • Muukkonen, Ilkka (2018)
    Objectives: Faces provide an ideal platform to look into the ways in which our brains process multidimensional information. In order to still recognize an individual when their expression changes, our brain must be able to separate two overlapping sources of information. Previous fMRI-studies have found several brain areas involved in face processing, especially fusiform face area (FFA), occipital face area (OFA), and superior temporal sulcus (STS). EEG- and MEG-studies have also pointed out face-specific temporal components, mainly P1, N170, and N250. However, only few studies have varied both expressions and identities at the same time, or combined spatially precise fMRI with temporally precise M/EEG. Methods: In separate experiments, EEG and fMRI were measured while participants (n=17) viewed morphed faces varying in their expression (neutral, happy, fearful and angry) and in identity. Classification accuracies were calculated using support vector machine (SVM), both from different spatial locations in fMRI and from different timepoints in EEG. In addition, the classification information in fMRI and EEG were combined using representational similarity analysis (RSA). Results: In EEG, we found support for very early processing of expressions (at 110 ms), later processing of identities (at 250 ms) than expressions, and more sustained decoding of angry faces than faces with other expressions. In fMRI, coding of expressions were found on a broad area containing early visual areas and face processing areas OFA, FFA, and STS. Results for identities, although less clear, showed FFA and middle frontal gyrus (MFG). RSA combining both EEG and fMRI showed progression of information from early visual areas at 130 ms to FFA at 150 ms, and to FFA and STS at 200 ms. Conclusions: Our results showed that with multivariate data analysis methods, temporal and spatial neural representations of faces can be studied simultaneously. Consistent with neural models of face processing, our results suggest partially separate processing of expressions and identities in spatially distributed brain network.
  • Räty, Matti (2020)
    SQL kuuluu suositeltujen oppiaineiden joukkoon tietojenkäsittelytieteestä. Se on tehokas tapa varastoida dataa kontekstista riippumatta. SQL on kuitenkin opittavana aiheena opiskelijoilleen vaikea, ja tämän vuoksi SQL-opetuksen rinnalla käytetään opetusohjelmistoja. Opetusohjelmistojen avulla SQL:ää päästään opettelemaan käytännössä, paikataan suurta oppilaiden määrää opettajien määrään nähden, ja kerätään aineistoa opiskelijoiden suoriutumisesta. Oppimisohjelmistojen keräämä aineisto oppilaiden suoriutumisesta tarjoaa mahdollisuuden ennustaa opiskelijoiden suoriutumista kurssilla koneoppimismenetelmin. Tämä tutkielma kouluttaa SQL-opetusohjelmiston aineistoilla hyväksi todettuja koneoppimisalgoritmeja malleiksi, jotka osaavat ennustaa osaako opiskelija seuraavalla yrityksellään SQL-harjoitustehtävän oikein. Kyseessä ei ole tehdä mallia joka osaisi tarkastaa SQL-tehtäviä, vaan tarkoituksena on antaa koneoppimisalgoritmien tarkkailla opiskelijoilta muita kerättyjä tilastoja tehtäväyrityksen oikeellisuuden arvioimiseen ilman itse oppilaan antamaa ratkaisua. Tutkielmassa huomataan useiden koneoppimismallien olevan toimivia tämän tavoitteen saavuttamiseksi. Vastaavia koneoppimismalleja voidaan hyödyntää oppilaiden löytämisessä, joilla on vaikeuksia tehtävien tekemisessä. Tämä tieto on arvokasta esimerkiksi opetusohjelmistoille, jotka pyrkivät antamaan SQL-tehtävien tekijöille vihjeitä hyödylliseen aikaan.
  • Kallela, Jenni; Jääskeläinen, Tiina; Kortelainen, Eija; Laivuori, Hannele (2016)
    Background The Finnish Pre-eclampsia Consortium (FINNPEC) case-control cohort consisting of 1447 pre-eclamptic and 1068 non-pre-eclamptic women was recruited at the five Finnish university hospitals to study genetic background of pre-eclampsia and fetal growth. Pre-eclampsia was defined by hypertension and proteinuria according to the modified The American College of Obstetricians and Gynecologists (ACOG) 2002 classification. The ACOG Task Force Report on Hypertension in Pregnancy (2013) and The international Society for the Study of Hypertension in Pregnancy (ISSHP) (2014) have published new classifications, which change the paradigm that the diagnosis of preeclampsia always requires proteinuria. Here we studied how the new classifications would affect the pre-eclampsia diagnoses in the FINNPEC cohort. Methods We re-evaluated pre-eclampsia diagnosis using the ACOG 2013 and the ISSHP 2014 classifications in those pre-eclamptic women with the amount of proteinuria not exceeding 1+ in dipstick (N=68) and in women with gestational hypertension (N=138). Results Number of women with pre-eclampsia increased 0.5% (1454/1447) according to the ACOG 2013 criteria and decreased 0.9% (1434/1447) according to the ISSHP 2014 criteria. All 68 women with the amount of proteinuria not exceeding 1+ in dipstick diagnosed originally pre-eclamptic met the ACOG 2013 criteria but only 20 women (29.4%) met the ISSHP 2014 criteria. Seven (5.1%) and 35 (25.4%) women with gestational hypertension were diagnosed with pre-eclampsia according to the ACOG 2013 and the ISSHP 2014 criteria, respectively. Conclusions Only minor changes were observed in the total number of pre-eclamptic women in the FINNPEC cohort when comparing the modified ACOC 2002 classification with the ACOG 2013 and ISSHP 2014 classifications.
  • Savolainen, Dominic (2021)
    This study attempts to discover the best predictors of mathematics and language learning outcomes across Kenya, Mozambique, Nigeria, Uganda, and Tanzania by analysing World Bank SDI data and using machine learning methods for variable selection purposes. Firstly, I use the SDI data to show the current fragilities in the quality of education service delivery, while also highlighting deficiencies in student learning outcomes. Then, I use CV Lasso, Adaptive Lasso, and Elastic Net regularisation methods to help discover the best predictors of learning outcomes. While the results from the regularisation methods show that private schools, teacher subject knowledge, and teacher pedagogical skills are good predictors of learning outcomes in a sample combining observations from Kenya, Mozambique, Nigeria, Uganda, and Tanzania, the results fail to infer causality by not distinguishing if unobservable factors are driving the results. To quantify the relationship of key predictors, and for statistical significance testing purposes, I then conduct subsequent OLS analysis. Despite not expecting the true partial derivative effects to be identical to the OLS coefficients presented in this study, this study highlights deficiencies in education service delivery and applies methods which help select key predictors of learning outcomes across the sampled schools in the SDI data.
  • Savolainen, Dominic (2021)
    This study attempts to discover the best predictors of mathematics and language learning outcomes across Kenya, Mozambique, Nigeria, Uganda, and Tanzania by analysing World Bank SDI data and using machine learning methods for variable selection purposes. Firstly, I use the SDI data to show the current fragilities in the quality of education service delivery, while also highlighting deficiencies in student learning outcomes. Then, I use CV Lasso, Adaptive Lasso, and Elastic Net regularisation methods to help discover the best predictors of learning outcomes. While the results from the regularisation methods show that private schools, teacher subject knowledge, and teacher pedagogical skills are good predictors of learning outcomes in a sample combining observations from Kenya, Mozambique, Nigeria, Uganda, and Tanzania, the results fail to infer causality by not distinguishing if unobservable factors are driving the results. To quantify the relationship of key predictors, and for statistical significance testing purposes, I then conduct subsequent OLS analysis. Despite not expecting the true partial derivative effects to be identical to the OLS coefficients presented in this study, this study highlights deficiencies in education service delivery and applies methods which help select key predictors of learning outcomes across the sampled schools in the SDI data.