Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by master's degree program "Master's Programme in Data Science"

Sort by: Order: Results:

  • Rauth, Ella (2022)
    Northern peatlands are a large source of methane (CH4) to the atmosphere and can vary strongly depending on local environmental conditions. However, few studies have mapped fine-grained CH4 fluxes at the landscape-level. The aim of this study was to predict land cover and CH4 flux patterns in Pallastunturi, Finland, in a study area dominated by forests, peatlands, fells, and lakes. I used random forest models to map land cover types and CH4 fluxes with multi-source remote sensing data and upscaled CH4 fluxes based on land cover maps. The random forest classifier reliably detected the same land cover patterns as the CORINE Land Cover maps. The main differences between the land cover maps were forest type classification, misclassification between neighboring peatland types, and detection of sparsely vegetated areas on fells. The upscaled CH4 fluxes of sinks were very robust to changes in land cover classification, but shrub tundra and peatland CH4 fluxes were sensitive to the level of detail in the land cover classification. The random forest regression performed well (NRMSE 6.6%, R2 82%) and predicted similar CH4 flux patterns as the upscaled CH4 flux maps, despite predicting larger areas that act as CH4 sources than the upscaled CH4 flux maps. The random forest regressor also better predicted CH4 fluxes in peatlands due to added information about soil moisture content from the remote sensing data. Random forests are a good model choice to detect landscape patterns and predict CH4 patterns in northern peatlands based on remote sensing and topographic data.
  • Räisä, Ossi (2021)
    Differential privacy has over the past decade become a widely used framework for privacy-preserving machine learning. At the same time, Markov chain Monte Carlo (MCMC) algorithms, particularly Metropolis-Hastings (MH) algorithms, have become an increasingly popular method of performing Bayesian inference. Surprisingly, their combination has not received much attention in the litera- ture. This thesis introduces the existing research on differentially private MH algorithms, proves tighter privacy bounds for them using recent developments in differential privacy, and develops two new differentially private MH algorithms: an algorithm using subsampling to lower privacy costs, and a differentially private variant of the Hamiltonian Monte Carlo algorithm. The privacy bounds of both new algorithms are proved, and convergence to the exact posterior is proven for the latter. The performance of both the old and the new algorithms is compared on several Bayesian inference problems, revealing that none of the algorithms is clearly better than the others, but subsampling is likely only useful to lower computational costs.
  • Suihkonen, Sini (2023)
    The importance of protecting sensitive data from information breaches has increased in recent years due to companies and other institutions gathering massive datasets about their customers, including personally identifiable information. Differential privacy is one of the state-of-the-art methods for providing provable privacy to these datasets, protecting them from adversarial attacks. This thesis focuses on studying existing differentially private random forest (DPRF) algorithms, comparing them, and constructing a version of the DPRF algorithm based on these algorithms. Twelve articles from the late 2000s to 2022, each implementing a version of the DPRF algorithm, are included in the review of previous work. The created algorithm, called DPRF_thesis , uses a privatized median as a method for splitting internal nodes of the decision trees. The class counts of the leaf-nodes are made with the exponential mechanism. Tests on the DPRF_thesis algorithm were run on three binary classification UCI datasets, and the accuracy results were mostly comparable with the two existing DPRF algorithms DPRF_thesis was compared to. ACM Computing Classification System (CCS): Computing methodologies → Machine learning → Machine learning approaches → Classification and regression trees Security and privacy → Database and storage security → Data anonymization and sanitization
  • Joosten, Rick (2020)
    In the past two decades, an increasing amount of discussions are held via online platforms such as Facebook or Reddit. The most common form of disruption of these discussions are trolls. Traditional trolls try to digress the discussion into a nonconstructive argument. One strategy to achieve this is to give asymmetric responses, responses that don’t follow the conventional patterns. In this thesis we propose a modern machine learning NLP method called ULMFiT to automatically detect the discourse acts of online forum posts in order to detect these conversational patterns. ULMFiT finetunes the language model before training its classifier in order to create a more accurate language representation of the domain language. This task of discourse act recognition is unique since it attempts to classify the pragmatic role of each post within a conversation compared to the functional role which is related to tasks such as question-answer retrieval, sentiment analysis, or sarcasm detection. Furthermore, most discourse act recognition research has been focused on synchronous conversations where all parties can directly interact with each other while this thesis looks at asynchronous online conversations. Trained on a dataset of Reddit discussions, the proposed model achieves a matthew’s correlation coefficient of 0.605 and an F1-score of 0.69 to predict the discourse acts. Other experiments also show that this model is effective at question-answer classification as well as showing that language model fine-tuning has a positive effect on both classification performance along with the required size of the training data. These results could be beneficial for current trolling detection systems.
  • Lange, Moritz Johannes (2020)
    In the context of data science and machine learning, feature selection is a widely used technique that focuses on reducing the dimensionality of a dataset. It is commonly used to improve model accuracy by preventing data redundancy and over-fitting, but can also be beneficial in applications such as data compression. The majority of feature selection techniques rely on labelled data. In many real-world scenarios, however, data is only partially labelled and thus requires so-called semi-supervised techniques, which can utilise both labelled and unlabelled data. While unlabelled data is often obtainable in abundance, labelled datasets are smaller and potentially biased. This thesis presents a method called distribution matching, which offers a way to do feature selection in a semi-supervised setup. Distribution matching is a wrapper method, which trains models to select features that best affect model accuracy. It addresses the problem of biased labelled data directly by incorporating unlabelled data into a cost function which approximates expected loss on unseen data. In experiments, the method is shown to successfully minimise the expected loss transparently on a synthetic dataset. Additionally, a comparison with related methods is performed on a more complex EMNIST dataset.
  • Jokinen, Olli (2024)
    The rise of large language models (LLMs) has revolutionized natural language processing, par- ticularly through transfer learning and fine-tuning paradigms that enhance the understanding of complex textual data. This thesis builds upon the concept of fine-tuning to improve the under- standing of Finnish Wikipedia articles. Specifically, a BERT-based language model is fine-tuned to create high-quality document representations from Finnish texts. The learned representations are applied to downstream tasks, where the model’s performance is evaluated against baseline models. This thesis draws on the SPECTER paper, published in 2020, which introduced a training frame- work for fine-tuning a general-purpose document embedder. SPECTER was trained using a document-level training objective that leveraged document link information. Originally, SPECTER was designed for scientific articles, utilizing citations between articles. The training instances con- sisted of triplets of query, positive, and negative papers, with the aim of capturing the semantic similarity of the documents. This work extends the SPECTER framework to Finnish Wikipedia data. While scientific articles have citations, Wikipedia’s cross-references are used to build a document graph that captures the relatedness between articles. Additionally, Wikipedia data is publicly available as a full data dump, making it an attractive choice for the dataset in this thesis. One of the objectives is to demonstrate the flexibility of the SPECTER framework on a new dataset that has a similar networked structure to that of scientific articles. The fine-tuned model can be used as a general-purpose tool for various tasks and applications; however, in this thesis, its performance is measured in topic classification and cross-reference ranking. The Transformer-based language model produces fixed-length embeddings, which are used as features in the topic classification task and as vectors to measure the L2 distance of article vectors in the cross-reference prediction task. This thesis shows that the proposed model, WikiSpecter, optimized with a document-level objective, outperformed baseline models in both tasks. The performance indicates that Finnish Wikipedia provides relevant cross-references that help the model capture relationships across a range of topics.
  • Joensuu, Juhana (2022)
    Currency risk is an important yet neglected consideration for investors holding internationally diversified investment portfolios. The foreign exchange market is an extremely liquid and efficient market, with daily transaction volumes exceeding the equivalent of several trillion euros. International investors have to decide upon the level of exposure on various currency risks typically by hedging some or all of the underlying currency exposure with currency derivative contracts. The currency overlay refers to an approach where the aggregate currency exposure from the investment portfolio is managed with a separate derivatives strategy, aimed at improving the overall portfolio’s risk adjusted returns. In this thesis, we develop a novel systematic, data-driven approach to manage the currency risk of investors holding diversified bond-equity portfolios, accounting for both risk minimization and expected returns maximization objectives on the portfolio level. The model is based upon modern portfolio theory, leveraging findings from prior literature in covariance modelling and expected currency returns. The focus of this thesis is in ensuring efficient risk diversification through the use of accurate covariance estimates fed by high-frequency data on exchange rates, bonds and equity indexes. As to the expected returns estimate, we identify purchasing power parity (PPP) and carry signals as credible alternatives to improve the expected risk-adjusted returns of the strategy. A block bootstrap simulation methodology is used to conduct empirical tests on different specifications of the developed dynamic overlay model. We find that dynamic risk-minimizing strategies significantly decrease portfolio risk relative to either unhedged or fully hedged portfolios. Using high-freqency data based returns covariance estimates is likely to improve portfolio diversification relative to a simple daily data-based estimator. The empirical results are much less clear in terms of risk adjusted returns. We find tentative evidence that the tested dynamic strategies improve risk adjusted returns. Due to the limited data sample used in this study, the findings regarding expected returns are highly uncertain. Nevertheless, considering evidence from prior research covering much longer time-horizons, we expect that both the risk-minimizing and returns maximizing components of the developed model are likely to improve portfolio-level risk adjusted returns. We recommend using the developed model as an input to support the currency risk management decision for investors with globally diversified investment portfolios, along with other relevant considerations such as solvency or discretionary market views.
  • Pajula, Ilari (2024)
    Combining data from visual and inertial sensors effectively reduces inherent errors in each modality, enhancing the robustness of sensor-fusion for accurate 6-DoF motion estimation over extended periods. While traditional SfM and SLAM frameworks are well established in literature and real-world applications, purely end-to-end learnable SfM and SLAM networks are still scarce. The adaptability of fully trained models in system configuration and navigation setup holds great potential for future developments in this field. This thesis introduces and assesses two novel end-to-end trainable sensor-fusion models using a supervised learning approach, tested on established navigation benchmarks and custom datasets. The first model utilizes optical flow, revealing its limitations in handling complex camera movements present in pedestrian motion. The second model addresses these shortcomings by using feature point-matching and a completely original design.
  • Ranta, Topi (2024)
    For machine learning, quantum computing provides effective new computation methods. The number of states a quantum register may express is exponential compared to the classical register of the same size, and this expressivity may be used in machine learning. It has been shown that in less than exponential time, a theoretical fault-tolerant quantum computer may perform computations that cannot be run on a classical computer in a feasible time. In machine learning, however, it has been shown that a classical machine learning method may learn a target model defined by an arbitrary quantum circuit if given a sufficient number of training data. In other words, a machine learning method that utilizes quantum computing may gain a quantum prediction advantage over its classical counterpart if the number of training data is low. However, this result does not address the noise of contemporary quantum computers. In this thesis, we use a simulation of a quantum circuit to test how a gradually increased noise affects the ability of a hybrid quantum-classical machine learning system to retain the quantum prediction advantage. With a simulated quantum circuit, we embed classical data rows into the quantum Hilbert space that acts as a feature space known from classical kernel theory. We project the data back to classical space, yielding a projected dataset that differs from the original. Using kernel matrices of the original and projected datasets, we then create adversarial binary labeling. With few training data, this adversarial labeling is easy for a classical neural network to learn using the projected features but impossible for using the original data. We show that this quantum prediction advantage diminishes as a function of the error rate introduced in the simulation of the data-embedding quantum circuit. Our results suggest the noise threshold for a feasible system lies slightly above the ones of contemporary hardware, indicating our experiment should be tested on actual quantum hardware. We derive a parameter optimization scheme for an arbitrary hardware implementation such that it may be concluded whether the quantum hardware in question may produce a quantum advantage dataset beyond the simulation capability of classical computers.
  • Valkama, Bearjadat (2022)
    Above-ground biomass (AGB) estimation is an important tool for predicting carbon flux and the effects of global warming. This study describes a novel application of remote-sensing based AGB estimation in the hemi-boreal vegetation zone of Finland, using Sentinel-1, Sentinel-2, ALOS-2 PALSAR-2, and the Multi-Source National Forest Inventory by Natural Resources Institute Finland as sources of data. A novel method of extracting data from the features of the surrounding observations is proposed, and the method’s effectiveness was evaluated. The findings showed that the method showed promising results, with the model trained using the extracted features achieving the highest evaluation scores in the study. In addition, the viability of using free and highly available satellite datasets for AGB estimation in the hemi-boreal Finland was analyzed, with the results suggesting that the free Synthetic Aperture Radar (SAR) based products had a low performance. The features extracted from the optical data of Sentinel-2 produced well-performing models, although the accuracy might still be too low to be feasible.
  • Suomela, Samu (2021)
    Large graphs often have labels only for a subset of nodes. Node classification is a semi-supervised learning task where unlabeled nodes are assigned labels utilizing the known information of the graph. In this thesis, three node classification methods are evaluated based on two metrics: computational speed and node classification accuracy. The three methods that are evaluated are label propagation, harmonic functions with Gaussian fields, and Graph Convolutional Neural Network (GCNN). Each method is tested on five citation networks of different sizes extracted from a large scientific publication graph, MAG240M-LSC. For each graph, the task is to predict the subject areas of scientific publications, e.g., cs.LG (Machine Learning). The motivation of the experiments is to give insight on whether the methods would be suitable for automatic labeling of scientific publications. The results show that label propagation and harmonic functions with Gaussian fields reach mediocre accuracy in the node classification task, while GCNN had a low accuracy. Label propagation was computationally slow compared to the other methods, whereas harmonic functions were exceptionally fast. Training of the GCNN took a long time compared to harmonic functions, but computational speed was acceptable. However, none of the methods reached a high enough classification accuracy to be utilized in automatic labeling of scientific publications.
  • Wachirapong, Fahsinee (2023)
    The importance of topic modeling in the analysis of extensive textual data is magnified by the inefficiency of manual work due to its time-consuming nature. Data preprocessing is a critical step before feeding text data to analysis. This process ensures that irrelevant information is removed and the remaining text is suitably formatted for topic modeling. However, the absence of standard rules often leads practitioners to adopt undisclosed or poorly understood preprocessing strategies. This potentially impacts the reproducibility and comparability of research findings. This thesis examines text preprocessing, including lowercase conversion, non-alphabetic removal, stopword elimination, stemming, and lemmatization, and explores their influence on data quality, vocabulary size, and topic interpretation generated by the topic model. Additionally, the variations in text preprocessing sequences and their impact on the topic model's outcomes. Our examination spans 120 diverse preprocessing approaches on the Manifesto Project Dataset. The results underscore the substantial impact of preprocessing strategies on perplexity scores and prove the challenges in determining the optimal number of topics and interpreting final results. Importantly, our study raises awareness of data preprocessing in shaping the perceived themes and content in identified topics and proposes recommendations for researchers to consider before performing data preprocessing.
  • Nurmi, Akseli (2024)
    Deep neural networks are widely used in natural language processing. Large language models, trained with large corpora, enable improved information extraction from data that is too large for human processing. This thesis reviews the performance of a deep learning natural language processing pipeline in detecting and removing (anonymising) personal information. Methods for fast and accurate ano- or pseudonymisation of data containing sensitive information are vital to research and development in science and industry, as legislation demands extensive procedures concerning handling of data with direct or indirect personal information. We propose a method that achieves state of the art results on noisy data, and good performance on a contemporary benchmark. Our comparison of anonymisation performance is one of the first for Finnish free texts.
  • Nurmi, Akseli (2024)
    Deep neural networks are widely used in natural language processing. Large language models, trained with large corpora, enable improved information extraction from data that is too large for human processing. This thesis reviews the performance of a deep learning natural language processing pipeline in detecting and removing (anonymising) personal information. Methods for fast and accurate ano- or pseudonymisation of data containing sensitive information are vital to research and development in science and industry, as legislation demands extensive procedures concerning handling of data with direct or indirect personal information. We propose a method that achieves state of the art results on noisy data, and good performance on a contemporary benchmark. Our comparison of anonymisation performance is one of the first for Finnish free texts.
  • Trigos-Raczkowski, Citlali (2024)
    This thesis examines components of an emerging topic: the interplay between immigration background and partnering in the modern Finnish context. It poses the question: how do various computational methods capture the ways that immigrant background status alters (1) the time to first union formation and (2) subsequent first union dissolutions in Finland from 1987-2020? Using longitudinal Finnish register data, the study focuses on all women residents in Finland observed from age 18 onwards during the specified period, categorized by their intergenerational immigration status. The study examines the relationship between immigration status and the two events of interest using the nonparametric Kaplan-Meier survivor function, semiparametric Cox Proportional Hazards model, and parametric survival model fitted with generalized gamma distribution. The strengths, limitations, and findings from each analytic method are compared. The results suggest three main findings: firstly, there is a clear gradient in the risk of first union formation and dissolution across women with different immigrant backgrounds in Finland, with Native Finnish women experiencing the highest risk, followed by 2.5 generation women (women with one Native Finnish parent and one 1st generation immigrant parent), 1st generation immigrant women, and finally 2nd generation women (women with two 1st generation immigrant parents). Secondly, factors including educational attainment, region of origin, rural/urban residence, and partnership homogamy based on region of origin contribute to differences in the risks for both union formation and union dissolution. Finally, despite the unique assumptions and constraints of each method, results remain consistent across all models, indicating that a variety of computational methods can provide robust insights into the complex interplay between immigration and first union dynamics in Finland. In light of the growing immigrant population and the potential influence of their first union dynamics on population change, these findings suggest alignment with segmented assimilation theory, highlighting a non-linear assimilation process influenced by socio-economic status and socio-cultural resources. The observed differences between the 2.5 and 2nd generations raise intriguing questions about the experiences of immigrant children in Finland. The 2nd generation's particularly low risk of first union formation indicates potentially unique acculturation stressors that warrant further investigation.
  • Bovellán, Jonne (2022)
    A lot of research is done in terms of time series forecasting. This research is conducted for example on the stock market data and the data derived from social media platforms. Several powerful methods are developed for time-series forecasting, including Bidirectional Long Short-Term Memory which is based on neural networks. TikTok is a social media platform that is focused on short videos. When a user posts a video to TikTok, they can also write a short textual description which can include hashtags. Often these hashtags describe things, events and trends that happen in the physical world. The possibility to have an accurate forecast of the future popularity of TikTok hashtags includes the financial potential it creates for individuals and organisations. As part of this thesis, an experimental study was conducted in order to forecast the popularity of TikTok hashtags. An algorithm based on Bidirectional Long-Short Term Memory was created that forecasts short-term and long-term popularity of a single hashtag based on its past. A data set which consisted of time series data for 9779 different TikTok hashtags was used in the development process. The created forecasting algorithm performs at a good level for forecasting the short-term popularity of a hashtag, but it is not suitable for long-term forecasting due to its bad performance.
  • Wei, Haoyu (2022)
    Ultrasonic guided lamb waves can be used to monitor structural conditions of pipes and other equipment in industry. An example is to detect accumulated precipitation on the surface of pipes in a non-destructive and non-invasive way. The propagation of Lamb waves in a pipe is influenced by the fouling on its surface, which makes the fouling detection possible. In addition, multiple helical propagation paths around pipe structure provides rich information that allows the spatial localization of the fouled area. Gaussian Processes (GP) are widely used tools for estimating unknown functions. In this thesis, we propose machine learning models for fouling detection and spatial localization of potential fouled pipes based on GPs. The research aims to develop a systematic machine learning approach for ultrasonic detection, interpret fouling observations from wave signals, as well as reconstruct fouling distribution maps from the observations. The lamb wave signals are generated in physics experiments. We developed a Gaussian Process Regression model as a detector, to determine whether each propagation path is going across the fouling or not, based on comparison with clean pipe. This binary classification can be regarded as one case of the different fouling observations. Latent variable Gaussian Process models are deployed to model the observations over the unknown fouling map. Then Hamiltonian Monte Carlo sampling is utilized to perform full Bayesian inference for the GP hyper-parameters. Thus, the fouling map can be reconstructed based on the estimated parameters. We investigate different latent variable GP models for different fouling observation cases. In this thesis, we present the first unsupervised machine learning methods for fouling detection and localization on the surface of pipe based on guided lamb waves. In these thesis we evaluate the performance of our methods with a collection of synthetic data. We also study the effect of noise on the localization accuracy.
  • Wei, Haoyu (2022)
    Ultrasonic guided lamb waves can be used to monitor structural conditions of pipes and other equipment in industry. An example is to detect accumulated precipitation on the surface of pipes in a non-destructive and non-invasive way. The propagation of Lamb waves in a pipe is influenced by the fouling on its surface, which makes the fouling detection possible. In addition, multiple helical propagation paths around pipe structure provides rich information that allows the spatial localization of the fouled area. Gaussian Processes (GP) are widely used tools for estimating unknown functions. In this thesis, we propose machine learning models for fouling detection and spatial localization of potential fouled pipes based on GPs. The research aims to develop a systematic machine learning approach for ultrasonic detection, interpret fouling observations from wave signals, as well as reconstruct fouling distribution maps from the observations. The lamb wave signals are generated in physics experiments. We developed a Gaussian Process Regression model as a detector, to determine whether each propagation path is going across the fouling or not, based on comparison with clean pipe. This binary classification can be regarded as one case of the different fouling observations. Latent variable Gaussian Process models are deployed to model the observations over the unknown fouling map. Then Hamiltonian Monte Carlo sampling is utilized to perform full Bayesian inference for the GP hyper-parameters. Thus, the fouling map can be reconstructed based on the estimated parameters. We investigate different latent variable GP models for different fouling observation cases. In this thesis, we present the first unsupervised machine learning methods for fouling detection and localization on the surface of pipe based on guided lamb waves. In these thesis we evaluate the performance of our methods with a collection of synthetic data. We also study the effect of noise on the localization accuracy.
  • Säkkinen, Niko (2020)
    Predicting patient deterioration in an Intensive Care Unit (ICU) effectively is a critical health care task serving patient health and resource allocation. At times, the task may be highly complex for a physician, yet high-stakes and time-critical decisions need to be made based on it. In this work, we investigate the ability of a set of machine learning models to algorithimically predict future occurrence of in hospital death based on Electronic Health Record (EHR) data of ICU-patients. For one, we will assess the generalizability of the models. We do this by evaluating the models on hospitals the data of which has not been considered when training the models. For another, we consider the case in which we have access to some EHR data for the patients treated at a hospital of interest. In this setting, we assess how EHR data from other hospitals can be used in the optimal way to improve the prediction accuracy. This study is important for the deployment and integration of such predictive models in practice, e.g., for real-time algorithmic deterioration prediction for clinical decision support. In order to address these questions, we use the eICU collaborative research database, which is a database containing EHRs of patients treated at a heterogeneous collection of hospitals in the United States. In this work, we use the patient demographics, vital signs and Glasgow coma score as the predictors. We devise and describe three computational experiments to test the generalization in different ways. The used models are the random forest, gradient boosted trees and long short-term memory network. In our first experiment concerning the generalization, we show that, with the chosen limited set of predictors, the models generalize reasonably across hospitals but that only a small data mismatch is observed. Moreover, with this setting, our second experiment shows that the model performance does not significantly improve when increasing the heterogeneity of the training set. Given these observations, our third experiment shows that
  • Cauchi, Daniel (2023)
    Alignment in genomics is the process of finding the positions where DNA strings fit best with one another, that is, where there are the least differences if they were placed side by side. This process, however, remains very computationally intensive, even with more recent algorithmic advancements in the field. Pseudoalignment is emerging as a new method over full alignment as an inexpensive alternative, both in terms of memory needed as well as in terms of power consumption. The process is to instead check for the existence of substrings within the target DNA, and this has been shown to produce good results for a lot of use cases. New methods for pseudoalignment are still evolving, and the goal of this thesis is to provide an implementation that massively parallelises the current state of the art, Themisto, by using all resources available. The most intensive parts of the pipeline are put on the GPU. Meanwhile, the components which run on the CPU are heavily parallelised. Reading and writing of the files is also done in parallel, so that parallel I/O can also be taken advantage of. Results on the Mahti supercomputer, using an NVIDIA A100, shows a 10 times end-to-end querying speedup over the best run of Themisto, using half the CPU cores as Themisto, on the dataset used in this thesis.