Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by master's degree program "Datatieteen maisteriohjelma"

Sort by: Order: Results:

  • Lange, Moritz Johannes (2020)
    In the context of data science and machine learning, feature selection is a widely used technique that focuses on reducing the dimensionality of a dataset. It is commonly used to improve model accuracy by preventing data redundancy and over-fitting, but can also be beneficial in applications such as data compression. The majority of feature selection techniques rely on labelled data. In many real-world scenarios, however, data is only partially labelled and thus requires so-called semi-supervised techniques, which can utilise both labelled and unlabelled data. While unlabelled data is often obtainable in abundance, labelled datasets are smaller and potentially biased. This thesis presents a method called distribution matching, which offers a way to do feature selection in a semi-supervised setup. Distribution matching is a wrapper method, which trains models to select features that best affect model accuracy. It addresses the problem of biased labelled data directly by incorporating unlabelled data into a cost function which approximates expected loss on unseen data. In experiments, the method is shown to successfully minimise the expected loss transparently on a synthetic dataset. Additionally, a comparison with related methods is performed on a more complex EMNIST dataset.
  • Joensuu, Juhana (2022)
    Currency risk is an important yet neglected consideration for investors holding internationally diversified investment portfolios. The foreign exchange market is an extremely liquid and efficient market, with daily transaction volumes exceeding the equivalent of several trillion euros. International investors have to decide upon the level of exposure on various currency risks typically by hedging some or all of the underlying currency exposure with currency derivative contracts. The currency overlay refers to an approach where the aggregate currency exposure from the investment portfolio is managed with a separate derivatives strategy, aimed at improving the overall portfolio’s risk adjusted returns. In this thesis, we develop a novel systematic, data-driven approach to manage the currency risk of investors holding diversified bond-equity portfolios, accounting for both risk minimization and expected returns maximization objectives on the portfolio level. The model is based upon modern portfolio theory, leveraging findings from prior literature in covariance modelling and expected currency returns. The focus of this thesis is in ensuring efficient risk diversification through the use of accurate covariance estimates fed by high-frequency data on exchange rates, bonds and equity indexes. As to the expected returns estimate, we identify purchasing power parity (PPP) and carry signals as credible alternatives to improve the expected risk-adjusted returns of the strategy. A block bootstrap simulation methodology is used to conduct empirical tests on different specifications of the developed dynamic overlay model. We find that dynamic risk-minimizing strategies significantly decrease portfolio risk relative to either unhedged or fully hedged portfolios. Using high-freqency data based returns covariance estimates is likely to improve portfolio diversification relative to a simple daily data-based estimator. The empirical results are much less clear in terms of risk adjusted returns. We find tentative evidence that the tested dynamic strategies improve risk adjusted returns. Due to the limited data sample used in this study, the findings regarding expected returns are highly uncertain. Nevertheless, considering evidence from prior research covering much longer time-horizons, we expect that both the risk-minimizing and returns maximizing components of the developed model are likely to improve portfolio-level risk adjusted returns. We recommend using the developed model as an input to support the currency risk management decision for investors with globally diversified investment portfolios, along with other relevant considerations such as solvency or discretionary market views.
  • Valkama, Bearjadat (2022)
    Above-ground biomass (AGB) estimation is an important tool for predicting carbon flux and the effects of global warming. This study describes a novel application of remote-sensing based AGB estimation in the hemi-boreal vegetation zone of Finland, using Sentinel-1, Sentinel-2, ALOS-2 PALSAR-2, and the Multi-Source National Forest Inventory by Natural Resources Institute Finland as sources of data. A novel method of extracting data from the features of the surrounding observations is proposed, and the method’s effectiveness was evaluated. The findings showed that the method showed promising results, with the model trained using the extracted features achieving the highest evaluation scores in the study. In addition, the viability of using free and highly available satellite datasets for AGB estimation in the hemi-boreal Finland was analyzed, with the results suggesting that the free Synthetic Aperture Radar (SAR) based products had a low performance. The features extracted from the optical data of Sentinel-2 produced well-performing models, although the accuracy might still be too low to be feasible.
  • Suomela, Samu (2021)
    Large graphs often have labels only for a subset of nodes. Node classification is a semi-supervised learning task where unlabeled nodes are assigned labels utilizing the known information of the graph. In this thesis, three node classification methods are evaluated based on two metrics: computational speed and node classification accuracy. The three methods that are evaluated are label propagation, harmonic functions with Gaussian fields, and Graph Convolutional Neural Network (GCNN). Each method is tested on five citation networks of different sizes extracted from a large scientific publication graph, MAG240M-LSC. For each graph, the task is to predict the subject areas of scientific publications, e.g., cs.LG (Machine Learning). The motivation of the experiments is to give insight on whether the methods would be suitable for automatic labeling of scientific publications. The results show that label propagation and harmonic functions with Gaussian fields reach mediocre accuracy in the node classification task, while GCNN had a low accuracy. Label propagation was computationally slow compared to the other methods, whereas harmonic functions were exceptionally fast. Training of the GCNN took a long time compared to harmonic functions, but computational speed was acceptable. However, none of the methods reached a high enough classification accuracy to be utilized in automatic labeling of scientific publications.
  • Wachirapong, Fahsinee (2023)
    The importance of topic modeling in the analysis of extensive textual data is magnified by the inefficiency of manual work due to its time-consuming nature. Data preprocessing is a critical step before feeding text data to analysis. This process ensures that irrelevant information is removed and the remaining text is suitably formatted for topic modeling. However, the absence of standard rules often leads practitioners to adopt undisclosed or poorly understood preprocessing strategies. This potentially impacts the reproducibility and comparability of research findings. This thesis examines text preprocessing, including lowercase conversion, non-alphabetic removal, stopword elimination, stemming, and lemmatization, and explores their influence on data quality, vocabulary size, and topic interpretation generated by the topic model. Additionally, the variations in text preprocessing sequences and their impact on the topic model's outcomes. Our examination spans 120 diverse preprocessing approaches on the Manifesto Project Dataset. The results underscore the substantial impact of preprocessing strategies on perplexity scores and prove the challenges in determining the optimal number of topics and interpreting final results. Importantly, our study raises awareness of data preprocessing in shaping the perceived themes and content in identified topics and proposes recommendations for researchers to consider before performing data preprocessing.
  • Bovellán, Jonne (2022)
    A lot of research is done in terms of time series forecasting. This research is conducted for example on the stock market data and the data derived from social media platforms. Several powerful methods are developed for time-series forecasting, including Bidirectional Long Short-Term Memory which is based on neural networks. TikTok is a social media platform that is focused on short videos. When a user posts a video to TikTok, they can also write a short textual description which can include hashtags. Often these hashtags describe things, events and trends that happen in the physical world. The possibility to have an accurate forecast of the future popularity of TikTok hashtags includes the financial potential it creates for individuals and organisations. As part of this thesis, an experimental study was conducted in order to forecast the popularity of TikTok hashtags. An algorithm based on Bidirectional Long-Short Term Memory was created that forecasts short-term and long-term popularity of a single hashtag based on its past. A data set which consisted of time series data for 9779 different TikTok hashtags was used in the development process. The created forecasting algorithm performs at a good level for forecasting the short-term popularity of a hashtag, but it is not suitable for long-term forecasting due to its bad performance.
  • Wei, Haoyu (2022)
    Ultrasonic guided lamb waves can be used to monitor structural conditions of pipes and other equipment in industry. An example is to detect accumulated precipitation on the surface of pipes in a non-destructive and non-invasive way. The propagation of Lamb waves in a pipe is influenced by the fouling on its surface, which makes the fouling detection possible. In addition, multiple helical propagation paths around pipe structure provides rich information that allows the spatial localization of the fouled area. Gaussian Processes (GP) are widely used tools for estimating unknown functions. In this thesis, we propose machine learning models for fouling detection and spatial localization of potential fouled pipes based on GPs. The research aims to develop a systematic machine learning approach for ultrasonic detection, interpret fouling observations from wave signals, as well as reconstruct fouling distribution maps from the observations. The lamb wave signals are generated in physics experiments. We developed a Gaussian Process Regression model as a detector, to determine whether each propagation path is going across the fouling or not, based on comparison with clean pipe. This binary classification can be regarded as one case of the different fouling observations. Latent variable Gaussian Process models are deployed to model the observations over the unknown fouling map. Then Hamiltonian Monte Carlo sampling is utilized to perform full Bayesian inference for the GP hyper-parameters. Thus, the fouling map can be reconstructed based on the estimated parameters. We investigate different latent variable GP models for different fouling observation cases. In this thesis, we present the first unsupervised machine learning methods for fouling detection and localization on the surface of pipe based on guided lamb waves. In these thesis we evaluate the performance of our methods with a collection of synthetic data. We also study the effect of noise on the localization accuracy.
  • Wei, Haoyu (2022)
    Ultrasonic guided lamb waves can be used to monitor structural conditions of pipes and other equipment in industry. An example is to detect accumulated precipitation on the surface of pipes in a non-destructive and non-invasive way. The propagation of Lamb waves in a pipe is influenced by the fouling on its surface, which makes the fouling detection possible. In addition, multiple helical propagation paths around pipe structure provides rich information that allows the spatial localization of the fouled area. Gaussian Processes (GP) are widely used tools for estimating unknown functions. In this thesis, we propose machine learning models for fouling detection and spatial localization of potential fouled pipes based on GPs. The research aims to develop a systematic machine learning approach for ultrasonic detection, interpret fouling observations from wave signals, as well as reconstruct fouling distribution maps from the observations. The lamb wave signals are generated in physics experiments. We developed a Gaussian Process Regression model as a detector, to determine whether each propagation path is going across the fouling or not, based on comparison with clean pipe. This binary classification can be regarded as one case of the different fouling observations. Latent variable Gaussian Process models are deployed to model the observations over the unknown fouling map. Then Hamiltonian Monte Carlo sampling is utilized to perform full Bayesian inference for the GP hyper-parameters. Thus, the fouling map can be reconstructed based on the estimated parameters. We investigate different latent variable GP models for different fouling observation cases. In this thesis, we present the first unsupervised machine learning methods for fouling detection and localization on the surface of pipe based on guided lamb waves. In these thesis we evaluate the performance of our methods with a collection of synthetic data. We also study the effect of noise on the localization accuracy.
  • Säkkinen, Niko (2020)
    Predicting patient deterioration in an Intensive Care Unit (ICU) effectively is a critical health care task serving patient health and resource allocation. At times, the task may be highly complex for a physician, yet high-stakes and time-critical decisions need to be made based on it. In this work, we investigate the ability of a set of machine learning models to algorithimically predict future occurrence of in hospital death based on Electronic Health Record (EHR) data of ICU-patients. For one, we will assess the generalizability of the models. We do this by evaluating the models on hospitals the data of which has not been considered when training the models. For another, we consider the case in which we have access to some EHR data for the patients treated at a hospital of interest. In this setting, we assess how EHR data from other hospitals can be used in the optimal way to improve the prediction accuracy. This study is important for the deployment and integration of such predictive models in practice, e.g., for real-time algorithmic deterioration prediction for clinical decision support. In order to address these questions, we use the eICU collaborative research database, which is a database containing EHRs of patients treated at a heterogeneous collection of hospitals in the United States. In this work, we use the patient demographics, vital signs and Glasgow coma score as the predictors. We devise and describe three computational experiments to test the generalization in different ways. The used models are the random forest, gradient boosted trees and long short-term memory network. In our first experiment concerning the generalization, we show that, with the chosen limited set of predictors, the models generalize reasonably across hospitals but that only a small data mismatch is observed. Moreover, with this setting, our second experiment shows that the model performance does not significantly improve when increasing the heterogeneity of the training set. Given these observations, our third experiment shows that
  • Cauchi, Daniel (2023)
    Alignment in genomics is the process of finding the positions where DNA strings fit best with one another, that is, where there are the least differences if they were placed side by side. This process, however, remains very computationally intensive, even with more recent algorithmic advancements in the field. Pseudoalignment is emerging as a new method over full alignment as an inexpensive alternative, both in terms of memory needed as well as in terms of power consumption. The process is to instead check for the existence of substrings within the target DNA, and this has been shown to produce good results for a lot of use cases. New methods for pseudoalignment are still evolving, and the goal of this thesis is to provide an implementation that massively parallelises the current state of the art, Themisto, by using all resources available. The most intensive parts of the pipeline are put on the GPU. Meanwhile, the components which run on the CPU are heavily parallelised. Reading and writing of the files is also done in parallel, so that parallel I/O can also be taken advantage of. Results on the Mahti supercomputer, using an NVIDIA A100, shows a 10 times end-to-end querying speedup over the best run of Themisto, using half the CPU cores as Themisto, on the dataset used in this thesis.
  • Laanti, Topi (2022)
    The research and methods in the field of computational biology have grown in the last decades, thanks to the availability of biological data. One of the applications in computational biology is genome sequencing or sequence alignment, a method to arrange sequences of, for example, DNA or RNA, to determine regions of similarity between these sequences. Sequence alignment applications include public health purposes, such as monitoring antimicrobial resistance. Demand for fast sequence alignment has led to the usage of data structures, such as the de Bruijn graph, to store a large amount of information efficiently. De Bruijn graphs are currently one of the top data structures used in indexing genome sequences, and different methods to represent them have been explored. One of these methods is the BOSS data structure, a special case of Wheeler graph index, which uses succinct data structures to represent a de Bruijn graph. As genomes can take a large amount of space, the construction of succinct de Bruijn graphs is slow. This has led to experimental research on using large-scale cluster engines such as Apache Spark and Graphic Processing Units (GPUs) in genome data processing. This thesis explores the use of Apache Spark and Spark RAPIDS, a GPU computing library for Apache Spark, in the construction of a succinct de Bruijn graph index from genome sequences. The experimental results indicate that Spark RAPIDS can provide up to 8 times speedups to specific operations, but for some other operations has severe limitations that limit its processing power in terms of succinct de Bruijn graph index construction.
  • Lavikka, Kari (2020)
    Visualization is an indispensable method in the exploration of genomic data. However, the current state of the art in genome browsers – a class of interactive visualization tools – limit the exploration by coupling the visual representations with specific file formats. Because the tools do not support the exploration of the visualization design space, they are difficult to adapt to atypical data. Moreover, although the tools provide interactivity, the implementations are often rudimentary, encumbering the exploration of the data. This thesis introduces GenomeSpy, an interactive genome visualization tool that improves upon the current state of the art by providing better support for exploration. The tool uses a visualization grammar that allows for implementing novel visualization designs, which can display the underlying data more effectively. Moreover, the tool implements GPU-accelerated interactions that better support navigation in the genomic space. For instance, smoothly animated transitions between loci or sample sets improve the perception of causality and help the users stay in the flow of exploration. The expressivity of the visualization grammar and the benefit of fluid interactions are validated with two case studies. The case studies demonstrate visualization of high-grade serous ovarian cancer data at different analysis phases. First, GenomeSpy is being used to create a tool for scrutinizing raw copy-number variation data along with segmentation results. Second, the segmentations along with point mutations are used in a GenomeSpy-based multi-sample visualization that allows for exploring and comparing both multiple data dimensions and samples at the same time. Although the focus has been on cancer research, the tool could be applied to other domains as well.
  • Sainio, Rita Anniina (2023)
    Node classification is an important problem on networks in many different contexts. Optimizing the graph embedding has great potential to help improve the classification accuracy. The purpose of this thesis is to explore how graph embeddings can be exploited in the node classification task in the context of citation networks. More specifically, this thesis looks into the impact of different kinds of embeddings on the node classification, comparing their performance. Using three different similarity functions and different dimensions for the embedding vector ranging from 1 to 800, we examined the impact of graph embeddings on accuracy in node classification using three benchmark datasets: Cora, Citeseer, and PubMed. Our experimental results indicate that there are some common tendencies in the way dimensionality impacts the graph embedding quality regardless of the graph. We also established that some network-specific hyperparameter tuning clearly affects classification accuracy.
  • Koli, Jaakko (2022)
    Humans need to reason about the unknown constantly utilising similar existing knowledge as well as explore the unknown to gather more information for the future. I investigate this kind of human exploration and extrapolation in simple conceptual and spatial tasks in this thesis using Bayesian optimisation. My work extends Wu et al. paper Similarities and differences in spatial and non-spatial cognitive maps [Wu et al., 2020] where they model human exploration and extrapolation with Bayesian optimisation using an acquisition function and an activation function to represent human exploration and a Gaussian process to model the participant's belief of the environment based on the knowledge they acquire. Wu et al. use Bayesian optimisation to model human behaviour in these tasks as their main model of choice. Their model consists of a Gaussian process with a Radial Basis Function (RBF) kernel, Upper Confidence Bound (UCB) acquisition function and softmax activation function to transform the output of the acquisition function. Their model has three free parameters: the length scale of the RBF kernel λ describing the extent of generalisation, the exploration bonus of UCB sampling β and the temperature of softmax activation function τ [Wu et al., 2020]. I attempt to extend their work by allowing the length scale parameter λ of the RBF kernel to change when participants explore the presented space and gather more information. This will model how the participants learn the extent of generalisation as they explore the space and gain more knowledge of the underlying environment. This model with a changing length scale parameter managed to improve the goodness of fit when compared to the model used by Wu et al. [Wu et al., 2020], but it failed to capture all of the behavioural differences between spatial and conceptual tasks. It is possible that the values estimated for the length scale parameter λ could have also absorbed information that would have otherwise allowed the other parameters τ and β to capture the differences between the spatial and conceptual tasks. This thesis provides a basis for further research of human exploration and extrapolation utilising Bayesian optimisation with a changing degree of generalisation where the aforementioned shortcomings could be mitigated for example by designing the experiment in a way that provides more information about the participant's belief of the environment during each trial.
  • Tobaben, Marlon (2022)
    Using machine learning to improve health care has gained popularity. However, most research in machine learning for health has ignored privacy attacks against the models. Differential privacy (DP) is the state-of-the-art concept for protecting individuals' data from privacy attacks. Using optimization algorithms such as the DP stochastic gradient descent (DP-SGD), one can train deep learning models under DP guarantees. This thesis analyzes the impact of changes to the hyperparameters and the neural architecture on the utility/privacy tradeoff, the main tradeoff in DP, for models trained on the MIMIC-III dataset. The analyzed hyperparameters are the noise multiplier, clipping bound, and batch size. The experiments examine neural architecture changes regarding the depth and width of the model, activation functions, and group normalization. The thesis reports the impact of the individual changes independently of other factors using Bayesian optimization and thus overcomes the limitations of earlier work. For the analyzed models, the utility is more sensitive to changes to the clipping bound than to the other two hyperparameters. Furthermore, the privacy/utility tradeoff does not improve when allowing for more training runtime. The changes to the width and depth of the model have a higher impact than other modifications of the neural architecture. Finally, the thesis discusses the impact of the findings and limitations of the experiment design and recommends directions for future work.
  • Vesalainen, Ari (2022)
    Digitization has changed history research. The materials are available, and online archives make it easier to find the correct information and speed up the search for information. The remaining challenge is how to use modern digital methods to analyze the text of historical documents in more detail. This is an active research topic in digital humanities and computer science areas. Document layout analysis is where computer vision object detection methods can be applied to historical documents to identify the document pages’ present objects (i.e., page elements). The recent development in deep learning based computer vision provides excellent tools for this purpose. However, most reviewed systems focus on coarse-grained methods, where only the high-level page elements are detected (e.g., text, figures, tables). Fine-grained detection methods are required to be able to analyze texts on a more detailed level; for example, footnotes and marginalia are distinguished from the body text to enable proper analysis. The thesis studies how image segmentation techniques can be used for fine-grained OCR document layout analysis. How to implement fine-grained page segmentation and region classification systems in practice, and what are the accuracy and the main challenges of such a system? The thesis includes implementing a layout analysis model that uses the instance segmentation method (Mask R-CNN). This implementation is compared against another existing layout analysis using the semantic segmentation method (U-net based P2PaLA implementation).
  • Rintaniemi, Ari-Heikki (2024)
    In this thesis a Retrieval-Augmented Generation (RAG) based Question Answering (QA) system is implemented. The RAG framework is composed of three components: a data storage, a retriever and a generator. To evaluate the performance of the system, a QA dataset is created from Prime minister Orpo's Government Programme. The QA pairs are created by human and also generated by using transformer-based language models. Experiments are conducted by using the created QA dataset to evaluate the performance of the different options to implement the retriever (both traditional algorithmic and transformer-based language models) and generator (transformer-based language models) components. The language model options used in the generator component are the same which were used for generating QA pairs to the QA dataset. Mean reciprocal rank (MRR) and semantic answer similarity (SAS) are used to measure the performance of the retriever and generator component, respectively. The used SAS metric turns out to be useful for providing an aggregated level view on the performance of the QA system, but it is not an optimal evaluation metric for every scenario identified in the results of the experiments. Inference costs of the system are also analysed, as commercial language models are included in the evaluation. Analysis of the created QA dataset shows that the language models generate questions that tend to reveal information from the underlying paragraphs, or the questions do not provide enough context, making the questions difficult to answer for the QA system. The human created questions are diverse and thus more difficult to answer compared to the language model generated questions. The QA pair source affects the results: the language models used in the generator component receive on average high score answers to QA pairs which they had themselves generated. In order to create a high quality QA dataset for QA system evaluation, human effort is needed for creating the QA pairs, but also prompt engineering could provide a way to generate more usable QA pairs. Evaluation approaches for the generator component need further research in order to find alternatives that would provide an unbiased view to the performance of the QA system.
  • Lauha, Patrik (2021)
    Automatic bird sound recognition has been studied by computer scientists since late 1990s. Various techniques have been exploited, but no general method, that could even nearly match the performance of a human expert, has been developed yet. In this thesis, the subject is approached by reviewing alternative methods for cross-correlation as a similarity measure between two signals in template-based bird sound recognition models. Template-specific binary classification models are fit with different methods and their performance is compared. The contemplated methods are template averaging and procession before applying cross-correlation, use of texture features as additional predictors, and feature extraction through transfer learning with convolutional neural networks. It is shown that the classification performance of template-specific models can be improved by template refinement and utilizing neural networks’ ability to automatically extract relevant features from bird sound spectrograms.
  • Barin Pacela, Vitória (2021)
    Independent Component Analysis (ICA) aims to separate the observed signals into their underlying independent components responsible for generating the observations. Most research in ICA has focused on continuous signals, while the methodology for binary and discrete signals is less developed. Yet, binary observations are equally present in various fields and applications, such as causal discovery, signal processing, and bioinformatics. In the last decade, Boolean OR and XOR mixtures have been shown to be identifiable by ICA, but such models suffer from limited expressivity, calling for new methods to solve the problem. In this thesis, "Independent Component Analysis for Binary Data", we estimate the mixing matrix of ICA from binary observations and an additionally observed auxiliary variable by employing a linear model inspired by the Identifiable Variational Autoencoder (iVAE), which exploits the non-stationarity of the data. The model is optimized with a gradient-based algorithm that uses second-order optimization with limited memory, resulting in a training time in the order of seconds for the particular study cases. We investigate which conditions can lead to the reconstruction of the mixing matrix, concluding that the method is able to identify the mixing matrix when the number of observed variables is greater than the number of sources. In such cases, the linear binary iVAE can reconstruct the mixing matrix up to order and scale indeterminacies, which are considered in the evaluation with the Mean Cosine Similarity Score. Furthermore, the model can reconstruct the mixing matrix even under a limited sample size. Therefore, this work demonstrates the potential for applications in real-world data and also offers a possibility to study and formalize identifiability in future work. In summary, the most important contributions of this thesis are the empirical study of the conditions that enable the mixing matrix reconstruction using the binary iVAE, and the empirical results on the performance and efficiency of the model. The latter was achieved through a new combination of existing methods, including modifications and simplifications of a linear binary iVAE model and the optimization of such a model under limited computational resources.
  • Hovhannisyan, Karen (2023)
    Microbial growth dynamics play an important role in virtually any ecosystem. To know the underlying laws of growth would help in understanding how bacteria interact with each other and their environment. In this thesis we try to automate the process of scientific discovery of said dynamics, via symbolic regression. It has historically been implemented with genetic algorithms, and although many of the new implementations have different approaches, we stick with a highly optimized genetic-programming based package. Whatever the approach, the purpose of symbolic regression is to search for a mathematical expression that explains a response variable. We test the highly interpretable machine learning method on several datasets, each generated to mimic certain patterns of growth. Our findings confirm its ability to reverse-engineer theory from data. Even when the generating equations contain the latent nutrient variable, whose dynamics are not observable through the raw data, symbolic regression is able to find an analytically correct reparametrization and exact solution. In this thesis we discuss these results and give an overview of symbolic regression and its applications.