Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by master's degree program "Master's Programme in Data Science"

Sort by: Order: Results:

  • Laanti, Topi (2022)
    The research and methods in the field of computational biology have grown in the last decades, thanks to the availability of biological data. One of the applications in computational biology is genome sequencing or sequence alignment, a method to arrange sequences of, for example, DNA or RNA, to determine regions of similarity between these sequences. Sequence alignment applications include public health purposes, such as monitoring antimicrobial resistance. Demand for fast sequence alignment has led to the usage of data structures, such as the de Bruijn graph, to store a large amount of information efficiently. De Bruijn graphs are currently one of the top data structures used in indexing genome sequences, and different methods to represent them have been explored. One of these methods is the BOSS data structure, a special case of Wheeler graph index, which uses succinct data structures to represent a de Bruijn graph. As genomes can take a large amount of space, the construction of succinct de Bruijn graphs is slow. This has led to experimental research on using large-scale cluster engines such as Apache Spark and Graphic Processing Units (GPUs) in genome data processing. This thesis explores the use of Apache Spark and Spark RAPIDS, a GPU computing library for Apache Spark, in the construction of a succinct de Bruijn graph index from genome sequences. The experimental results indicate that Spark RAPIDS can provide up to 8 times speedups to specific operations, but for some other operations has severe limitations that limit its processing power in terms of succinct de Bruijn graph index construction.
  • Lavikka, Kari (2020)
    Visualization is an indispensable method in the exploration of genomic data. However, the current state of the art in genome browsers – a class of interactive visualization tools – limit the exploration by coupling the visual representations with specific file formats. Because the tools do not support the exploration of the visualization design space, they are difficult to adapt to atypical data. Moreover, although the tools provide interactivity, the implementations are often rudimentary, encumbering the exploration of the data. This thesis introduces GenomeSpy, an interactive genome visualization tool that improves upon the current state of the art by providing better support for exploration. The tool uses a visualization grammar that allows for implementing novel visualization designs, which can display the underlying data more effectively. Moreover, the tool implements GPU-accelerated interactions that better support navigation in the genomic space. For instance, smoothly animated transitions between loci or sample sets improve the perception of causality and help the users stay in the flow of exploration. The expressivity of the visualization grammar and the benefit of fluid interactions are validated with two case studies. The case studies demonstrate visualization of high-grade serous ovarian cancer data at different analysis phases. First, GenomeSpy is being used to create a tool for scrutinizing raw copy-number variation data along with segmentation results. Second, the segmentations along with point mutations are used in a GenomeSpy-based multi-sample visualization that allows for exploring and comparing both multiple data dimensions and samples at the same time. Although the focus has been on cancer research, the tool could be applied to other domains as well.
  • Sainio, Rita Anniina (2023)
    Node classification is an important problem on networks in many different contexts. Optimizing the graph embedding has great potential to help improve the classification accuracy. The purpose of this thesis is to explore how graph embeddings can be exploited in the node classification task in the context of citation networks. More specifically, this thesis looks into the impact of different kinds of embeddings on the node classification, comparing their performance. Using three different similarity functions and different dimensions for the embedding vector ranging from 1 to 800, we examined the impact of graph embeddings on accuracy in node classification using three benchmark datasets: Cora, Citeseer, and PubMed. Our experimental results indicate that there are some common tendencies in the way dimensionality impacts the graph embedding quality regardless of the graph. We also established that some network-specific hyperparameter tuning clearly affects classification accuracy.
  • Martti, Jyrki (2024)
    It is increasingly common for individuals to measure their daily activities with different kind of wearable devices such as smart watches and smart rings. These devices can give individuals insight to, for example, their sleep and sport activities. Many of these devices utilize optically obtained photoplethysmogram (PPG) for estimating, amongst others, the heart rate (HR) of an individual. The PPG based approach is commonly used, as it is easy and cost efficient to implement. However, even if the PPG based HR estimates can have an error as small as 1.23 beats per minute (BPM) when the individual remains stationary, the estimates can be highly inaccurate if individual is moving extensively. The HR values estimated by a PPG based device can be indicating a HR that is as much as 40% higher than the true HR. This is unlikely desirable for recreational activities and not acceptable for any medical surveillance. To address this challenge, we have used a distributed machine learning (ML) approach called fed- erated learning (FL) for calibrating the PPG based HR estimates. We have used the FL approach, as with the FL approach no local training data needs to be sent to a central location, which signifi- cantly helps improving the individual’s privacy. In the FL approach there exists a central authority that delivers a global model to all the FL clients. After training the local models the FL clients send the local model weights to the central location for forming the next version of the global model. Our main research objective was to compare the FL performance with the conventional ML approach where the ML model is trained centrally. Our results show that we can achieve similar results with FL as with centralized learning (CL). In addition, we observed, that with local learning (LL), a kind of a derivative of the FL, we can achieve even better results as with the FL and with the CL. In addition, as one of the challenges, and as one of the key findings, we observed that the FL approach has a FL specific risk with overfitting of the local ML models, which can easily corrupt the global model if the overfitting challenge is not properly addressed.
  • Koli, Jaakko (2022)
    Humans need to reason about the unknown constantly utilising similar existing knowledge as well as explore the unknown to gather more information for the future. I investigate this kind of human exploration and extrapolation in simple conceptual and spatial tasks in this thesis using Bayesian optimisation. My work extends Wu et al. paper Similarities and differences in spatial and non-spatial cognitive maps [Wu et al., 2020] where they model human exploration and extrapolation with Bayesian optimisation using an acquisition function and an activation function to represent human exploration and a Gaussian process to model the participant's belief of the environment based on the knowledge they acquire. Wu et al. use Bayesian optimisation to model human behaviour in these tasks as their main model of choice. Their model consists of a Gaussian process with a Radial Basis Function (RBF) kernel, Upper Confidence Bound (UCB) acquisition function and softmax activation function to transform the output of the acquisition function. Their model has three free parameters: the length scale of the RBF kernel λ describing the extent of generalisation, the exploration bonus of UCB sampling β and the temperature of softmax activation function τ [Wu et al., 2020]. I attempt to extend their work by allowing the length scale parameter λ of the RBF kernel to change when participants explore the presented space and gather more information. This will model how the participants learn the extent of generalisation as they explore the space and gain more knowledge of the underlying environment. This model with a changing length scale parameter managed to improve the goodness of fit when compared to the model used by Wu et al. [Wu et al., 2020], but it failed to capture all of the behavioural differences between spatial and conceptual tasks. It is possible that the values estimated for the length scale parameter λ could have also absorbed information that would have otherwise allowed the other parameters τ and β to capture the differences between the spatial and conceptual tasks. This thesis provides a basis for further research of human exploration and extrapolation utilising Bayesian optimisation with a changing degree of generalisation where the aforementioned shortcomings could be mitigated for example by designing the experiment in a way that provides more information about the participant's belief of the environment during each trial.
  • Hommy, Antwan (2024)
    Machine learning (ML) is becoming increasingly important in the telecommunications industry. The purpose of machine learning models in telecommunications is to outperform a classical receiver’s performance by fine-tuning parameters. Since ML models have the advantage of being more concise, their performance is easier to evaluate, contrary to a classical receiver’s multiple blocks each with their own small errors. Evaluating the said models, however, is challenging. To identify the correct parameters is also not trivial. To address this issue, a coherent and reliant hyperparameter optimization method needs to be introduced. This thesis investigates how a hyperparameter optimization method can be implemented, and which one is best suited for the problem. It looks into the value it provides, the metrics displayed for each hyperparameter set during training and inference, and the challenges of realising such a system, in addition to various other qualities needed for an efficient training stage. The framework aims to provide valuable insight into model accuracy, validation loss, computing cost, signal-to-noise ratio improvement, and available resources when using hyperparameter tuning. The framework uses grid search optimization, Bayesian optimization as well as genetic algorithm optimization to determine which performs better, and compare the results between them. Grid search will act as a reference baseline for the performance of the two algorithms. The thesis is split into two parts: Phase One, which implements the system in a sandbox-like manner, essentially acting as a testing platform to assess the implementation compatibility. Phase Two inspects a more real-case scenario more suitable for a 5G physical layer environment. The proposed framework uses modern, widely used orchestration and development tools. These include ResNet, Pytorch, and sklearn.
  • Tobaben, Marlon (2022)
    Using machine learning to improve health care has gained popularity. However, most research in machine learning for health has ignored privacy attacks against the models. Differential privacy (DP) is the state-of-the-art concept for protecting individuals' data from privacy attacks. Using optimization algorithms such as the DP stochastic gradient descent (DP-SGD), one can train deep learning models under DP guarantees. This thesis analyzes the impact of changes to the hyperparameters and the neural architecture on the utility/privacy tradeoff, the main tradeoff in DP, for models trained on the MIMIC-III dataset. The analyzed hyperparameters are the noise multiplier, clipping bound, and batch size. The experiments examine neural architecture changes regarding the depth and width of the model, activation functions, and group normalization. The thesis reports the impact of the individual changes independently of other factors using Bayesian optimization and thus overcomes the limitations of earlier work. For the analyzed models, the utility is more sensitive to changes to the clipping bound than to the other two hyperparameters. Furthermore, the privacy/utility tradeoff does not improve when allowing for more training runtime. The changes to the width and depth of the model have a higher impact than other modifications of the neural architecture. Finally, the thesis discusses the impact of the findings and limitations of the experiment design and recommends directions for future work.
  • Vesalainen, Ari (2022)
    Digitization has changed history research. The materials are available, and online archives make it easier to find the correct information and speed up the search for information. The remaining challenge is how to use modern digital methods to analyze the text of historical documents in more detail. This is an active research topic in digital humanities and computer science areas. Document layout analysis is where computer vision object detection methods can be applied to historical documents to identify the document pages’ present objects (i.e., page elements). The recent development in deep learning based computer vision provides excellent tools for this purpose. However, most reviewed systems focus on coarse-grained methods, where only the high-level page elements are detected (e.g., text, figures, tables). Fine-grained detection methods are required to be able to analyze texts on a more detailed level; for example, footnotes and marginalia are distinguished from the body text to enable proper analysis. The thesis studies how image segmentation techniques can be used for fine-grained OCR document layout analysis. How to implement fine-grained page segmentation and region classification systems in practice, and what are the accuracy and the main challenges of such a system? The thesis includes implementing a layout analysis model that uses the instance segmentation method (Mask R-CNN). This implementation is compared against another existing layout analysis using the semantic segmentation method (U-net based P2PaLA implementation).
  • Rintaniemi, Ari-Heikki (2024)
    In this thesis a Retrieval-Augmented Generation (RAG) based Question Answering (QA) system is implemented. The RAG framework is composed of three components: a data storage, a retriever and a generator. To evaluate the performance of the system, a QA dataset is created from Prime minister Orpo's Government Programme. The QA pairs are created by human and also generated by using transformer-based language models. Experiments are conducted by using the created QA dataset to evaluate the performance of the different options to implement the retriever (both traditional algorithmic and transformer-based language models) and generator (transformer-based language models) components. The language model options used in the generator component are the same which were used for generating QA pairs to the QA dataset. Mean reciprocal rank (MRR) and semantic answer similarity (SAS) are used to measure the performance of the retriever and generator component, respectively. The used SAS metric turns out to be useful for providing an aggregated level view on the performance of the QA system, but it is not an optimal evaluation metric for every scenario identified in the results of the experiments. Inference costs of the system are also analysed, as commercial language models are included in the evaluation. Analysis of the created QA dataset shows that the language models generate questions that tend to reveal information from the underlying paragraphs, or the questions do not provide enough context, making the questions difficult to answer for the QA system. The human created questions are diverse and thus more difficult to answer compared to the language model generated questions. The QA pair source affects the results: the language models used in the generator component receive on average high score answers to QA pairs which they had themselves generated. In order to create a high quality QA dataset for QA system evaluation, human effort is needed for creating the QA pairs, but also prompt engineering could provide a way to generate more usable QA pairs. Evaluation approaches for the generator component need further research in order to find alternatives that would provide an unbiased view to the performance of the QA system.
  • Tulijoki, Juha-Pekka (2024)
    A tag is a freely chosen keyword that a user attaches to an item. Offering a simple, cheap, and natural way to describe content, tagging has become popular in contemporary web applications. The tag genome is a data structure that contains item-tag relevance scores, i.e., continuous scale numbers from 0 to 1 indicating how relevant a tag is for an item. For example, the tag romantic comedy has a relevance score of 0.97 for the movie Love Actually. With sufficient data, a tag genome dataset can be constructed for any domain. To the best of available knowledge, there are tag genome datasets for movies and books. The tag genome for movies is used in a movie recommender and for various purposes in recommender systems research, such as detecting filter bubbles and serendipity. Creating a diverse tag genome dataset requires an effective machine learning solution, as manual assessment of item-tag relevance scores is impractical. The current state-of-the-art solution, called TagDL, uses features extracted from user-generated tags, reviews, and ratings to employ a multilayer perceptron architecture to predict the item-tag relevance scores. This study aims to enhance TagDL by extracting more features from the embeddings of textual content, namely tags, user reviews, and item titles, using Bidirectional Encoder Representations from Transformers (BERT). The results show that features based on BERT embeddings have a potential positive impact on item-tag relevance score prediction. However, the results do not generalize to both tag genome datasets, improving the results only for the movie dataset. This may indicate that the new features have a stronger impact if the amount of available training data is smaller, as with the movie dataset. Moreover, this thesis discusses future work ideas and implementation possibilities.
  • Trangcasanchai, Sathianpong (2024)
    Large language models (LLMs) have been proven to be state-of-the-art solutions for many NLP benchmarks. However, LLMs in real applications face many limitations. Although such models are seen to contain real-world knowledge, it is kept implicitly in their parameters that cannot be revised and extended unless expensive additional training is performed. These models can hallucinate by confidently producing human-like texts which might contain misleading information. The knowledge limitation and the tendency to hallucinate cause LLMs to struggle with out-of-domain settings. Furthermore, LLMs lack transparency in that their responses are products of big black-box models. While fine-tuning can mitigate some of these issues, it requires high computing resources. On the other hand, retrieval augmentation has been used to tackle knowledge-intensive tasks and proven by recent studies to be effective when coupled with LLMs. In this thesis, we explore Retrieval-Augmented Generation (RAG), a framework to augment generative LLMs with a neural retriever component, in a domain-specific question answering (QA) task. Empirically, we study how RAG helps LLMs in knowledge-intensive situations and explore design decisions in building a RAG pipeline. Our findings underscore the benefits of RAG in the studied situation by showing that leveraging retrieval augmentation yields significant improvement on QA performance over using a pre-trained LLM alone. Furthermore, incorporating RAG in an LLM-driven QA pipeline results in a QA system that accompanies its predictions with evidence documents, leading to a more trustworthy and grounded AI applications.
  • Lauha, Patrik (2021)
    Automatic bird sound recognition has been studied by computer scientists since late 1990s. Various techniques have been exploited, but no general method, that could even nearly match the performance of a human expert, has been developed yet. In this thesis, the subject is approached by reviewing alternative methods for cross-correlation as a similarity measure between two signals in template-based bird sound recognition models. Template-specific binary classification models are fit with different methods and their performance is compared. The contemplated methods are template averaging and procession before applying cross-correlation, use of texture features as additional predictors, and feature extraction through transfer learning with convolutional neural networks. It is shown that the classification performance of template-specific models can be improved by template refinement and utilizing neural networks’ ability to automatically extract relevant features from bird sound spectrograms.
  • Barin Pacela, Vitória (2021)
    Independent Component Analysis (ICA) aims to separate the observed signals into their underlying independent components responsible for generating the observations. Most research in ICA has focused on continuous signals, while the methodology for binary and discrete signals is less developed. Yet, binary observations are equally present in various fields and applications, such as causal discovery, signal processing, and bioinformatics. In the last decade, Boolean OR and XOR mixtures have been shown to be identifiable by ICA, but such models suffer from limited expressivity, calling for new methods to solve the problem. In this thesis, "Independent Component Analysis for Binary Data", we estimate the mixing matrix of ICA from binary observations and an additionally observed auxiliary variable by employing a linear model inspired by the Identifiable Variational Autoencoder (iVAE), which exploits the non-stationarity of the data. The model is optimized with a gradient-based algorithm that uses second-order optimization with limited memory, resulting in a training time in the order of seconds for the particular study cases. We investigate which conditions can lead to the reconstruction of the mixing matrix, concluding that the method is able to identify the mixing matrix when the number of observed variables is greater than the number of sources. In such cases, the linear binary iVAE can reconstruct the mixing matrix up to order and scale indeterminacies, which are considered in the evaluation with the Mean Cosine Similarity Score. Furthermore, the model can reconstruct the mixing matrix even under a limited sample size. Therefore, this work demonstrates the potential for applications in real-world data and also offers a possibility to study and formalize identifiability in future work. In summary, the most important contributions of this thesis are the empirical study of the conditions that enable the mixing matrix reconstruction using the binary iVAE, and the empirical results on the performance and efficiency of the model. The latter was achieved through a new combination of existing methods, including modifications and simplifications of a linear binary iVAE model and the optimization of such a model under limited computational resources.
  • Hovhannisyan, Karen (2023)
    Microbial growth dynamics play an important role in virtually any ecosystem. To know the underlying laws of growth would help in understanding how bacteria interact with each other and their environment. In this thesis we try to automate the process of scientific discovery of said dynamics, via symbolic regression. It has historically been implemented with genetic algorithms, and although many of the new implementations have different approaches, we stick with a highly optimized genetic-programming based package. Whatever the approach, the purpose of symbolic regression is to search for a mathematical expression that explains a response variable. We test the highly interpretable machine learning method on several datasets, each generated to mimic certain patterns of growth. Our findings confirm its ability to reverse-engineer theory from data. Even when the generating equations contain the latent nutrient variable, whose dynamics are not observable through the raw data, symbolic regression is able to find an analytically correct reparametrization and exact solution. In this thesis we discuss these results and give an overview of symbolic regression and its applications.
  • Bouri, Ioanna (2019)
    In model selection, it is necessary to select a model from a set of candidate models based on some observed data. The model should fit the data well, but without being overly complex, since that would not allow the model to generalize well its predictions to unseen data. Information criteria are widely used model selection methods that select a model based on some criteria. Information criteria estimate a score for each candidate model, and use that score to make a selection. A common way of estimating such a score, rewards the candidate model for its goodness of fit on some observed data and penalizes for the model complexity. Many popular information criteria, such as Akaike's Information Criterion (AIC) and Bayesian Information Criterion (BIC) penalize model complexity by the feature dimension. However, in a non-standard setting with inherent dependencies, these criteria are prone to over-penalizing the complexity of the model. Motivated by how these commonly used criteria tend to over-penalize, we evaluate AIC and BIC on a multi-target setting with correlated features. We compare AIC and BIC, with the Fisher Information Criterion (FIC), a criterion that takes into consideration correlations amongst features and does not penalize model complexity solely by the feature dimension of the candidate model. We evaluate the feature selection and predictive performances of the three information criteria in a linear regression setting with correlated features. We evaluate the precision, recall and F1 score of the set of features each criterion selects, compared to the feature set of the generative model. Under this setting's assumptions, we find that FIC yields the best results, compared to AIC and BIC, both in the feature selection and predictive performance evaluation. Finally, using FIC's properties for feature selection, we derive a formulation that allows to approximate the effective feature dimension of models with correlated features, in linear regression settings.
  • Melkas, Laila (2021)
    Multiple algorithms exist for the detection of causal relations from observational data but they are limited by their required assumptions regarding the data or by available computational resources. Only limited amount of information can be extracted from finite data but domain experts often have some knowledge of the underlying processes. We propose combining an expert’s prior knowledge with data likelihood to find models with high posterior probability. Our high-level procedure for interactive causal structure discovery contains three modules: discovery of initial models, navigation in the space of causal structures, and validation for model selection and evaluation. We present one manner of formulating the problem and implementing the approach assuming a rational, Bayesian expert which assumption we use to model the user in simulated experiments. The expert navigates greedily in the structure space using their prior information and the structures’ fit to data to find a local maximum a posteriori structure. Existing algorithms provide initial models for the navigation. Through simulated user experiments with synthetic data and use cases with real-world data, we find that the results of causal analysis can be improved by adding prior knowledge. Additionally, different initial models can lead to the expert finding different causal models and model validation helps detect overfitting and concept drift.
  • Ylitalo, Markku (2023)
    This Master’s thesis covers the visualization process of Finnish housing and mortgage markets by referring Tamara Munzner’s nested visualization process model [25]. This work is implemented as an assignment for the Bank of Finland that is the national monetary authority and central bank of Finland. The thesis includes a literature survey in which the different stages of the visualization task are examined by referring them to previous studies, and an experimental part that describes the actual implementation steps of the visualization ensemble, that is an encompassing collection of interactive dashboard sheets regarding Finnish housing and mortgage markets, which was made as a supporting analysis tool for the economists of the Bank of Finland. The domain aspects of this visualization task are validated by arranging an end user survey for the economists of the Bank of Finland. Nearly a hundred open answers were collected and processed from which the fundamental guidelines of the desired end product were formed. By following these guidelines and leaning on the know-how of the previous studies, the concrete visualization task was completed successfully. According to the gathered feedback, the visualization ensemble managed to correspond the expectations of end users comprehensively, and to fulfill its essential purpose as a macroeconomic analysis tool laudably. ACM Computing Classification System (CCS): Human-centered computing → Visualization → Visualization techniques Human-centered computing → Visualization → Empirical studies in visualization
  • Ersalan, Muzaffer Gür (2019)
    In this thesis, Convolutional Neural Networks (CNN) and Inverse Mathematic methods will be discussed for automated defect detection in materials that are used for radiation detectors. The first part of the thesis is dedicated to the literature review on the methods that are used. These include a general overview of Neural Networks, computer vision algorithms and Inverse Mathematics methods, such as wavelet transformations, or total variation denoising. In the Materials and Methods section, how these methods can be utilized in this problem setting will be examined. Results and Discussions part will reveal the outcomes and takeaways from the experiments. A focus of this thesis is put on the CNN architecture that fits the task best, how to optimize that chosen CNN architecture and discuss, how selected inputs created by Inverse Mathematics influence the Neural Network and it's performance. The results of this research reveal that the initially chosen Retina-Net is well suited for the task and the Inverse Mathematics methods utilized in this thesis provided useful insights.
  • Niemi, Roope Oskari (2022)
    DeepRx is a deep learning receiver which replaces much of the functionality of a traditional 5G receiver. It is a deep model which uses residual connections and a fully convolutional architecture to process an incoming signal, and it outputs log-likelihood ratios for each bit. However, the deep model can be computationally too heavy to use in a real environment. Nokia Bell Labs has recently developed an iterative version of the DeepRx, where a model with fewer layers is used iteratively. This thesis focuses on developing a neural network which determines how many iterations the iterative DeepRx needs to use. We trained a separate neural network, the stopping condition neural network, which will be used together with the iterative model. It predicts the number of iterations the model requires to process the input correctly, with the aim that each inference uses as few iterations as possible. The model also stops the inference early if it predicts that the required number of iterations is greater than the maximum amount. Our results show that an iterative model with a stopping condition neural network has significantly fewer parameters than the deep model. The results also show that while the stopping condition neural network could predict with a high accuracy which samples could be decoded, using it also increased the uncoded bit error rate of the iterative model slightly. Therefore, using a stopping condition neural network together with an iterative model seems to be a flexible lightweight alternative to the DeepRx model.
  • Holmberg, Daniel (2022)
    The LHC particle accelerator at CERN probes the elementary building blocks of matter by colliding protons at a center-of-mass energy of √s = 13 TeV. Collimated sprays of particles arise when quarks and gluons are produced at high energies, that are reconstructed from measured data and clustered together into jets. Accurate measurements of the energy of jets are paramount for sensitive particle physics analyses at the CMS experiment. Jet energy corrections are for that reason used to map measurements towards Monte Carlo simulated truth values, which are independent of detector response. The aim of this thesis is to improve upon the standard jet energy corrections by utilizing deep learning. Recent advancements on learning from point clouds in the machine learning community have been adopted in particle physics studies to improve jet flavor classification accuracy. This includes representing jet constituents as an unordered set, or a so-called “particle cloud”. Two highly performant models suitable for such data are the set-based Particle Flow Network and the graph-based ParticleNet. A natural next step in the advancement of jet energy corrections is to adopt a similar methodology, only changing the problem statement from classification to regression. The deep learning models developed in this work provide energy corrections that are generically applicable to differently flavored jets. Their performance is presented in the form of jet energy response resolution and reduction in flavor dependence. The models achieve state of the art performance for both metrics, significantly surpassing the standard corrections benchmark.