Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by master's degree program "Master's Programme in Data Science"

Sort by: Order: Results:

  • Mäki, Niklas (2023)
    Most graph neural network architectures take the input graph as granted and do not assign any uncertainty to its structure. In real life, however, data is often noisy and may contain incorrect edges or exclude true edges. Bayesian methods, which consider the input graph as a sample from a distribution, have not been deeply researched, and most existing research only tests the methods on small benchmark datasets such as citation graphs. As often is the case with Bayesian methods, they do not scale well for large datasets. The goal of this thesis is to research different Bayesian graph neural network architectures for semi-supervised node classification and test them on larger datasets, trying to find a method that improves the baseline model and is scalable enough to be used with graphs of tens of thousands of nodes with acceptable latency. All the tests are done twice with different amounts of training data, since Bayesian methods often excel with low amounts of data and in real life labeled data can be scarce. The Bayesian models considered are based on the graph convolutional network, which is also used as the baseline model for comparison. This thesis finds that the impressive performance of the Bayesian graph neural networks does not generalize to all datasets, and that the existing research relies too much on the same small benchmark graphs. Still, the models may be beneficial in some cases, and some of them are quite scalable and could be used even with moderately large graphs.
  • Horn, Matthew (2024)
    Long term monitoring programs gather important data to understand population trends and man- age biodiversity, including phenological data. The sampling of such data can suffer from left- censoring where the first occurrence of an event coincides with the first sampling time. This can lead to overestimation of the timing of species’ life history events and obscure phenological trends. This study develops a Bayesian survival model to predict and impute the true first occurrence times of Finnish moths in a given sampling season in left-censored cases, thereby estimating the amount of left-censoring and effectively "decensoring" the data. A simulation study was done to test the model on synthetic data and explore how effect size, the severity of censoring, and sampling fre- quency effect the inference. Forward feature selection was done over environmental covariates for a generalized linear survival model with logit link, incorporating both left-censoring and interval censoring. Five-fold cross validation was done to select the best model and see what covariates would be added during the feature selection process. The validation tested the model both in its ability to predict points that were not left-censored and those that were artificially left-censored. The final model included terms for cumulative Growing Degree Days, cumulative Chilling Days, mean spring temperature, cumulative rainfall, and daily minimum temperature, in addition to an intercept term. It was trained on all of the data and predictions were made for the true first occurrence times of the left-censored sites and years.
  • Porna, Ilkka (2022)
    Despite development in many areas of machine learning in recent decades, still, changing data sources between the domain in a model is trained and the domain in the same model is used for predictions is a fundamental and common problem. In the area of domain adaptation, these circum- stances have been studied by incorporating causal knowledge about the information flow between features to be utilized in the feature selection for the model. That work has shown promising results to accomplish so-called invariant causal prediction, which means a prediction performance is immune to the change levels between domains. Within these approaches, recognizing the Markov blanket to the target variable has served as a principal workhorse to find the optimal starting point. In this thesis, we continue to investigate closely the property of invariant prediction performance within Markov blankets to target variable. Also, some scenarios with latent parents involved in the Markov blanket are included to understand the role of the related covariates around the latent parent effect to the invariant prediction properties. Before the experiments, we cover the concepts of Makov blankets, structural causal models, causal feature selection, covariate shift, and target shift. We also look into ways to measure bias between changing domains by introducing transfer bias and incomplete information bias, as these biases play an important role in the feature selection, often being a trade-off situation between these biases. In the experiments, simulated data sets are generated from structural causal models to conduct the testing scenarios with the changing conditions of interest. With different scenarios, we investigate changes in the features of Markov blankets between training and prediction domains. Some scenarios involve changes in latent covariates as well. As result, we show that parent features are generally steady predictors enabling invariant prediction. An exception is a changing target, which basically requires more information about the changes in other earlier domains to enable invariant prediction. Also, emerging with latent parents, it is important to have some real direct causes in the feature sets to achieve invariant prediction performance.
  • Anttila, Jesse (2020)
    Visual simultaneous localization and mapping (visual SLAM) is a method for consistent self-contained localization using visual observations. Visual SLAM can produce very precise pose estimates without any specialized hardware, enabling applications such as AR navigation. The use of visual SLAM in very large areas and over long distances is not presently possible due to a number of significant scalability issues. In this thesis, these issues are discussed and solutions for them explored, culminating in a concept for a real-time city-scale visual SLAM system. A number of avenues for future work towards a practical implementation are also described.
  • Laaksonen, Jenniina (2021)
    Understanding customer behavior is one of the key elements in any thriving business. Dividing customers into different groups based on their distinct characteristics can help significantly when designing the service. Understanding the unique needs of customer groups is also the basis for modern marketing. The aim of this study is to explore what types of customer groups exist in an entertainment service business. In this study, customer segmentation is conducted with k-prototypes, a variation of k-means clustering. K-prototypes is a machine learning approach partitioning a group of observations into subgroups. These subgroups have little variation within the group and clear differences when compared to other subgroups. The advantage of k-prototypes is that it can process both categorical and numeric data efficiently. The results show that there are significant and meaningful differences between customer groups emerging from k-prototypes clustering. These customer groups can be targeted based on their unique characteristics and their reactions to different types of marketing actions vary. The unique characteristics of the customer groups can be utilized to target marketing actions better. Other possibilities to benefit from customer segmentation include such as personalized views, recommendations and helping strategy level decision making when designing the service. Many of these require further technical development or deeper understanding of the segments. Data selection as well as the quality of the data has an impact on the results and those should be considered carefully when deciding future actions on customer segmentation.
  • Koivisto, Teemu (2021)
    Programming courses often receive large quantities of program code submissions to exercises which, due to their large number, are graded and students provided feedback automatically. Teachers might never review these submissions therefore losing a valuable source of insight into student programming patterns. This thesis researches how these submissions could be reviewed efficiently using a software system, and a prototype, CodeClusters, was developed as an additional contribution of this thesis. CodeClusters' design goals are to allow the exploration of the submissions and specifically finding higher-level patterns that could be used to provide feedback to students. Its main features are full-text search and n-grams similarity detection model that can be used to cluster the submissions. Design science research is applied to evaluate CodeClusters' design and to guide the next iteration of the artifact and qualitative analysis, namely thematic synthesis, to evaluate the problem context as well as the ideas of using software for reviewing and providing clustered feedback. The used study method was interviews conducted with teachers who had experience teaching programming courses. Teachers were intrigued by the ability to review submitted student code and to provide more tailored feedback to students. The system, while still a prototype, is considered worthwhile to experiment on programming courses. A tool for analyzing and exploring submissions seems important to enable teachers to better understand how students have solved the exercises. Providing additional feedback can be beneficial to students, yet the feedback should be valuable and the students incentivized to read it.
  • Pritom Kumar, Das (2024)
    Many researchers use dwell time, a measure of engagement with information, as a universal measure of importance in judging the relevance and usefulness of information. However, it may not fully account for individual cognitive variations. This study investigates how individual differences in cognitive abilities, specifically working memory and processing speed, can significantly impact dwell time. We examined the browsing behavior of 20 individuals engaged in information-intensive tasks, measuring their cognitive abilities, tracking their web page visits, and asses their perceived relevance and usefulness of the information encountered. Our findings shows a clear connection between working memory, processing speed, and dwell time. Based on this finding, we developed a model that combines both cognitive abilities and dwell time to predict the relevance and usefulness of web pages. Interestingly, our analysis reveals that cognitive abilities, particularly fluid intelligence, significantly influence dwell time. This highlights the importance of incorporating individual cognitive differences into prediction models to improve their accuracy. Thus, personalized services that set dwell time thresholds based on individual users' cognitive abilities could provide more accurate estimations of what users find relevant and useful.
  • Hyvärinen, Linda (2023)
    With the increased usage of machine learning models in various tasks and domains, the demand of understanding the models is emphasized. However, often modern machine learning models are difficult to understand and therefore do not provoke trust. Models can be understood by revealing their inner logic with explanations, but explanations can be difficult to interpret for non-expert users. We introduce an interactive visual interface to help non-expert users to understand and compare machine learning models. The interface visualizes explanations for multiple models in order to help the user to understand how the models generate predictions and whether the predictions can be trusted. We also explore current research in explainable AI visualizations, in order to compare our prototype to comparable systems present in research. The contributions of this paper are a system description and a use case for an interactive visualization interface to compare and explain machine learning models, as well as providing an understanding of the current state of research in explainable AI visualization systems and recommendations for future studies. We conclude that our system enables efficient visualizations for regression models unlike the papers covered in our survey. Another conclusion is that the field lacks precise terminology.
  • Nissilä, Viivi (2020)
    Origin-Destination (OD) data is a crucial part of price estimation in the aviation industry, and an OD flight is any number of flights a passenger takes in a single journey. OD data is a complex set of data that is both flow and multidimensional type of data. In this work, the focus is to design interactive visualization techniques to support user exploration of OD data. The thesis work aims to find which of the two menu designs suit better for OD data visualization: breadth-first or depth-first menu design. The two menus follow Schneiderman’s Task by Data Taxonomy, a broader version of the Information Seeking Mantra. The first menu design is a parallel, breadth-first menu layout. The layout shows the variables in an open layout and is closer to the original data matrix. The second menu design is a hierarchical, depth-first layout. This layout is derived from the semantics of the data and is more compact in terms of screen space. The two menu designs are compared in an online survey study conducted with the potential end users. The results of the online survey study are inconclusive, and therefore are complemented with an expert review. Both the survey study and expert review show that the Sankey graph is a good visualization type for this work, but the interaction of the two menu designs requires further improvements. Both of the menu designs received positive and negative feedback in the expert review. For future work, a solution that combines the positives of the two designs could be considered. ACM Computing Classification System (CCS): Human-Centered Computing → Visualization → Empirical Studies in Visualization Human-centered computing → Interaction design → Interaction design process and methods → Interface design prototyping
  • Koppatz, Maximilian (2022)
    Automatic headline generation has the potential to significantly assist editors charged with head- lining articles. Approaches to automation in the headlining process can range from tools as creative aids, to complete end to end automation. The latter is difficult to achieve as journalistic require- ments imposed on headlines must be met with little room for error, with the requirements depending on the news brand in question. This thesis investigates automatic headline generation in the context of the Finnish newsroom. The primary question I seek to answer is how well the current state of text generation using deep neural language models can be applied to the headlining process in Finnish news media. To answer this, I have implemented and pre-trained a Finnish generative language model based on the Transformer architecture. I have fine-tuned this language model for headline generation as autoregression of headlines conditioned on the article text. I have designed and implemented a variation of the Diverse Beam Search algorithm, with additional parameters, to perform the headline generation in order to generate a diverse set of headlines for a given text. The evaluation of the generative capabilities of this system was done with real world usage in mind. I asked domain-experts in headlining to evaluate a generated set of text-headline pairs. The task was to accept or reject the individual headlines in key criteria. The responses of this survey were then quantitatively and qualitatively analyzed. Based on the analysis and feedback, this model can already be useful as a creative aid in the newsroom despite being far from ready for automation. I have identified concrete improvement directions based on the most common types of errors, and this provides interesting future work.
  • Lipsanen, Mikko (2022)
    The thesis presents and evaluates a model for detecting changes in discourses in diachronic text corpora. Detecting and analyzing discourses that typically evolve over a period of time and differ in their manifestations in individual documents is a challenging task, and existing approaches like topic modeling are often not able to reach satisfactory results. One key problem is the difficulty of properly evaluating the results of discourse detection methods, due in large part to the lack of annotated text corpora. The thesis proposes a solution where synthetic datasets containing non-stable discourse patterns are generated from a corpus of news articles. Using the news categories as a proxy for discourses allows both to control the complexity of the data and to evaluate the model results based on the known discourse patterns. The complex task of extracting topics from texts is commonly performed using generative models, which are based on simplifying assumptions regarding the process of data generation. The model presented in the thesis explores instead the potential of deep neural networks, combined with contrastive learning, to be used for discourse detection. The neural network model is first trained using supervised contrastive loss function, which teaches the model to differentiate the input data based on the type of discourse pattern it belongs to. This pretrained model is then employed for both supervised and unsupervised downstream classification tasks, where the goal is to detect changes in the discourse patterns at the timepoint level. The main aim of the thesis is to find out whether contrastive pretraining can be used as a part of a deep learning approach to discourse change detection, and whether the information encoded into the model during contrastive training can generalise to other, closely related domains. The results of the experiments show that contrastive pretraining can be used to encode information that directly relates to its learning goal into the end products of the model, although the learning process is still incomplete. However, the ability of the model to generalise this information in a way that could be useful in the timepoint level classification tasks remains limited. More work is needed to improve the model performance, especially if it is to be used with complex real world datasets.
  • Zhao, Zhao (2023)
    This thesis aims to offer a practical solution for making cost-effective decisions regarding weather routing deployment to optimize computational costs. The study focuses on developing three collaborative model components that collectively address the challenge of rerouting decision-making. Model 1 involves training a neural network-based Ship Performance Model, which forms the foundation for the weather routing model. Model 2 is centered around constructing a time-dependent path-finding model that integrates real-time weather forecasts. This model optimizes routing within a designated experimental area, generating simulation training samples. Model 3 utilizes the outcomes of Model 2 to train a practical machine learning decision-making model. This model seeks to address the question: should the weather routing system be activated and the route be adjusted based on updated weather forecasts? The integration of these models supports informed maritime decision-making. While these methods represent a preliminary step towards optimizing weather routing deployment frequencies, they hold the potential for enhancing operational efficiency and responsible resource usage in maritime sector.
  • Kuivaniemi, Esa (2024)
    Machine Learning (ML) has experienced significant growth, fuelled by the surge in big data. Organizations leverage ML techniques to take advantage of the data. So far, the focus has predominantly been on increasing the value by developing ML algorithms. Another option would be to optimize resource consumption to reach cost optimality. This thesis contributes to cost optimality by identifying and testing frameworks that enable organizations to make informed decisions on cost-effective cloud infrastructure while designing and developing ML workflows. The two frameworks we introduce to model Cost Optimality are: "Cost Optimal Query Processing in the Cloud" for data pipelines and "PALEO" for ML model training pipelines. The latter focuses on estimating the training time needed to train a Neural Net, while the first one is more generic in assessing cost-optimal cloud setup for query processing. Through the literature review, we show that it is critical to consider both the data and ML training aspects when designing a cost-optimal ML workflow. Our results indicate that the frameworks provide accurate estimates about cost-optimal hardware configuration in the cloud for ML workflow. There are deviations when we dive into the details: our chosen version of the Cost Optimal Model does not consider the impact of larger memory. Also, the frameworks do not provide accurate execution time estimates: PALEO estimates our accelerated EC2 instance to execute the training workload with half of the time it took. However, the purpose of the study was not to provide accurate execution or cost estimates, but we aimed to see if the frameworks estimate the cost-optimal cloud infrastructure setup among the five EC2 instances that we chose to execute our three different workloads.
  • Haatanen, Henri (2022)
    In the modern era, using personalization when reaching out to potential or current customers is essential for businesses to compete in their area of business. With large customer bases, this personalization becomes more difficult, thus segmenting entire customer bases into smaller groups helps businesses focus better on personalization and targeted business decisions. These groups can be straightforward, like segmenting solely based on age, or more complex, like taking into account geographic, demographic, behavioral, and psychographic differences among the customers. In the latter case, customer segmentation should be performed with Machine Learning, which can help find more hidden patterns within the data. Often, the number of features in the customer data set is so large that some form of dimensionality reduction is needed. That is also the case with this thesis, which includes 12802 unique article tags that are desired to be included in the segmentation. A form of dimensionality reduction called feature hashing is selected for hashing the tags for its ability to be introduced new tags in the future. Using hashed features in customer segmentation is a balancing act. With more hashed features, the evaluation metrics might give better results and the hashed features resemble more closely the unhashed article tag data, but with less hashed features the clustering process is faster, more memory-efficient and the resulting clusters are more interpretable to the business. Three clustering algorithms, K-means, DBSCAN, and BIRCH, are tested with eight feature hashing bin sizes for each, with promising results for K-means and BIRCH.
  • Jurinec, Fran (2023)
    This thesis explores the applicability of open-source tools on addressing the challenges of data-driven fusion research. The issue is explored through a survey of the fusion data ecosystem and exploration of possible data architectures, which were used to derive the goals and requirements of a proof-of-concept data platform. This platform, developed using open-source software, namely InvenioRDM and Apache Airflow, enabled transforming existing machine learning (ML) workloads into reusable data-generating workflows, and the cataloging of resulting clean ML datasets. Through a survey of the fusion data ecosystem, a set of challenges and goals was established for the development of a fusion data platform. It was identified that many of the challenges for data-driven research stem from a heterogeneous and geographically scattered source data layer combined with a monolithic approach to ML research. These challenges could be alleviated through improved ML infrastructure, for which two approaches were identified: a query-based approach, which offers more data retrieval flexibility but requires improvements in querying functionality and source data access speeds, and a persisted dataset approach, which uses a centralized workflow to collect and clean data, but requires additional storage resources. Additionally, by cataloging metadata in a central location it would be possible to combine data discovery across heterogeneous sources, combining the benefits of various infrastructure developments. Building on these identified goals and the metadata-driven platform architecture, a proof-of-concept data platform was implemented and examined through a case study. This implementation used InvenioRDM as a metadata catalog to index and provide a dashboard for discovering ML-ready datasets, and Apache Airflow as a workflow orchestration platform to manage the data collection workflows. The case study, grounded in real-world fusion ML research, showcased the platform's ability to convert existing ML workloads into reusable data-generating workflows and to publish clean ML datasets without introducing significant complexity into the research workflows.
  • Matakos, Alexandros (2024)
    This thesis presents DeepGT, a 3D Convolutional Neural Network designed to enhance the spatial resolution of GNSS Tropospheric Tomography, a technique for estimating atmospheric water vapor distribution using GNSS signals. By utilizing Slant Wet Delays from dense GNSS networks and boundary meteorological data from Numerical Weather Prediction models, DeepGT refines low-resolution tomographic wet refractivity fields. The proposed method quadruples the horizontal resolution, while improving the accuracy of the tomographic reconstruction. Two experiments are conducted to validate this: one with real-world SWEPOS data and another with a hypothetical dense GNSS network. The results demonstrate the potential of deep learning models such as DeepGT in enhancing GNSS Meteorology, with implications for improved weather forecasting and climate studies.
  • Mäkinen, Sasu (2021)
    Deploying machine learning models is found to be a massive issue in the field. DevOps and Continuous Integration and Continuous Delivery (CI/CD) has proven to streamline and accelerate deployments in the field of software development. Creating CI/CD pipelines in software that includes elements of Machine Learning (MLOps) has unique problems, and trail-blazers in the field solve them with the use of proprietary tooling, often offered by cloud providers. In this thesis, we describe the elements of MLOps. We study what the requirements to automate the CI/CD of Machine Learning systems in the MLOps methodology. We study if it is feasible to create a state-of-the-art MLOps pipeline with existing open-source and cloud-native tooling in a cloud provider agnostic way. We designed an extendable and cloud-native pipeline covering most of the CI/CD needs of Machine Learning system. We motivated why Machine Learning systems should be included in the DevOps methodology. We studied what unique challenges machine learning brings to CI/CD pipelines, production environments and monitoring. We analyzed the pipeline’s design, architecture, and implementation details and its applicability and value to Machine Learning projects. We evaluate our solution as a promising MLOps pipeline, that manages to solve many issues of automating a reproducible Machine Learning project and its delivery to production. We designed it as a fully open-source solution that is relatively cloud provider agnostic. Configuring the pipeline to fit the client needs uses easy-to-use declarative configuration languages (YAML, JSON) that require minimal learning overhead.
  • Savolainen, Outi (2022)
    Today, Global Navigation Satellite Systems (GNSS) provide services that many critical systems [1] as well as normal users, need in everyday life. These signals are threatened by unintentional and intentional interference. The received satellite signals are complex-valued by nature, however, state-of-the-art anomaly detection approaches operate in the real domain. Changing the anomaly detection into the complex domain allows for preserving the phase component of the signal data. In this thesis, I developed and tested a fully complex-valued Long Short-Term Memory (LSTM) based autoencoder for anomaly detection. I also developed a method for scaling of complex-numbers that forces both real and imaginary units into the range [-1,1] and does not change the direction of a complex vector. The model is trained and tested both in the time and frequency domains, and the frequency domain is divided into two parts: real and complex domain. The developed model’s training data consists only of clean sample data, and the output of the model is the reconstruction of the model’s input. In testing, it can be determined whether the output is clean or anomalous based on the reconstruction error and the computed threshold value. The results show that the autoencoder model in the real domain outperforms the model trained in the complex domain. This does not indicate that the anomaly detection in the complex domain does not work; rather, the model’s architecture needs improvements, and the amount of training data must be increased to reduce the overfitting of the complex domain and thus improve the anomaly detection capability. It was also investigated that some anomalous sample sequences contain a few large valued spikes while other values in the same data snapshot are smaller. After scaling, the values other than in the spikes get closer to zero. This phenomenon causes small reconstruction errors in the model and yields false predictions in the complex domain.
  • Rannisto, Meeri (2020)
    Bat monitoring is commonly based on audio analysis. By collecting audio recordings from large areas and analysing their content, it is possible estimate distributions of bat species and changes in them. It is easy to collect a large amount of audio recordings by leaving automatic recording units in nature and collecting them later. However, it takes a lot of time and effort to analyse these recordings. Because of that, there is a great need for automatic tools. We developed a program for detecting bat calls automatically from audio recordings. The program is designed for recordings that are collected from Finland with the AudioMoth recording device. Our method is based on a median clipping method that has previously shown promising results in the field of bird song detection. We add several modifications to the basic method in order to make it work well for our purpose. We use real-world field recordings that we have annotated to evaluate the performance of the detector and compare it to two other freely available programs (Kaleidoscope and Bat Detective). Our method showed good results and got the best F2-score in the comparison.
  • Unknown author (2023)
    This study focused on detecting horizontal and vertical collusion within Indonesian government procurement processes, leveraging data-driven techniques and statistical methods. Regarding horizontal collusion, we applied clustering techniques to categorize companies based on their supply patterns, revealing clusters with similar bidding practices that may indicate potential collusion. Additionally, we identified patterns where specific supplier groups consistently won procurements, raising questions about potential competitive advantages or strategic practices that need further examination for collusion. For vertical collusion, we examined the frequency of associations between specific government employees and winning companies. While high-frequency collaborations were observed, it is essential to interpret these results with caution as they do not definitively indicate collusion, and legitimate factors might justify such associations. Despite revealing important patterns, the study acknowledges its limitations, including the representativeness of the dataset and the reliance on quantitative methods. Nevertheless, our findings carry substantial implications for enhancing procurement monitoring, strengthening anti-collusion regulations, and promoting transparency in Indonesian government procurement processes. Future research could enrich these findings by incorporating qualitative methods, exploring additional indicators of collusion, and leveraging machine learning techniques to detect collusion.