Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by master's degree program "Datatieteen maisteriohjelma"

Sort by: Order: Results:

  • Laaksonen, Jenniina (2021)
    Understanding customer behavior is one of the key elements in any thriving business. Dividing customers into different groups based on their distinct characteristics can help significantly when designing the service. Understanding the unique needs of customer groups is also the basis for modern marketing. The aim of this study is to explore what types of customer groups exist in an entertainment service business. In this study, customer segmentation is conducted with k-prototypes, a variation of k-means clustering. K-prototypes is a machine learning approach partitioning a group of observations into subgroups. These subgroups have little variation within the group and clear differences when compared to other subgroups. The advantage of k-prototypes is that it can process both categorical and numeric data efficiently. The results show that there are significant and meaningful differences between customer groups emerging from k-prototypes clustering. These customer groups can be targeted based on their unique characteristics and their reactions to different types of marketing actions vary. The unique characteristics of the customer groups can be utilized to target marketing actions better. Other possibilities to benefit from customer segmentation include such as personalized views, recommendations and helping strategy level decision making when designing the service. Many of these require further technical development or deeper understanding of the segments. Data selection as well as the quality of the data has an impact on the results and those should be considered carefully when deciding future actions on customer segmentation.
  • Koivisto, Teemu (2021)
    Programming courses often receive large quantities of program code submissions to exercises which, due to their large number, are graded and students provided feedback automatically. Teachers might never review these submissions therefore losing a valuable source of insight into student programming patterns. This thesis researches how these submissions could be reviewed efficiently using a software system, and a prototype, CodeClusters, was developed as an additional contribution of this thesis. CodeClusters' design goals are to allow the exploration of the submissions and specifically finding higher-level patterns that could be used to provide feedback to students. Its main features are full-text search and n-grams similarity detection model that can be used to cluster the submissions. Design science research is applied to evaluate CodeClusters' design and to guide the next iteration of the artifact and qualitative analysis, namely thematic synthesis, to evaluate the problem context as well as the ideas of using software for reviewing and providing clustered feedback. The used study method was interviews conducted with teachers who had experience teaching programming courses. Teachers were intrigued by the ability to review submitted student code and to provide more tailored feedback to students. The system, while still a prototype, is considered worthwhile to experiment on programming courses. A tool for analyzing and exploring submissions seems important to enable teachers to better understand how students have solved the exercises. Providing additional feedback can be beneficial to students, yet the feedback should be valuable and the students incentivized to read it.
  • Hyvärinen, Linda (2023)
    With the increased usage of machine learning models in various tasks and domains, the demand of understanding the models is emphasized. However, often modern machine learning models are difficult to understand and therefore do not provoke trust. Models can be understood by revealing their inner logic with explanations, but explanations can be difficult to interpret for non-expert users. We introduce an interactive visual interface to help non-expert users to understand and compare machine learning models. The interface visualizes explanations for multiple models in order to help the user to understand how the models generate predictions and whether the predictions can be trusted. We also explore current research in explainable AI visualizations, in order to compare our prototype to comparable systems present in research. The contributions of this paper are a system description and a use case for an interactive visualization interface to compare and explain machine learning models, as well as providing an understanding of the current state of research in explainable AI visualization systems and recommendations for future studies. We conclude that our system enables efficient visualizations for regression models unlike the papers covered in our survey. Another conclusion is that the field lacks precise terminology.
  • Nissilä, Viivi (2020)
    Origin-Destination (OD) data is a crucial part of price estimation in the aviation industry, and an OD flight is any number of flights a passenger takes in a single journey. OD data is a complex set of data that is both flow and multidimensional type of data. In this work, the focus is to design interactive visualization techniques to support user exploration of OD data. The thesis work aims to find which of the two menu designs suit better for OD data visualization: breadth-first or depth-first menu design. The two menus follow Schneiderman’s Task by Data Taxonomy, a broader version of the Information Seeking Mantra. The first menu design is a parallel, breadth-first menu layout. The layout shows the variables in an open layout and is closer to the original data matrix. The second menu design is a hierarchical, depth-first layout. This layout is derived from the semantics of the data and is more compact in terms of screen space. The two menu designs are compared in an online survey study conducted with the potential end users. The results of the online survey study are inconclusive, and therefore are complemented with an expert review. Both the survey study and expert review show that the Sankey graph is a good visualization type for this work, but the interaction of the two menu designs requires further improvements. Both of the menu designs received positive and negative feedback in the expert review. For future work, a solution that combines the positives of the two designs could be considered. ACM Computing Classification System (CCS): Human-Centered Computing → Visualization → Empirical Studies in Visualization Human-centered computing → Interaction design → Interaction design process and methods → Interface design prototyping
  • Koppatz, Maximilian (2022)
    Automatic headline generation has the potential to significantly assist editors charged with head- lining articles. Approaches to automation in the headlining process can range from tools as creative aids, to complete end to end automation. The latter is difficult to achieve as journalistic require- ments imposed on headlines must be met with little room for error, with the requirements depending on the news brand in question. This thesis investigates automatic headline generation in the context of the Finnish newsroom. The primary question I seek to answer is how well the current state of text generation using deep neural language models can be applied to the headlining process in Finnish news media. To answer this, I have implemented and pre-trained a Finnish generative language model based on the Transformer architecture. I have fine-tuned this language model for headline generation as autoregression of headlines conditioned on the article text. I have designed and implemented a variation of the Diverse Beam Search algorithm, with additional parameters, to perform the headline generation in order to generate a diverse set of headlines for a given text. The evaluation of the generative capabilities of this system was done with real world usage in mind. I asked domain-experts in headlining to evaluate a generated set of text-headline pairs. The task was to accept or reject the individual headlines in key criteria. The responses of this survey were then quantitatively and qualitatively analyzed. Based on the analysis and feedback, this model can already be useful as a creative aid in the newsroom despite being far from ready for automation. I have identified concrete improvement directions based on the most common types of errors, and this provides interesting future work.
  • Lipsanen, Mikko (2022)
    The thesis presents and evaluates a model for detecting changes in discourses in diachronic text corpora. Detecting and analyzing discourses that typically evolve over a period of time and differ in their manifestations in individual documents is a challenging task, and existing approaches like topic modeling are often not able to reach satisfactory results. One key problem is the difficulty of properly evaluating the results of discourse detection methods, due in large part to the lack of annotated text corpora. The thesis proposes a solution where synthetic datasets containing non-stable discourse patterns are generated from a corpus of news articles. Using the news categories as a proxy for discourses allows both to control the complexity of the data and to evaluate the model results based on the known discourse patterns. The complex task of extracting topics from texts is commonly performed using generative models, which are based on simplifying assumptions regarding the process of data generation. The model presented in the thesis explores instead the potential of deep neural networks, combined with contrastive learning, to be used for discourse detection. The neural network model is first trained using supervised contrastive loss function, which teaches the model to differentiate the input data based on the type of discourse pattern it belongs to. This pretrained model is then employed for both supervised and unsupervised downstream classification tasks, where the goal is to detect changes in the discourse patterns at the timepoint level. The main aim of the thesis is to find out whether contrastive pretraining can be used as a part of a deep learning approach to discourse change detection, and whether the information encoded into the model during contrastive training can generalise to other, closely related domains. The results of the experiments show that contrastive pretraining can be used to encode information that directly relates to its learning goal into the end products of the model, although the learning process is still incomplete. However, the ability of the model to generalise this information in a way that could be useful in the timepoint level classification tasks remains limited. More work is needed to improve the model performance, especially if it is to be used with complex real world datasets.
  • Zhao, Zhao (2023)
    This thesis aims to offer a practical solution for making cost-effective decisions regarding weather routing deployment to optimize computational costs. The study focuses on developing three collaborative model components that collectively address the challenge of rerouting decision-making. Model 1 involves training a neural network-based Ship Performance Model, which forms the foundation for the weather routing model. Model 2 is centered around constructing a time-dependent path-finding model that integrates real-time weather forecasts. This model optimizes routing within a designated experimental area, generating simulation training samples. Model 3 utilizes the outcomes of Model 2 to train a practical machine learning decision-making model. This model seeks to address the question: should the weather routing system be activated and the route be adjusted based on updated weather forecasts? The integration of these models supports informed maritime decision-making. While these methods represent a preliminary step towards optimizing weather routing deployment frequencies, they hold the potential for enhancing operational efficiency and responsible resource usage in maritime sector.
  • Kuivaniemi, Esa (2024)
    Machine Learning (ML) has experienced significant growth, fuelled by the surge in big data. Organizations leverage ML techniques to take advantage of the data. So far, the focus has predominantly been on increasing the value by developing ML algorithms. Another option would be to optimize resource consumption to reach cost optimality. This thesis contributes to cost optimality by identifying and testing frameworks that enable organizations to make informed decisions on cost-effective cloud infrastructure while designing and developing ML workflows. The two frameworks we introduce to model Cost Optimality are: "Cost Optimal Query Processing in the Cloud" for data pipelines and "PALEO" for ML model training pipelines. The latter focuses on estimating the training time needed to train a Neural Net, while the first one is more generic in assessing cost-optimal cloud setup for query processing. Through the literature review, we show that it is critical to consider both the data and ML training aspects when designing a cost-optimal ML workflow. Our results indicate that the frameworks provide accurate estimates about cost-optimal hardware configuration in the cloud for ML workflow. There are deviations when we dive into the details: our chosen version of the Cost Optimal Model does not consider the impact of larger memory. Also, the frameworks do not provide accurate execution time estimates: PALEO estimates our accelerated EC2 instance to execute the training workload with half of the time it took. However, the purpose of the study was not to provide accurate execution or cost estimates, but we aimed to see if the frameworks estimate the cost-optimal cloud infrastructure setup among the five EC2 instances that we chose to execute our three different workloads.
  • Haatanen, Henri (2022)
    In the modern era, using personalization when reaching out to potential or current customers is essential for businesses to compete in their area of business. With large customer bases, this personalization becomes more difficult, thus segmenting entire customer bases into smaller groups helps businesses focus better on personalization and targeted business decisions. These groups can be straightforward, like segmenting solely based on age, or more complex, like taking into account geographic, demographic, behavioral, and psychographic differences among the customers. In the latter case, customer segmentation should be performed with Machine Learning, which can help find more hidden patterns within the data. Often, the number of features in the customer data set is so large that some form of dimensionality reduction is needed. That is also the case with this thesis, which includes 12802 unique article tags that are desired to be included in the segmentation. A form of dimensionality reduction called feature hashing is selected for hashing the tags for its ability to be introduced new tags in the future. Using hashed features in customer segmentation is a balancing act. With more hashed features, the evaluation metrics might give better results and the hashed features resemble more closely the unhashed article tag data, but with less hashed features the clustering process is faster, more memory-efficient and the resulting clusters are more interpretable to the business. Three clustering algorithms, K-means, DBSCAN, and BIRCH, are tested with eight feature hashing bin sizes for each, with promising results for K-means and BIRCH.
  • Jurinec, Fran (2023)
    This thesis explores the applicability of open-source tools on addressing the challenges of data-driven fusion research. The issue is explored through a survey of the fusion data ecosystem and exploration of possible data architectures, which were used to derive the goals and requirements of a proof-of-concept data platform. This platform, developed using open-source software, namely InvenioRDM and Apache Airflow, enabled transforming existing machine learning (ML) workloads into reusable data-generating workflows, and the cataloging of resulting clean ML datasets. Through a survey of the fusion data ecosystem, a set of challenges and goals was established for the development of a fusion data platform. It was identified that many of the challenges for data-driven research stem from a heterogeneous and geographically scattered source data layer combined with a monolithic approach to ML research. These challenges could be alleviated through improved ML infrastructure, for which two approaches were identified: a query-based approach, which offers more data retrieval flexibility but requires improvements in querying functionality and source data access speeds, and a persisted dataset approach, which uses a centralized workflow to collect and clean data, but requires additional storage resources. Additionally, by cataloging metadata in a central location it would be possible to combine data discovery across heterogeneous sources, combining the benefits of various infrastructure developments. Building on these identified goals and the metadata-driven platform architecture, a proof-of-concept data platform was implemented and examined through a case study. This implementation used InvenioRDM as a metadata catalog to index and provide a dashboard for discovering ML-ready datasets, and Apache Airflow as a workflow orchestration platform to manage the data collection workflows. The case study, grounded in real-world fusion ML research, showcased the platform's ability to convert existing ML workloads into reusable data-generating workflows and to publish clean ML datasets without introducing significant complexity into the research workflows.
  • Matakos, Alexandros (2024)
    This thesis presents DeepGT, a 3D Convolutional Neural Network designed to enhance the spatial resolution of GNSS Tropospheric Tomography, a technique for estimating atmospheric water vapor distribution using GNSS signals. By utilizing Slant Wet Delays from dense GNSS networks and boundary meteorological data from Numerical Weather Prediction models, DeepGT refines low-resolution tomographic wet refractivity fields. The proposed method quadruples the horizontal resolution, while improving the accuracy of the tomographic reconstruction. Two experiments are conducted to validate this: one with real-world SWEPOS data and another with a hypothetical dense GNSS network. The results demonstrate the potential of deep learning models such as DeepGT in enhancing GNSS Meteorology, with implications for improved weather forecasting and climate studies.
  • Mäkinen, Sasu (2021)
    Deploying machine learning models is found to be a massive issue in the field. DevOps and Continuous Integration and Continuous Delivery (CI/CD) has proven to streamline and accelerate deployments in the field of software development. Creating CI/CD pipelines in software that includes elements of Machine Learning (MLOps) has unique problems, and trail-blazers in the field solve them with the use of proprietary tooling, often offered by cloud providers. In this thesis, we describe the elements of MLOps. We study what the requirements to automate the CI/CD of Machine Learning systems in the MLOps methodology. We study if it is feasible to create a state-of-the-art MLOps pipeline with existing open-source and cloud-native tooling in a cloud provider agnostic way. We designed an extendable and cloud-native pipeline covering most of the CI/CD needs of Machine Learning system. We motivated why Machine Learning systems should be included in the DevOps methodology. We studied what unique challenges machine learning brings to CI/CD pipelines, production environments and monitoring. We analyzed the pipeline’s design, architecture, and implementation details and its applicability and value to Machine Learning projects. We evaluate our solution as a promising MLOps pipeline, that manages to solve many issues of automating a reproducible Machine Learning project and its delivery to production. We designed it as a fully open-source solution that is relatively cloud provider agnostic. Configuring the pipeline to fit the client needs uses easy-to-use declarative configuration languages (YAML, JSON) that require minimal learning overhead.
  • Savolainen, Outi (2022)
    Today, Global Navigation Satellite Systems (GNSS) provide services that many critical systems [1] as well as normal users, need in everyday life. These signals are threatened by unintentional and intentional interference. The received satellite signals are complex-valued by nature, however, state-of-the-art anomaly detection approaches operate in the real domain. Changing the anomaly detection into the complex domain allows for preserving the phase component of the signal data. In this thesis, I developed and tested a fully complex-valued Long Short-Term Memory (LSTM) based autoencoder for anomaly detection. I also developed a method for scaling of complex-numbers that forces both real and imaginary units into the range [-1,1] and does not change the direction of a complex vector. The model is trained and tested both in the time and frequency domains, and the frequency domain is divided into two parts: real and complex domain. The developed model’s training data consists only of clean sample data, and the output of the model is the reconstruction of the model’s input. In testing, it can be determined whether the output is clean or anomalous based on the reconstruction error and the computed threshold value. The results show that the autoencoder model in the real domain outperforms the model trained in the complex domain. This does not indicate that the anomaly detection in the complex domain does not work; rather, the model’s architecture needs improvements, and the amount of training data must be increased to reduce the overfitting of the complex domain and thus improve the anomaly detection capability. It was also investigated that some anomalous sample sequences contain a few large valued spikes while other values in the same data snapshot are smaller. After scaling, the values other than in the spikes get closer to zero. This phenomenon causes small reconstruction errors in the model and yields false predictions in the complex domain.
  • Rannisto, Meeri (2020)
    Bat monitoring is commonly based on audio analysis. By collecting audio recordings from large areas and analysing their content, it is possible estimate distributions of bat species and changes in them. It is easy to collect a large amount of audio recordings by leaving automatic recording units in nature and collecting them later. However, it takes a lot of time and effort to analyse these recordings. Because of that, there is a great need for automatic tools. We developed a program for detecting bat calls automatically from audio recordings. The program is designed for recordings that are collected from Finland with the AudioMoth recording device. Our method is based on a median clipping method that has previously shown promising results in the field of bird song detection. We add several modifications to the basic method in order to make it work well for our purpose. We use real-world field recordings that we have annotated to evaluate the performance of the detector and compare it to two other freely available programs (Kaleidoscope and Bat Detective). Our method showed good results and got the best F2-score in the comparison.
  • Unknown author (2023)
    This study focused on detecting horizontal and vertical collusion within Indonesian government procurement processes, leveraging data-driven techniques and statistical methods. Regarding horizontal collusion, we applied clustering techniques to categorize companies based on their supply patterns, revealing clusters with similar bidding practices that may indicate potential collusion. Additionally, we identified patterns where specific supplier groups consistently won procurements, raising questions about potential competitive advantages or strategic practices that need further examination for collusion. For vertical collusion, we examined the frequency of associations between specific government employees and winning companies. While high-frequency collaborations were observed, it is essential to interpret these results with caution as they do not definitively indicate collusion, and legitimate factors might justify such associations. Despite revealing important patterns, the study acknowledges its limitations, including the representativeness of the dataset and the reliance on quantitative methods. Nevertheless, our findings carry substantial implications for enhancing procurement monitoring, strengthening anti-collusion regulations, and promoting transparency in Indonesian government procurement processes. Future research could enrich these findings by incorporating qualitative methods, exploring additional indicators of collusion, and leveraging machine learning techniques to detect collusion.
  • Rauth, Ella (2022)
    Northern peatlands are a large source of methane (CH4) to the atmosphere and can vary strongly depending on local environmental conditions. However, few studies have mapped fine-grained CH4 fluxes at the landscape-level. The aim of this study was to predict land cover and CH4 flux patterns in Pallastunturi, Finland, in a study area dominated by forests, peatlands, fells, and lakes. I used random forest models to map land cover types and CH4 fluxes with multi-source remote sensing data and upscaled CH4 fluxes based on land cover maps. The random forest classifier reliably detected the same land cover patterns as the CORINE Land Cover maps. The main differences between the land cover maps were forest type classification, misclassification between neighboring peatland types, and detection of sparsely vegetated areas on fells. The upscaled CH4 fluxes of sinks were very robust to changes in land cover classification, but shrub tundra and peatland CH4 fluxes were sensitive to the level of detail in the land cover classification. The random forest regression performed well (NRMSE 6.6%, R2 82%) and predicted similar CH4 flux patterns as the upscaled CH4 flux maps, despite predicting larger areas that act as CH4 sources than the upscaled CH4 flux maps. The random forest regressor also better predicted CH4 fluxes in peatlands due to added information about soil moisture content from the remote sensing data. Random forests are a good model choice to detect landscape patterns and predict CH4 patterns in northern peatlands based on remote sensing and topographic data.
  • Räisä, Ossi (2021)
    Differential privacy has over the past decade become a widely used framework for privacy-preserving machine learning. At the same time, Markov chain Monte Carlo (MCMC) algorithms, particularly Metropolis-Hastings (MH) algorithms, have become an increasingly popular method of performing Bayesian inference. Surprisingly, their combination has not received much attention in the litera- ture. This thesis introduces the existing research on differentially private MH algorithms, proves tighter privacy bounds for them using recent developments in differential privacy, and develops two new differentially private MH algorithms: an algorithm using subsampling to lower privacy costs, and a differentially private variant of the Hamiltonian Monte Carlo algorithm. The privacy bounds of both new algorithms are proved, and convergence to the exact posterior is proven for the latter. The performance of both the old and the new algorithms is compared on several Bayesian inference problems, revealing that none of the algorithms is clearly better than the others, but subsampling is likely only useful to lower computational costs.
  • Suihkonen, Sini (2023)
    The importance of protecting sensitive data from information breaches has increased in recent years due to companies and other institutions gathering massive datasets about their customers, including personally identifiable information. Differential privacy is one of the state-of-the-art methods for providing provable privacy to these datasets, protecting them from adversarial attacks. This thesis focuses on studying existing differentially private random forest (DPRF) algorithms, comparing them, and constructing a version of the DPRF algorithm based on these algorithms. Twelve articles from the late 2000s to 2022, each implementing a version of the DPRF algorithm, are included in the review of previous work. The created algorithm, called DPRF_thesis , uses a privatized median as a method for splitting internal nodes of the decision trees. The class counts of the leaf-nodes are made with the exponential mechanism. Tests on the DPRF_thesis algorithm were run on three binary classification UCI datasets, and the accuracy results were mostly comparable with the two existing DPRF algorithms DPRF_thesis was compared to. ACM Computing Classification System (CCS): Computing methodologies → Machine learning → Machine learning approaches → Classification and regression trees Security and privacy → Database and storage security → Data anonymization and sanitization
  • Joosten, Rick (2020)
    In the past two decades, an increasing amount of discussions are held via online platforms such as Facebook or Reddit. The most common form of disruption of these discussions are trolls. Traditional trolls try to digress the discussion into a nonconstructive argument. One strategy to achieve this is to give asymmetric responses, responses that don’t follow the conventional patterns. In this thesis we propose a modern machine learning NLP method called ULMFiT to automatically detect the discourse acts of online forum posts in order to detect these conversational patterns. ULMFiT finetunes the language model before training its classifier in order to create a more accurate language representation of the domain language. This task of discourse act recognition is unique since it attempts to classify the pragmatic role of each post within a conversation compared to the functional role which is related to tasks such as question-answer retrieval, sentiment analysis, or sarcasm detection. Furthermore, most discourse act recognition research has been focused on synchronous conversations where all parties can directly interact with each other while this thesis looks at asynchronous online conversations. Trained on a dataset of Reddit discussions, the proposed model achieves a matthew’s correlation coefficient of 0.605 and an F1-score of 0.69 to predict the discourse acts. Other experiments also show that this model is effective at question-answer classification as well as showing that language model fine-tuning has a positive effect on both classification performance along with the required size of the training data. These results could be beneficial for current trolling detection systems.
  • Lange, Moritz Johannes (2020)
    In the context of data science and machine learning, feature selection is a widely used technique that focuses on reducing the dimensionality of a dataset. It is commonly used to improve model accuracy by preventing data redundancy and over-fitting, but can also be beneficial in applications such as data compression. The majority of feature selection techniques rely on labelled data. In many real-world scenarios, however, data is only partially labelled and thus requires so-called semi-supervised techniques, which can utilise both labelled and unlabelled data. While unlabelled data is often obtainable in abundance, labelled datasets are smaller and potentially biased. This thesis presents a method called distribution matching, which offers a way to do feature selection in a semi-supervised setup. Distribution matching is a wrapper method, which trains models to select features that best affect model accuracy. It addresses the problem of biased labelled data directly by incorporating unlabelled data into a cost function which approximates expected loss on unseen data. In experiments, the method is shown to successfully minimise the expected loss transparently on a synthetic dataset. Additionally, a comparison with related methods is performed on a more complex EMNIST dataset.