Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by master's degree program "Datatieteen maisteriohjelma"

Sort by: Order: Results:

  • Sykkö, Antti (2023)
    Official Statistics (OS) are crucial in facilitating informed and reliable decision-making. However, while the demand for diverse and precise information surges, challenges in obtaining accurate data emerge. The declining response rates to statistical surveys and escalating data collection costs further exacerbate the situation, particularly in surveys measuring rare events. This thesis explores the application of the Bayesian framework to statistical production. The concept of OS and the fundamental principles that guide their production are introduced. The suitability of the Bayesian approach for OS production is assessed from theoretical, philosophical, and practical standpoints. The core of statistical inference is explored, and the differences between Bayesian and frequentist approaches are compared. General tools for Bayesian inference and their practical utilization are presented, focusing especially on the graphical representation of a probabilistic model. Furthermore, a progressive construction of the proposed baseline model for analyzing Recreational Fishing Survey data is illustrated, with attention given to the issue of selection bias. The Bayesian Finnish Recreational Fishing Statistics 2020, with concise content produced through the developed model and the genuine data collected for OS, is also launched. While the thesis underscores that the proposed model should be regarded as a basis for further development, the results indicate that reasonable assessments can be obtained even with a simple Bayesian model. Overall, this thesis emphasizes the importance of adopting Bayesian thinking in statistical analysis to enhance knowledge-driven policy-making and adapt to the evolving information needs.
  • Tene, Idan (2024)
    Accurate forest height estimates are critical for environmental, ecological, and economical reasons. They are a crucial parameter for developing forest management responses to climate change and for sustainable forest management practices, and are a good covariate for estimating biomass, volume, and biodiversity, among others. With the increased availability of Light Detection and Ranging (LiDAR) data and high-resolution images (both satellite and aerial), it has become more common to estimate forest heights from the sensory fusion of these instruments. However, comparing recent advancements in height estimation methods is challenging due to the lack of a framework that considers the impact of varying data resolutions (which can range from 1 meter to 100 meters) used with techniques like convolutional neural networks (CNNs). In this work, we address this gap and explore how resolution affects error metrics in forest height estimations. We implement and replicate three state-of-the-art convolutional neural networks, and analyse how their error metrics change as a dependency of the input and target resolution. Our findings suggest that as resolution decreases, the error metrics appear to improve. We hypothesize that this improvement does not reflect a true increase in accuracy, but rather a fundamental shift in what the model is learning at lower resolutions. We identify a possible change point between 3 meter and 5 meter resolution, where estimating forest height potentially transitions to estimating overall forest structure.
  • Vainionpää, Matti (2022)
    The aim of the research presented in this dissertation is to construct a model for personalised item recommendations in an online setting using a reinforcement learning approach, specifically Thompson sampling, which is part of the family of multi-armed bandit algorithms. Moreover the setting involves an online shopfront where arriving customers get viewed with the recommended item, and make purchasing decisions of them. The recommendations are conducted by the multi- armed bandit algorithm which "plays" different arms, represented by the items, while learning, exploring and exploiting the underlying distributions of the data that is obtained. Thompson sampling and the theory behind it is introduced thoroughly and comparison against other bandit algorithms as well as a multinomial logistic regression model is conducted both on real-life data collected over time from an online environment and a dummy data set. The experiments focus on the applicability of bandits in the setting, dealing with challenges that a bandit algorithm may face and the strengths they have over more traditional and well known models such as the logistic regression model in the setting at hand.
  • Kramar, Vladimir (2022)
    This work presents a novel concept of categorising failures within test logs using string similarity algorithms. The concept was implemented in the form of a tool that went through three major iterations to its final version. These iterations are the following: 1) utilising two state-of-the-art log parsing algorithms, 2) manual log parsing of the Pytest testing framework, and 3) parsing of .xml files produced by the Pytest testing framework. The unstructured test logs were automatically converted into a structured format using the three approaches. Then, structured data was compared using five different string similarity algorithms, Sequence Matcher, Jaccard index, Jaro-Winkler distance, cosine similarity and Levenshtein ratio, to form the clusters. The results from each approach were implemented and validated across three different data sets. The concept was validated by implementing an open-sourced Test Failure Analysis (TFA) tool. The validation phase revealed the best implementation approach (approach 3) and the best string similarity algorithm for this task (cosine similarity). Lastly, the tool was deployed into an open-source project’s CI pipeline. Results of this integration, application and usage are reported. The achieved tool significantly reduces software engineers’ manual work and error-prone work by utilising cosine similarity as a similarity score to form clusters of failures.
  • Rodriguez Beltran, Sebastian (2024)
    DP-SGD (Differentially Private Stochastic Gradient Descent) is the gold standard approach for implementing privacy in deep learning settings. DP-SGD achieves this by clipping the gradients during training and then injecting noise. The algorithm aims to limit the impact of any data point from the training dataset on the model. Therefore, the gradient of an individual sample will give information until a certain point, thus limiting the chances of an inference attack discovering the data used for the model training. While DP-SGD ensures the privacy of the model, there is no free lunch, and it has its downsides in terms of the utility-privacy trade-off and an increase in computational resources for training. This thesis aims to evaluate different DP-SGD implementations in terms of performance and computational efficiency. We will compare the use of optimized clipping algorithms, different GPU processing unit architectures, speed-up by compilation, lower precision in the data representation, and distributed training. These strategies effectively reduce the computational cost of adding privacy to the deep learning training compared to the non-private baseline.
  • Zubair, Maria (2022)
    The growing popularity of the Internet of Things (IoT) has massively increased the volume of data available for analysis. This data can be used to get detailed and precise insights about users, products, and organizations. Traditionally, organizations collect and process this data separately, which is a slow process and requires significant resources. Over the past decade, data sharing has become a popular trend, where several organizations have engaged in sharing their collected data with other organizations and processing it together for analysis. Digital marketplaces are developed to facilitate this data sharing. These marketplaces connect producers and consumers of data while ensuring that the data can be shared inside and outside the organization seamlessly and securely. This is achieved by implementing a fine-grained and efficient data access control method that restricts access to the data for authorized parties only. The data generated by IoT devices is voluminous, continuous, and heterogeneous. Therefore, traditional access control methods are no longer suitable for managing access to this data in a digital marketplace. IoT data requires an access control model, which can handle large volumes of streaming data, and provides full control transparency of data access to IoT device owners. In this thesis, we have designed and implemented a novel access control mechanism for a data distribution system developed by Nokia Bell Labs. We have outlined the requirements for designing an access control system to manage data access for data shared across multiple heterogeneous organizations. We have evaluated the proposed system to assess the feasibility and performance of the system in various scenarios. The thesis also discusses the strengths and limitations of the proposed system and highlights future research perspectives in this domain. We expect this thesis to be helpful for researchers studying IoT data processing, access control methods for streaming (big) data, and digital marketplaces.
  • Bolanos Mejia, Tlahui Alberto (2021)
    Credit rating is one of the core tools for risk management within financial firms. Ratings are usually provided by specialized agencies which perform an overall study and diagnosis on a given firm’s financial health. Dealing with unrated entities is a common problem, as several risk models rely on the ratings’ completeness, and agencies can not realistically rate every existing company. To solve this, credit rating prediction has been widely studied in academia. However, research in this topic tends to separate models amongst the different rating agencies due to the difference in both rating scales and composition. This work uses transfer learning, via label adaptation, to increase the number of samples for feature selection, and appends these adapted labels as an additional feature to improve the predictive power and stability of previously proposed methods. Accuracy on exact label prediction was improved from 0.30, in traditional models, up to 0.33 in the transfer learning setting. Furthermore, when measuring accuracy with a tolerance of 3 grade notches, accuracy increased almost 0.10, from 0.87 to 0.96. Overall, transfer learning displayed better out-of-sample generalization.
  • Pelvo, Nasti (2024)
    Object detection and multi-object tracking are crucial components of computer vision systems aiming for comprehensive scene understanding and reliable autonomous decision making. While methods developed for visual input data are widely studied, they are susceptible to environmental factors such as poor lighting and weather conditions. Thermal imaging, on the other hand, is robust against most adversarial environmental conditions and thus presents an intriguing alternative to visual photography. Due to the characteristics of thermal images, current state-of-the-art object detection and tracking methods perform poorly when presented with thermal input. Open source thermal data for training large neural network models is not widely available: existing datasets are small and homogenenous, and the resulting models lack the generalizability required for their application on real world input data. The effect is especially relevant for transformer-based methods, which exhibit a lack of visual inductive bias and thus require large-scale training. This thesis presents the first in-depth literature review and experimental study into transformer-based object detection and tracking on challenging thermal and aerial data. By conducting an analysis on existing transformer-based multi-object tracking methods, we argue for the application of the joint detection and tracking paradigm, where multi-object tracking is treated as an end-to-end problem. Our experiments on two transformer-based multi-object tracking models confirm that fully exploiting multi-frame input can increase the stability of object detection and enforce robustness against the domain issues prevalent in thermal images. Due to the high training data requirement of transformers, the methods are, however, held back by the lack of open source training data. We thus introduce two novel data augmentation techniques which aim to supplement and diversify existing training data, and thus improve the transferability of detection and tracking methods between the visual and thermal domains.
  • Viljamaa, Venla (2022)
    In bioinformatics, new genomes are sequenced at an increasing rate. To utilize this data in various bioinformatics problems, it must be annotated first. Genome annotation is a computational problem that has traditionally been approached by using statistical methods such as the Hidden Markov model (HMM). However, implementing these methods is often time-consuming and requires domain knowledge. Neural network-based approaches have also been developed for the task, but they typically require a large amount of pre-labeled data. Genomes and natural language share many properties, not least the fact that they both consist of letters. Genomes also have their own grammar, semantics, and context-based meanings, just like phrases in the natural language. These similarities give motivation to the use of Natural language processing (NLP) techniques in genome annotation. In recent years, pre-trained Transformer neural networks have been widely used in NLP. This thesis shows that due to the linguistic properties of genomic data, Transformer network architecture is also suitable for gene predicting. The model used in the experiments, DNABERT, is pre-trained using the full human genome. Using task-specific labeled data sets, the model is then trained to classify DNA sequences into genes and non-genes. The main fine-tuning dataset is the genome of the Escherichia coli bacterium, but preliminary experiments are also performed on human chromosome data. The fine-tuned models are evaluated for accuracy, F1-score and Matthews correlation coefficient (MCC). A customized estimation method is developed, in which the predictions are compared to ground-truth labels at the nucleotide level. Based on that, the best models achieve a 90.15% accuracy and an MCC value of 0.4683 using the Escherichia coli dataset. The model correctly classifies even the minority label, and the execution times are measured in minutes rather than hours. These suggest that the NLP-based Transformer network is a powerful tool for learning the characteristics of gene and non-gene sequences.
  • Kivimäki, Juhani (2022)
    In this thesis, we give an overview of current methodology in the field of uncertainty estimation in machine learning, with focus on confidence scores and their calibration. We also present a case study, where we propose a novel method to improve uncertainty estimates of an in-production machine learning model operating in an industrial setting with real-life data. This model is used by a Finnish company Basware to extract information from invoices in the form of machine-readable PDFs. The solution we propose is shown to produce confidence estimates, which outperform the legacy estimates on several relevant metrics, increasing coverage of automated invoices from 65.6% to 73.2% with no increase in error rate.
  • Vilenius, Jaakko (2023)
    The objective of this thesis was to explore the article corpus of a domain-specific Finnish-language newspaper to generate a new content tag word set to replace an existing one of poor quality. The articles used as the dataset in this study had previously human-assigned content tags, but in absence of a proper tagging strategy and guidelines, the assigned tags were found to include too much variation to be useful, e.g., as search terms. No supervised learning models were used, since there was no good quality training material available specific to the topics of the data. Instead, we experimented in generating new tagsets with two unsupervised methods alongside a few variations based on nouns and proper nouns in the text content of the articles. A proper tag set would be useful in tagging future articles automatically or in drafting guidelines for manual tagging of future articles by the journalists. A limited survey among subject matter experts and other responders was conducted in order to evaluate the results generated by the methods. In general, the results were not encouraging, with the most basic model TF-IDF clearly performing better than the other models across all responders. Further examination after topic modeling using Latent Dirichlet Allocation (LDA) revealed, that somewhat better scores could be found among some topics. A further manual task of assessing and naming of the topics followed by a new tagging effort within topics was suggested as a next step in order to overcome the deficiencies in the presented methods.
  • Gundyreva, Elina (2022)
    In this thesis, a novel method for linking scientific articles to taxonomy terms in the domain of food systems research is presented. With food systems being in the center of 12 of the 17 United Nations Sustainable Development goals, there has been an ever-growing amount of scientific articles in this field. These articles are vital in understanding the complex nature of food systems and their inter-dependencies. However, finding relevant literature in this field is difficult for decision makers given the interdisciplinary nature of the field and that annotation and expert feedback is expensive. In the thesis, BERT-based models (SBERT, SPECTER and SciBERT) are adapted to the food systems area and fine-tuned for tasks such as text classification and text similarity, which represents a solution to the problem of finding relevant articles in the food systems domain. The proposed search system uses several taxonomies and data augmentation to achieve the results, which are visualized in a created website. Linking food systems research articles to taxonomy terms shows good accuracy, with models finetuned on domain data achieving better performance on classification task. The best fine-tuning strategy for SPECTER and SciBERT is the combination of domain adaptation and classification. Fine-tuning for text similarity for SBERT improves SBERT performance only slightly. The proposed method can be used in other domains than food systems.
  • Lehtiranta, Jarkko (2023)
    With the growing concerns over data privacy and new regulations like the General Data Protection Regulation (GDPR), there has been increased attention to privacy-preserving synthetic data generation methods. However, the usability of these methods has received limited attention. This thesis focuses on the usability challenges associated with privacy-preserving synthetic data generation methods based on probabilistic graphical models. This thesis addresses usability challenges related to running time efficiency, applicability with continuous data and query selection with experiments conducted on different datasets. This thesis aims to bridge the gap between cutting-edge privacy-preserving synthetic data generation methods and their practical implementation with real-world datasets. Proposed solutions have the potential to make these methods more accessible and usable, thereby facilitating their broader adoption. The thesis concludes by summarizing the key results and emphasizing the importance of addressing usability challenges in privacy-preserving synthetic data generation methods.
  • Hussain, Zafar (2020)
    The National Library of Finland has digitized newspapers starting from late eighteenth century. Digitized data of Finnish newspapers is a heterogeneous data set, which contains the content and metadata of historical newspapers. This research work is focused to study this rich materiality data to find the data-driven categorization of newspapers. Since the data is not known beforehand, the objective is to understand the development of newspapers and use statistical methods to analyze the fluctuations in the attributes of this metadata. An important aspect of this research work is to study the computational and statistical methods which can better express the complexity of Finnish historical newspaper metadata. Exploratory analyses are performed to get an understanding of the attributes and extract the patterns among them. To explicate the attributes’ dependencies on each other, Ordinary Least Squares and Linear Regression methods are applied. The results of these regression methods confirm the significant correlation between the attributes. To categorize the data, spectral and hierarchical clustering methods are studied for grouping the newspapers with similar attributes. The clustered data further helps in dividing and understanding the data over time and place. Decision trees are constructed to split the newspapers after attributes’ logical divisions. The results of Random Forest decision trees show the paths of development of the attributes. The goal of applying various methods is to get a comprehensive interpretation of the attributes’ development based on language, time, and place and evaluate the usefulness of these methods on the newspaper data. From the features’ perspective, area appears as the most imperative feature and from language based comparison Swedish newspapers are ahead of Finnish newspapers in adapting popular trends of the time. Dividing the newspaper publishing places into regions, small towns show more fluctuations in publishing trends, while from the perspective of time the second half of twentieth century has seen a large increase in newspapers and publishing trends. This research work coordinates information on regions, language, page size, density, and area of newspapers and offers robust statistical analysis of newspapers published in Finland.
  • Toivanen, Pihla (2019)
    Valeuutiset ovat viime vuosina nousseet merkittäväksi yhteiskunnallisen keskustelun aiheeksi niin Suomessa kuin ulkomaillakin. Esimerkiksi vuoden 2016 yhdysvaltojen presidentinvaalien aikana jotkin valeuutiset levisivät laajemmalle kuin suosituimmat valtamediauutiset, ja valeuutisten onkin arveltu vaikuttaneen merkittävästi Trumpin voittoon kyseisissä vaaleissa. Aiemmasta suomalaisesta tutkimuksesta tiedetään, että Suomessa valeuutiset eivät aina sisällä suoraan virheellistä tietoa, ja tämän vuoksi suomalaisia valemedioita kutsutaan myös vastamedioiksi. Tiedetään myös, että suomalaisissa vastamediauutisissa kehystetään usein valtamedian uutisia tukemaan vastamedian omaa agendaa. Kehystämisellä tarkoitetaan viestinnän tutkimuksessa prosessia, jolla valikoinnin, poissulkemisen ja esimerkiksi metaforien ja iskulauseiden avulla muokataan mediaesityksen tulkintaa. Kehyksen käsite sekä kehysanalyysi ovat saaneet alkunsa sosiaalipsykologiasta ja levinneet sittemmin mediatutkimukseen. Laskennallisesti kehysanalyysiä on tehty sekä ohjatuilla että ohjaamattomilla koneoppimismenetelmillä, mutta yksikään näistä menetelmistä ei ole vakiintunut kehyksen operationalisoinnin monikäsitteisyyden vuoksi. Tämän tutkielman tarkoituksena on selvittää, millaisilla prosesseilla suomalainen vastamedia uudelleenkehystää valtamedian uutisia, sekä soveltaa ohjattua koneoppimista eri kehystämisen tapojen tunnistamiseen. Tutkimuskysymyksiin vastaamiseksi kerättiin kattava aineisto eräästä suomalaisesta vastamediasta, ja eroteltiin aineistosta valtamedialinkin sisältävät artikkelit. Tämän jälkeen identifioitiin laadullisesti kolme tapaa jolla vastamedia kehystää valtamedian uutisia: kritisoimalla valtamediaa, kopioimalla sisältöä sekä hyödyntämällä valtamedialähdettä argumentoinnin välineenä. Tässä tutkielmassa rakennetaan ohjattu koneoppimismalli kolmen edellä luetellun kehystämisen prosessin identifiointiin. Malli rakennettiin luokittelemalla 1000 artikkelin satunnaisotos valtamedialähteen sisältävästä aineistosta kolmeen edellä lueteltuun kehystämisen prosessin kategoriaan. Tämän jälkeen luokitellusta datasta eristettiin erilaisia piirteitä ja rakennettiin näiden pohjalta luokittelija. Työssä vertailtiin erilaisia satunnaismetsäluokittelijoita sekä tukivektorikoneita, joista eräs satunnaismetsäluokittelija suoriutui luokittelutehtävästi parhaiten. Luokittelijaa ei kuitenkaan voida pitää tarpeeksi tarkkana useimpiin käytännön hyvin korkeaa tarkkuutta vaativiin sovelluksiin. Luokittelijan merkittävimpinä pitämistä piirteistä saadaan kuitenkin uutta tietoa siitä, miten eri sanoja ja tekstin muotoilutyylillisiä keinoja käytetään eri kehystämistavoissa. Esimerkiksi artikkeleissa käytettyjen linkkien määrä sekä alaotsikkojen määrä nousivat luokittelijalle merkittävimpien piirteiden joukkoon. Tuloksista voidaan päätellä, että laskennallisessa mediatutkimuksessa sanojen lisäksi on hyödyllistä eristää myös artikkeliin liittyvää muotoiludataa. Toinen keskeinen tulos on, että ohjattua koneoppimista voidaan hyödyntää erilaisten median lähteeseen suuntautuvien orientaatioiden tunnistamiseen.
  • Kovapohja, Fanni (2022)
    Time-dependent hierarchical data is a complex type of data that is difficult to visualize in a clear manner. It can be found in many real-life situations, for example in customer analysis, but the best practices for visualizing this type of data are not commonly known in business world. This thesis focuses on visualizing changes over time in hierarchical customer data using the Plotly Python Graphing Library and is written as an assignment for a Finnish company. The thesis consists of a literature survey and experimental part. The literature survey introduces the most common hierarchical visualization methods, and the different possible encoding techniques for adding time dimension on top of these hierarchical visualization methods. Moreover, the pros and cons of different visualization techniques and encodings are discussed about. In the experimental part of the thesis, visualization prototypes are designed using the Plotly Python Graphing Library. A company customer data set of the commissioning company is partitioned into hierarchical customer segments by a hierarchical industrial classification TOL 2008, and changes over time in a continuous variable are visualized by these segments. Two hierarchical visualization techniques: the sunburst chart and treemap, are used to create two prototype versions, and the combination of color, typography, and interaction is used to encode time dimension in these prototypes. The same prototypes are also exploited to visualize customer segments by an artificial hierarchy created by combining multiple categorical features into a hierarchical structure. The prototypes are validated in the commissioning company by arranging an end user study and expert review. Concerning the prototypes by the industrial classification: According to the end users and experts, both prototype versions are very useful and well-implemented. Among the end users, there was no significant difference in which one of these prototype versions is faster to use, but the clear majority of the respondents regarded the sunburst chart version as their favorite prototype. The two experts who participated in the expert review had different opinions on which one of the prototype versions they would select to be utilized in practice. Concerning the prototypes by the artificial hierarchy: These prototypes also received positive feedback, but the possibility to change the order of features in the hierarchy was considered as an extremely important development idea. ACM Computing Classification System (CCS): Human-Centered Computing → Visualization → Visualization Techniques Human-Centered Computing → Visualization → Empirical Studies in Visualization
  • Holappa, Hilma (2022)
    Electronic Health Records (ERH) data is one datatype that is continuously produced by hospitals all over the world. There is multiple different applications for this data; one of them is MIMIC dataset which applications are focused on machine learning, however, there is lack of visualisation of this dataset. Topic of this thesis is visualising ERH and more specifically the MIMIC dataset to help medical students gather insights into individual patient data. Considerations implementing this dataset to an individual patient visualisation are also examined. In thesis I first familiarized myself with interactive visualisation design principles with Munzners method. In the first stage I built a preliminary visualisation and gathered feedback with user inter- views. At the second stage I built a final visualisation and gathered final feedback and insights with user interviews. Visualisation consists of a dot graph for medication, line graph for patient response and table with diagnoses. The insights were analysed with Framework Method for qualitative data analysis. From the feedback for the visualisation and the insights gathered it is concluded that there are some issues that need to be taken in account when considering this dataset for individual patient visualisations. Insights from this visualisation were mostly factual and this type of visualisation seems to be slightly too challenging to medical students for more deep insights. Medication graph part of the final visualisation was seen as possible addition to current patient visualisations used in hospitals and the idea of using this visualisation as a teaching tool could only be considered with further additions and a different dataset.
  • Sinisalo, Erkki (2022)
    This thesis investigates the problem of Visual Simultaneous Localization and Mapping (vSLAM) in changing environments. The vSLAM problem is to sequentially estimate the pose of a device with mounted cameras in a map generated based on images taken with those cameras. vSLAM algorithms face two main challenges in changing environments: moving objects and temporal appearance changes. Moving objects cause problems in pose estimation if they are mistaken for static objects. Moving objects also cause problems for loop closure detection (LCD), which is the problem of detecting whether a previously visited place has been revisited. A same moving object observed in two different places may cause false loop closures to be detected. Temporal appearance changes such as those brought about by time of day or weather changes cause long-term data association errors for LCD. These cause difficulties in recognizing previously visited places after they have undergone appearance changes. Focus is placed on LCD, which turns out to be the part of vSLAM that changing environment affects the most. In addition, several techniques and algorithms for Visual Place Recognition (VPR) in challenging conditions that could be used in the context of LCD are surveyed and the performance of two state-of-the-art modern VPR algorithms in changing environments is assessed in an experiment in order to measure their applicability for LCD. The most severe performance degrading appearance changes are found to be those caused by change in season and illumination. Several algorithms and techniques that perform well in loop closure related tasks in specific environmental conditions are identified as a result of the survey. Finally, a limited experiment on the Nordland dataset implies that the tested VPR algorithms are usable as is or can be modified for use in long-term LCD. As a part of the experiment, a new simple neighborhood consistency check was also developed, evaluated, and found to be effective at reducing false positives output by the tested VPR algorithms.