Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by master's degree program "Master's Programme in Data Science"

Sort by: Order: Results:

  • Yeom Song, Victor Manuel (2024)
    Planning and decision making are active areas of research in cognitive neuroscience that strive to explain how the brain makes decisions in complex scenarios. Research in this field has traditionally been restricted to simplistic experiments such as two-alternative forced choice situations, and has relatively recently broken into more naturalistic settings with the help of computational modeling and games. Importantly, these computational models aim to be interpretable, meaning that they are crafted in a way that what each parameter means has a clear meaning, perhaps in contrast to massive neural networks. However, the latter may better capture more complex behaviors that the hand-crafted model could miss, so it may be desirable to use a neural network as a guide or ``oracle'' to study and improve the parameters to include in the interpretable model. In this thesis, we present GPT-4IAR, a transformer neural network architecture for modeling and predicting human behavior in the board game four-in-a-row (4IAR). Building upon previous studies that use fully connected neural networks to improve models around 4IAR, and the excellent capabilities of the GPT architecture in tasks where data is sequential, we train a transformer on millions of games of 4IAR to study biases that arise in human decision making. Experiments show that conditioning action predictions on longer histories of previous moves leads to improved accuracy over prior state-of-the-art models, hinting at longer-term strategic biases in human gameplay. Reaction time prediction is also explored, showing promise in capturing meaningful gameplay statistics beyond raw actions.
  • Martikainen, Jussi-Pekka (2019)
    Wood is the fuel for the forest industry. Fellable wood is collected from the forests and requires transportation to the mills. The distance to the mills is quite often very long. The most used long- distance transportation means of wood in Finland is by road transportation with wood-trucks. The poor condition of the lower road network increases the transportation costs not only for the forest industry but for the whole natural resources industry. Timely information about the conditions of the lower road network is considered beneficial for the wood transportation and for the road maintenance planning to reduce the transportation related costs. Acquisition of timely information about the conditions of the lower road network is a laborious challenge to the industry specialists due to the vast size of the road network in Finland. Until the recent development in ubiquitous mobile computing collecting the road measurement data and the detection of certain road anomalies from the measurements has traditionally required expensive and specialized equipment. Crowdsensing with the capabilities of a modern smartphone is seen as inexpensive means with high potential to acquire timely information about the conditions of the lower road network. In this thesis a literature review is conducted to find out the deteriorative factors behind the conditions of the lower road network in Finland. Initial assumptions are drawn about the detectability of such factors from the inertial sensor data of a smartphone. The literature on different computational methods for detecting the road anomalies based on the obtained accelerometer and gyroscope measurement data is reviewed. As a result a summary about the usability of the reviewed computational methods for detecting the reviewed deteriorative factors is presented. And finally suggestions for further analysis for obtaining more training data for machine learning methods and for predicting the road conditions are presented.
  • Moisio, Mikko (2021)
    Semantic textual similarity (STS), the procedure of determining how similar pieces of text are in terms of their meaning, is an important problem in the rapidly evolving field of natural language processing (NLP). STS accelerates major information retrieval applications dealing with natural language text, such as web search engines. For computational efficiency reasons, text pieces are often encoded into semantically meaningful real-valued vectors, sentence embeddings, that can be compared with similarity metrics. Majority of recent NLP research has focused on a small set of largest Indo-European languages and Chinese. Although much of the research is machine learning oriented and is thus often applicable across languages, languages with lesser speaker population, such as Finnish, often lack annotated data required to train, or even evaluate, complex models. BERT, a language representation framework building on transfer learning, is one of the recent quantum leaps in NLP research. BERT-type models take advantage of unsupervised pre-training reducing annotated data demands for supervised tasks. Furthermore, a BERT modification called Sentence-BERT enables us to extend and train BERT-type models to derive semantically meaningful sentence embeddings. However, yet the annotated data demands for conventional training of a Sentence-BERT is relatively low, often such data is unavailable for low-resourced languages. Multilingual knowledge distillation has been shown to be a working strategy for extending mono- lingual Sentence-BERT models to new languages. This technique allows transferring and merging desired properties of two language models, and, instead of annotated data, consumes bilingual parallel samples. In this thesis we study using knowledge distillation to transfer STS properties learnt from English into a model pre-trained on Finnish while bypassing the lack of annotated Finnish data. Further, we experiment distillation with different types of data, English-Finnish bilingual, English monolingual and random pseudo samples, to observe which properties of training data are really necessary. We acquire a bilingual English-Finnish test dataset by translating an existing annotated English dataset and use this set to evaluate the fit of our resulting models. We evaluate the performance of the models in different tasks, English, Finnish and English-Finnish cross-lingual STS, to observe how well the properties being transferred are captured, and how well the models retain the desired properties they already have. We find that knowledge distillation is indeed a feasible approach for obtaining a relatively high quality Sentence-BERT for Finnish. Surprisingly, in all setups large portion of desired properties are transferred to the Finnish model, and, training with English-Finnish bilingual data yields best Finnish sentence embedding model we are aware of.
  • Eurasto, Felix (2024)
    G protein-coupled receptors (GPCRs) constitute the largest family of receptors in humans. They are involved in the regulation of major biological processes including sight, taste, and mood. Due to their prevalence in the human body and involvement in such a wide range of tasks, GPCRs are medically extremely important. GPCRs are cell-surface receptors, responsible for conveying biological messages from the extracel- lular domain to the cytoplasmic region. As such, GPCRs are constantly interacting with the lipids of the cell membrane. These interactions are thought to mediate the activation behaviour of the GPCRs, although the exact nature of these effects is often unknown. The beta-2 adrenergic receptor (β2AR) is a class-A GPCR, whose native ligand is adrenaline. It plays a crucial role in the inactivation of the sympathetic nervous system to trigger the fight-or-flight response. Many GPCRs exhibit basal activity. That is, these receptors can activate even in the absence of an activating ligand. β2AR is one of these GPCRs. The specific cause and mechanism of basal activity are often unknown and, as of the start of the project presented in this thesis, were undetermined for β2AR. We used high-throughput fully atomistic molecular dynamics (MD) simulations coupled with ma- chine learning (ML) methods to ascertain specific interactions between a highly conserved aspartate residue of β2AR and phosphatidylcholine (PC) lipids that stabilize the active state of β2AR. We also found that cholesterol plays a role in mediating these interactions. These results shed light on the effect of the lipid composition of one’s cell membranes, and by extension one’s lipid diet, on the activation behaviour of β2AR, a medically extremely relevant receptor.
  • Saukkoriipi, Mikko (2022)
    Two factors define the success of a deep neural network (DNN) based application; the training data and the model. Nowadays, many state-of-the-art DNN models are available free of charge, and training and deploying these models is easier than ever before. As a result, anyone can set up a state-of-the-art DNN algorithm within days or even hours. In the past, most of the focus has been given to the model when researchers were building faster and more accurate deep learning architectures. These research groups commonly use large and high-quality datasets in their work, which is not the case when one wants to train a new model for a specific use case. Training a DNN algorithm for a specific task requires collecting a vast amount of unlabelled data and then labeling the training data. To train a high-performance model, the labeled training dataset must be large and diverse to cover all relevant scenarios of the intended use case. This thesis will present an efficient and straightforward active learning method to sample the most informative images to train a powerful anchor-free Intersection over Union (IoU) predicting objector detector. Our method only uses classification confidences and IoU predictions to estimate the image informativeness. By collecting the most informative images, we can cover the whole diversity of the images with fewer human-annotated training images. This will save time and resources, as we avoid labeling images that would not be beneficial.
  • Lehtoranta, Selina (2020)
    Tutkielma on toteutettu suomalaisen elintarvike- ja logistiikkayrityksen pyynnöstä, ja heidän pää-asiallisena tavoitteena on saada vastaus kysymykseen "Voidaanko toimitusketjun lämpötilaa soveltaa toimitusasiakkaan velvoittamaan vastaanotto-lämpötilan mittaukseen?" Tutkielmassa esitetään ja sovelletaan kahta eri klusterointitekniikkaa, jotka ovat k-means -klusterointi ja EM-algoritmi Gaussin sekoitemalleille. Tutkielmassa hieman verrataan näitä kahta klusterointitekniikkaa ja selvitetään, kumpi niistä on parempi tällaisessa tutkimuksessa. Perinteisen EM-GMM lähestymistavan lisäksi EM-algoritmia sovelletaan Gaussin sekoitemalleille hyödyntäen pääkomponenttianalyysia. Näiden lisäksi vastataan tutkimuskysymykseen käyttäen suhteellista muutosta.
  • Huttunen, Mika (2021)
    Arvopaperin tulevan hinnanmuodostuksen ennustaminen on mielenkiintoista niin sijoittajan kuin aktiivisesti kauppaa tekevän markkinatoimijan näkökulmasta. Tarpeeksi hyvällä tarkkuudella arvopaperin tulevaa hinnanmuodostusta ennustamalla voi markkinatoimija ostaa arvopaperia ennen sen mahdollista markkinahinnan nousua, tai suojata salkkuaan sitä jo omistaessaan, mikäli on vaara, että arvopaperin markkinahinta laskee ajan mittaan merkittävästi. Tutkielmassani käsittelen koneoppimisen soveltamista tekniseen analyysiin. Tutkin, voidaanko tekniseen analyysiin pohjautuen markkinan tai arvopaperin tulevaa hinnanmuodostusta ennustaa lyhyellä aikavälillä tarpeeksi hyvällä tarkkuudella. Selvitän arvopaperimarkkinoiden toimintaa ja käyn läpi, miten tarkasteltavan markkinan tulevaa kysynnän ja tarjonnan suhdetta voidaan teknistä analyysiä hyödyntäen pyrkiä ennustamaan. Taustoitan myös omassa tutkimuksessa käyttämieni teknisen analyysin indikaattorien sekä koneoppimisen menetelmien toimintaa ja esitän aiempaa tutkimusta ongelman parissa. Havaitsin, että markkinoiden tulevan hinnanmuodostuksen ennustaminen on haastavaa. Käyttämilläni ohjatun oppimisen menetelmillä en onnistunut generoimaan mallia, joka olisi osannut ennustaa S&P 500-osakeindeksille, onko tarkasteltavaa ajanhetkeä seuraavan lyhyen aikavälin päätteeksi markkinahinta korkeammalla vai enintään yhtä korkealla kuin tarkasteluajankohtana. Opetetut mallit saavuttivat parhaimmillaan vain 50.8 − 51.4 % ennustetarkkuuden, kun taas naiivi luokittelija, joka ennustaa jokaisen aikavälin päätteeksi markkinahinnan kohonneen saavuttaa 53.0 %:n tarkkuuden. Vehnäfutuurisopimusmarkkinalle saamani tulokset olivat lupaavampia ja opetetut mallit saavuttivat edellä mainitun ongelmanratkaisuun parhaimmillaan 51.7 − 52.5 % ennustetarkkuuden, joka ylitti naiivin luokittelijan 50.9 % tarkkuuden. Analysoin saamiani tuloksia ja esitin jatkotutkimusmahdollisuuksia mallien tehostamiseksi.
  • Lampinen, Sebastian (2022)
    Modeling customer engagement assists a business in identifying the high risk and high potential customers. A way to define high risk and high potential customers in a Software-as-a-Service (SaaS) business is to define them as customers with high potential to churn or upgrade. Identifying the high risk and high potential customers in time can help the business retain and grow revenue. This thesis uses churn and upgrade prediction classifiers to define a customer engagement score for a SaaS business. The classifiers used and compared in the research were logistic regression, random forest and XGBoost. The classifiers were trained using data from the case-company containing customer data such as user count and feature usage. To tackle class imbalance, the models were also trained with oversampled training data. The hyperparameters of each classifier were optimised using grid search. After training the models, performance of the classifiers on a test data was evaluated. In the end, the XGBoost classifiers outperformed the other classifiers in churn prediction. In predicting customer upgrades, the results were more mixed. Feature importances were also calculated, and the results showed that the importances differ for churn and upgrade prediction.
  • Hytönen, Jimi (2022)
    In recent years, significant progress has been made in computer vision regarding object detection and tracking which has allowed the emergence of various applications. These often focus on identifying and tracking people in different environments such as buildings. Detecting people allows us to get a more comprehensive view of people flow as traditional IoT data from elevators cannot track individual people and their trajectories. In this thesis, we concentrate on people detection in elevator lobbies which we can use to improve the efficiency of the elevators and the convenience of the building. We compare the performance and speed of various object detection algorithms. Additionally, we research an edge device's capability to run an object detection model on multiple cameras and whether a single device can cover the target building. We were able to train an object detection algorithm suitable for our application. This allowed accurate people detection that can be used for people counting. We found that out of the three object detection algorithms we trained, YOLOv3 was the only one capable of generalizing to unseen environments, which is essential for general purpose application. The performances of the other two models (SSD and Faster R-CNN) were poor in terms of either accuracy or speed. Based on these, we chose to deploy YOLOv3 to the edge device. We found that the edge device's inference time is linearly dependent on the number of cameras. Therefore, we can conclude that one edge device should be sufficient for our target building, allowing two cameras for each floor. We also demonstrated that the edge device allows easy addition of an object tracking layer, which is required for the solution to work in a real-life office building.
  • Muiruri, Dennis (2021)
    Ubiquitous sensing is transforming our societies and how we interact with our surrounding envi- ronment; sensors provide large streams of data while machine learning techniques and artificial intelligence provide the tools needed to generate insights from the data. These developments have taken place in almost every industry sector with topics such as smart cities and smart buildings becoming key topical issues as societies seek more sustainable ways of living. Smart buildings are the main context of this thesis. These are buildings equipped with various sensors used to collect data from the surrounding environment allowing the building to adapt itself and increasing its operational efficiency. Previously, most efforts in realizing smart buildings have focused on energy management and au- tomation where the goal is to improve costs associated with heating, ventilation, and air condi- tioning. A less studied area involves smart buildings and their indoor environments especially relative to sub-spaces within a building. Increased developments in low-cost sensor technologies have created new opportunities to sense indoor environments in more granular ways that provide new possibilities to model finer attributes of spaces within a building. This thesis focuses on modeling indoor environment data obtained from a multipurpose building that serves primarily as a school. The aim is to explore the quality of the indoor environment relative to regulatory guidelines and also exploring suitable predictive models for thermal comfort and indoor air quality. Additionally, design science methodology is applied in the creation of a proof of concept software system. This system is aimed at demonstrating the use of Web APIs to provide sensor data to clients that may use the data to render analytics among other insights to a building’s stakeholders. Overall, the main technical contributions of this thesis are twofold: (i) a potential web-application design for indoor air quality IoT data and (ii) an exposition of modeling of indoor air quality data based on a variety of sensors and multiple spaces within the same building. Results indicate a software-based tool that supports monitoring the indoor environment of a building would be beneficial in maintaining the correct levels of various indoor parameters. Further, modeling data from different spaces within the building shows a need for heterogeneous models to predict variables in these spaces. This implies parameters used to predict thermal comfort and air quality are different in varying spaces especially where the spaces differ in size, indoor climate control settings, and other attributes such as occupancy control.
  • Noykova, Neli (2022)
    This work is focused on Bayesian hierarchical modeling of geographical distribution of marine species Coregonus lavaretus L. s.l. along the Gulf of Bothnia. Spatial dependences are modeled by Gaussian processes. The main modeling objective is to predict whitefish larvae distribution for previously unobserved spatial locations along the Gulf of Bothnia. In order to achieve this objective, we have to solve two main tasks: to investigate the sensitivity of posterior parameters estimates with respect to different parameter priors, and to solve model selection task. In model selection, among all candidate models, we have to choose the model with best predictive performance. The candidate models were divided into two main groups: models that describe spatial effects, and models without such description. The candidates in each group involved different number (6 or 8) and expressions of environmental variables. In the group describing spatial effects, we analyzed four different models of Gaussian mean, and for every mean model we used four different prior parameters combinations. The same four models of latent function were used in the candidates where spatial dependences were not described. For every such model we assigned four different priors of overdispersion parameter. Thus, all at all, 32 candidate models were analyzed. All candidate models were estimated with Hamiltonian Monte Carlo MCMC algorithm. Model checks were conducted using the posterior predictive distributions. The predictive distributions were evaluated using the logarithmic score with 10 fold cross validation. The analysis of posterior estimates in models describing spatial effects revealed, that these estimates were very sensitive to prior parameters choices. The provided sensitivity analysis helped us to choose the most suitable priors combination. The results from model selection showed that the model, which showed best predictive performance, does not need to be very complicated and to involve description of spatial effects when the data are not informative enough to detect well the spatial effects. Although the selected model was simpler, the corresponding predictive maps of log larvae intensity correctly predicted the larvae distribution along the Gulf of Bothnia.
  • Niska, Päivö (2024)
    This thesis delves into the complex world of multi-model database migration, investigating its theoretical foundations, strategic implementation, and implications for modern data management. The research utilizes a mixed-methods approach, combining quantitative benchmarking tests with qualitative insights from industry practitioners to give a comprehensive knowledge of the migration process. The importance of smart migration techniques, as well as the crucial function of schema mapping in assuring data consistency are highlighted as key results. Success examples from a variety of industries highlight the practical relevance and advantages of multi-model database migration, while implications for theoretical advances and practical issues in organizational contexts are discussed. The strategic implementation framework leads businesses via rigorous project planning, schema mapping, and iterative optimization, stressing the joint efforts of multiple stakeholders. Future concerns include the influence of developing technologies, the dynamic interaction between migration and data security, and industry-specific subtleties impacting migration tactics as the technological environment advances. The synthesis of ideas leads to a common knowledge base, defining the data management strategy discourse. This investigation serves as a road map for informed decision-making, iterative optimization, and continual adaptation in database management, developing a better knowledge of multi-model database migration in the context of modern data ecosystems.
  • Chen, Cheng (2022)
    How to store data is an enduring topic in the computer science field, and traditional relational databases have done this well and are still widely used today. However, with the growth of non-relational data and the challenges in the big data era, a series of NoSQL databases have come into view. Thus, comparing, evaluating, and choosing a better database has become a worthy topic of research. In this thesis, an experiment that can store the same data set and execute the same tasks or workload on the relational, graph and multi-model databases is designed. The investigation proposes how to adapt relational data, tables on a graph database and, conversely, store graph data on a relational database. Similarly, the tasks performed are unified across query languages. We conducted exhaustive experiments to compare and report the performance of the three databases. In addition, we propose a workload classification method to analyze the performance of the databases and compare multiple aspects of the database from an end-user perspective. We have selected PostgreSQL, ArangoDB, Neo4j as representatives. The comparison in terms of task execution time does not have any database that completely wins. The results show that relational databases have performance advantages for tasks such as data import, but the execution of multi-table join tasks is slow and graph algorithm support is lacking. The multi-model databases have impressive support for simultaneous storage of multiple data formats and unified language queries, but the performance is not outstanding. The graph database has strong graph algorithm support and intuitive support for graph query language, but it is also important to consider whether the format and interrelationships of the original data, etc. can be well converted into graph format.
  • Gierlach, Mateusz Tadeusz (2020)
    Visual fashion understanding (VFU) is a discipline which aims to solve tasks related to clothing recognition, such as garment categorization, garment’s attributes prediction or clothes retrieval, with the use of computer vision algorithms trained on fashion-related data. Having surveyed VFU- related scientific literature, I conclude that, because of the fact that at the heart of all VFU tasks is the same issue of visually understanding garments, those VFU tasks are in fact related. I present a hypothesis that building larger multi-task learning models dedicated to predicting multiple VFU tasks at once might lead to better generalization properties of VFU models. I assess the validity of my hypothesis by implementing two deep learning solutions dedicated primarily to category and attribute prediction. First solution uses multi-task learning concept of sharing features from ad- ditional branch dedicated to localization task of landmarks’ position prediction. Second solution does not share knowledge from localization branch. Comparison of those two implementations con- firmed my hypothesis, as sharing knowledge between tasks increased category prediction accuracy by 53% and attributes prediction recall by 149%. I conclude that multi-task learning improves generalization properties of deep learning-based visual fashion understanding models across tasks.
  • Hätönen, Vili (2020)
    Recently it has been shown that sparse neural networks perform better than dense networks with similar number of parameters. In addition, large overparameterized networks have been shown to contain sparse networks which, while trained in isolation, reach or exceed the performance of the large model. However, the methods to explain the success of sparse networks are still lacking. In this work I study the performance of sparse networks using network’s activation regions and patterns, concepts from the neural network expressivity literature. I define network specialization, a novel concept that considers how distinctly a feed forward neural network (FFNN) has learned to processes high level features in the data. I propose Minimal Blanket Hypervolume (MBH) algorithm to measure the specialization of a FFNN. It finds parts of the input space that the network associates with some user-defined high level feature, and compares their hypervolume to the hypervolume of the input space. My hypothesis is that sparse networks specialize more to high level features than dense networks with the same number of hidden network parameters. Network specialization and MBH also contribute to the interpretability of deep neural networks (DNNs). The capability to learn representations on several levels of abstraction is at the core of deep learning, and MBH enables numerical evaluation of how specialized a FFNN is w.r.t. any abstract concept (a high level feature) that can be embodied in an input. MBH can be applied to FFNNs in any problem domain, e.g. visual object recognition, natural language processing, or speech recognition. It also enables comparison between FFNNs with different architectures, since the metric is calculated in the common input space. I test different pruning and initialization scenarios on the MNIST Digits and Fashion datasets. I find that sparse networks approximate more complex functions, exploit redundancy in the data, and specialize to high level features better than dense, fully parameterized networks with the same number of hidden network parameters.
  • Li, Yinong (2024)
    The thesis is about developing a new neural network-based simulation-based inference (SBI) method for performing flexible point estimation; we call this method Neural Amortization of Bayesian Point Estimation (NBPE). Firstly, using neural networks, we can achieve amortized inference so that most of the computation cost is spent on training the neural network while performing inference only costs a few milliseconds. In this thesis, we utilize an encoder-decoder architecture; we use an encoder as a summary network to extract informative features from raw data and then feed them to a decoder as an inference network to output point estimations. Moreover, with a novel training method, the utilization of a variable \( \alpha \) in the loss function \( |\theta_i - \theta_{\text{pred}}|^\alpha \) enables the prediction of different statistics (mean, median, mode) of the posterior distribution. Thus, with our method, at inference time, we can get a fast point estimation, and if we want to get different statistics of the posterior, we have to specify the value of the power of the loss $\alpha$. When $\alpha = 2$, the result will be the mean; when $\alpha = 1$, the result will be the median; and when $\alpha$ is getting closer to 0, the result will approach the mode. We conducted comprehensive experiments on both toy and simulator models to demonstrate these features. In the first part of the analysis, we focused on testing the accuracy and efficiency of our method, NBPE. We compared it to the established method called Neural Posterior Estimation (NPE) in the BayesFlow SBI software. NBPE performs with competitive accuracy compared to NPE and can perform faster inference than NPE. In the second part of the analysis, we concentrated on the flexible point estimation capabilities of NBPE. We conducted experiments on three conjugate models since most of these models' posterior mean, median, and mode have analytical expressions, which leads to more straightforward analysis. The results show that at inference time, the different choices of $\alpha$ can influence the output exactly, and the results align with our expectations. In summary, in this thesis, we propose a new neural SBI method, NBPE, that can perform fast, accurate, and flexible point estimation, broadening the application of SBI in downstream tasks of Bayesian inference.
  • Pyykölä, Sara (2022)
    This thesis regards non-Lambertian surfaces and their challenges, solutions and study in computer vision. The physical theory for understanding the phenomenon is built first, using the Lambertian reflectance model, which defines Lambertian surfaces as ideally diffuse surfaces, whose luminance is isotropic and the luminous intensity obeys Lambert's cosine law. From these two assumptions, non-Lambertian surfaces violate at least the cosine law and are consequently specularly reflecting surfaces, whose perceived brightness is dependent from the viewpoint. Thus non-Lambertian surfaces violate also brightness and colour constancies, which assume that the brightness and colour of same real-world points stays constant across images. These assumptions are used, for example, in tracking and feature matching and thus non-Lambertian surfaces pose complications for object reconstruction and navigation among other tasks in the field of computer vision. After formulating the theoretical foundation of necessary physics and a more general reflectance model called the bi-directional reflectance distribution function, a comprehensive literature review into significant studies regarding non-Lambertian surfaces is conducted. The primary topics of the survey include photometric stereo and navigation systems, while considering other potential fields, such as fusion methods and illumination invariance. The goal of the survey is to formulate a detailed and in-depth answer to what methods can be used to solve the challenges posed by non-Lambertian surfaces, what are these methods' strengths and weaknesses, what are the used datasets and what remains to be answered by further research. After the survey, a dataset is collected and presented, and an outline of another dataset to be published in an upcoming paper is presented. Then a general discussion about the survey and the study is undertaken and conclusions along with proposed future steps are introduced.
  • Siilasjoki, Niila Johan (2024)
    Machine learning operations (MLOps) is an intersection paradigm between machine learning (ML), software engineering, and data engineering. It focuses on the development and operations of software engineering by providing principles, components, and workflows that form the MLOps operational support system (OSS) platform. The increasing use of ML with increasing data size and model complexity has created a challenge where the MLOps OSS platforms require cloud and high-performance computing environments to achieve flexible and efficient scalability for different workflows. Unfortunately, there are not many open-source solutions that are user-friendly or viable enough to be utilized by an MLOps OSS platform, which is why this thesis proposes a bridge solution utilized by a pipeline to address the problem. We used Design Science Methodology to define the problem, set objectives, design the implementation, demonstrate the implementation, and evaluate the solution. The resulting solutions are an environment bridge called the HTC-HPC bridge and a pipeline called the Cloud-HPC pipeline that uses it. We defined a general model for Cloud-HPC MLOps pipelines to implement the used functions in a use case suitable infrastructure ecosystem and MLOps OSS platform using open-source, provided, and self-implemented software. The demonstration and evaluation showed that the HTC-HPC bridge and Cloud-HPC pipeline provide easy setup, utilized, customizable, and scalable workflow automation, which can be used for typical ML research workflows. However, it also showed that the bridge needed improved multi-tenancy design and that the pipeline required templates for a better user experience. These aspects, alongside testing use case potential and finding real-world use cases, are part of future work.
  • Polis, Arturs (2019)
    Recently, a neural network based approach to automatic generation of image descriptions has become popular. Originally introduced as neural image captioning, it refers to a family of models where several neural network components are connected end-to-end to infer the most likely caption given an input image. Neural image captioning models usually comprise a Convolutional Neural Network (CNN) based image encoder and a Recurrent Neural Network (RNN) language model for generating image captions based on the output of the CNN. Generating long image captions – commonly referred to as paragraph captions – is more challenging than producing shorter, sentence-length captions. When generating paragraph captions, the model has more degrees of freedom, due to a larger total number of combinations of possible sentences that can be produced. In this thesis, we describe a combination of two approaches to improve paragraph captioning: using a hierarchical RNN model that adds a top-level RNN to keep track of the sentence context, and using richer visual features obtained from dense captioning networks. In addition to the standard MS-COCO Captions dataset used for image captioning, we also utilize the Stanford-Paragraph dataset specifically designed for paragraph captioning. This thesis describes experiments performed on three variants of RNNs for generating paragraph captions. The flat model uses a non-hierarchical RNN, the hierarchical model implements a two-level, hierarchical RNN, and the hierarchical-coherent model improves the hierarchical model by optimizing the coherence between sentences. In the experiments, the flat model outperforms the published non-hierarchical baseline and reaches similar results to our hierarchical model. The hierarchical model performs similarly to the corresponding published model, thus validating it. The hierarchical-coherent model gives us inconclusive results – it outperforms our hierarchical model but does not reach the same scores as the corresponding published model. With our flat model implementation, we have shown that with minor improvements to a simple image captioning model, one can obtain much higher scores on standard metrics than previously reported. However, it is yet unclear whether a hierarchical RNN is required to model the paragraph captions, or whether a single RNN layer on its own can be powerful enough. Our initial human evaluation indicates that the captions produced by a hierarchical RNN may in fact be more fluent, however the standard automatic evaluation metrics do not capture this.
  • Niemi, Mikko Olavi (2020)
    Standard machine learning procedures are based on assumption that training and testing data is sampled independently from identical distributions. Comparative data of traits in biological species breaks this assumption. Data instances are related by ancestry relationships, that is phylogeny. In this study, new machine learning procedures are presented that take into account phylogenetic information when fitting predictive models. Phylogenetic statistics for classification accuracy and error are proposed based on the concept of effective sample size. Versions of perceptron training and KNN classification are built on these metrics. Procedures for regularised PGLS regression, phylogenetic KNN regression, neural network regression and regression trees are presented. Properties of phylogenetic perceptron training and KNN regression are studied with synthetic data. Experiments demonstrate that phylogenetic perceptron training improves robustness when the phylogeny is unbalanced. Regularised PGLS and KNN regression are applied to mammal dental traits and environments to both test the algorithms and gain insights in the relationship of mammal teeth and the environment.