Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "machine learning"

Sort by: Order: Results:

  • Hommy, Antwan (2024)
    Machine learning (ML) is becoming increasingly important in the telecommunications industry. The purpose of machine learning models in telecommunications is to outperform a classical receiver’s performance by fine-tuning parameters. Since ML models have the advantage of being more concise, their performance is easier to evaluate, contrary to a classical receiver’s multiple blocks each with their own small errors. Evaluating the said models, however, is challenging. To identify the correct parameters is also not trivial. To address this issue, a coherent and reliant hyperparameter optimization method needs to be introduced. This thesis investigates how a hyperparameter optimization method can be implemented, and which one is best suited for the problem. It looks into the value it provides, the metrics displayed for each hyperparameter set during training and inference, and the challenges of realising such a system, in addition to various other qualities needed for an efficient training stage. The framework aims to provide valuable insight into model accuracy, validation loss, computing cost, signal-to-noise ratio improvement, and available resources when using hyperparameter tuning. The framework uses grid search optimization, Bayesian optimization as well as genetic algorithm optimization to determine which performs better, and compare the results between them. Grid search will act as a reference baseline for the performance of the two algorithms. The thesis is split into two parts: Phase One, which implements the system in a sandbox-like manner, essentially acting as a testing platform to assess the implementation compatibility. Phase Two inspects a more real-case scenario more suitable for a 5G physical layer environment. The proposed framework uses modern, widely used orchestration and development tools. These include ResNet, Pytorch, and sklearn.
  • Korhonen, Teo Ilmari (2022)
    Flares are short, high-energy magnetic events on stars, including the Sun. Observations of young stars and red dwarfs regularly show the occurrence of flare events multiple orders of magnitude more energetic than even the fiercest solar storms ever recorded. As our technology remains vulnerable to disruptions due to space weather, the study of flares and other stellar magnetic activity is crucial. Until recently, the detection of extrasolar flares has required much manual work and observation resources. This work presents a mostly automatic pipeline to detect and estimate the energies of extrasolar flare events from optical light curves. To model and remove the star's background radiation in spite of complex periodicity, short windows of nonlinear support vector regression are used to form a multi-model consensus. Outliers above the background are flagged as likely flare events, and a template model is fitted to the flux residual to estimate the energy. This approach is tested on light curves collected from the stars AB Doradus and EK Draconis by the Transiting Exoplanet Survey Satellite, and dozens of flare events are found. The results are consistent with recent literature, and the method is generalizable for further observations with different telescopes and different stars. Challenges remain regarding edge cases, uncertainties, and reliance on user input.
  • Kurki, Lauri (2021)
    Atomic force microscopy (AFM) is a widely utilized characterization method capable of capturing atomic level detail in individual organic molecules. However, an AFM image contains relatively little information about the deeper atoms in a molecule and thus interpretation of AFM images of non-planar molecules offers significant challenges for human experts. An end-to-end solution starting from an AFM imaging system ending in an automated image interpreter would be a valuable asset for all research utilizing AFM. Machine learning has become a ubiquitous tool in all areas of science. Artificial neural networks (ANNs), a specific machine learning tool, have also arisen as a popular method many fields including medical imaging, self-driving cars and facial recognition systems. In recent years, progress towards interpreting AFM images from more complicated samples has been made utilizing ANNs. In this thesis, we aim to predict sample structures from AFM images by modeling the molecule as a graph and using a generative model to build the molecular structure atom-by-atom and bond-by-bond. The generative model uses two types of ANNs, a convolutional attention mechanism to process the AFM images and a graph neural network to process the generated molecule. The model is trained and tested using simulated AFM images. The results of the thesis show that the model has the capability to learn even slight details from complicated AFM images, especially when the model only adds a single atom to the molecule. However, there are challenges to overcome in the generative model for it to become a part of a fully capable end-to-end AFM process.
  • Gu, Chunhao (2021)
    Along with the rapid scale-up of biological knowledge bases, mechanistic models, especially metabolic network models, are becoming more accurate. On the other hand, machine learning has been widely applied in biomedical researches as a large amount of omics data becomes available in recent years. Thus, it is worth to conduct a study on integration of metabolic network models and machine learning, and the method may result in some biological discoveries. In 2019, MIT researchers proposed an approach called 'White-Box Machine Learning' when they used fluxomics data derived from in silico simulation of a genome-scale metabolic (GEM) model and experimental antibiotic lethality measurements (IC50 values) of E. coli under hundreds of screening conditions to train a linear regression-based machine learning model, and they extracted coefficients of the model to discover some metabolic mechanism involving in antibiotic lethality. In this thesis, we propose a new approach based on the framework of the 'White-Box Machine Learning'. We replace the GEM model with another state-of-the-art metabolic network model -- the expression and thermodynamics flux (ETFL) formulation. We also replace the linear regression-based machine learning model with a novel nonlinear regression model – multi-task elastic net multilayer perceptron (MTENMLP). We apply the approach on the same experimental antibiotic lethality measurements (IC50 values) of E. coli from the 'White-Box Machine Learning' study. Finally, we validate their conclusions and make some new discoveries. Specially, our results show the ppGpp metabolism is active under antibiotic stress, which is supported by some literature. This implies that our approach has potential to make a biological discovery even if we don't know a possible conclusion.
  • Zhao, Zhanghu (2024)
    Atmospheric new-particle formation (NPF) plays a crucial role in generating climate-influencing aerosol particles. Direct observation of NPF is achievable by tracking the evolution of aerosol particle size distributions in the environment. Such analysis allows researchers to determine the occurrence of NPF on specific days. Currently, the most dependable method for categorizing days into NPF event (class Ia, class Ib, class II) or non-event categories relies on manual visual analysis. However, this manual process is labor-intensive and subjective, particularly with long- term data series. These issues underscore the need for an automated classification system to classify these days more objectively. This paper introduces feature-engineering based machine learning classifiers to discern NPF event and non-event days at the SMEAR II station in Hyytiälä, Finland. The classification utilizes a suite of informative features derived from the multi-modal log-normal distribution fitted to the aerosol particle concentration data and time series analysis at various scales. The proposed machine learning classifiers can achieve an accuracy of more than 90% in identifying NPF event and non-event days. Moreover, they are able to reach an accuracy of around 80% in further categorizing days into detailed subcategories including class Ia, class Ib, class II, and non-event. Notably, the machine learning classifiers reliably predict all event Ia days where particle growth and formation rates are confidently measurable. Moreover, a comparative analysis is conducted between feature-engineering machine learning methods and image-based deep learning in terms of time efficiency and overall performance. The conclusion drawn is that through reasonable feature engineering, machine learning methods can match or even surpass deep learning approaches, particularly in scenarios where time efficiency is paramount. The results of this study strongly support further investigation into this area to improve our knowledge and proficiency in automating New Particle Formation (NPF) event detection.
  • Grönroos, Sonja (2021)
    Several nuclear power plants in the European Union are approaching the ends of their originally planned lifetimes. Extensions to the lifetimes are made to secure the supply of nuclear power in the coming decades. To ensure the safe long-term operation of a nuclear power plant, the neutron-induced embrittlement of the reactor pressure vessel (RPV) must be assessed periodically. The embrittlement of RPV steel alloys is determined by measuring the ductile-to-brittle transition temperature (DBTT) and upper-shelf energy (USE) of the material. Traditionally, a destructive Charpy impact test is used to determine the DBTT and USE. This thesis contributes to the NOMAD project. The goal of the NOMAD project is to develop a tool that uses nondestructively measured parameters to estimate the DBTT and USE of RPV steel alloys. The NOMAD Database combines data measured using six nondestructive methods with destructively measured DBTT and USE data. Several non-irradiated and irradiated samples made out of four different steel alloys have been measured. As nondestructively measured parameters do not directly describe material embrittlement, their relationship with the DBTT and USE needs to be determined. A machine learning regression algorithm can be used to build a model that describes the relationship. In this thesis, six models are built using six different algorithms, and their use is studied in predicting the DBTT and USE based on the nondestructively measured parameters in the NOMAD Database. The models estimate the embrittlement with sufficient accuracy. All models predict the DBTT and USE based on unseen input data with mean absolute errors of approximately 20 °C and 10 J, respectively. Two of the models can be used to evaluate the importance of the nondestructively measured parameters. In the future, machine learning algorithms could be used to build a tool that uses nondestructively measured parameters to estimate the neutron-induced embrittlement of RPVs on site.
  • Makkonen, Eetu Petter (2024)
    Vacuum breakdown is a limiting factor in the design of powerful and cost-efficient particle accelerators. Modern models have suggested that the rate of breakdowns is driven by dislocation dynamics in the electrode materials suffering from breakdowns. In order to understand why specifically the copper-2wt%beryllium alloy outperforms other electrode materials in vacuum breakdown rate and maximum electric fields in breakdown experiments at CERN, a new machine-learning interatomic potential (ML-IAP) for the CuBe alloy was developed. Density functional theory (DFT) was used in calculating a dataset of atomic forces, energies, and virials for a set of CuBe structures. This dataset was performed a fit on with Gaussian process regression, producing an IAP with close-to-DFT accuracy in its intended use cases. With the developed IAP, the interactions between single interstitial beryllium atoms and edge dislocations in a face-centered cubic (FCC) copper matrix were studied with molecular dynamics (MD). It was found that beryllium atoms bind to the edge dislocations, inhibiting their mobility under shear stress. Furthermore, beryllium atoms were found to increase the intrinsic stacking fault energy of FCC copper, possibly leading to an increase in dislocation mobility. These two findings suggest that beryllium atoms could increase copper's resistance to vacuum breakdown mainly via trapping dislocations. Future studies could look at how precipitates of beryllium, or other alloys of copper, play a role in dislocation dynamics.
  • Lampinen, Sebastian (2022)
    Modeling customer engagement assists a business in identifying the high risk and high potential customers. A way to define high risk and high potential customers in a Software-as-a-Service (SaaS) business is to define them as customers with high potential to churn or upgrade. Identifying the high risk and high potential customers in time can help the business retain and grow revenue. This thesis uses churn and upgrade prediction classifiers to define a customer engagement score for a SaaS business. The classifiers used and compared in the research were logistic regression, random forest and XGBoost. The classifiers were trained using data from the case-company containing customer data such as user count and feature usage. To tackle class imbalance, the models were also trained with oversampled training data. The hyperparameters of each classifier were optimised using grid search. After training the models, performance of the classifiers on a test data was evaluated. In the end, the XGBoost classifiers outperformed the other classifiers in churn prediction. In predicting customer upgrades, the results were more mixed. Feature importances were also calculated, and the results showed that the importances differ for churn and upgrade prediction.
  • Laakso, Joosua (2023)
    Semantic segmentation is a computer vision problem of partitioning an image based on what type of an object each part represents, with pixel-level precision. Producing labeled datasets to train deep learning models for semantic segmentation can be laborious due to the demand for pixel-level precision. On the other hand, a deep learning model trained on one dataset might have inferior performance when applied on another dataset, depending on how different those datasets are. Unsupervised domain adaptation attempts to narrow this performance gap by adapting the model to the other dataset, even if ground-truth labels for that dataset are not available. In this work, we review some of the pre-existing methods for unsupervised domain adaptation in semantic segmentation. We then present our own efforts to develop novel methods for the problem. Those include a new type of loss function for unsupervised output shaping, unsupervised training of the model backbone based on the feature statistics and a method for unsupervised adaptation of the model backbone using an auxiliary network that attempts to mimic the gradients of supervised training. We present empirical results of the performance of these methods. We additionally present our findings on the effects of changes in the statistics of the batch normalization layers on domain adaptation performance.
  • Niemi, Mikko Olavi (2020)
    Standard machine learning procedures are based on assumption that training and testing data is sampled independently from identical distributions. Comparative data of traits in biological species breaks this assumption. Data instances are related by ancestry relationships, that is phylogeny. In this study, new machine learning procedures are presented that take into account phylogenetic information when fitting predictive models. Phylogenetic statistics for classification accuracy and error are proposed based on the concept of effective sample size. Versions of perceptron training and KNN classification are built on these metrics. Procedures for regularised PGLS regression, phylogenetic KNN regression, neural network regression and regression trees are presented. Properties of phylogenetic perceptron training and KNN regression are studied with synthetic data. Experiments demonstrate that phylogenetic perceptron training improves robustness when the phylogeny is unbalanced. Regularised PGLS and KNN regression are applied to mammal dental traits and environments to both test the algorithms and gain insights in the relationship of mammal teeth and the environment.
  • Keningi, Eino (2022)
    In little over a decade, cryptocurrencies have become a highly speculative asset class in global financial markets, with Bitcoin leading the way. Throughout its relatively brief history, the price of bitcoin has gone through multiple cycles of growth and decline. As a consequence, Bitcoin has become a widely discussed – and polarizing – topic on Twitter. This work studies whether the sentiment of popular Bitcoin-related tweets can be used to predict the future price movements of bitcoin. In total, seven different algorithms are evaluated: Vector Autoregression, Vector Autoregression Moving-Average, Random Forest, XGBoost, LightGBM, Long Short-Term Memory, and Gated Recurrent Unit. By applying lexicon-based sentiment analysis, and heuristic filtering of tweets, it was discovered that sentiment-based features of popular tweets improve the prediction accuracy over baseline features (open-high-low-close data) in five of the seven algorithms tested. The tree-based algorithms (Random Forest, XGBoost, LightGBM) generally had the lowest prediction errors, while the neural network algorithms (Light Short-Term Memory and Gated Recurrent Unit) had the poorest performance. The findings suggest that the sentiment of popular Bitcoin-related tweets can be an important feature in predicting the future price movements of bitcoin.
  • Tyree, Juniper (2023)
    Response Surface Models (RSM) are cheap, reduced complexity, and, usually, statistical models that are fit to the response of more complex models to approximate their outputs with higher computational efficiency. In atmospheric science, there has been a continuous push to reduce the amount of training data required to fit an RSM. With this reduction in costly data gathering, RSMs can be used more ad hoc and quickly adapted to new applications. However, with the decrease in diverse training data, the risk increases that the RSM is eventually used on inputs on which it cannot make a prediction. If there is no indication from the model that its outputs can no longer be trusted, trust in an entire RSM decreases. We present a framework for building prudent RSMs that always output predictions with confidence and uncertainty estimates. We show how confidence and uncertainty can be propagated through downstream analysis such that even predictions on inputs outside the training domain or in areas of high variance can be integrated. Specifically, we introduce the Icarus RSM architecture, which combines an out-of-distribution detector, a prediction model, and an uncertainty quantifier. Icarus-produced predictions and their uncertainties are conditioned on the confidence that the inputs come from the same distribution that the RSM was trained on. We put particular focus on exploring out-of-distribution detection, for which we conduct a broad literature review, design an intuitive evaluation procedure with three easily-visualisable toy examples, and suggest two methodological improvements. We also explore and evaluate popular prediction models and uncertainty quantifiers. We use the one-dimensional atmospheric chemistry transport model SOSAA as an example of a complex model for this thesis. We produce a dataset of model inputs and outputs from simulations of the atmospheric conditions along air parcel trajectories that arrived at the SMEAR II measurement station in Hyytiälä, Finland, in May 2018. We evaluate several prediction models and uncertainty quantification methods on this dataset and construct a proof-of-concept SOSAA RSM using the Icarus RSM architecture. The SOSAA RSM is built on pairwise-difference regression using random forests and an auto-associative out-of-distribution detector with a confidence scorer, which is trained with both the original training inputs and new synthetic out-of-distribution samples. We also design a graphical user interface to configure the SOSAA model and trial the SOSAA RSM. We provide recommendations for out-of-distribution detection, prediction models, and uncertainty quantification based on our exploration of these three systems. We also stress-test the proof-of-concept SOSAA RSM implementation to reveal its limitations for predicting model perturbation outputs and show directions for valuable future research. Finally, our experiments affirm the importance of reporting predictions alongside well-calibrated confidence scores and uncertainty levels so that the predictions can be used with confidence and certainty in scientific research applications.
  • Nikkari, Eeva (2017)
    The sentence segmentation task is the task of segmenting a text corpus into sentences. Segmenting well structured and fully punctuated data into sentences is not a very difficult problem. However, when the data is poorly structured or missing punctuation the task is more difficult. This thesis will look into this problem by using probabilistic language modeling, with special emphasis on the n-gram model. We will present theory related to language models and evaluating them, as well as empirical results achieved on documents provided by AlphaSense Oy and a freely available Reuters-21578 corpus. The experiments on n-gram models focused on the following questions. How does the smoothing and order of the n-gram affect the model? How well does a model trained on one type of data adapt to another type of text? How does retaining more or less symbols and punctuation affect the performance? And how much is enough training data for the model? The n-gram models performed rather well on the same type of data they were trained on. However, the performance was significantly worse when moving to another document type. In absence of punctuation the performance of the model was also rather poor. The conclusion is that the n-gram model seems inadequate in recovering the sentence boundaries in difficult settings such as separating the unpuncutated title from the body of the text.
  • Rosenberg, Otto (2023)
    Bayesian networks (BN) are models that map the mutual dependencies and independencies between a set of variables. The structure of the model can be represented as a directed acyclic graph (DAG), which is a graph where the nodes represent variables and the directed edges between variables represent a dependency. BNs can be either constructed by using knowledge of the system or derived computationally from observational data. Traditionally, BN structure discovery from observational data has been done through heuristic algorithms, but advances in deep learning have made it possible to train neural networks for this task in a supervised manner. This thesis provides an overview of BN structure discovery and discusses the strengths and weaknesses of the emerging supervised paradigm. One supervised method, the EQ-model, that uses neural networks for structure discovery using equivariant models, is also explored in further detail with empirical tests. Through a process of hyperparameter optimisation and moving to online training, the performance of the EQ-model is increased. The EQ-model is still observed to underperform in comparison to a competing score-based model, NOTEARS, but offers convenient features, such as dramatically faster runtime, that compensate for the reduced performance. Several interesting lines of further study that could be used to further improve the performance of the EQ-model are also identified.
  • Aino, Kaltiainen (2024)
    The planetary boundary layer (PBL) is a layer of the atmosphere directly influenced by the presence of Earth's surface. In addition to its importance to the weather and climate systems, it plays significant role in controlling the air pollution levels and low-level heat conditions, thereby directly influencing the general well-being. While the modification of the boundary layer conditions by varying atmospheric forcings has been widely studied and discussed, it remains unknown what the dominant states of the PBL variation in response to this modification are. In this study, the dominant boundary layer types in both daytime and nighttime layers are examined. To understand the factors contributing to the development of these layers, weather regimes in the northern Atlantic-European region are considered. Machine learning techniques are utilized to study both the boundary layer and the large-scale flow classes, with an emphasis on unsupervised learning methods. It was found that the boundary layers in Helsinki, Finland, can be categorized into four daytime and three nighttime boundary layers, each characterized by the dominant turbulence production mechanism or the absence thereof. During the daytime, layers driven by both mechanical and buoyant turbulence are observed in summer, autumn, and spring, while individually buoyancy-driven layers occur in summer and winter, and individually mechanically-driven layers emerge in autumn, winter, and spring. Additionally, a layer characterized by overall reduced turbulence production is present throughout all seasons. During the nighttime, all three boundary layer types---individually buoyancy-driven, individually mechanically-driven, and stable layer---are observed in all seasons. Each boundary layer type exhibits season-specific variations, whereas daytime and nighttime boundary layers driven by the same mechanisms reflect the diurnal cycle of their relative intensities. The analysis revealed that the weather regimes producing cyclonic and anticyclonic flow anomalies over southern Finland collectively influence the boundary layer conditions, whereas the impact of individual weather regimes remains relatively small. Large-scale flow variation is associated with changes in the boundary layer dynamics through alterations in surface radiation budget (cloudiness) and wind conditions, thereby influencing the relative intensities of mechanical and buoyant turbulence production. However, inconsistencies in the analysis suggest that additional mechanisms, such as mesoscale phenomena, must also contribute to the development of the observed boundary layer types.
  • Kivimäki, Juhani (2022)
    In this thesis, we give an overview of current methodology in the field of uncertainty estimation in machine learning, with focus on confidence scores and their calibration. We also present a case study, where we propose a novel method to improve uncertainty estimates of an in-production machine learning model operating in an industrial setting with real-life data. This model is used by a Finnish company Basware to extract information from invoices in the form of machine-readable PDFs. The solution we propose is shown to produce confidence estimates, which outperform the legacy estimates on several relevant metrics, increasing coverage of automated invoices from 65.6% to 73.2% with no increase in error rate.
  • Kari, Daniel (2020)
    Estimating the effect of random chance (’luck’) has long been a question of particular interest in various team sports. In this thesis, we aim to determine the role of luck in a single icehockey game by building a model to predict the outcome based on the course of events in a game. The obtained prediction accuracy should also to some extent reveal the effect of random chance. Using the course of events from over 10,000 games, we train feedforward and convolutional neural networks to predict the outcome and final goal differential, which has been proposed as a more informative proxy for outcome. Interestingly, we are not able to obtain distinctively higher accuracy than previous studies, which have focused on predicting the outcome with infomation available before the game. The results suggest that there might exist an upper bound for prediction accuracy even if we knew ’everything’ that went on in a game. This further implies that random chance could affect the outcome of a game, although assessing this is difficult, as we do not have a good quantitative metric for luck in the case of single ice hockey game prediction.
  • Niinikoski, Eerik (2020)
    The aim of this thesis is to predict total career racing performance of Finnish trotter horses by using trotters early career racing performance and other early career variables. This thesis presents a brief introductory of harness racing and horses used in Finnish trotting sport. The data is presented and modified for predictions, with descriptive statistics of tables and visuals. The machine learning method of Random forests for regression is introduced and used in the predictions. After training the model, this thesis presents the prediction accuracy and variables of importance of the predictions of total career racing performance for both Finnhorse trotters and Finnish Standardbred trotter population. Finally, the writer discusses on the shortages and possible improvements for future research. The data for this thesis was provided by The Finnish trotting and breeding association (Suomen Hippos ry), which included all information of harness races from 1984 to the end of 2019, raced in Finland. From almost three million rows, the data was summarised to a data table of 46704 rows of trotters, that have started their career at earliest allowed three age groups. A total of 37 independent variables were used to predict three outcomes of total career earnings, total number of career starts and total number of career first placings, as separate models. The predictors are derived from other studies that estimate the environmental and genetic factors of racing performance of a trotter. The three models performed poor to moderate, with total earnings having the highest prediction accuracy. The model predicted quite well larger amounts of earnings, but was avid to predict some earnings when there in fact were none. Prediction accuracy of total number of starts was poor, especially when the true amount of starts was low. Model that predicted total number of career first placings performed the worst. This can partially be explained by the fact that winning is a rare event for a trotter in general. The models fit better for Finnish Standardbred trotters than for Finnhorse trotters. This thesis works as a good basis for future similar research, where massive amounts of data and machine learning is used to predict trotter’s career, racing performance or other factors. The results show that predicting total career racing performance as a classification problem could be a better fit than regression. These adequate classes, as well as possible better predictors and suitable imputes for missing values, should be consulted with an audience of superior knowledge in harness racing.