Browsing by study line "ei opintosuuntaa"

Now showing items 1-20 of 233

Adversarial Robustness of Hybrid Machine Learning Architecture for Malware Classification

Trizna, Dmitrijs (2022)

The detection heuristic in contemporary machine learning Windows malware classifiers is typically based on the static properties of the sample. In contrast, simultaneous utilization of static and behavioral telemetry is vaguely explored. We propose a hybrid model that employs dynamic malware analysis techniques, contextual information as an executable filesystem path on the system, and static representations used in modern state-of-the-art detectors. It does not require an operating system virtualization platform. Instead, it relies on kernel emulation for dynamic analysis. Our model reports enhanced detection heuristic and identify malicious samples, even if none of the separate models express high confidence in categorizing the file as malevolent. For instance, given the $0.05\%$ false positive rate, individual static, dynamic, and contextual model detection rates are $18.04\%$, $37.20\%$, and $15.66\%$. However, we show that composite processing of all three achieves a detection rate of $96.54\%$, above the cumulative performance of individual components. Moreover, simultaneous use of distinct malware analysis techniques address independent unit weaknesses, minimizing false positives and increasing adversarial robustness. Our experiments show a decrease in contemporary adversarial attack evasion rates from $26.06\%$ to $0.35\%$ when behavioral and contextual representations of sample are employed in detection heuristic.
Ajankohtaista biogeenisten amiinien määrittämisestä viineistä 2000-luvulla

Sachan, Sinivuokko (2021)

At low concentrations, biogenic amines (BA) promote natural physiological activity, but at higher concentrations they can cause a wide variety of health hazards, especially for more sensitive individuals. The BA determination in wine is challenging due to the variation in physicochemical properties and the potential matrix effects of other compounds in the sample. It is important to develop efficient sample purification methods to minimize matrix interference. Derivatization is required for most biogenic amines due to the absence of chromophores. The conditions that promote the origin or formation of biogenic amines in wines are not yet fully understood, as many factors contribute to their formation. The main sources or stages of BA formation during wine-making should be identified in order to reduce BA levels by corrective measures. Currently, the analytical community is striving for more environmentally friendly methods. The literature review examines methods for determination of biogenic amines in wines from 2005 until 2020. The methods are high-performance liquid chromatography, ultra-high-performance liquid chromatography, high-temperature liquid chromatography, nano-liquid chromatography, micellar liquid chromatography, capillary electrophoresis, micromachined capillary electrophoresis, gas chromatography, immunoassay, sensor, colorimetric method, thin-layer chromatography and ion chromatography. The health disadvantages of biogenic amines and the problem areas associated with their determination from a complex wine matrix, such as matrix effect and derivatization, are also surveyed. In addition, changes in the BA profile during different stages of winemaking and storage, as well as the effect of the grape variety and lactic acid bacterial strain on the BA profile, are surveyed. Validation determines the suitability of a method for its intended use. In the methods for determining the literature review, measurement uncertainty - possibly the most important validation parameter - had not been determined in any of the validations. The aim of the research project was to obtain a functional and validated method for the determination of biogenic amines in wines for the Alcohol Control Laboratory at Alko Inc. In the method tested, histamine, tyramine, putrescine, cadaverine, phenylethylamine and isoamylamine derivatized with diethyl ethoxymethylene malonate were determined by high-performance liquid chromatography and diode array detector. The method was not sufficiently reliable, so a competitive enzyme-linked immunosorbent assay for the determination of histamine in wines was introduced, which provided a useful method for the Alcohol Control Laboratory. The validation determined specificity/selectivity, recovery, repeatability, systematic error, estimation of random error, measurement uncertainty, expanded measurement uncertainty, limit of detection and limit of quantification. The European Food Safety Authority has confirmed histamine and tyramine as the most toxic amines. The International Organization of Vine and Wine has not set legal limits for BA levels, but some European countries have had recommended maximum levels for histamine. Many wine importers in the European Union require a BA analysis even in the absence of regulations. Based on the literature review, high BA levels were found in the wines under study, including levels of histamine, tyramine, and phenylethylamine that exceeded the toxicity limits. Some wines had biogenic amines below the detection limit, so the production of low-amine wines is possible. In addition, certain strains of lactic acid bacteria were found to significantly reduce the BA levels in wine. High-performance liquid chromatography is the most widely used determination method. An increasing trend is to develop simpler methods such as the portable sensor-based method.
A machine learning approach for customer segmentation

Shappo, Viacheslav (2022)

The primary concern of the companies working with many customers is proper customer segmentation, i.e., division of the customers into different groups based on their common characteristics. Customer segmentation helps marketing specialists to adjust their offers and reach potential customer groups interested in a specific type of product or service. In addition, knowing such customer segments may help search for new look-alike customers sharing similar characteristics. The first and most crucial segmentation is splitting the customers into B2B (business to business) and B2C (business to consumers). The next step is to analyze these groups properly and create more through product-specific groups. Nowadays, machine learning plays a vital role in customer segmentation. This is because various classification algorithms can see more patterns in customer characteristics and create more tailored customer segmentations than a human can. Therefore, utilizing machine learning approaches in customer segmentation may help companies save their costs on marketing campaigns and increase their sales by targeting the correct customers. This thesis aims to analyze B2B customers potentially interested in renewable diesel "Neste MY" and create a classification model for such segmentation. The first part of the thesis is focused on the theoretical background of customer segmentation and its use in marketing. Firstly, the thesis introduces general information about Neste as a company and discusses the marketing stages that involve the customer segmentation approach. Secondly, the data features used in the study are presented. Then the methodological part of the thesis is introduced, and the performance of three selected algorithms is evaluated on the test data. Finally, the study's findings and future means of improvement are discussed. The significant finding of the study is that finely selected features may significantly improve model performance while saving computational power. Several important features are selected as the most crucial customer characteristics that the marketing department afterward uses for future customer segmentations.
A method for estimating regression errors with application to virtual concept drift detection

Tiittanen, Henri (2019)

Estimating the error level of models is an important task in machine learning. If the data used is independent and identically distributed, as is usually assumed, there exist standard methods to estimate the error level. However, if the data distribution changes, i.e., a phenomenon known as concept drift occurs, those methods may not work properly anymore. Most existing methods for detecting concept drift focus on the case in which the ground truth values are immediately known. In practice, that is often not the case. Even when the ground truth is unknown, a certain type of concept drift called virtual concept drift can be detected. In this thesis we present a method called drifter for estimating the error level of arbitrary regres- sion functions when the ground truth is not known. Concept drift detection is a straightforward application of error level estimation. Error level based concept drift detection can be more useful than traditional approaches based on direct distribution comparison, since only changes that affect the error level are detected. In this work we describe the drifter algorithm in detail, including its theoretical basis, and present an experimental evaluation of its performance in virtual concept drift detection on multiple datasets consisting of both synthetic and real-world datasets and multiple regression functions. Our experi- ments show that the drifter algorithm can be used to detect virtual concept drift with a reasonable accuracy.
Amortized Bayesian inference of Gaussian process hyperparameters

Rehn, Aki (2022)

The application of Gaussian processes (GPs) is limited by the rather slow process of optimizing the hyperparameters of a GP kernel which causes problems especially in applications -- such as Bayesian optimization -- that involve repeated optimization of the kernel hyperparameters. Recently, the issue was addressed by a method that "amortizes" the inference of the hyperparameters using a hierarchical neural network architecture to predict the GP hyperparameters from data; the model is trained on a synthetic GP dataset and in general does not require retraining for unseen data. We asked if we can understand the method well enough to replicate it with a squared exponential kernel with automatic relevance determination (SE-ARD). We also asked if it is feasible to extend the system to predict posterior approximations instead of point-estimates to support fully Bayesian GPs. We introduce the theory behind Bayesian inference; gradient-based optimization; Gaussian process regression; variational inference; neural networks and the transformer architecture; the method that predicts point-estimates of the hyperparameters; and finally our proposed architecture to extend the method to a variational inference framework. We were able to successfully replicate the method from scratch with an SE-ARD kernel. In our experiments, we show that our replicated version of the method works and gives good results. We also implemented the proposed extension of the method to a variational inference framework. In our experiments, we do not find concrete reasons that would prevent the model from functioning, but observe that the model is very difficult to train. The final model that we were able to train predicted good means for (Gaussian) posterior approximations, but the variances that the model predicted were abnormally large. We analyze possible causes and suggest future work.
Analysing Controversy on Twitter via Graph Embeddings

Comănescu, Andrei-Daniel (2020)

Social networks represent a public forum of discussion for various topics, some of them controversial. Twitter is such a social network; it acts as a public space where discourse occurs. In recent years the role of social networks in information spreading has increased. As have the fears regarding the increasingly polarised discourse on social networks, caused by the tendency of users to avoid exposure to opposing opinions, while increasingly interacting with only like-minded individuals. This work looks at controversial topics on Twitter, over a long period of time, through the prism of political polarisation. We use the daily interactions, and the underlying structure of the whole conversation, to create daily graphs that are then used to obtain daily graph embeddings. We estimate the political ideologies of the users that are represented in the graph embeddings. By using the political ideologies of users and the daily graph embeddings, we offer a series of methods that allow us to detect and analyse changes in the political polarisation of the conversation. This enables us to conclude that, during our analysed time period, the overall polarisation levels for our examined controversial topics have stagnated. We also explore the effects of topic-related controversial events on the conversation, thus revealing their short-term effect on the conversation as a whole. Additionally, the linkage between increased interest in a topic and the increase of political polarisation is explored. Our findings reveal that as the interest in the controversial topic increases, so does the political polarisation.
Analysis of B12 vitamers in food samples by liquid chromatography-mass spectrometry

Grönfors, Helle (2023)

The literature review focused on liquid chromatographic-mass spectrometric (LC-MS) methods used to quantify B12 vitamers in food matrices. Various MS methods have been used for the detection of B12, offering more specificity than other commonly used analysis techniques. This thesis aimed to develop a method for quantifying the native forms of B12 in different food matrices and avoiding the commonly used conversion to cyanocobalamin during extraction. In the experimental study, an ultra-high-performance LC-tandem MS (UHPLC-MS/MS) method was developed and validated for selectivity, specificity, recovery, repeatability, reproducibility, trueness, and measurement uncertainty to determine B12 vitamers in fermented plant-based foods and microbial cell supernatants. The development was initiated by setting up mass spectrometer conditions and selecting transitions for multiple reaction monitoring (MRM) to achieve selective and sensitive detection method for individual B12 vitamers. This was followed by developing the UHPLC method utilizing a reversed-phased C18 column and gradient elution with 0.5% formic acid and 0.5% FA in methanol. The vitamers were ionized using electrospray ionization in a positive ion mode and detected in an MRM mode using hydroxocobalamin, cyanocobalamin, adenosylcobalamin, and methylcobalamin. All B12 vitamers were detected and separated with the developed and optimized UHPLC-MS/MS method. The internal standard calibration method was necessary to overcome matrix effects when analyzing food samples. The calibration curve content range was 0.2–200 pg/µL, and the results showed good linearity. The instrumental method was selective, precise, repeatable, and reproducible with detection and quantitation limits of 0.03–0.4 pg/µL and 0.2–2 pg/µL, respectively. The measurement uncertainty of the instrumental method varied between 10% and 20%. For the entire method, recoveries for the B12 vitamers ranged from 40% to 200%, and measurement uncertainties from 40% to 60%. Results for the total B12 content in food samples deviated from those determined using a conventional UHPLC-PDA method: Recovery for tempeh was over 90%, but for fortified bread only 20%. These results indicate the need for further development of sample pretreatment. The instrumental method was successfully validated and separated matrix compounds from B12 vitamers in food samples to some extent. The developed sample pretreatment method is a good starting point for developing more effective sample pretreatment methods in the future.
Analysis of volatile organic compounds emitted from indoor building materials

Heiskanen, Ilmari (2021)

Interest towards indoor air quality has increased for several decades from human health perspective. In order to evaluate the quality of indoor air in terms of volatile organic compound (VOC) levels, robust analytical procedures and techniques must be used for indoor air VOC measurements. Since indoor building materials are the greatest source of indoor VOC emissions, same kind of procedures must be used for analysis of emission rates from building materials and their surfaces. Theory part of this thesis reviews background of VOCs and human health, legislation and guideline values, common building materials with emissions and used sampling techniques/approaches for indoor air sampling and surface material emission rate sampling & analysis. Discussed sampling techniques include, for example, material emission test chambers, field and laboratory test emission cells, solid phase microextraction (SPME) fibre applications and Radiello passive samplers. Also new innovative approaches are discussed. Used common analysis instruments are Gas Chromatography (GC) with Mass Spectrometer (MS) or Flame Ionization Detector (FID) for VOCs and High-Performance Liquid Chromatography-Ultraviolet/Visible light detector (HPLC-UV/VIS) for carbonyl VOCs (e.g. formaldehyde) after suitable derivatization. Analytical procedures remain highly ISO 16000 standard series orientated even in recent studies. In addition, potential usage of new modern miniaturized sample collection devices SPME Arrow and In-tube extraction (ITEX) used in experimental part of this thesis are discussed as an addition to indoor air and VOC emission studies. The aim of the experimental part of this thesis was to develop calibrations for selected organic nitrogen compounds with SPME Arrow and ITEX sampling techniques and test the calibration with indoor and outdoor samples. A calibration was successfully carried out with SPME Arrow (MCM-41 sorbent), ITEX (MCM-TP sorbent) and ITEX (Polyacrylonitrile (PAN) 10 % sorbent) with permeation system combined with GC-MS for the following selected organic nitrogen compounds: triethylamine, pyridine, isobutyl amine, allylamine, trimethylamine, ethylenediamine, dipropyl amine, hexylamine, 1,3-diaminopropane, 1-methyl-imidazole, N, N-dimethylformamide, 1,2-diaminocyclohexane, 1-nitropropane and formamide. The overall quality of the calibration curves was evaluated, and the calibrations were compared in terms of linear range, relative standard deviation (RSD) % for accepted calibration levels and obtained Limits of Detection (LOD) values. Also, ways to improve the calibrations were discussed. The calibration curves were tested with real indoor and outdoor samples and quantitative, as well as semi-quantitative, results were obtained.
Analysis of volatile organic compounds from environmental samples with solid phase microextraction Arrows and gas chromatography mass-spectrometer

Kivinen, Anssi (2020)

The analysis of volatile organics is growing by the year and there is a great interest in fast and simple sample preparation techniques. With solid phase micro-extraction, samples can be extracted non-destructively without a need for solvents. This is both cost effective and ecological, because even most eco-friendly solvents still cause strain on the environment. This thesis focused on studying the effect of extraction conditions on the extraction efficiency. The effect of different sorptive phase materials was tested as well. New single-step sample extraction and preparation method was developed for gas chromatographic mass spectrometric analysis. Three different sorptive phase materials were compared and the extraction conditions were optimized for each. The method developed was used to extract, analyze and determine unknown compounds from a butterfly specimen. Multiple extractions were performed from both headspace and with direct immersion. By progressively changing the extraction conditions, properties of the compounds such as volatility and polarity could be determined by their presence alone. Analysis was performed using with gas chromatography mass-spectrometer using electron ionization quadrupole mass detector in full scan mode.
An empirical study on feature data management practices and challenges

Louhi, Jarkko (2023)

The rapid growth of artificial intelligence (AI) and machine learning (ML) solutions has created a need to develop, deploy and maintain AI/ML those to production reliably and efficiently. MLOps (Machine Learning Operations) framework is a collection of tools and practices that aims to address this challenge. Within the MLOps framework, a concept called the feature store is introduced, serving as a central repository responsible for storing, managing, and facilitating the sharing and reuse of extracted features derived from raw data. This study gives first an overview of the MLOps framework and delves deeper into feature engineering and feature data management, and explores the challenges related to these processes. Further, feature stores are presented, what they exactly are and what benefits do they introduce to organizations and companies developing ML solutions. The study also reviews some of the currently popular feature store tools. The primary goal of this study is to provide recommendations for organizations to leverage feature stores as a solution to the challenges they encounter in managing feature data currently. Through an analysis of the current state-of-the-art and a comprehensive study of organizations' practices and challenges, this research presents key insights into the benefits of feature stores in the context of MLOps. Overall, the thesis highlights the potential of feature stores as a valuable tool for organizations seeking to optimize their ML practices and achieve a competitive advantage in today's data-driven landscape. The research aims to explore and gather practitioners' experiences and opinions on the aforementioned topics through interviews conducted with experts from Finnish organizations.
APM Requirements Analysis and Comparison for Veikkaus Oy

Törnroos, Topi (2021)

Application Performance Management (APM) is a growing field, and APM tools on the market tend to be complex enterprise solutions with features ranging from traffic analysis and error reporting to real- user monitoring and business transaction management. This thesis is a study done on behalf of Veikkaus Oy, a Finnish government-owned game company and betting agency. It serves as a look into the current state-of-the-art field of leading APM tools as well as a requirements analysis done from the perspective of the company’s IT personnel. A list of requirements was gathered and scored based on perceived importance, and four APM tools on the market—Datadog APM, Dynatrace, New Relic and AppDynamics—were each compared to each other and scored based on the gathered requirements. In addition, open-source alternatives were considered and investigated. Our results suggest that the leading APM vendors have products very similar to each other with marginal differences between them, feature-wise. In general, APMs were deemed useful and valuable to the company, able to assist in the work of a wide variety of IT personnel, as well as able to replace many tools currently in use by Veikkaus Oy and simplify their application ecosystem.
Applying fluctuations to simulations of early universe bubble collisions in O(N) scalar field theory

Sukuvaara, Satumaaria (2023)

Many beyond the Standard Model theories include a first order phase transition in the early universe. A phase transition of this kind is presumed to be able to source gravitational waves that might be be observed with future detectors, such as the Laser Interferometer Space Antenna. A first order phase transition from a symmetric (metastable) minimum to the broken (stable) one causes the nucleation of broken phase bubbles. These bubbles expand and then collide. It is of importance to examine how the bubbles collide in depth, as the events during the collision affect the gravitational wave spectrum. We assume the field to interact very weakly or not at all with the particle fluid in the early universe. The universe also experiences fluctuations due to thermal or quantum effects. We look into how these background fluctuations affect the field evolution and bubble collisions during the phase transition in O(N) scalar field theory. Specifically, we numerically simulate two colliding bubbles nucleated on top of the background fluctuations, with the field being a N-dimensional vector in the O(N) group. Due to the symmetries present, the system can be examined in cylindrical coordinates, lowering the number of simulated spatial dimensions. In this thesis, we perform the calculation of initial state fluctuations and simulate them and two bubbles numerically. We present results of the simulation of the field, concentrating on the effects of fluctuations on the O(N) scalar field theory.
Architectures of Leading Image Captioning Systems

Kotola, Mikko Markus (2021)

Image captioning is the task of generating a natural language description of an image. The task requires techniques from two research areas, computer vision and natural language generation. This thesis investigates the architectures of leading image captioning systems. The research question is: What components and architectures are used in state-of-the-art image captioning systems and how could image captioning systems be further improved by utilizing improved components and architectures? Five openly reported leading image captioning systems are investigated in detail: Attention on Attention, the Meshed-Memory Transformer, the X-Linear Attention Network, the Show, Edit and Tell method, and Prophet Attention. The investigated leading image captioners all rely on the same object detector, the Faster R-CNN based Bottom-Up object detection network. Four out of five also rely on the same backbone convolutional neural network, ResNet-101. Both the backbone and the object detector could be improved by using newer approaches. Best choice in CNN-based object detectors is the EfficientDet with an EfficientNet backbone. A completely transformer-based approach with a Vision Transformer backbone and a Detection Transformer object detector is a fast-developing alternative. The main area of variation between the leading image captioners is in the types of attention blocks used in the high-level image encoder, the type of natural language decoder and the connections between these components. The best architectures and attention approaches to implement these components are currently the Meshed-Memory Transformer and the bilinear pooling approach of the X-Linear Attention Network. Implementing the Prophet Attention approach of using the future words available in the supervised training phase to guide the decoder attention further improves performance. Pretraining the backbone using large image datasets is essential to reach semantically correct object detections and object features. The feature richness and dense annotation of data is equally important in training the object detector.
A Reinforcement Learning Application for Portfolio Optimization in the Stock Market

Huertas, Andres (2020)

Investment funds are continuously looking for new technologies and ideas to enhance their results. Lately, with the success observed in other fields, wealth managers are taking a closes look at machine learning methods. Even if the use of ML is not entirely new in finance, leveraging new techniques has proved to be challenging and few funds succeed in doing so. The present work explores de usage of reinforcement learning algorithms for portfolio management for the stock market. It is well known the stochastic nature of stock and aiming to predict the market is unrealistic; nevertheless, the question of how to use machine learning to find useful patterns in the data that enable small market edges, remains open. Based on the ideas of reinforcement learning, a portfolio optimization approach is proposed. RL agents are trained to trade in a stock exchange, using portfolio returns as rewards for their RL optimization problem, thus seeking optimal resource allocation. For this purpose, a set of 68 stock tickers in the Frankfurt exchange market was selected, and two RL methods applied, namely Advantage Actor-Critic(A2C) and Proximal Policy Optimization (PPO). Their performance was compared against three commonly traded ETFs (exchange-traded funds) to asses the algorithm's ability to generate returns compared to real-life investments. Both algorithms were able to achieve positive returns in a year of testing( 5.4\% and 9.3\% for A2C and PPO respectively, a European ETF (VGK, Vanguard FTSE Europe Index Fund) for the same period, reported 9.0\% returns) as well as healthy risk-to-returns ratios. The results do not aim to be financial advice or trading strategies, but rather explore the potential of RL for studying small to medium size stock portfolios.
A Review of Proposals for Improvements in Evaluation of Natural Language Generation

Moilanen, Jouni Petteri (2023)

In recent years, a concern has grown within the NLG community about the comparability of systems and reproducibility of research results. This concern has mainly been focused on the evaluation of NLG systems. Problems with automated metrics, crowd-sourced human evaluations, sloppy experimental design and error reporting, etc. have been widely discussed in the literature. A lot of proposals for best practices, metrics, frameworks and benchmarks for NLG evaluation have lately been issued to address these problems. In this thesis we examine the current state of NLG evaluation – focusing on data-to-text evaluation – in terms of proposed best practices, benchmarks, etc., and their adoption in practice. Academic publications concerning NLG evaluation indexed in the Scopus database published in 2018-2022 were examined. After manual inspection 141 of those I deemed to contain some kind of concrete proposal for improvements in evaluation practices. The adoption (use in practice) of those was again examined by inspecting papers citing them. There seems to be a willingness in the academic community to adopt these proposals, especially ”best practices” and metrics. As for datasets, benchmarks, evaluation platforms, etc., the results are inconclusive.
Asteroid Classification Using Gaia Spectroscopy and Photometry

Uvarova, Elizaveta (2024)

Asteroids within our Solar System attract considerable attention for their potential impact on Earth and their role in elucidating the Solar System's formation and evolution. Understanding asteroids' composition is crucial for determining their origin and history, making spectral classification a cornerstone of asteroid categorization. Spectral classes, determined by asteroids' reflectance spectrum, offer insights into their surface composition. Early attempts at classification, predating 1973, utilized photometric observations in ultraviolet and visible wavelengths. The Chapman-McCord-Johnson classification system of 1973 marked the beginning of formal asteroid taxonomy, employing reflectance spectrum slopes for classification. Subsequent developments included machine learning techniques, such as principal component analysis and artificial neural networks, for improved classification accuracy. Gaia mission's Data Release 3 has significantly expanded asteroid datasets, allowing more extensive analyses. In this study, I examine the relationship between asteroid photometric slopes, spectra, and taxonomy using a feed-forward neural network trained on known spectral types to classify asteroids of unknown types. Our classification gained the mean accuracy of 80.4 ± 2.0 % over 100 iterations and separated successfully three asteroid taxonomic groups (C, S, and X) and the asteroid class D.
A Supervised Learning Approach to Predicting Loan Applicant’s Expenditures in Loan Origination

Paavola, Jaakko (2024)

Lenders assess the credit risk of loan applicants from both affordability and indebtedness perspective. The affordability perspective involves assessing the applicant’s disposable income after accounting for regular household expenditures and existing credit commitments, a measure called money-at-disposal or MaD. Having an estimate of the applicant’s expenditures is crucial, but simply asking applicants for their expenditures could lead to inaccuracies. Thus, lenders must produce their own estimates based on statistical or survey data about household expenditures, which are then passed to the MaD framework as input parameters or used as control limits to ascertain expenditure information reported by the applicant is truthful or at least adequately conservative. More accurate expenditure estimates in the loan origination would enable lenders to quantify mortgage credit risk more precisely, tailor loan terms more aptly, and protect customers against over-indebtedness better. Consequently, this would facilitate the lenders to be more profitable in their lending business as well as serve their customers better. But there is also a need for interpretability of the estimates stemming from compliance and trustworthiness motives. In this study, we examine the accuracy and interpretability of expenditure predictions of supervised models fitted to a microdataset of household consumption expenditures. To our knowledge, this is the first study to use such a granular and broad dataset to create predictive models of loan applicants’ expenditures. The virtually uninterpretable "black box" models we used, aiming at maximizing predictive power, rarely did better accuracy-wise than interpretable linear regression ones. Even when they did, the gain was marginal or in predicting minor expenditure categories that contributed only a low share of the total expenditures. Thus, ordinary linear regression is what we suggest generally provides the best combination of predictive power and interpretability. After careful feature selection, the best predictive power was attained with 20-54 predictor variables, the number depending on the expenditure category. If a very simple interpretation is needed, we suggest either a linear regression model of three predictor variables representing the number of household members, or a model based on the means within 12 "common sense groups" that we divided the households in. An alternative solution with a predictive power somewhere between the full linear regression model and the two simpler models is to use decision trees providing easy interpretation in the form of a set of rules.
A survival analysis of benign prostatic hyperplasia procedure complications

Ulkuniemi, Uula (2022)

This thesis presents a complication risk comparison of the most used surgical interventions for benign prostatic hyperplasia (BPH). The investigated complications are the development of either a post-surgery BPH recurrence (reoperation), an urethral stricture or stress incontinence severe enough to require a surgical procedure for their treatment. The analysis is conducted with survival analysis methods on a data set of urological patients sourced from the Finnish Institute for Health and Welfare. The complication risk development is estimated with the Aalen-Johansen estimator and the effects of certain covariates on the complication risks is estimated with the Cox PH regression model. One of the regression covariates is the Charlson Comorbidity Index score, which attempts to quantify a disease load of a patient at a certain point in time as a single number. A novel Spark algorithm was designed to facilitate the eﬀicient calculation of the Charlson Comorbidity Index score on a data set of the same size as the one used in the analyses here. The algorithm achieved at least similar performance to the previously available ones and scaled better on larger data sets and with stricter computing resource constraints. Both the urethral stricture and urinary incontinence endpoints suffered from a lower number of samples, which made the associated results less accurate. The estimated complication probabilities in both endpoint types were also so low that the BPH procedures couldn’t be reliably differentiated. In contrast, BPH reoperation risk analyses yielded noticeable differences among the initial BPH procedures. Regression analysis results suggested that the Charlson Comoborbidity Index score isn’t a particularly good predictor in any of the endpoints. However, certain cancer types that are included in the Charlson Comorbidity Index score did predict the endpoints well when used as separate covariates. An increase in the patient’s age was associated with a higher complication risk, but less so than expected. In the urethral stricture and urinary incontinence endpoints the number of preceding BPH operations was usually associated with a notable complication risk increase.
A Theoretical Introduction to Stimulated Resonant Inelastic X-ray Scattering up to the Quadrupole Order

Rasola, Miika (2020)

Resonant inelastic X-ray scattering (RIXS) is one of the most powerful synchrotron based methods for attaining information of the electronic structure of materials. Novel ultra-brilliant X-ray sources, X-ray free electron lasers (XFEL), offer new intriguing possibilities beyond the traditional synchrotron based techniques facilitating the transition of X-ray spectroscopic methods to the nonlinear intensity regime. Such nonlinear phenomena are well known in the optical energy range, less so in X-ray energies. The transition of RIXS to the nonlinear region could have significant impact on X-ray based materials research by enabling more accurate measurements of previously observed transitions, allowing the detection of weakly coupled transitions on dilute samples and possibly uncovering completely unforeseen information or working as a platform for novel intricate methods of the future. The nonlinear RIXS or stimulated RIXS (SRIXS) on XFEL has already been demonstrated in the simplest possible proof of concept case. In this work a comprehensive introduction to SRIXS is presented from a theoretical point of view starting from the very beginning, thus making it suitable for anyone with the basic understanding of quantum mechanics and spectroscopy. To start off, the principles of many body quantum mechanics are revised and the configuration interactions method for representing molecular states is introduced. No previous familiarity with X-ray matter interaction or RIXS is required as the molecular and interaction Hamiltonians are carefully derived, based on which a thorough analysis of the traditional RIXS theory is presented. In order to stay in touch with the real world, the basic experimental facts are recapped before moving on to SRIXS. First, an intuitive picture of the nonlinear process is presented shedding some light onto the term \textit{stimulated} while introducing basic terminology and some X-ray pulse schemes along with futuristic theoretical examples of SRIXS experiments. After this, a careful derivation of the Maxwell-Liouville-von Neumann theory up to quadrupole order is presented for the first time ever. Finally, the chapter is concluded with a short analysis of the experimental status quo on XFELs and some speculation on possible transition metal samples where SRIXS in its current state could be applied to observe quadrupole transitions advancing the field remarkably.
Auditing TikTok’s Recommender System with Sock Puppets

Aarne, Onni (2022)

The content we see is increasingly determined by ever more advanced recommender systems, and popular social media platform TikTok represents the forefront of this development (See Chapter 1). There has been much speculation about the workings of these recommender systems, but precious little systematic, controlled study (See Chapter 2). To improve our understanding of these systems, I developed sock puppet bots that consume content on TikTok as a normal user would (See Chapter 3). This allowed me to run controlled experiments to see how the TikTok recommender system would respond to sock puppets exhibiting different behaviors and preferences in a Finnish context, and how this would differ from the results obtained by earlier investigations (See Chapter 4). This research was done as part of a journalistic investigation in collaboration with Long Play. I found that TikTok appears to have adjusted their recommender system to personalize content seen by users to a much lesser degree, likely in response to a previous investigation by the WSJ. However, I came to the conclusion that, while sock puppet audits can be useful, they are not a sufficiently scalable solution to algorithm governance, and other types of audits with more internal access are needed (See Chapter 5).

Now showing items 1-20 of 233

Browsing by study line "ei opintosuuntaa"

Yhteystiedot

HELSINGIN YLIOPISTO