Browsing by discipline "Algorithms and Machine Learning"
Now showing items 1-16 of 16
-
(2020)Heart rate (HR) monitoring has been the foundation of many researches and applications in the field of health care, sports and fitness, and physiology. With the development of affordable non- invasive optical heart rate monitoring technology, continuous monitoring of heart rate and related physiological parameters is increasingly possible. While this allows continuous access to heart rate information, its potential is severely constrained by the inaccuracy of the optical sensor that provides the signal for deriving heart rate information. Among all the factors influencing the sensor performance, hand motion is a particularly significant source of error. In this thesis, we first quantify the robustness and accuracy of the wearable heart rate monitor under everyday scenario, demonstrating its vulnerability to different kinds of motions. Consequently, we developed DeepHR, a deep learning based calibration technique, to improve the quality of heart rate measurements on smart wearables. DeepHR associates the motion features captured by accelerometer and gyroscope on the wearable with a reference sensor, such as a chest-worn HR monitor. Once pre-trained, DeepHR can be deployed on smart wearables to correct the errors caused by motion. Through rigorous and extensive benchmarks, we demonstrate that DeepHR significantly improves the accuracy and robustness of HR measurements on smart wearables, being superior to standard fully connected deep neural network models. In our evaluation, DeepHR is capable of generalizing across different activities and users, demonstrating that having a general pre-trained and pre-deployed model for various individual users is possible.
-
(2020)Traditional flat classification methods (e.g., binary, multiclass, and multi-label classification) seek to associate each example with a single class label or a set of labels without any structural dependence among them. Although, there are some problems in which classes can be divided or grouped into subclasses or superclasses respectively. Such a scenario demands the application of methods prepared to deal with hierarchical classification. An algorithm for hierarchical classification uses the information related to structure present in the class hierarchy and then improves the predictive performance . The freedom to perform a more generic classification, but with higher reliability, gives the process a greater versatility. Several studies have shown that, in solving a hierarchical classification problem, flat models are mostly overcome by hierarchical ones, regardless of the approach – local (including its derivations) or global – chosen. This thesis aims to compare the most popular hierarchical classification methods (local and global) empirically, reporting their performance – measured using hierarchical evaluation indexes. To do so, we had to adapt the global hierarchical models to conduct single path predictions, starting from the root class and moving towards a leaf class within the hierarchical structure. Further, we applied hierarchical classification on data streams by detecting concept drift. We first study data streams, various types of concept drifts, and state-of-the-art concept drift detection methods. Then we implement Global-Model Hierarchical Classification Naive Bayes (GNB) with three concept drift detectors: (i) Kolmogorov-Smirnov test, (ii) Wilcoxon test, and (iii) Drift Detection Method (DDM). A fixed-size sliding window was used to estimate the performance of GNB online. Finally, we must highlight that this thesis contributes to the task of automatic insect recognition.
-
(2020)We present Active Faceted Search, a technique which allows a user to iteratively refine search results by selecting from a set of facets that is dynamically refined with each iteration. The facets presented are selected using a contextual multi armed bandit model. We first describe the computational model of a system which implements Active Faceted Search. We also create a web application to demonstrate an example of a system that can use an active faceted search component along with more traditional search elements such as a typed query and sidebar component. We perform simulations to compare the performance of the system under different parameters. Finally, we present a user experiment in which users are instructed to perform tasks in order to compare Active Faceted Search to traditional search techniques.
-
(2019)Data compression is one way to gain better performance from a database. Compression is typically achieved with a compression algorithm, an encoding or both. Effective compression directly lowers the physical storage requirements translating to reduced storage costs. Additionally, in case of a data transfer bottleneck where CPU is data starved, compression can yield improved query performance through increased transfer bandwidth and better CPU utilization. However, obtaining better query performance is not trivial since many factors affect the viability of compression. Compression has been found especially successful in column oriented databases where similar data is stored closely in physical media. This thesis studies the effect of compression on a columnar storage format Apache Parquet through a micro benchmark that is based on the TPC-H benchmark. Compression is found to have positive effects on simple queries. However, with complex queries, where data scanning is relatively small portion of the query, no performance gains were observed. Furthermore, this thesis examines the decoding performance of the encoding layer that belongs to a case database, Fastorm. The goal is to determine its efficiency among other encodings and whether it could be improved upon. Fastorm's encoding is compared against various encodings of Apache Parquet in a setting where data is from a real world business. Fastorm's encoding is deemed to perform well enough coupled with strong evidence to consider adding delta encoding to its repertoire of encoding techniques.
-
(2020)A recent machine learning technique called federated learning (Konecny, McMahan, et. al., 2016) offers a new paradigm for distributed learning. It consists of performing machine learning on multiple edge devices and simultaneously optimizing a global model for all of them, without transmitting user data. The goal for this thesis was to prove the benefits of applying federated learning to forecasting telecom key performance indicator (KPI) values from radio network cells. After performing experiments with different data sources' aggregations and comparing against a centralized learning model, the results revealed that a federated model can shorten the training time for modelling new radio cells. Moreover, the amount of transferred data to a central server is minimized drastically while keeping equivalent performance to a traditional centralized model. These experiments were performed with multi-layer perceptron as model architecture after comparing its performance against LSTM. Both, input and output data were sequences of KPI values.
-
(2020)Interpretability in machine learning aims to provide explanations on the behaviors of complex predictive models, widely refer as black-boxes. Generally, interpretability means the understanding of how the models work internally, whereas, explanations are the one way to make machine learning models interpretable, e.g., using transparent and simple models. Numerous approaches have been proposed as explanation methods which strive to interpret black-box models. These explanation methods mainly try to approximate the local behavior of a model, and then explain it in a human-understandable way. The primary reason to explain the local-behavior is that explaining the global behavior of a black-box is difficult, and it remains an unsolved challenge. Moreover, there is another challenge which argues on the quality and stability of the generated explanations. One way to evaluate the quality of explanations is by using robustness as a property. In this work, we define the explanation evaluation framework, which attempts to measure the robustness of explanations. The framework consists of two distance-based measures stability and separability. We explore and use stability measure from existing literature and introduce our new separability measure, which goes along with stability measure in order to quantify the robustness of explanations. We examine model-agnostic (LIME, SHAP) and model-dependent (DeepExplain) explanation methods to interpret the predictions for various supervised predictive models, especially classifiers. We build classifiers by using UCI classification benchmark datasets and MNIST handwritten digits dataset. Our results illustrate that current model-agnostic and model-dependent explanation methods do not perform adequately with respect to our explanation evaluation framework. Our results show that these explanation methods are not robust to variations in features values and often produce different explanations for similar values and similar explanations for different values, which leads to unstable explanations. Our results and outcomes demonstrate that the developed explanation evaluation framework is useful to assess the robustness of explanations and inspire further exploration and work.
-
(2020)Locally checkable labeling problems in the LOCAL model of distributed computation are known to have only three distinct complexity classes when the attention is restricted to problems on toroidic grids only: trivial with time complexity Θ(1), local with time complexity Θ(log∗ n) and global with time complexity Θ(n). Prior work shows that problems belonging to the trivial class are easy to recognize, but that local and global labeling problems are undecidable to separate. However, a method called algorithm synthesis exists for creating an asymptotically optimal normal-form algorithm for any locally checkable labeling problem with a time complexity of Θ(log∗ n). This process, when automated, can be used to process vast amounts of suitably encoded labeling problems in bulk, creating a more defined boundary for the undecidable class of local problems. As a proof-of-concept of this method, this work presents a new asymptotically optimal algorithm for a relaxed form of 3-coloring as well as methods for more general search of local problems.
-
(2020)Anomaly detection is an important task in many domains such as maritime where it is used to detect, for example, unsafe, unexpected or criminal behaviour. This thesis studies the use of deep autoencoders for anomaly detection on high dimensional data in an unsupervised manner. The study is performed on a benchmark data set and a real-life AIS (Automatic Tracking System) data set containing actual ship trajectories. The ships’ trajectories in the AIS data set are a form of time-series data, and therefore recurrent layers are used in an autoencoder to allow the model to capture temporal dependencies in the data. An autoencoder is a neural network architecture where an encoder network produces an encoding and decoder network takes the encoding intending to produce the original input. An encoding is a compressed fixed-sized vector presentation of the original input. Since the encoding is used by the decoder to construct the original input, the model learns during the training process to store essential information of the input sequence to the encoding. Autoencoders can be used to detect anomalies using reconstruction error by assuming that a trained autoencoder is able to reconstruct non-anomalius data points more accurately than anomalous data points, and therefore data points with high reconstruction error can be considered anomalies. In addition to reconstruction error, the autoencoders produce encodings. The research of this thesis studies the possibility of calculating an outlier score for the encodings and combining the score with resconstruction error to form a combined outlier score. OPTICS-OF (Ordering Points to Identify the ClusteringStructure with Outlier Factors) is a density based anomaly detection technique which can be used to calculate outlier scores for the encodings. The outlier score of OPTICS-OF for a data point is based on how isolated it is within its neighbourhood. The proposed method is evaluated on a benchmark Musk data set for which anomalies are known. A data set with labelled anomalies provides a setting for analyzing the performance of the method and its properties. The method is then put to the test on the AIS data set where it is used to find new anomalies in the data set from two derived distinct feature sets. The AIS data set contains one known anomaly which is presented both as an example of a maritime anomaly and for which more detailed analysis of the produced outlier scores are presented. The results of the study show potential for the proposed combined score method, and the analysis identifies multiple areas for further research. Deep autoencoders are successfully used to find new anomalies from the AIS data set which show actual behaviour deviating from normal ship movement.
-
(2020)In the late 2010’s classical games of Go, Chess and Shogi have been considered ’solved’ by deep reinforcement learning AI agents. Competitive online video games may offer a new, more challenging environment for deep reinforcement learning and serve as a stepping stone in a path to real world applications. This thesis aims to give a short introduction to the concepts of reinforcement learning, deep networks and deep reinforcement learning. Then the thesis proceeds to look into few popular competitive online video games and to the general problems of AI development in these types of games. Deep reinforcement learning algorithms, techniques and architectures used in the development of highly competitive AI agents in Starcraft 2, Dota 2 and Quake 3 are overviewed. Finally, the results are looked into and discussed.
-
(2019)Time series analysis has been a popular research topic in the last few decades. In this thesis, we develop time series models to investigate short time series of count data. We first begin with Poisson autoregressive model and extend it to capture day effects explicitly. Then we propose hierarchical Poisson tensor factorization model as an alternative to the traditional count time series models. Furthermore, we suggest acontext-based model as an improvement over hierarchical Poisson tensor factorization model. We implement the models in an open-source probabilistic programming framework Edward. This tool enables us to express the models in form of executable program code and allows us to rapidly prototype models without the need of derivation of model specificupdaterules. We also explore strategies for selecting the best model out of alternatives. We study the proposed models on a dataset containing media consumption data. Our experimental findings demonstrate that the hierarchical Poisson tensor factorization model significantly outperforms the Poisson autoregressive models in predicting event counts. We also visualize the key results of our exploratory data analysis.
-
(2020)Enormous datasets are a common occurence today and compressing them is often beneficial. Fast direct access to any element in the compressed data is a requirement in the field of compressed data structures, which is not easily supported with traditional compression methods. Variable-byte encoding is a method for compressing integers of different byte lengths. It removes unused leading bytes and adds an additional continuation bit to each byte to denote whether the compressed integer continues to the next byte or not. An existing solution using a rank data structure performs well in this given task. This thesis introduces an alternative solution using a select data structure and compares the two implementations. An experimentation is also done on retrieving a subarray from the compressed data structure. The rank implementation performs better on data containing mostly small integers. The select implementation benefits on larger integers. The select implementation has significant advantages on subarray fetching due to how the data is compressed.
-
(2019)Spam detection techniques have made our lives easier by unclogging our inboxes and keeping unsafe messages from being opened. With the automation of text messaging solutions and the increase in telecommunication companies and message providers, the volume of text messages has been on the rise. With this growth came along malicious traffic which users had little control over. In this thesis, we present an implementation of a spam detection system in a real-world text messaging platform. Using well-established machine learning algorithms, we make an in-depth analysis on the performance of the models using two different datasets: one publicly available (N=5,574) and the other gathered from actual traffic of the platform (N=1,477). Making use of the empirical results, we outline the models and hyperparameters which can be used in the platform and in which scenarios they produce optimal performance. The results indicate that our dataset poses a great challenge at accurate classification, most likely due to the small sample size and unbalanced dataset, along with nuances in the dataset. Nevertheless, there were models that were found to have a good all-around performance and they can be trained and used in the platform.
-
(2019)Plagiarism is an act of copying where one doesn’t rightfully credit the original source. The motivations behind plagiarism can vary from completing academic courses to even gaining economical advantage. Plagiarism exists in various domains, where people want to take credit from something they have worked on. These areas can include e.g. literature, art or software, which all have a meaning for an authorship. In this thesis we conduct a systematic literature review from the topic of source code plagiarism detection methods, then based on the literature propose a new approach to detect plagiarism which combines both similarity detection and authorship identification, introduce our tokenization method for the source code, and lastly evaluate the model by using real life data sets. The goal for our model is to point out possible plagiarism from a collection of documents, which in this thesis is specified as a collection of source code files written by various authors. Our data, which we will use to our statistical methods, consists of three datasets: (1) collection of documents belonging to University of Helsinki’s first programming course, (2) collection of documents belonging to University of Helsinki’s advanced programming course and (3) submissions for source code re-use competition. Statistical methods in this thesis are inspired by the theory of search engines, which are related to data mining when detecting similarity between documents and machine learning when classifying document with the most likely author in authorship identification. Results show that our similarity detection model can be used successfully to retrieve documents for further plagiarism inspection, but false positives are quickly introduced even when using a high threshold that controls the minimum allowed level of similarity between documents. We were unable to use the results of authorship identification in our study, as the results with our machine learning model were not high enough to be used sensibly. This was possibly caused by the high similarity between documents, which is due to the restricted tasks and the course setting that teaches a specific programming style during the timespan of the course.
-
(2020)Most online fraud involves identity thief, especially in financial services such as banking, commercial services, or home security. Passwords have always been one of the most reliable and common way to protect user identities. However, passwords can be guessed or breached. Biometric authentications have emerged to be a compliment way to improve the security. Nevertheless, biometric factors such as fingerprint or face recognition can also be spoofed. Additionally, those factors require either user interaction (touch to unlock) or additional hardware (surveillance camera). Therefore, the next level of security with lower risk of attack and less user friction is essentially needed. gait authentication is one of the viable solutions since gait is the signature of the way humans walk, and the analysis can be done passively without any user interactions. Several breakthroughs in terms of model accuracy and efficiency were reported across several state-of-the-art papers. For example, DeepSense reported the accuracy of 0.942±0.032 in Human Activity Recognition and 0.997±0.001 in User Identification. Although there have been research focusing on gait-analysis recently, there has not been a stan- dardized way to define proper testing workflow and techniques that are required to ensure the correctness and efficiency of gait application system, especially when it is done in production scale. This thesis will present a general workflow of Machine Learning (ML) system testing in gait au- thentication using V-model, as well as identifying the areas and components that requires testing, including data testing and performance testing in each ML-related components. This thesis will also suggest some adversarial cases that the model can fail to predict. Traditional testing technique such as differential testing will also be introduced as a testing candidate for gait segmentation. In addition, several metrics and testing ideas will also be suggested and experimented. At last, some interesting findings will be reported in the experimental results section, and some areas for further future work will also be mentioned.
-
(2020)Ubiquitous surveillance has become part of our lives to increase security and safety. Despite the wide application of surveillance systems, their efficiency is limited by human factors, such as boredom and fatigue; because most of the time, nothing unusual happens. In safety-critical applications, time is essential and it is vital to act fast to prevent costly incidents. This thesis proposes a two-stage abnormal crowd event detection framework based on k-means clustering in the first stage, and sparse representation based methods in the second stage, to alleviate the laborious task of video monitoring. We conduct a literature review of 18 studies, where we specifically focus on sparse representation based methods. Accordingly, we choose the spatio-temporal gradient feature due to its simplicity, efficiency, and effectiveness in motion representation. After extracting features only from normal events, k-means clustering is applied to separate different motion feature clusters. Then, clusters with smaller samples, which are deemed to contain mostly abnormal features, are removed according to a threshold. In the second stage, we learn a dictionary for each remaining cluster using the approximate K-SVD algorithm. In testing, the reconstruction error of a feature against a learned dictionary and its sparse representation is used to determine an abnormality. We conduct extensive experiments on a standard dataset to evaluate the detection performance of the method. Furthermore, the effect of hyper-parameters in our method is investigated. We also compare our method with different methods to examine its effectiveness. Results indicate that our abnormal event detection framework can successfully understand abnormal events in a scene while running in real-time at 161 frames per second. With a few exceptions, no significant advantage of the two-stage sparse representation approach over a single large dictionary was found. We speculate that these results may be influenced by a small sample size. Nevertheless, our approach, due to its unsupervised nature, can be adapted to different contexts without additional annotation effort and using only normal events from videos. Therefore it motivates us for further development.
-
(2020)Tropes are storytelling devices or conventions that can be found in storytelling media, for example in movies. DBTropes is an RDF-dataset with media-trope connections extracted from Tv Tropes, which is a community-edited wiki listing tropes for various creative works. This study investigates whether the tropes of films can be used to cluster similar movies. First, we extracted film-trope connections from the DBTropes dataset. We then took four samples from the dataset, three for clustering films, and one for clustering tropes. We used the film–trope connections to calculate euclidean and cosine distances between movies and for the last sample between tropes. Then we clustered the samples with hierarchical clustering using complete and ward linkage. Our results show that hierarchical clustering can group similar films together using this dataset. For calculating distances the cosine distance method works significantly better than euclidean distance. Both hierarchical clustering methods complete and ward work well. It depends on the chosen sample, which of them results in a clearer and more interpretable output. We conclude that the data works well for clustering similar films or similar tropes.
Now showing items 1-16 of 16