Browsing by study line "Algoritmit"
Now showing items 2140 of 75

(2021)Sequence alignment by exact or approximate string matching is one of the fundamental problems in bioinformatics. As the volume of sequenced genomes grows rapidly, pairwise sequence alignment becomes inefficient for pangenomic analyses involving multiple sequences. The graph representation of multiple genomes has been an increasingly useful tool in pangenomics research. Therefore, sequencetograph alignment becomes an important and challenging problem. For pairwise approximate sequence alignment under Levenshtein (edit) distance, subquadratic algorithms for finding an optimal solution are unknown. As a result, aligning sequences of millions of characters optimally is too challenging and impractical. Thus, many heuristics and techniques are developed for possibly suboptimal alignments. Among them, colinear chaining (CLC) is a powerful and popular technique that approximates the alignment by finding a chain of short aligned fragments that may come from exact matching. The optimal solution to CLC on sequences can be found efficiently in subquadratic time. For sequencetograph alignment, the CLC problem has been solved theoretically on a special class of graphs that are narrow and have no cycles, i.e. directed acyclic graphs (DAGs) with small width, by Mäkinen et al. (ACM Transactions on Algorithms, 2019). Pangenome graphs such as variation graphs satisfy these restrictions but allowing cycles may enable more general applications of the algorithm. In this thesis, we introduce an efficient algorithm to solve the CLC problem on general graphs with small width that may have cycles, by reducing it to a slightly modified CLC problem on DAGs. We implemented an initial version of the new algorithm on DAGs as a sequencetograph aligner GraphChainer. The aligner is evaluated and compared to an existing stateoftheart aligner GraphAligner (Genome Biology, 2020) in experiments using both simulated and real genome assembly data on variation graphs. Our method improves the quality of alignments significantly in the task of aligning real human PacBio data. GraphChainer is freely available as an open source tool at https://github.com/algbio/GraphChainer.

(2024)Lossy image compression algorithms are used everywhere, ranging from general photography to natively running applications and the web. The most important question regarding a lossy image compression algorithm is arguably its ratedistortion performance. However, recent research has emphasized the fact that subjective experience of image quality can come at the expense of added or more pronounced distortions for low bit rates. The main focus of this thesis is to form a comprehensive view into standard lossy image compression, and to compare both classical and neural lossy image compression methods in order to inform their selection and use. Moreover, we focus on image quality in addition to traditional ratedistortion considerations. Various lossy image compression algorithms are discussed in some detail for a comprehensive view. Furthermore, we compare them using the most widely used objective IQA measures, the PSNR, the SSIM, the MSSSIM, as well as a crowdsourced subjective image quality comparison questionnaire. The results indicate that neural methods easily dominate with respect to both objective and subjective scores. However, rank ordering reveals some discrepancies, indicating that high objective measurement scores do not always align with subjective experience. The thesis gives detailed explanations as well as objective and subjective scores for eight different classical or neural lossy image compression algorithms: JPEG, JPEG2000, WebP, BPG, VVC intra coding, HiFiC, the Coarsetofine hyperprior model, and iWave++.

(2024)Graph drawing in two dimensions is an established area of research with many proposed drawing algorithms and drawing quality measures. Most research has remained in two dimensions because of the limitations of paper and traditional monitors. However, with the advancements in headmounted VR and AR headsets, it is possible to visualise graphs in true three dimensions which allows the viewer to comprehend larger graphs. To provide a useful graph viewing experience, the graph needs to be laid out in a way that highlights important features and structures. In this paper, we provide a collection of measures that measure the quality of 3D graph layouts and analyse their usefulness. We also measure the quality and performance of eleven graph layout algorithms on synthetic and realworld graphs and give suggestions for choosing an algorithm.

(2020)Demand forecasts are required for optimizing multiple challenges in the retail industry, and they can be used to reduce spoilage and excess inventory sizes. The classical forecasting methods provide point forecasts and do not quantify the uncertainty of the process. We evaluate multiple predictive posterior approximation methods with a Bayesian generalized linear model that captures weekly and yearly seasonality, changing trends and promotional effects. The model uses negative binomial as the sampling distribution because of the ability to scale the variance as a quadratic function of the mean. The forecasting methods provide highest posterior density intervals in different credible levels ranging from 50% to 95%. They are evaluated with proper scoring function and calculation of hit rates. We also measure the duration of the calculations as an important result due to the scalability requirements of the retail industry. The forecasting methods are Laplace approximation, Monte Carlo Markov Chain method, Automatic Differentiation Variational Inference, and maximum a posteriori inference. Our results show that the Markov Chain Monte Carlo method is too slow for practical use, while the rest of the approximation methods can be considered for practical use. We found out that Laplace approximation and Automatic Differentiation Variational Inference have results closer to the method with best analytical quarantees, the Markov Chain Monte Carlo method, suggesting that they were better approximations of the model. The model faced difficulties with highly promotional, slow selling, and intermittent data. Best fit was provided with high selling SKUs, for which the model provided intervals with hit rates that matched the levels of the credible intervals.

(2021)Arranging products in stores according to planograms, optimized product arrangement maps, is important for keeping up with the highly competitive modern retail market. The planograms are realized into product arrangements by humans, a process which is prone to mistakes. Therefore, for optimal merchandising performance, the planogram compliance of the arrangements needs to be evaluated from time to time. We investigate utilizing a computer vision problem setting – retail product detection – to automate planogram compliance evaluation. We introduce the relevant problems, the stateof theart approaches for solving them and background information necessary for understanding them. We then propose a computer vision based planogram compliance evaluation pipeline based on the current state of the art. We build our proposed models and algorithms using PyTorch, and run tests against public datasets and an internal dataset collected from a large Nordic retailer. We find that while the retail product detection performance of our proposed approach is quite good, the planogram compliance evaluation performance of our whole pipeline leaves a lot of room for improvement. Still, our approach seems promising, and we propose multiple ways for improving the performance enough to enable possible real world utility. The code used for our experiments and the weights for our models are available at https://github.com/laitalaj/cvpce

(2022)Improving the monitoring of health and wellbeing of dairy cows through the use of computer vision based systems is a topic of ongoing research. A reliable and lowcost method for identifying cow individuals would enable automatic detection of stress, sickness or injury, and the daily observation of the animals would be made easier. Neural networks have been used successfully in the identification of cow individuals, but methods are needed that do not require incessant annotation work to generate training datasets when there are changes within a group. Methods for person reidentification and tracking have been researched extensively, with the aim of generalizing beyond the training set. These methods have been found suitable also for reidentifying and tracking previously unseen dairy cows in video frames. In this thesis, a metriclearning based reidentification model pretrained on an existing cow dataset is compared to a similar model that has been trained on new video data recorded at Luke Maaninka research farm in Spring 2021, which contains 24 individually labelled cow individuals. The models are evaluated in tracking context as appearance descriptors in Kalman filter based tracking algorithm. The test data is video footage from a separate enclosure in Maaninka and a group of 24 previously unseen cow individuals. In addition, a simple procedure is proposed for the automatic labeling of cow identities in images based on RFID data collected from cow ear tags and feeding stations, and the known feeding station locations.

(2020)In this work, we seek robust methods for designing affirmative action policies for university admissions. Specifically, we study university admissions under a real centralized system that uses grades and standardized test scores to match applicants to university programs. For the purposes of affirmative action, we consider policies that assign bonus points to applicants from underrepresented groups with the goal of preventing large gaps in admission rates across groups, while ensuring that the admitted students are for the most part those with the highest scores. Since such policies have to be announced before the start of the application period, there is uncertainty about which students will apply to which programs. This poses a difficult challenge for policymakers. Hence, we introduce a strategy to design policies for the upcoming round of applications that can either address a single or multiple demographic groups. Our strategy is based on application data from previous years and a predictive model trained on this data. By comparing this predictive strategy to simpler strategies based only on application data from, e.g., the previous year, we show that the predictive strategy is generally more conservative in its policy suggestions. As a result, policies suggested by the predictive strategy lead to more robust effects and fewer cases where the gap in admission rates is inadvertently increased through the suggested policy intervention. Our findings imply that universities can employ predictive methods to increase the reliability of the effects expected from the implementation of an affirmative action policy.

(2020)This thesis presents a scalable method for identifying anomalous periods of nonactivity in short periodic event sequences. The method is tested with real world pointofsale (POS) data from grocery retail setting. However, the method can be applied also to other problem domains which produce similar sequential data. The proposed method models the underlying event sequence as a nonhomogeneous Poisson process with a piecewise constant rate function. The rate function for the piecewise homogeneous Poisson process can be estimated with a change point detection algorithm that minimises a cost function consisting of the negative Poisson loglikelihood and a penalty term that is linear to the number of change points. The resulting model can be queried for anomalously long periods of time with no events, i.e., waiting times, by defining a threshold below which the waiting time observations are deemed anomalies. The first experimental part of the thesis focuses on model selection, i.e., in finding a penalty value that results in the change point detection algorithm detecting the true changes in the intensity of the arrivals of the events while not reacting to random fluctuations in the data. In the second experimental part the performance of the anomaly detection methodology is measured against stockout data, which gives an approximate ground truth for the termination of a POS event sequence. The performance of the anomaly detector is found to be subpar in terms of precision and recall, i.e., the true positive rate and the positive predictive value. The number of false positives remains high even with small threshold values. This needs to be taken into account when considering applying the anomaly detection procedure in practice. Nevertheless, the methodology may have practical value in the retail setting, e.g., in guiding the store personnel where to focus their resources in ensuring the availability of the products.

(2021)Discourse dynamics is one of the important fields in digital humanities research. Over time, the perspectives and concerns of society on particular topics or events might change. Based on the changing in popularity of a certain theme different patterns are formed, increasing or decreasing the prominence of the theme in news. Tracking these changes is a challenging task. In a large text collection discourse themes are intertwined and uncategorized, which makes it hard to analyse them manually. The thesis tackles a novel task of automatic extraction of discourse trends from large text corpora. The main motivation for this work lies in the need in digital humanities to track discourse dynamics in diachronic corpora. Machine learning is a potential method to automate this task by learning patterns from the data. However, in many real usecases ground truth is not available and annotating discourses on a corpuslevel is incredibly difficult and timeconsuming. This study proposes a novel procedure to generate synthetic datasets for this task, a quantitative evaluation method and a set of benchmarking models. Largescale experiments are run using these synthetic datasets. The thesis demonstrates that a neural network model trained on such datasets can obtain meaningful results when applied to a real dataset, without any adjustments of the model.

(2023)The maximum constraint satisfaction problem (MaxCSP) is a combinatorial optimization problem in which the set of feasible solutions is expressed using decision variables and constraints on how the variables can be assigned. It can be used to represent a wide range of other combinatorial optimization problems. The maximum satisfiability problem (MaxSAT) is a restricted variant of the maximum constraint satisfaction problem with the additional restrictions that all variables must be Boolean variables, and all constraints must be logical Boolean formulas. Because of this, expressing problems using MaxSAT can be unintuitive. The known solving methods for the MaxSAT problem are more efficient than the known solving methods for MaxCSP. Therefore, it is desirable to express problems using MaxSAT. However, every MaxCSP instance that only has finitedomain variables can be encoded into an equivalent MaxSAT instance. Encoding a MaxCSP instance to a MaxSAT instance allows users to combine the strengths of both approaches by expressing problems using the more intuitive MaxCSPs but solving them using the more efficient MaxSAT solving methods. In this thesis, we overview three common MaxCSP to MaxSAT encodings, the sparse, log, and order encodings, that differ in how they encode an integer variable into a set of Boolean variables. We use correlation clustering as a practical example for comparing the encodings. We first represent correlation clustering problems using MaxCSPs, and then encode them into MaxSATs instances. Stateoftheart MaxSAT solvers are then used to solve the MaxSAT instances. We compare the encodings by measuring the time it takes to encode a MaxCSP instance into a MaxSAT instance and the time it takes to solve the MaxSAT instance. The scope of our experiments is too small to draw general conclusions but in our experiments, the log encoding was the best overall choice.

(2022)The traditional way of computing the BurrowsWheeler transform (BWT) has been to first build a suffix array, and then use this newly formed array to obtain the BWT. While this approach runs in linear time, the space requirement is far from optimal. When the length of the input string increases, the required working space quickly becomes too large for normal computers to handle. To overcome this issue, researchers have proposed many different types of algorithms for building the BWT. In 2009, Daisuke Okanohara and Kunihiko Sadakane presented a new linear time algorithm for BWT construction. This algorithm is relatively fast and requires far less working space than the traditional way of computing the BWT. It is based on a technique called induced sorting and can be seen as a stateoftheart approach for internal memory BWT construction. However, a proper exploration of how to implement the algorithm efficiently has not been undertaken. One 32bit implementation of the algorithm is known to exist, but due to the limitations of 32bit programs, it can only be used for input strings under the size of 4 GB. This thesis introduces the algorithm from Okanohara and Sadakane and implements a 64bit version of it. The implemented algorithm can in theory support input strings that are thousands of gigabytes in size. In addition to the explanation of the algorithm, the time and space requirements of the 64bit implementation are compared to some other fast BWT algorithms.

(2022)In independent component analysis the data is decomposed into its statistically independent components. In recent years, statistical models have been developed that solve a nonlinear version of the independent component analysis. This thesis focuses on the estimation methods of a particular nonlinear independent component analysis model called iVAE. It is shown on simulated data that the generative adversarial networks can significantly improve the iVAE model estimation compared with the previously used default iVAE estimation method. The improved model estimation might enable new applications for the iVAE model.

(2022)In independent component analysis the data is decomposed into its statistically independent components. In recent years, statistical models have been developed that solve a nonlinear version of the independent component analysis. This thesis focuses on the estimation methods of a particular nonlinear independent component analysis model called iVAE. It is shown on simulated data that the generative adversarial networks can significantly improve the iVAE model estimation compared with the previously used default iVAE estimation method. The improved model estimation might enable new applications for the iVAE model.

(2019)Air pollution is considered to be one of the biggest environmental risks to health, causing symptoms from headache to lung diseases, cardiovascular diseases and cancer. To improve awareness of pollutants, air quality needs to be measured more densely. Lowcost air quality sensors offer one solution to increase the number of air quality monitors. However, they suffer from low accuracy of measurements compared to professionalgrade monitoring stations. This thesis applies machine learning techniques to calibrate the values of a lowcost air quality sensor against a reference monitoring station. The calibrated values are then compared to a reference station’s values to compute error after calibration. In the past, the evaluation phase has been carried out very lightly. A novel method of selecting data is presented in this thesis to ensure diverse conditions in training and evaluation data, that would yield a more realistic impression about the capabilities of a calibration model. To better understand the level of performance, selected calibration models were trained with data corresponding to different levels of air pollution and meteorological conditions. Regarding pollution level, using homogeneous training and evaluation data, the error of a calibration model was found to be even 85% lower than when using diverse training and evaluation pollution environment. Also, using diverse meteorological training data instead of more homogeneous data was shown to reduce the size of the error and provide stability on the behavior of calibration models.

(2024)Combinatorial optimization problems arise in many applications. Finding solutions that are as good as possible, ideally optimal, respect to given criteria is important. Additionally, many realworld combinatorial optimization problems are NPhard. The socalled declarative approach to solving combinatorial optimization problems has proven to be successful in practice. In this work we focus on the the implicit hitting setbased (IHS) maximum satisfiability (MaxSAT) paradigm to solving combinatorial optimization problems declaratively. In the MaxSAT paradigm the problem at hand is formulated as a linear objective function to minimize subject to a set of constraints expressed in the language of propositional logic. In the IHS approach the problem is solved by alternating calls to two subroutines. An optimizer procedure computes optimal solutions over the variables in the objective function without the constraints available and a feasibility oracle verifies the solutions in terms of the constraints. In this work we study alternative divisions of constraints of a given problem formulation between the optimizer and the oracle. We allow the optimizer to compute solutions over any variables of the problem instance, thus extending the hitting set formulations of the IHSbased MaxSAT. We focus on two specific combinatorial optimization problems and existing MaxSAT encodings of these problems. The problems focus on are computing the treewidth of a graph and finding an optimal kundercover Boolean matrix factorization. We have also extended a stateoftheart IHSbased MaxSAT solver to support extended divisions of encodings and provide the implementation as open source.

(2022)The objective of the shortest common superstring problem is to find a string of minimum length that contains all keywords in the given input as substrings. Shortest common superstrings have many applications in the fields of data compression and bioinformatics. For example, a common superstring can be seen as a compressed form of the keywords it is generated from. Since the shortest common superstring problem is NPhard, we focus on the approximation algorithms that implement a socalled greed heuristic. It turns out that the actual shortest common superstring is not always needed. Instead, it is often enough to find an approximate solution of sufficient quality. We provide an implementation of the Ukkonen's linear time algorithm for the greedy heuristic. The practical performance of this implementation is measured by comparing it to another implementation of the same heuristic. We also hypothesize that shortest common superstrings can be potentially used to improve the compression ratio of the Relative LempelZiv data compression algorithm. This hypothesis is examined and shown to be valid.

(2020)Kmer counting is the process of building a histogram of all substrings of length k for an input string S. The problem itself is quite simple, but counting kmers efficiently for a very large input string is a difficult task that has been researched extensively. In recent years the performance of kmer counting algorithms have improved significantly, and there have been efforts to use graphics processing units (GPUs) in kmer counting. The goal for this thesis was to design, implement and benchmark a GPU accelerated kmer counting algorithm SNCGPU. The results showed that SNCGPU compares reasonably well to the Gerbil kmer counting algorithm on a midrange desktop computer, but does not utilize the resources of a highend computing platform as efficiently. The implementation of SNCGPU is available as opensource software.

(2023)Magnetic reconnection is a phenomenon occurring in plasma and related magnetic fields when magnetic field lines break and rejoin, leading to the release of energy. Magnetic reconnections take place, for example, in the Earth’s magnetosphere, where they can affect the space weather and even damage systems and technology on and around the Earth. Another site of interest is in fusion reactors, where the energy released from reconnection events can cause instability in the fusion process. So far, 2D magnetic reconnection has been widely studied and is relatively wellunderstood, whereas the 3D case remains more challenging to characterize. However, in realworld situations, reconnection occurs in three dimensions, which makes it essential to be able to detect and analyse 3D magnetic reconnection, as well. In this thesis, we examine what potential signs of 3D magnetic reconnection can be identified from the topological elements of a magnetic vector field. To compute the topological elements, we use the Visualization Toolkit (VTK) Python package. The topology characterizes the behaviour of the vector field, and it may reveal potential reconnection sites, where the topological elements can change as a result of magnetic field lines reconnecting. The magnetic field data used in this thesis is from a simulation of the nightside magnetosphere produced using Vlasiator. The contributions of this thesis include analysis of the topological features of 3D magnetic reconnection and topological representations of nightside reconnection conditions to use in potential future machine learning approaches. In addition, a modified version of the VTK function for computing the critical points of the topology is created with the purpose of gearing it more towards magnetic vector fields instead of vector fields in general.

(2023)In this thesis, we discuss the Relative LempelZiv (RLZ) lossless compression algorithm, our implementation of it, and the performance of RLZ in comparison to more traditional lossless compression programs such as gzip. Like the LZ77 compression algorithm, the RLZ algorithm compresses its input by parsing it into a series of phrases, which are then encoded as a position+length number pair describing the location of the phrase within the text. Unlike ordinary LZ77, where these pairs refer to earlier points in the same text and thus decompression must happen sequentially, in RLZ the pairs point to an external text called the dictionary. The benefit of this approach is faster random access to the original input given its compressed form: with RLZ, we can rapidly (in linear time with respect to the compressed length of the text) begin decompression from anywhere. With nonrepetitive data, such as the text of a single book, website, or one version of a program's source code, RLZ tends to perform poorer than traditional compression methods, both in terms of compression ratio and in terms of runtime. However, with very similar or highly repetitive data, such as the entire version history of a Wikipedia article or many versions of a genome sequence assembly, RLZ can compress data better than gzip and approximately as well as xz. Dictionary selection requires care, though, as compression performance relies entirely on it.

(2024)There are two primary types of quantum computers: quantum annealers and circuit model computers. Quantum annealers are specifically designed to tackle particular problems, as opposed to circuit model computers, which can be viewed as universal quantum computers. Substantial efforts are underway to develop quantumbased algorithms for various classical computational problems. The objective of this thesis is to implement algorithms for solving graph problems using quantum annealer computers and analyse these implementations. The aim is to contribute to the ongoing development of algorithms tailored for this type of machine. Three distinct types of graph problems were selected: all pairs shortest path, graph isomorphism, and community detection. These problems were chosen to represent varying levels of computational complexity. The algorithms were tested using the DWave quantum annealer Advantage system 4.1, equipped with 5760 qubits. DWave provides a cloud platform called Leap and a Python library, Ocean tools, through which quantum algorithms can be designed and run using local simulators or real quantum computers in the cloud. Formulating graph problems to be solved on quantum annealers was relatively straightforward, as significant literature already contains implementations of these problems. However, running these algorithms on existing quantum annealer machines proved to be challenging. Even though quantum annealers currently boast thousands of qubits, algorithms performed satisfactorily only on small graphs. The bottleneck was not the number of qubits but rather the limitations imposed by topology and noise. DWave also provides hybrid solvers that utilise both the Quantum Processing Unit (QPU) and CPU to solve algorithms, which proved to be much more reliable than using a pure quantum solver.
Now showing items 2140 of 75