Browsing by study line "Algorithms"

Now showing items 61-74 of 74

Practical Aspects of Implementing a Suffix Array-based Lempel-Ziv Data Compressor

Utoslahti, Aki (2022)

Lempel-Ziv factorization of a string is a fundamental tool that is used by myriad data compressors. Despite its optimality regarding the number of produced factors, it is rarely used without modification, for reasons of its computational cost. In recent years, Lempel-Ziv factorization has been a busy research subject, and we have witnessed the state-of-the-art being completely changed. In this thesis, I explore the properties of the latest suffix array-based Lempel-Ziv factorization algorithms, while I experiment with turning them into an efficient general-purpose data compressor. The setting of this thesis is purely exploratory, guided by reliable and repeatable benchmarking. I explore all aspects of the suffix array-based Lempel-Ziv data compressor. I describe how the chosen factorization method affects the development of encoding and other components of a functional data compressor. I show how the chosen factorization technique, together with capabilities of modern hardware, allows determining the length of the longest common prefix of two strings over 80% faster compared to the baseline approach. I also present a novel approach to optimizing the encoding cost of the Lempel-Ziv factorization of a string, i.e., bit-optimality, using a dynamic programming approach to the Single-Source Shortest Path problem. I observed that, in its current state, the process of suffix array construction is a major computational bottleneck in suffix array-based Lempel-Ziv factorization. Additionally, using a suffix array to produce a Lempel-Ziv factorization leads to optimality regarding the number of factors, which does not necessarily correspond to bit-optimality. Finally, a comparison with common third-party data compressors revealed that relying exclusively on Lempel-Ziv factorization prevents reaching the highest compression efficiency. For these reasons, I conclude that current suffix array-based Lempel-Ziv factorization is unsuitable for general-purpose data compression.
Predicting Field Value with Interpretable Models

Kailamäki, Kalle (2022)

This thesis explores predicting current prices of individual agricultural fields in Finland based on historical data. The task is to predict field prices accurately with the data we have available while keeping model predictions interpretable and well explainable. The research question is to find which out of several different models we try out is most optimal for the task. The motivation behind this research is the growing agricultural land market and the lack of publicly available field valuation services that can assist market participants to determine and identify reasonable asking prices. Previous studies on the topic have used standard statistics to establish relevant factors that affect field prices. Rather than creating a model whose predictions can be used on their own in every case, the primary purpose of previous works has indeed been to identify information that should be considered in manual field valuation. We, on the other hand, focus on the predictive ability of models that do not require any manual labor. Our modelling approaches focus mainly but not exclusively on algorithms based on Markov–Chain Monte Carlo. We create a nearest neighbors model and four hierarchical linear models of varying complexity. Performance comparisons lead us to recommend a nearest neighbor -type model for this task.
Programmatic and Efficient GPU Acceleration of CPU Stencil Codes

Puro, Touko (2023)

GPUs have become an important part of large-scale and high-performance physics simulations, due to their superior performance [11] and energy effiency [23] over CPUs. This thesis examines how to accelerate an existing CPU stencil code, that is originally parallelized through message passing, with GPUs. Our first research question is how to utilize the CPU cores alongside GPUs when the bulk of the computation is happening on GPUs. Secondly, we investigate how to address the performance bottleneck of data movement between CPU and GPU when there is a need to perform computational tasks originally intended to be executed on CPUs. Lastly, we investigate how the performance bottleneck of communication between processes can be alleviated to make better use of the available compute resources. In this thesis we approach these problems by building a preprocessor designed for making an existing CPU codebase suitable for GPU acceleration and the communication bottleneck is alleviated through extending a existing GPU oriented library Astaroth. We improve its task scheduling system and extend its own domain specific language (DSL) for stencil computations. Our solutions are demonstrated by making an existing CPU based astrophysics simulation code Pencil Code [4] suitable for GPU acceleration with the use of our preprocessor and the Astaroth library. Our results show that we are able to utilize CPU cores to perform useful work alongside the GPUs. We also show that we are able to circumvent the CPU-GPU data movement bottleneck by making code suitable for offloading through OpenMP offloading and code translation to GPU code. Lastly, we show that in certain cases Astaroth’s communication performance is increased by around 21% through smaller message sizes — with the added benefit of 14% lower memory usage, which corresponds to around 18% improvement in overall performance. Furthermore, we show benefits of the improved tasking and a identified memoryperformance trade-off.
Pseudo-Boolean Optimization by Implicit Hitting Sets

Smirnov, Pavel (2022)

There are many computationally difficult problems where the task is to find a solution with the lowest cost possible that fulfills a given set of constraints. Such problems are often NP-hard and are encountered in a variety of real-world problem domains, including planning and scheduling. NP-hard problems are often solved using a declarative approach by encoding the problem into a declarative constraint language and solving the encoding using a generic algorithm for that language. In this thesis we focus on pseudo-Boolean optimization (PBO), a special class of integer programs (IP) that only contain variables that admit the values 0 and 1. We propose a novel approach to PBO that is based on the implicit hitting set (IHS) paradigm, which uses two separate components. An IP solver is used to find an optimal solution under an incomplete set of constraints. A pseudo-Boolean satisfiability solver is used to either validate the feasibility of the solution or to extract more constraints to the integer program. The IHS-based PBO algorithm iteratively invokes the two algorithms until an optimal solution to a given PBO instance is found. In this thesis we lay out the IHS-based PBO solving approach in detail. We implement the algorithm as the PBO-IHS solver by making use of recent advances in reasoning techniques for pseudo-Boolean constraints. Through extensive empirical evaluation we show that our PBO-IHS solver outperforms other available specialized PBO solvers and has complementary performance compared to classical integer programming techniques.
Reasoning over Assumption-Based Argumentation Frameworks via Answer Set Programming

Lehtonen, Tuomo (2019)

Formal argumentation is a vibrant research area within artificial intelligence, in particular in knowledge representation and reasoning. Computational models of argumentation are divided into abstract and structured formalisms. Since its introduction in 1995, abstract argumentation, where the structure of arguments is abstracted away, has been much studied and applied. Structured argumentation formalisms, on the other hand, contain the explicit derivation of arguments. This is motivated by the importance of the construction of arguments in the application of argumentation formalisms, but also makes structured formalisms conceptually and often computationally more complex than abstract argumentation. The focus of this work is on assumption-based argumentation (ABA), a major structured formalism. Specifically we address the relative lack of efficient computational tools for reasoning in ABA compared to abstract argumentation. The computational efficiency of ABA reasoning systems has been markedly lower than the systems for abstract argumentation. In this thesis we introduce a declarative approach to reasoning in ABA via answer set programming (ASP), drawing inspiration from existing tools for abstract argumentation. In addition, we consider ABA+, a generalization of ABA that incorporates preferences into the formalism. The complexity of reasoning in ABA+ is higher than in ABA for most problems. We are able to extend our declarative approach to some ABA+ reasoning problems. We show empirically that our approach vastly outperforms previous reasoning systems for ABA and ABA+.
Refined Core Relaxations for Core-Guided Maximum Satisfiability Algorithms

Ihalainen, Hannes (2022)

The so-called declarative approach has proven to be a viable paradigm for solving various real-world NP-hard optimization problems in practice. In the declarative approach, the problem at hand is encoded using a mathematical constraint language, and an algorithm for the specific language is employed to obtain optimal solutions to an instance of the problem. One of the most viable declarative optimization paradigms of the last years is maximum satisfiability (MaxSAT) with propositional logic as the constraint language. So-called core-guided MaxSAT algorithms are arguably one of the most effective MaxSAT-solving paradigms in practice today. Core-guided algorithms iteratively detect and rule out (relax) sources of inconsistencies (so-called unsatisfiable cores) in the instance being solved. Especially effective are recent algorithmic variants of the core-guided approach which employ so-called soft cardinality constraints for ruling out inconsistencies. In this thesis, we present a structure-sharing technique for the cardinality-based core relaxation steps performed by core-guided MaxSAT solvers. The technique aims at reducing the inherent growth in the size of the propositional formula resulting from the core relaxation steps. Additionally, it enables more efficient reasoning over the relationships between different cores. We empirically evaluate the proposed technique on two different core-guided algorithms and provide open-source implementations of our solvers employing the technique. Our results show that the proposed structure-sharing can improve the performance of the algorithms both in theory and in practice.
Rental Price Prediction on Greater Helsinki Apartments

Karanko, Lauri (2022)

Determining the optimal rental price of an apartment is typically something that requires a real estate agent to gauge the external and internal features of the apartment, and similar apartments in the vicinity of the one being examined. Hedonic pricing models that rely on regression are commonplace, but those that employ state of the art machine learning methods are still not widespread. The purpose of this thesis is to investigate an optimal machine learning method for predicting property rent prices for apartments in the Greater Helsinki area. The project was carried out at the behest of a client in the real estate investing business. We review what external and inherent apartment features are the most suitable for making predictions, and engineer additional features that result in predictions with the least error within the Greater Helsinki area. Combining public demographic data from Tilastokeskus (Statistics Finland) and data from the online broker Oikotie Oy gives rise to a model that is comparable to contemporary commercial solutions offered in Finland. Using inverse distance weighting to interpolate and generate a price for the coordinates of the new apartment was also found to be crucial in developing an performant model. After reviewing models, the gradient boosting algorithm XGBoost was noted to fare the best for this regression task.
Safe walks for flow decompositions with subpath constraints, in general graphs

Ingervo, Eliel (2024)

The problem of safety in a general graph is the problem of finding walks in the graph that are subwalks of a walk in any possible solution within a given model (Tomescu and Medvedev, 2017). The problem of safety has proven to be a valuable problem for bioinformatics in the context of genome assembly and other assembly problems. When working with perfect data, safe walks in a graph correspond to correct sequences of the genome. When working with real (erroneous) data, safe walks correspond to close to correct sequences, where the errors correspond only to errors already in the data. Safety has previously been considered in two different models: in a model where the graph holds only the information of graph topology, and in a model where the graph holds the information of graph topology and network flow. Subpath constraints are paths in a graph that restrict the solution set. Restricting a solution set increases the lengths of safe walks, so using a model that holds the information of subpath constraints is advantageous. However, subpath constraints have never been considered with the problem of safety. In this work, we introduce subwalk constraints, which are subpath constraints that can have cycles and are not limited to directed acyclic graphs (DAGs). Then, we present a new model, where the graph holds the information of graph topology, network flow and subwalk constraints. In the new model, we present three methodologies to compute safe walks that become possible because of subwalk constraints. Then, we present an algorithm that computes safe walks in polynomial time in the new model utilizing the three new methodologies. Due to the new model, all safe walks produced by other algorithms are subwalks of the safe walks generated by our new algorithm
Sanastorajoitteiden käyttö neurokonekäännöksessä monipuolisesti taipuville kielille

Friberg, Théo (2024)

Tässä työssä esitetään uusi tapa parantaa konekäännöstä, kun käännettävän tekstin erityisalan termeille tunnetaan oikeat käännösvastineet. Uusi menetelmä soveltuu neurokonekäännökseen olemassa olevilla käännösmalleilla. Se ei siis vaadi laskennallisesti, energiankulutuksellisesti ja rahallisesti kallista käännösmallin uudelleenkoulutusta. Tunnetuista olemassa olevia malleja hyödyntävistä menetelmistä poiketen se myös soveltuu suomeksi ja suomesta kääntämiseen ja sopii yleisemmin monipuolisesti taipuvien kielten kääntämiseen. Työssä esitellään aikaisempaa tutkimusta termikäännöksestä, käydään uusi menetelmä läpi yksityiskohtaisesti ja vertaillaan sen toimivuutta erilaisiin verrokkimenetelmiin kahdella erikoisalalla. Vertailussa käytetään hyväksi kahden yhteistyötahon ammattiterminologien laatimia termistöjä. Arvioinnissa hyödynnetään sekä koneellisesti laskettavia mittoja että ihmisarviointia. Uusi menetelmä pärjää BLEU, chrF ja TER-mitoilla paremmin kuin verrokit ja käännöksistä löytyvien termien määrän suhteen kirkkaasti paremmin kuin verrokit. COMET-mitalla tulokset ovat yhdellä aineistolla verrokkeja paremmat ja toisella hieman huonommat. Ihmisarvioinnissa uusi menetelmä tuottaa eniten hyväksyttäviä virkkeitä yhdellä aineistolla, mutta toimii yleisesti huonommin kuin verrokit toisella. Tämän syitä pohditaan.
Scientific methods in database compression research

Tenhunen, Jouni (2024)

Compression methods are widely used in modern computing. With the amount of data stored and transferred by database systems constantly increasing, the implementation of compression methods into database systems has been studied from different angles during the past four decades. This thesis studies the scientific methods used in relational database research. The goal of the thesis is to evaluate the methods employed and to gain an understanding into how research into the subject should be conducted. A literature review is conducted. 14 papers are identified for review and their methodology is described and analysed. The papers reviewed are used to answer four research question and classified according to insights gained during the review process. There are similarities in methods of different papers that can be described to use as a starting point for conducting research in the field of database compression.
Speeding up dynamic bit vectors

Dönges, Saska (2021)

Bit vectors have many applications within succinct data structures, compression and bioinformatics among others. Any improvements in bit vector performance translates to improvements in the applications. In this thesis we focus on dynamic bit vector performance. Fully dynamic succinct bit vectors enable other dynamic succinct data structures, for example dynamic compressed strings. We briefly discuss the theory of bit vectors and the current state of research related to static and dynamic bit vectors. The main focus of the thesis is on our research into improving the dynamic bit vector implementation in the DYNAMIC C++ library (Prezza, 2017). Our main contribution is the inclusion of buffering to speed up insertions and deletions while not negatively impacting non-modifying operations. In addition, we optimized some of the code in the DYNAMIC library and experimented with vectorizing some of the access operations. Our code optimizations yield a substantial improvement to insertion and deletion performance. Our buffering implementation speeds up insertions and deletions significantly, with negligible impact to other operations or space efficiency. Our implementation acts as proof-of-concept for buffering and suggests that future research into more advanced buffering is likely to increase performance. Finally, our testing indicates that using vectorized instructions in the AVX2 and AVX512 microarchitecture extensions is beneficial in at least some cases and should be researched further. Our implementation available at https://github.com/saskeli/DYNAMIC should only be considered a proof-of-concept as there are known bugs in some of the operations that are not extensively tested.
Using a Foundation Model For Medical Image Segmentation

Häkkinen, Iira (2024)

Foundation models have the potential to reduce the level of supervision required for medical image segmentation tasks. Currently, the medical image segmentation field still largely relies on supervised, task specific models. The aim of this thesis is to investigate if a foundation model, the Segment Anything Model (SAM), can be used to reduce the level of supervision needed for medical image segmentation. The main goal of this thesis is to see if the annotation workload required to generate labeled medical segmentation datasets can be significantly reduced with the help of Segment Anything Model. The second goal of this thesis is to validate the zero-shot performance of the Segment Anything Model on a medical segmentation dataset. A UNet model is used as a baseline. The results of this thesis give positive feedback on SAM's ability to be used as a tool for medical image annotation. During the experiments, it was found that especially for homogeneous, clearly outlined tasks, like organs, using ''pseudo labels'' generated by SAM for training a UNet model resulted in comparable accuracy with training a UNet model on human-annotated labels. Furthermore, the results show that zero-shot SAM has somewhat comparable performance to UNet, and even beats UNet in two of the experimented tasks. For one complexly structured task, SAM and UNet with pseudo labels, trained using SAM's masks, fail to produce accurate results. It is notable that some of the tasks have small training dataset sizes, which limits the test accuracy of UNet. The results are in accordance with recent literature which shows that zero-shot SAM can have comparable performance to state-of-the-art models with large and distinct objects, but when it comes to small, complex structures, SAM is not up to par accuracy-wise to the state-of-the-art medical segmentation models.
Using Reinforcement Learning and Task Decomposition for Learning to Play Doom

Kropotov, Ivan (2020)

Reinforcement learning (RL) is a basic machine learning method, which has recently gained in popularity. As the field matures, RL methods are being applied on progressively more complex problems. This leads to need to design increasingly more complicated models, which are difficult to train and apply in practice. This thesis explores one potential way of solving the problem with large and slow RL models, which is using a modular approach to build the models. The idea behind this approach is to decompose the main task into smaller subtasks and have separate modules each of which concentrates on solving a single subtask. In more detail, the proposed agent will be built using the Q-decomposition algorithm, which provides a simple and robust algorithm for building modular RL agents. The problem we use as an example of usefulness of the modular approach is a simplified version of the video game Doom and we design a RL agent that learns to play it. The empirical results indicate that the proposed model is able to learn to play the simplified version of Doom on a reasonable level, but not perfectly. Additionally, we show that the proposed model might suffer from usage of too simple models for solving the subtasks. Nevertheless, taken as a whole the results and the experience of designing the agent show that the modular approach for RL is a promising way forward and warrants further exploration.
Visual Element Extraction and Grouping from Eighteenth-Century Books based on Deep-Learning Methods

Wang, Ruilin (2023)

This thesis is an integral part of an interdisciplinary research endeavor that provides computer science-driven approaches and deep learning methodologies that seamlessly integrate into the broader research conducted by scholars in the field of digital humanities and historians. Utilizing deep learning techniques, this research investigates the printers and publishers of 18th-century books, with a specific focus on the prominent family-based printing dynasty called Tonson. By identifying common visual elements, the thesis facilitates a comparative analysis of associations between different printers, providing valuable insights into the historical context and artistic characteristics of the Tonson dynasty. The thesis begins by discussing various deep learning models trained on an expert-annotated dataset, enabling the extraction of five main categories and sixteen subcategories of visual elements. Notably, the MaskRCNN model demonstrates superior performance, particularly in detecting headpieces and initials. The study then delves into the grouping of initials and headpieces within the dataset. The Sim CLR model is employed using data augmentation techniques that simulate the inherent noise present in the dataset. This enables the generation of distinct embeddings for each initial and headpiece. Various unsupervised learning methods are applied, with Hierarchical Clustering emerging as the most effective technique. Higher similarity scores for headpieces compared to initials indicate greater ease in identifying similar headpieces. We further discuss the potential applications from a historical perspective including book pricing, future avenues for accurately identifying related printers, and temporal research concerning the Tonson dynasty. In conclusion, this thesis presents a novel integration of computer science and deep learning methodologies within the field of digital humanities and historical studies. By focusing on the Tonson dynasty, it provides a comprehensive analysis of printers and publishers in 18th-century books, ultimately contributing to a deeper understanding of this historical period.

Now showing items 61-74 of 74

Browsing by study line "Algorithms"

Yhteystiedot

HELSINGIN YLIOPISTO