Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "data compression"

Sort by: Order: Results:

  • Vidjeskog, Martin (2022)
    The traditional way of computing the Burrows-Wheeler transform (BWT) has been to first build a suffix array, and then use this newly formed array to obtain the BWT. While this approach runs in linear time, the space requirement is far from optimal. When the length of the input string increases, the required working space quickly becomes too large for normal computers to handle. To overcome this issue, researchers have proposed many different types of algorithms for building the BWT. In 2009, Daisuke Okanohara and Kunihiko Sadakane presented a new linear time algorithm for BWT construction. This algorithm is relatively fast and requires far less working space than the traditional way of computing the BWT. It is based on a technique called induced sorting and can be seen as a state-of-the-art approach for internal memory BWT construction. However, a proper exploration of how to implement the algorithm efficiently has not been undertaken. One 32-bit implementation of the algorithm is known to exist, but due to the limitations of 32-bit programs, it can only be used for input strings under the size of 4 GB. This thesis introduces the algorithm from Okanohara and Sadakane and implements a 64-bit version of it. The implemented algorithm can in theory support input strings that are thousands of gigabytes in size. In addition to the explanation of the algorithm, the time and space requirements of the 64-bit implementation are compared to some other fast BWT algorithms.
  • Siipola, Ossi (2022)
    Inverted indices is a core index structure for different low-level structures, like search engines and databases. It stores a mapping from terms, numbers etc. to list of location in document, set of documents, database, table etc. and allows efficient full-text searches on indexed structure. Mapping location in the inverted indicies is usually called a postings list. In real life applications, scale of the inverted indicies size can grow huge. Therefore efficient representation of it is needed, but at the same time, efficient queries must be supported. This thesis explores ways to represent postings lists efficiently, while allowing efficient nextGEQ queries on the set. Efficient nextGEQ queries is needed to implement inverted indicies. First we convert postings lists into one bitvector, which concatenates each postings list's characteristic bitvector. Then representing an integer set efficiently converts to representing this bitvector efficiently, which is expected to have long runs of 0s and 1s. Run-length encoding of bitvector have recently led to promising results. Therefore in this thesis we experiment two encoding methods (Top-k Hybrid coder, RLZ) that encode postings lists via run-length encodes of the bitvector. We also investigate another new bitvector compression method (Zombit-vector), which encodes bitvectors by finding redundancies of runs of 0/1s. We compare all encoding to current state-of-the-art Partitioned Elisa-Fano (PEF) coding. Compression results on all encodings were more efficient than the current state-of-the-art PEF encoding. Zombit-vector nextGEQ query results were slighty more efficient than PEF's, which make it more attractive with bitvectors that have long runs of 0s and 1s. More work is needed with Top-k Hybrid coder and RLZ, so that those encodings nextGEQ can be compared to Zombit-vector and PEF.