Skip to main content
Login | Suomeksi | På svenska | In English

Verbatim Implementation of a Fast and Space-Efficient Indexed Pattern Matching Algorithm

Show simple item record

dc.date.accessioned 2016-09-26T18:35:41Z und
dc.date.accessioned 2017-10-24T12:24:18Z
dc.date.available 2016-09-26T18:35:41Z und
dc.date.available 2017-10-24T12:24:18Z
dc.date.issued 2016-09-26T18:35:41Z
dc.identifier.uri http://radr.hulib.helsinki.fi/handle/10138.1/5747 und
dc.identifier.uri http://hdl.handle.net/10138.1/5747
dc.title Verbatim Implementation of a Fast and Space-Efficient Indexed Pattern Matching Algorithm en
ethesis.discipline Computer science en
ethesis.discipline Tietojenkäsittelytiede fi
ethesis.discipline Datavetenskap sv
ethesis.discipline.URI http://data.hulib.helsinki.fi/id/1dcabbeb-f422-4eec-aaff-bb11d7501348
ethesis.department.URI http://data.hulib.helsinki.fi/id/225405e8-3362-4197-a7fd-6e7b79e52d14
ethesis.department Institutionen för datavetenskap sv
ethesis.department Department of Computer Science en
ethesis.department Tietojenkäsittelytieteen laitos fi
ethesis.faculty Matematisk-naturvetenskapliga fakulteten sv
ethesis.faculty Matemaattis-luonnontieteellinen tiedekunta fi
ethesis.faculty Faculty of Science en
ethesis.faculty.URI http://data.hulib.helsinki.fi/id/8d59209f-6614-4edd-9744-1ebdaf1d13ca
ethesis.university.URI http://data.hulib.helsinki.fi/id/50ae46d8-7ba9-4821-877c-c994c78b0d97
ethesis.university Helsingfors universitet sv
ethesis.university University of Helsinki en
ethesis.university Helsingin yliopisto fi
dct.creator Norri, Tuukka
dct.issued 2016
dct.language.ISO639-2 eng
dct.abstract Approximate string matching considers finding a given text pattern in another text while allowing some number of differences. In the offline version of the problem, the text is known beforehand and is processed to generate an indexing data structure. While the problem has received a lot of attention and it has many practical uses in bioinformatics, the common tools often do not make use of the algorithms the time and space complexities of which are the best ones known. Hence it is interesting to compare the performance of an efficient algorithm to tools that make use of heuristics. In this work, a pattern matching algorithm by T.-W. Lam, W.-K. Sung and S.-S. Wong is described. An implementation of the algorithm is provided and tested against two other tools, namely Erne 2 and readaligner 2012. The algorithm by Lam, Sung and Wong searches the text for the pattern while allowing one mismatch or difference, that is, also allowing character insertion and deletion. It makes use of certain types of compressed suffix array and compressed suffix tree that provide fast operations. Additionally, to restrict the search to relevant parts of the suffix tree, a sample is taken from the suffix array and the sampled indices are stored into a data structure that provides double logarithmic worst case range queries. To find the pattern in the text while allowing k errors, the algorithm is combined with a dynamic programming algorithm. The latter is used to find partial matches with k - 1 errors. The candidate occurrences are located from the suffix tree and these branches are used with Lam, Sung and Wong's algorithm. For a text of length n and a pattern of length m drawn from an alphabet of size σ, the time complexity of the algorithm is O(σ^k m^k (k + log log n) + occ) using an O(n (log n)^(1/2) log σ)-bit indexing data structure, where occ is the number of occurrences in the text, given that σ is O(2^((log n)^(1/2))). For short patterns, this is the best known time complexity with an indexing data structure of the given size. The results indicate that in practice relying on heuristics yields better results in terms of time and memory use. While such an outcome is not remarkable, some important data structures were implemented in the process. An implementation of S. S. Rao's compressed suffix array already existed but it was rewritten to allow using different supporting data structures for e.g. rank and select support. The inverse suffix array described by Lam, Sung and Wong was implemented. Also, while implementations of X and Y-fast were available, to the author's knowledge a publicly available implementation of these combined with perfect hashing had not been produced before. Moreover, no implementation of Lam, Sung and Wong's algorithm was known to exist before. en
dct.language en
ethesis.language.URI http://data.hulib.helsinki.fi/id/languages/eng
ethesis.language English en
ethesis.language englanti fi
ethesis.language engelska sv
ethesis.thesistype pro gradu-avhandlingar sv
ethesis.thesistype pro gradu -tutkielmat fi
ethesis.thesistype master's thesis en
ethesis.thesistype.URI http://data.hulib.helsinki.fi/id/thesistypes/mastersthesis
dct.identifier.urn URN:NBN:fi-fe2017112251764
dc.type.dcmitype Text

Files in this item

Files Size Format View
gradu.pdf 363.5Kb PDF

This item appears in the following Collection(s)

Show simple item record