Verbatim Implementation of a Fast and Space-Efficient Indexed Pattern Matching Algorithm

Verbatim Implementation of a Fast and Space-Efficient Indexed Pattern Matching Algorithm

dc.date.accessioned	2016-09-26T18:35:41Z	und
dc.date.accessioned	2017-10-24T12:24:18Z
dc.date.available	2016-09-26T18:35:41Z	und
dc.date.available	2017-10-24T12:24:18Z
dc.date.issued	2016-09-26T18:35:41Z
dc.identifier.uri	http://radr.hulib.helsinki.fi/handle/10138.1/5747	und
dc.identifier.uri	http://hdl.handle.net/10138.1/5747
dc.title	Verbatim Implementation of a Fast and Space-Efficient Indexed Pattern Matching Algorithm	en
ethesis.discipline	Computer science	en
ethesis.discipline	Tietojenkäsittelytiede	fi
ethesis.discipline	Datavetenskap	sv
ethesis.discipline.URI	http://data.hulib.helsinki.fi/id/1dcabbeb-f422-4eec-aaff-bb11d7501348
ethesis.department.URI	http://data.hulib.helsinki.fi/id/225405e8-3362-4197-a7fd-6e7b79e52d14
ethesis.department	Institutionen för datavetenskap	sv
ethesis.department	Department of Computer Science	en
ethesis.department	Tietojenkäsittelytieteen laitos	fi
ethesis.faculty	Matematisk-naturvetenskapliga fakulteten	sv
ethesis.faculty	Matemaattis-luonnontieteellinen tiedekunta	fi
ethesis.faculty	Faculty of Science	en
ethesis.faculty.URI	http://data.hulib.helsinki.fi/id/8d59209f-6614-4edd-9744-1ebdaf1d13ca
ethesis.university.URI	http://data.hulib.helsinki.fi/id/50ae46d8-7ba9-4821-877c-c994c78b0d97
ethesis.university	Helsingfors universitet	sv
ethesis.university	University of Helsinki	en
ethesis.university	Helsingin yliopisto	fi
dct.creator	Norri, Tuukka
dct.issued	2016
dct.language.ISO639-2	eng
dct.abstract	Approximate string matching considers finding a given text pattern in another text while allowing some number of differences. In the offline version of the problem, the text is known beforehand and is processed to generate an indexing data structure. While the problem has received a lot of attention and it has many practical uses in bioinformatics, the common tools often do not make use of the algorithms the time and space complexities of which are the best ones known. Hence it is interesting to compare the performance of an efficient algorithm to tools that make use of heuristics. In this work, a pattern matching algorithm by T.-W. Lam, W.-K. Sung and S.-S. Wong is described. An implementation of the algorithm is provided and tested against two other tools, namely Erne 2 and readaligner 2012. The algorithm by Lam, Sung and Wong searches the text for the pattern while allowing one mismatch or difference, that is, also allowing character insertion and deletion. It makes use of certain types of compressed suffix array and compressed suffix tree that provide fast operations. Additionally, to restrict the search to relevant parts of the suffix tree, a sample is taken from the suffix array and the sampled indices are stored into a data structure that provides double logarithmic worst case range queries. To find the pattern in the text while allowing k errors, the algorithm is combined with a dynamic programming algorithm. The latter is used to find partial matches with k - 1 errors. The candidate occurrences are located from the suffix tree and these branches are used with Lam, Sung and Wong's algorithm. For a text of length n and a pattern of length m drawn from an alphabet of size σ, the time complexity of the algorithm is O(σ^k m^k (k + log log n) + occ) using an O(n (log n)^(1/2) log σ)-bit indexing data structure, where occ is the number of occurrences in the text, given that σ is O(2^((log n)^(1/2))). For short patterns, this is the best known time complexity with an indexing data structure of the given size. The results indicate that in practice relying on heuristics yields better results in terms of time and memory use. While such an outcome is not remarkable, some important data structures were implemented in the process. An implementation of S. S. Rao's compressed suffix array already existed but it was rewritten to allow using different supporting data structures for e.g. rank and select support. The inverse suffix array described by Lam, Sung and Wong was implemented. Also, while implementations of X and Y-fast were available, to the author's knowledge a publicly available implementation of these combined with perfect hashing had not been produced before. Moreover, no implementation of Lam, Sung and Wong's algorithm was known to exist before.	en
dct.language	en
ethesis.language.URI	http://data.hulib.helsinki.fi/id/languages/eng
ethesis.language	English	en
ethesis.language	englanti	fi
ethesis.language	engelska	sv
ethesis.thesistype	pro gradu-avhandlingar	sv
ethesis.thesistype	pro gradu -tutkielmat	fi
ethesis.thesistype	master's thesis	en
ethesis.thesistype.URI	http://data.hulib.helsinki.fi/id/thesistypes/mastersthesis
dct.identifier.urn	URN:NBN:fi-fe2017112251764
dc.type.dcmitype	Text

Files in this item

Files	Size	Format	View
gradu.pdf	363.5Kb	PDF

This item appears in the following Collection(s)

Faculty of Science [4203]

Show simple item record

Verbatim Implementation of a Fast and Space-Efficient Indexed Pattern Matching Algorithm

Files in this item

This item appears in the following Collection(s)

Yhteystiedot

HELSINGIN YLIOPISTO