Skip to main content
Login | Suomeksi | På svenska | In English

Analysis and experimental evaluation of an approximation algorithm for the length of an optimal Lempel-Ziv parsing

Show simple item record

dc.date.accessioned 2019-08-13T12:38:08Z
dc.date.available 2019-08-13T12:38:08Z
dc.date.issued 2019-08-13
dc.identifier.uri http://hdl.handle.net/123456789/25091
dc.title Analysis and experimental evaluation of an approximation algorithm for the length of an optimal Lempel-Ziv parsing en
ethesis.discipline Tietojenkäsittelytiede und
ethesis.department Tietojenkäsittelytieteen osasto und
ethesis.faculty Matemaattis-luonnontieteellinen tiedekunta fi
ethesis.faculty Faculty of Science en
ethesis.faculty Matematisk-naturvetenskapliga fakulteten sv
ethesis.faculty.URI http://data.hulib.helsinki.fi/id/8d59209f-6614-4edd-9744-1ebdaf1d13ca
ethesis.university.URI http://data.hulib.helsinki.fi/id/50ae46d8-7ba9-4821-877c-c994c78b0d97
ethesis.university Helsingin yliopisto fi
ethesis.university University of Helsinki en
ethesis.university Helsingfors universitet sv
dct.creator Nietosvaara, Joonas
dct.issued 2019
dct.language.ISO639-2 eng
dct.abstract We examine a previously known sublinear-time algorithm for approximating the length of a string’s optimal (i.e. shortest) Lempel-Ziv parsing (a.k.a. LZ77 factorization). This length is a measure of compressibility under the LZ77 compression algorithm, so the algorithm also estimates a string’s compressibility. The algorithm’s approximation approach is based on a connection between optimal Lempel-Ziv parsing length and the number of distinct substrings of different lengths in a string. Some aspects of the algorithm are described more explicitly than in earlier work, including the constraints on its input and how to distinguish between strings with short vs. long optimal parsings in sublinear time; several proofs (and pseudocode listings) are also more detailed than in earlier work. An implementation of the algorithm is provided. We experimentally investigate the algorithm’s practical usefulness for estimating the compressibility of large collections of data. The algorithm is run on real-world data under a wide range of approximation parameter settings. The accuracy of the resulting estimates is evaluated. The estimates turn out to be consistently highly inaccurate, albeit always inside the stated probabilistic error bounds. We conclude that the algorithm is not promising as a practical tool for estimating compressibility. We also examine the empirical connection between optimal parsing length and the number of distinct substrings of different lengths. The latter turns out to be a suprisingly accurate predictor of the former within our test data, which suggests avenues for future work. en
dct.language en
ethesis.isPublicationLicenseAccepted true
ethesis.language.URI http://data.hulib.helsinki.fi/id/languages/eng
ethesis.language englanti fi
ethesis.language English en
ethesis.language engelska sv
ethesis.thesistype pro gradu -tutkielmat fi
ethesis.thesistype master's thesis en
ethesis.thesistype pro gradu-avhandlingar sv
ethesis.thesistype.URI http://data.hulib.helsinki.fi/id/thesistypes/mastersthesis
dct.identifier.ethesis E-thesisID:8dac80a2-9dad-4191-ad46-ed3c41f0e52d
dct.identifier.urn URN:NBN:fi:hulib-201908133201
dc.type.dcmitype Text
ethesis.facultystudyline.URI none
ethesis.mastersdegreeprogram.URI none

Files in this item

Files Size Format View
Nietosvaara_Joonas_Pro_gradu_2019.pdf 760.2Kb PDF

This item appears in the following Collection(s)

Show simple item record