Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Author "Haapasalmi, Risto"

Sort by: Order: Results:

  • Haapasalmi, Risto (2020)
    In recent years highly compact succinct text indexes developed in bioinformatics have spread to the domain of natural language processing, in particular n-gram indexing. One line of research has been to utilize compressed suffix trees as both the text index and the language model. Compressed suffix trees have several favourable properties for compressing n-gram strings and associated satellite data while allowing for both fast access and fast computation of the language model probabilities over the text. When it comes to count based n-gram language models and especially to low-order n-gram models, the Kneser-Ney language model has long been de facto industry standard. Shareghi et al. showed how to utilize a compressed suffix tree to build a highly compact index that is competitive with state-of-the-art language models in space. In addition, they showed how the index can work as a language model and allows computing modified Kneser-Ney probabilities straight from the data structure. This thesis analyzes and extends the works of Shareghi et al. in building a compressed suffix tree based modified Kneser-Ney language model. We explain their solution and present three attempts to improve the approach. Out of the three experiments, one performed far worse than the original approach, but two showed minor gains in time with no real loss in space.