Skip to main content
Login | Suomeksi | På svenska | In English

Sentence segmentation on poorly structured text using language models

Show simple item record

dc.date.accessioned 2017-05-11T11:19:08Z und
dc.date.accessioned 2017-10-24T12:24:25Z
dc.date.available 2017-05-11T11:19:08Z und
dc.date.available 2017-10-24T12:24:25Z
dc.date.issued 2017-05-11T11:19:08Z
dc.identifier.uri http://radr.hulib.helsinki.fi/handle/10138.1/6015 und
dc.identifier.uri http://hdl.handle.net/10138.1/6015
dc.title Sentence segmentation on poorly structured text using language models en
ethesis.discipline Computer science en
ethesis.discipline Tietojenkäsittelytiede fi
ethesis.discipline Datavetenskap sv
ethesis.discipline.URI http://data.hulib.helsinki.fi/id/1dcabbeb-f422-4eec-aaff-bb11d7501348
ethesis.department.URI http://data.hulib.helsinki.fi/id/225405e8-3362-4197-a7fd-6e7b79e52d14
ethesis.department Institutionen för datavetenskap sv
ethesis.department Department of Computer Science en
ethesis.department Tietojenkäsittelytieteen laitos fi
ethesis.faculty Matematisk-naturvetenskapliga fakulteten sv
ethesis.faculty Matemaattis-luonnontieteellinen tiedekunta fi
ethesis.faculty Faculty of Science en
ethesis.faculty.URI http://data.hulib.helsinki.fi/id/8d59209f-6614-4edd-9744-1ebdaf1d13ca
ethesis.university.URI http://data.hulib.helsinki.fi/id/50ae46d8-7ba9-4821-877c-c994c78b0d97
ethesis.university Helsingfors universitet sv
ethesis.university University of Helsinki en
ethesis.university Helsingin yliopisto fi
dct.creator Nikkari, Eeva
dct.issued 2017
dct.language.ISO639-2 eng
dct.abstract The sentence segmentation task is the task of segmenting a text corpus into sentences. Segmenting well structured and fully punctuated data into sentences is not a very difficult problem. However, when the data is poorly structured or missing punctuation the task is more difficult. This thesis will look into this problem by using probabilistic language modeling, with special emphasis on the n-gram model. We will present theory related to language models and evaluating them, as well as empirical results achieved on documents provided by AlphaSense Oy and a freely available Reuters-21578 corpus. The experiments on n-gram models focused on the following questions. How does the smoothing and order of the n-gram affect the model? How well does a model trained on one type of data adapt to another type of text? How does retaining more or less symbols and punctuation affect the performance? And how much is enough training data for the model? The n-gram models performed rather well on the same type of data they were trained on. However, the performance was significantly worse when moving to another document type. In absence of punctuation the performance of the model was also rather poor. The conclusion is that the n-gram model seems inadequate in recovering the sentence boundaries in difficult settings such as separating the unpuncutated title from the body of the text. en
dct.subject sentence boundary disambiguation en
dct.subject SBD en
dct.subject sentence segmentation en
dct.subject machine learning en
dct.subject language modeling en
dct.language en
ethesis.language.URI http://data.hulib.helsinki.fi/id/languages/eng
ethesis.language English en
ethesis.language englanti fi
ethesis.language engelska sv
ethesis.thesistype pro gradu-avhandlingar sv
ethesis.thesistype pro gradu -tutkielmat fi
ethesis.thesistype master's thesis en
ethesis.thesistype.URI http://data.hulib.helsinki.fi/id/thesistypes/mastersthesis
dct.identifier.urn URN:NBN:fi-fe2017112251276
dc.type.dcmitype Text

Files in this item

Files Size Format View
Nikkari_Eeva_Progradu_2017.pdf 560.0Kb PDF

This item appears in the following Collection(s)

Show simple item record