Sentence segmentation on poorly structured text using language models

Sentence segmentation on poorly structured text using language models

dc.date.accessioned	2017-05-11T11:19:08Z	und
dc.date.accessioned	2017-10-24T12:24:25Z
dc.date.available	2017-05-11T11:19:08Z	und
dc.date.available	2017-10-24T12:24:25Z
dc.date.issued	2017-05-11T11:19:08Z
dc.identifier.uri	http://radr.hulib.helsinki.fi/handle/10138.1/6015	und
dc.identifier.uri	http://hdl.handle.net/10138.1/6015
dc.title	Sentence segmentation on poorly structured text using language models	en
ethesis.discipline	Computer science	en
ethesis.discipline	Tietojenkäsittelytiede	fi
ethesis.discipline	Datavetenskap	sv
ethesis.discipline.URI	http://data.hulib.helsinki.fi/id/1dcabbeb-f422-4eec-aaff-bb11d7501348
ethesis.department.URI	http://data.hulib.helsinki.fi/id/225405e8-3362-4197-a7fd-6e7b79e52d14
ethesis.department	Institutionen för datavetenskap	sv
ethesis.department	Department of Computer Science	en
ethesis.department	Tietojenkäsittelytieteen laitos	fi
ethesis.faculty	Matematisk-naturvetenskapliga fakulteten	sv
ethesis.faculty	Matemaattis-luonnontieteellinen tiedekunta	fi
ethesis.faculty	Faculty of Science	en
ethesis.faculty.URI	http://data.hulib.helsinki.fi/id/8d59209f-6614-4edd-9744-1ebdaf1d13ca
ethesis.university.URI	http://data.hulib.helsinki.fi/id/50ae46d8-7ba9-4821-877c-c994c78b0d97
ethesis.university	Helsingfors universitet	sv
ethesis.university	University of Helsinki	en
ethesis.university	Helsingin yliopisto	fi
dct.creator	Nikkari, Eeva
dct.issued	2017
dct.language.ISO639-2	eng
dct.abstract	The sentence segmentation task is the task of segmenting a text corpus into sentences. Segmenting well structured and fully punctuated data into sentences is not a very difficult problem. However, when the data is poorly structured or missing punctuation the task is more difficult. This thesis will look into this problem by using probabilistic language modeling, with special emphasis on the n-gram model. We will present theory related to language models and evaluating them, as well as empirical results achieved on documents provided by AlphaSense Oy and a freely available Reuters-21578 corpus. The experiments on n-gram models focused on the following questions. How does the smoothing and order of the n-gram affect the model? How well does a model trained on one type of data adapt to another type of text? How does retaining more or less symbols and punctuation affect the performance? And how much is enough training data for the model? The n-gram models performed rather well on the same type of data they were trained on. However, the performance was significantly worse when moving to another document type. In absence of punctuation the performance of the model was also rather poor. The conclusion is that the n-gram model seems inadequate in recovering the sentence boundaries in difficult settings such as separating the unpuncutated title from the body of the text.	en
dct.subject	sentence boundary disambiguation	en
dct.subject	SBD	en
dct.subject	sentence segmentation	en
dct.subject	machine learning	en
dct.subject	language modeling	en
dct.language	en
ethesis.language.URI	http://data.hulib.helsinki.fi/id/languages/eng
ethesis.language	English	en
ethesis.language	englanti	fi
ethesis.language	engelska	sv
ethesis.thesistype	pro gradu-avhandlingar	sv
ethesis.thesistype	pro gradu -tutkielmat	fi
ethesis.thesistype	master's thesis	en
ethesis.thesistype.URI	http://data.hulib.helsinki.fi/id/thesistypes/mastersthesis
dct.identifier.urn	URN:NBN:fi-fe2017112251276
dc.type.dcmitype	Text

Files in this item

Files	Size	Format	View
Nikkari_Eeva_Progradu_2017.pdf	560.0Kb	PDF

This item appears in the following Collection(s)

Faculty of Science [4203]

Show simple item record

Sentence segmentation on poorly structured text using language models

Files in this item

This item appears in the following Collection(s)

Yhteystiedot

HELSINGIN YLIOPISTO