Skip to main content
Login | Suomeksi | På svenska | In English

Sentence segmentation on poorly structured text using language models

Show full item record

Title: Sentence segmentation on poorly structured text using language models
Author(s): Nikkari, Eeva
Contributor: University of Helsinki, Faculty of Science, Department of Computer Science
Discipline: Computer science
Language: English
Acceptance year: 2017
Abstract:
The sentence segmentation task is the task of segmenting a text corpus into sentences. Segmenting well structured and fully punctuated data into sentences is not a very difficult problem. However, when the data is poorly structured or missing punctuation the task is more difficult. This thesis will look into this problem by using probabilistic language modeling, with special emphasis on the n-gram model. We will present theory related to language models and evaluating them, as well as empirical results achieved on documents provided by AlphaSense Oy and a freely available Reuters-21578 corpus. The experiments on n-gram models focused on the following questions. How does the smoothing and order of the n-gram affect the model? How well does a model trained on one type of data adapt to another type of text? How does retaining more or less symbols and punctuation affect the performance? And how much is enough training data for the model? The n-gram models performed rather well on the same type of data they were trained on. However, the performance was significantly worse when moving to another document type. In absence of punctuation the performance of the model was also rather poor. The conclusion is that the n-gram model seems inadequate in recovering the sentence boundaries in difficult settings such as separating the unpuncutated title from the body of the text.
Keyword(s): sentence boundary disambiguation SBD sentence segmentation machine learning language modeling


Files in this item

Files Size Format View
Nikkari_Eeva_Progradu_2017.pdf 560.0Kb PDF

This item appears in the following Collection(s)

Show full item record