Skip to main content
Login | Suomeksi | På svenska | In English

A Method for Wavelet-Based Time Series Analysis of Historical Newspapers

Show full item record

Title: A Method for Wavelet-Based Time Series Analysis of Historical Newspapers
Author(s): Avikainen, Jari
Contributor: University of Helsinki, Faculty of Science, none
Discipline: none
Degree program: Master's Programme in Computer Science
Specialisation: Algorithms
Language: English
Acceptance year: 2019
Abstract:
This thesis presents a wavelet-based method for detecting moments of fast change in the textual contents of historical newspapers. The method works by generating time series of the relative frequencies of different words in the newspaper contents over time, and calculating their wavelet transforms. Wavelet transform is essentially a group of transformations describing the changes happening in the original time series at different time scales, and can therefore be used to pinpoint moments of fast change in the data. The produced wavelet transforms are then used to detect fast changes in word frequencies by examining products of multiple scales of the transform. The properties of the wavelet transform and the related multi-scale product are evaluated in relation to detecting various kinds of steps and spikes in different noise environments. The suitability of the method for analysing historical newspaper archives is examined using an example corpus consisting of 487 issues of Uusi Suometar from 1869–1918 and 250 issues of Wiipuri from 1893–1918. Two problematic features in the newspaper data, noise caused by OCR (optical character recognition) errors and uneven temporal distribution of the data, are identified and their effects on the results of the presented method are evaluated using synthetic data. Finally, the method is tested using the example corpus, and the results are examined briefly. The method is found to be adversely affected especially by the uneven temporal distribution of the newspaper data. Without additional processing, or improving the quality of the examined data, a significant amount of the detected steps are due to the noise in the data. Various ways of alleviating the effect are proposed, among other suggested improvements on the system.


Files in this item

Files Size Format View
Avikainen_Jari_Pro_gradu_2019.pdf 1.264Mb PDF

This item appears in the following Collection(s)

Show full item record