Skip to main content
Login | Suomeksi | På svenska | In English

Label Noise Influence and Recognition on Survey Datasets Using Hierarchical Ensemble Noise Filters

Show full item record

Title: Label Noise Influence and Recognition on Survey Datasets Using Hierarchical Ensemble Noise Filters
Author(s): Gallegos Gutierrez, Angel Manuel
Contributor: University of Helsinki, Faculty of Science, Department of Computer Science
Language: English
Acceptance year: 2018
Abstract:
Statistical Bureaus are responsible for producing meaningful statistical publications. Evidently, the reliability of their publications is subject to the quality of the source dataset, and consequently a significant amount of resources is allocated on detecting and correcting inconsistencies before any statistical output is produced. Particularly, Statistics Finland (Tilastokeskus) is developing a pilot project based on the selective data editing methodology, aiming to preserve high standards in the quality of their datasets while reducing manual interventions. Label noise is a presumably common situation encountered in several real-world datasets, and the current development does not include a module capable of handling such inconsistencies in their datasets. Moreover, the labels characterizing the instances are defined over a class hierarchy following a tree structure. Therefore, this thesis is an initial assessment for including a preprocessing module for explicit label noise recognition in two of their survey datasets. Although automatic label noise corrections cannot be performed for preserving high data quality standards, plausible replacements could be used as a tool assisting the manual interventions. Based on the previous motivations, this thesis was focused on explicitly recognizing hierarchical label inconsistencies and the impact of label noise in the hierarchical classification performance. The performance of several hierarchical classification techniques was assessed under different levels of artificial label noise. In this work, only mandatory leaf node predictions were considered during the evaluations. Two promising noise filtering techniques were evaluated in their capability to uncover the artificially created label noise. Given that the labels are structured over a class hierarchy, the best performing hierarchical methods were selected to work as the base noise filters. Although the results could not be conclusive, certain hierarchical classification methods showed a certain level of robustness against label noise, and their performance is competitive with the conventional methods. On the other hand, noise filtering techniques were effective against hierarchical noise completely at random. Hierarchical adaptations of the noise filters remain competitive and might show signs of handling better rare cases.


Files in this item

Files Size Format View
computer_science_gallegos_gutierrez.pdf 840.0Kb PDF

This item appears in the following Collection(s)

Show full item record