Skip to main content
Login | Suomeksi | På svenska | In English

Label Noise Influence and Recognition on Survey Datasets Using Hierarchical Ensemble Noise Filters

Show simple item record

dc.date.accessioned 2018-10-22T08:54:48Z
dc.date.available 2018-10-22T08:54:48Z
dc.date.issued 2018-10-22
dc.identifier.uri http://hdl.handle.net/123456789/21275
dc.title Label Noise Influence and Recognition on Survey Datasets Using Hierarchical Ensemble Noise Filters en
ethesis.department Institutionen för datavetenskap sv
ethesis.department Department of Computer Science en
ethesis.department Tietojenkäsittelytieteen laitos fi
ethesis.faculty Matemaattis-luonnontieteellinen tiedekunta fi
ethesis.faculty Faculty of Science en
ethesis.faculty Matematisk-naturvetenskapliga fakulteten sv
ethesis.faculty.URI http://data.hulib.helsinki.fi/id/8d59209f-6614-4edd-9744-1ebdaf1d13ca
ethesis.university.URI http://data.hulib.helsinki.fi/id/50ae46d8-7ba9-4821-877c-c994c78b0d97
ethesis.university Helsingin yliopisto fi
ethesis.university University of Helsinki en
ethesis.university Helsingfors universitet sv
dct.creator Gallegos Gutierrez, Angel Manuel
dct.issued 2018
dct.language.ISO639-2 eng
dct.abstract Statistical Bureaus are responsible for producing meaningful statistical publications. Evidently, the reliability of their publications is subject to the quality of the source dataset, and consequently a significant amount of resources is allocated on detecting and correcting inconsistencies before any statistical output is produced. Particularly, Statistics Finland (Tilastokeskus) is developing a pilot project based on the selective data editing methodology, aiming to preserve high standards in the quality of their datasets while reducing manual interventions. Label noise is a presumably common situation encountered in several real-world datasets, and the current development does not include a module capable of handling such inconsistencies in their datasets. Moreover, the labels characterizing the instances are defined over a class hierarchy following a tree structure. Therefore, this thesis is an initial assessment for including a preprocessing module for explicit label noise recognition in two of their survey datasets. Although automatic label noise corrections cannot be performed for preserving high data quality standards, plausible replacements could be used as a tool assisting the manual interventions. Based on the previous motivations, this thesis was focused on explicitly recognizing hierarchical label inconsistencies and the impact of label noise in the hierarchical classification performance. The performance of several hierarchical classification techniques was assessed under different levels of artificial label noise. In this work, only mandatory leaf node predictions were considered during the evaluations. Two promising noise filtering techniques were evaluated in their capability to uncover the artificially created label noise. Given that the labels are structured over a class hierarchy, the best performing hierarchical methods were selected to work as the base noise filters. Although the results could not be conclusive, certain hierarchical classification methods showed a certain level of robustness against label noise, and their performance is competitive with the conventional methods. On the other hand, noise filtering techniques were effective against hierarchical noise completely at random. Hierarchical adaptations of the noise filters remain competitive and might show signs of handling better rare cases. en
dct.language en
ethesis.isPublicationLicenseAccepted false
ethesis.language.URI http://data.hulib.helsinki.fi/id/languages/eng
ethesis.language English en
ethesis.language engelska sv
ethesis.language englanti fi
ethesis.thesistype pro gradu -tutkielmat fi
ethesis.thesistype master's thesis en
ethesis.thesistype pro gradu-avhandlingar sv
ethesis.thesistype.URI http://data.hulib.helsinki.fi/id/thesistypes/mastersthesis
dct.identifier.ethesis E-thesisID:ae748fa2-3454-454e-86c3-af1896e2c93e
ethesis.degreeprogram Algorithms and Machine Learning en
dct.identifier.urn URN:NBN:fi-fe201804208653
dc.type.dcmitype Text

Files in this item

Files Size Format View
computer_science_gallegos_gutierrez.pdf 840.0Kb PDF

This item appears in the following Collection(s)

Show simple item record