Skip to main content
Login | Suomeksi | På svenska | In English

Cost-Optimal Correlation Clustering via Partial Maximum Satisfiability

Show simple item record

dc.date.accessioned 2014-05-27T09:41:35Z und
dc.date.accessioned 2017-10-24T12:21:30Z
dc.date.available 2014-05-27T09:41:35Z und
dc.date.available 2017-10-24T12:21:30Z
dc.date.issued 2014-05-27T09:41:35Z
dc.identifier.uri http://radr.hulib.helsinki.fi/handle/10138.1/3736 und
dc.identifier.uri http://hdl.handle.net/10138.1/3736
dc.title Cost-Optimal Correlation Clustering via Partial Maximum Satisfiability en
ethesis.discipline Applied Mathematics en
ethesis.discipline Soveltava matematiikka fi
ethesis.discipline Tillämpad matematik sv
ethesis.discipline.URI http://data.hulib.helsinki.fi/id/2646f59d-c072-44e7-b1c1-4e4b8b798323
ethesis.department.URI http://data.hulib.helsinki.fi/id/61364eb4-647a-40e2-8539-11c5c0af8dc2
ethesis.department Institutionen för matematik och statistik sv
ethesis.department Department of Mathematics and Statistics en
ethesis.department Matematiikan ja tilastotieteen laitos fi
ethesis.faculty Matematisk-naturvetenskapliga fakulteten sv
ethesis.faculty Matemaattis-luonnontieteellinen tiedekunta fi
ethesis.faculty Faculty of Science en
ethesis.faculty.URI http://data.hulib.helsinki.fi/id/8d59209f-6614-4edd-9744-1ebdaf1d13ca
ethesis.university.URI http://data.hulib.helsinki.fi/id/50ae46d8-7ba9-4821-877c-c994c78b0d97
ethesis.university Helsingfors universitet sv
ethesis.university University of Helsinki en
ethesis.university Helsingin yliopisto fi
dct.creator Berg, Jeremias
dct.issued 2014
dct.language.ISO639-2 eng
dct.abstract Clustering is one of the core problems of unsupervised machine learning. In a clustering problem we are given a set of data points and asked to partition them into smaller subgroups, known as clusters, such that each point is assigned to exactly one cluster. The quality of the obtained partitioning (clustering) is then evaluated according to some objective measure dependent on the specific clustering paradigm. A traditional approach within the machine learning community to solving clustering problems has been focused on approximative, local search algorithms that in general can not provide optimality guarantees of the clusterings produced. However, recent advances in the field of constraint optimization has allowed for an alternative view on clustering, and many other data analysis problems. The alternative view is based on stating the problem at hand in some declarative language and then using generic solvers for that language in order to solve the problem optimally. This thesis contributes to this approach to clustering by providing a first study on the applicability of state-of-the-art Boolean optimization procedures to cost-optimal correlation clustering under constraints in a general similarity-based setting. The correlation clustering paradigm is geared towards classifying data based on qualitative--- as opposed to quantitative similarity information of pairs of data points. Furthermore, correlation clustering does not require the number of clusters as input. This makes it especially well suited to problem domains in which the true number of clusters is unknown. In this thesis we formulate correlation clustering within the language of propositional logic. As is often done within computational logic, we focus only on formulas in conjunctive normal form (CNF), a limitation which can be done without loss of generality. When encoded as a CNF-formula the correlation clustering problem becomes an instance of partial Maximum Satisfiability (MaxSAT), the optimization version of the Boolean satisfiability (SAT) problem. We present three different encodings of correlation clustering into CNF-formulas and provide proofs of the correctness of each encoding. We also experimentally evaluate them by applying a state-of-the-art MaxSAT solver for solving the resulting MaxSAT instances. The experiments demonstrate both the scalability of our method and the quality of the clusterings obtained. As a more theoretical result we prove that the assumption of the input graph being undirected can be done without loss of generality, this justifies our encodings being applicable to all variants of correlation clustering known to us. This thesis also addresses another clustering paradigm, namely constrained correlation clustering. In constrained correlation clustering additional constraints are used in order to restrict the acceptable solutions to the correlation clustering problem, for example according to some domain specific knowledge provided by an expert. We demonstrate how our MaxSAT-based approach to correlation clustering naturally extends to constrained correlation clustering. Furthermore we show experimentally that added user knowledge allows clustering larger datasets, decreases the running time of our approach, and steers the obtained clusterings fast towards a predefined ground-truth clustering. en
dct.language en
ethesis.language.URI http://data.hulib.helsinki.fi/id/languages/eng
ethesis.language English en
ethesis.language englanti fi
ethesis.language engelska sv
ethesis.thesistype pro gradu-avhandlingar sv
ethesis.thesistype pro gradu -tutkielmat fi
ethesis.thesistype master's thesis en
ethesis.thesistype.URI http://data.hulib.helsinki.fi/id/thesistypes/mastersthesis
dct.identifier.urn URN:NBN:fi-fe2017112251337
dc.type.dcmitype Text

Files in this item

Files Size Format View
paper.pdf 592.9Kb PDF

This item appears in the following Collection(s)

Show simple item record