Cost-Optimal Correlation Clustering via Partial Maximum Satisfiability

Cost-Optimal Correlation Clustering via Partial Maximum Satisfiability

dc.date.accessioned	2014-05-27T09:41:35Z	und
dc.date.accessioned	2017-10-24T12:21:30Z
dc.date.available	2014-05-27T09:41:35Z	und
dc.date.available	2017-10-24T12:21:30Z
dc.date.issued	2014-05-27T09:41:35Z
dc.identifier.uri	http://radr.hulib.helsinki.fi/handle/10138.1/3736	und
dc.identifier.uri	http://hdl.handle.net/10138.1/3736
dc.title	Cost-Optimal Correlation Clustering via Partial Maximum Satisfiability	en
ethesis.discipline	Applied Mathematics	en
ethesis.discipline	Soveltava matematiikka	fi
ethesis.discipline	Tillämpad matematik	sv
ethesis.discipline.URI	http://data.hulib.helsinki.fi/id/2646f59d-c072-44e7-b1c1-4e4b8b798323
ethesis.department.URI	http://data.hulib.helsinki.fi/id/61364eb4-647a-40e2-8539-11c5c0af8dc2
ethesis.department	Institutionen för matematik och statistik	sv
ethesis.department	Department of Mathematics and Statistics	en
ethesis.department	Matematiikan ja tilastotieteen laitos	fi
ethesis.faculty	Matematisk-naturvetenskapliga fakulteten	sv
ethesis.faculty	Matemaattis-luonnontieteellinen tiedekunta	fi
ethesis.faculty	Faculty of Science	en
ethesis.faculty.URI	http://data.hulib.helsinki.fi/id/8d59209f-6614-4edd-9744-1ebdaf1d13ca
ethesis.university.URI	http://data.hulib.helsinki.fi/id/50ae46d8-7ba9-4821-877c-c994c78b0d97
ethesis.university	Helsingfors universitet	sv
ethesis.university	University of Helsinki	en
ethesis.university	Helsingin yliopisto	fi
dct.creator	Berg, Jeremias
dct.issued	2014
dct.language.ISO639-2	eng
dct.abstract	Clustering is one of the core problems of unsupervised machine learning. In a clustering problem we are given a set of data points and asked to partition them into smaller subgroups, known as clusters, such that each point is assigned to exactly one cluster. The quality of the obtained partitioning (clustering) is then evaluated according to some objective measure dependent on the specific clustering paradigm. A traditional approach within the machine learning community to solving clustering problems has been focused on approximative, local search algorithms that in general can not provide optimality guarantees of the clusterings produced. However, recent advances in the field of constraint optimization has allowed for an alternative view on clustering, and many other data analysis problems. The alternative view is based on stating the problem at hand in some declarative language and then using generic solvers for that language in order to solve the problem optimally. This thesis contributes to this approach to clustering by providing a first study on the applicability of state-of-the-art Boolean optimization procedures to cost-optimal correlation clustering under constraints in a general similarity-based setting. The correlation clustering paradigm is geared towards classifying data based on qualitative--- as opposed to quantitative similarity information of pairs of data points. Furthermore, correlation clustering does not require the number of clusters as input. This makes it especially well suited to problem domains in which the true number of clusters is unknown. In this thesis we formulate correlation clustering within the language of propositional logic. As is often done within computational logic, we focus only on formulas in conjunctive normal form (CNF), a limitation which can be done without loss of generality. When encoded as a CNF-formula the correlation clustering problem becomes an instance of partial Maximum Satisfiability (MaxSAT), the optimization version of the Boolean satisfiability (SAT) problem. We present three different encodings of correlation clustering into CNF-formulas and provide proofs of the correctness of each encoding. We also experimentally evaluate them by applying a state-of-the-art MaxSAT solver for solving the resulting MaxSAT instances. The experiments demonstrate both the scalability of our method and the quality of the clusterings obtained. As a more theoretical result we prove that the assumption of the input graph being undirected can be done without loss of generality, this justifies our encodings being applicable to all variants of correlation clustering known to us. This thesis also addresses another clustering paradigm, namely constrained correlation clustering. In constrained correlation clustering additional constraints are used in order to restrict the acceptable solutions to the correlation clustering problem, for example according to some domain specific knowledge provided by an expert. We demonstrate how our MaxSAT-based approach to correlation clustering naturally extends to constrained correlation clustering. Furthermore we show experimentally that added user knowledge allows clustering larger datasets, decreases the running time of our approach, and steers the obtained clusterings fast towards a predefined ground-truth clustering.	en
dct.language	en
ethesis.language.URI	http://data.hulib.helsinki.fi/id/languages/eng
ethesis.language	English	en
ethesis.language	englanti	fi
ethesis.language	engelska	sv
ethesis.thesistype	pro gradu-avhandlingar	sv
ethesis.thesistype	pro gradu -tutkielmat	fi
ethesis.thesistype	master's thesis	en
ethesis.thesistype.URI	http://data.hulib.helsinki.fi/id/thesistypes/mastersthesis
dct.identifier.urn	URN:NBN:fi-fe2017112251337
dc.type.dcmitype	Text

Files in this item

Files	Size	Format	View
paper.pdf	592.9Kb	PDF

This item appears in the following Collection(s)

Faculty of Science [4203]

Show simple item record

Cost-Optimal Correlation Clustering via Partial Maximum Satisfiability

Files in this item

This item appears in the following Collection(s)

Yhteystiedot

HELSINGIN YLIOPISTO