Skip to main content
Login | Suomeksi | På svenska | In English

Cost-Optimal Correlation Clustering via Partial Maximum Satisfiability

Show full item record

Title: Cost-Optimal Correlation Clustering via Partial Maximum Satisfiability
Author(s): Berg, Jeremias
Contributor: University of Helsinki, Faculty of Science, Department of Mathematics and Statistics
Discipline: Applied Mathematics
Language: English
Acceptance year: 2014
Abstract:
Clustering is one of the core problems of unsupervised machine learning. In a clustering problem we are given a set of data points and asked to partition them into smaller subgroups, known as clusters, such that each point is assigned to exactly one cluster. The quality of the obtained partitioning (clustering) is then evaluated according to some objective measure dependent on the specific clustering paradigm. A traditional approach within the machine learning community to solving clustering problems has been focused on approximative, local search algorithms that in general can not provide optimality guarantees of the clusterings produced. However, recent advances in the field of constraint optimization has allowed for an alternative view on clustering, and many other data analysis problems. The alternative view is based on stating the problem at hand in some declarative language and then using generic solvers for that language in order to solve the problem optimally. This thesis contributes to this approach to clustering by providing a first study on the applicability of state-of-the-art Boolean optimization procedures to cost-optimal correlation clustering under constraints in a general similarity-based setting. The correlation clustering paradigm is geared towards classifying data based on qualitative--- as opposed to quantitative similarity information of pairs of data points. Furthermore, correlation clustering does not require the number of clusters as input. This makes it especially well suited to problem domains in which the true number of clusters is unknown. In this thesis we formulate correlation clustering within the language of propositional logic. As is often done within computational logic, we focus only on formulas in conjunctive normal form (CNF), a limitation which can be done without loss of generality. When encoded as a CNF-formula the correlation clustering problem becomes an instance of partial Maximum Satisfiability (MaxSAT), the optimization version of the Boolean satisfiability (SAT) problem. We present three different encodings of correlation clustering into CNF-formulas and provide proofs of the correctness of each encoding. We also experimentally evaluate them by applying a state-of-the-art MaxSAT solver for solving the resulting MaxSAT instances. The experiments demonstrate both the scalability of our method and the quality of the clusterings obtained. As a more theoretical result we prove that the assumption of the input graph being undirected can be done without loss of generality, this justifies our encodings being applicable to all variants of correlation clustering known to us. This thesis also addresses another clustering paradigm, namely constrained correlation clustering. In constrained correlation clustering additional constraints are used in order to restrict the acceptable solutions to the correlation clustering problem, for example according to some domain specific knowledge provided by an expert. We demonstrate how our MaxSAT-based approach to correlation clustering naturally extends to constrained correlation clustering. Furthermore we show experimentally that added user knowledge allows clustering larger datasets, decreases the running time of our approach, and steers the obtained clusterings fast towards a predefined ground-truth clustering.


Files in this item

Files Size Format View
paper.pdf 592.9Kb PDF

This item appears in the following Collection(s)

Show full item record