Constrained or semi-supervised clustering uses constraints to steer the “pure” clustering process, i.e. one that only seeks to maximize the similarity of instances placed into clusters together. Earliest work
focused on instance-level constraints, specifying that two instances had to or must not be in the same cluster. Later work added additional constraints, e.g. on the minimal size or the diameter of
clusters. The last 15 years have shown the appeal of declarative paradigms (Constraint Programming, SAT, Integer Linear Programming) for solving data mining problems (pattern mining, clustering) under constraints. Such frameworks allow to integrate various types of constraints to model a Data Mining problem. For instance, it has been shown that given some “semantic” object descriptors, it is possible to define meaningful constraints for experts. Nevertheless, a trade-off has to be found between the expressiveness of the constraints and the efficiency of the solver.
As for pattern mining, we will explore several directions: the first one intends to work on a single clustering and to refine it depending on user feedback, the second will develop approaches to merge several clusterings obtained either via different constrained clustering methods or different samplings. Additionally, we have to fix what feedback we will exploit: a cluster being split, some instances reassigned, some instances not to be put together, but more interestingly constraints on the patterns that describe the clusters. For getting feedback, we intend to rely on visualization tools
as, for instance, GrouseFlocks developed by the LABRI partner of this project.
- Designing new algorithms integrating semantic constraints. In the current state of the art, constraints mainly concern the structure of the clustering and, except for must-link or
cannot-link constraints, are usually imposed on all clusters, whereas feedback is often local for some specific clusters. Moreover, a strength of this project is to tightly integrate pattern mining and clustering thus allowing to have semantic feedback. There have been some attempts to integrate semantic knowledge but they are limited to imposing the same semantic constraint to be satisfied on all clusters. Feedback will allow constraints on specific clusters, or some constraints on the whole clustering (thus involving several clusters).
New approaches must thus be defined. They could be based on fully declarative frameworks such as CP, but new hybrid approaches must be developed integrating declarative approaches and, for instance, deep learning. The first one will ensure the satisfiability of constraints, the second one will allow representation changes that amplify the effect of the constraints. Deep constrained clustering approaches have emerged but are limited to certain types of constraints and do not guarantee their satisfaction. Given feedback, a direct method is to run the clustering algorithm again with the new constraints, potentially leading to a very different clustering. A final approach is to systematically explore the neighborhoods of clustering solutions until the required changes have been achieved. - Integrating feedback on several clusterings . Instead of a single clustering that the user gives feedback on, one can offer her several choices and let her pick one or merge several that
show desired characteristics in parts of them. To do this, we will need to develop two aspects. 1) Measures and explanations to compare/rank clusterings. There exist already
several criteria to compare two clusterings, for instance the Adjusted Rand Index (ARI) or the diversity criterion. Those criteria are badly adapted to the use by experts, though, because they give a global similarity score or return a single global score for each clustering that the user needs to interpret. To make them useful, we will develop approaches that highlight the differences and provide interpretations to the expert, the level of explanations depending on the structural or semantic information available. 2) Merging different clusterings under constraints given by the expert. There might not be a single preferred clustering but several that should be merged into a consensus while satisfying certain constraints. Two kinds of methods will be considered: purely declarative ones that guarantee to find a merged clustering satisfying all the constraints to the detriment of efficiency, and more efficient methods that come at the price of maybe not satisfying all the constraints. - Use constraints as goodness criteria for clusterings. In the context of user interaction, when dealing with several clusterings, tools must be developed to select the most promising clusterings. Constraints can be used to maintain a set of clusterings to be proposed to the user. Recent work has proposed to generate a (potentially large) collection of candidate clusterings with/ without taking constraints into account, and keeping only those that satisfy the constraints best. The advantage of such solutions is that they allow much more flexibility in terms of used clustering techniques and give a diversity of solutions, yet the drawback is that constraint satisfaction is not guaranteed, something that might not be
acceptable to a user. This could also be used to measure the difficulty of the constraints to be satisfied and thus offer an additional feedback option to users. New constraints close to
the initial ones could be generated to test the robustness of the clusterings.