Integrating pattern mining and clustering methods, developing the interface

Clustering, pattern mining, and visualization are tightly coupled in our project to support interpretability/explainability and feedback options, as well as to improve mining. Integrating clustering, pattern mining, and visualization solutions, and taking expert input w.r.t. visualizations and feedback options into account while doing so will require to tackle several objectives:

  1. Interpretability A cluster (or several) can/should be represented in terms of discrete descriptors of contained instances, which helps with the dimensionality (and interpretability) problem: instead of projecting a cloud of instances onto a small (<=3 to be visually presentable) set of dimensions, the user is presented with a conjunction of attribute-value combinations, a sequence, subtree or subgraph, or disjunctions (or sets) thereof. For structured data, currently such descriptors, i.e. patterns, are mined beforehand and a subset is selected based on the clustering, which entails a trade-off – too few descriptors: the description will not give insights, especially if clusters change according to user feedback, too many : the clustering process will take long and suffer the curse of dimensionality, i.e. in high-dimensional spaces all instances seem equally (dis)similar. In InvolvD, we instead cluster on the original presentation, derive patterns from formed clusters, and adjust or remine patterns when clusters change due to user feedback. A pattern has to be presented in the context of other patterns and the data it was mined from to allow for meaningful interpretation and feedback. Compared to the paragraph section, the roles are therefore reversed: where the cluster was the result the user is mainly interested in and (sets of) patterns the descriptors that help interpret it, the user now considers the patterns and clusters (both in terms of the contained instances and other patterns derived from the clusters) offer the supporting information.
  2. Feedback. We will develop new feedback options in InvolvD but can discuss existing forms of feedback and how they should impact on the representations here. At the cluster level, a user can designate instances that should not be part of a cluster or that should be assigned to an existing cluster. Not only will this restart the clustering process (or refine the clustering), but also affect representative patterns whose support and discriminatory power will change. We will develop methods that take such additions/subtractions into account in real time, update patterns’ statistics (and potentially their position in a ranking), and restart the pattern mining process if statistics change strongly. The last point will be of major interest: how to decide automatically when the mining process should be restarted, with such decisions based on work in statistically sound pattern discovery. A generalization of the above is to demand that a cluster be split into subclusters or to actively designate a subset of a cluster that should be formed and separated from the rest. The number of clusters, or the cluster membership of certain instances, can be changed indirectly by changing constraint settings on the diameter, intra-cluster similarity, or minimum size of clusters. This will also require restarting or refining the clustering process but effects on patterns are not straight-forward. We will develop theoretical guarantees on the effect such changes have on pattern statistics and develop algorithms exploiting them. At the pattern level, users can indicate that they do not want certain elements in the pattern to occur (causing a reassessment of pattern elements and of covered instances), a rejection of a complete pattern (different from rejecting the instances that support it), or an increase/decrease of a pattern’s rank in a list w.r.t. other patterns, which translates into feedback about the measures for ranking, and therefore the mining process itself. Such feedback, i.e. limiting the vocabulary of the pattern language or establishing relative preferences, will be used as input to learning algorithms that indicate how to change constraints. Users can also constrain more general measures, such as the size of the pattern, which affects the mining process because it limits how far the search space can be explored.
  3. Mining process. Cluster descriptions depend on the description of the instances placed into them – if a pattern is not part of the description, we cannot describe a cluster in terms of it. The flip side is that changing available patterns changes what clusterings can possibly be expressed. Changing patterns therefore means that only subsets of the data need to be reclustered, easing the computational load. Members of the consortium have modelled descriptive clustering as a bi-objective optimization problem, namely via a requirement for compact clusters w.r.t. a given distance, and interpretable ones w.r.t. tags associated with objects. Three optimization criteria have been defined depending on the density of the tags. The proposed algorithm maximizes cluster coherence and description clarity at the same time, finding the necessary balance. Any feedback that excludes certain instances or parts of instance descriptions changes the search space for this algorithm and allows new solutions to be discovered. That work only uses unstructured descriptors, however, and we will exploit ways of adapting it to the structured case. In structured data, excluding a single atom is equivalent to excluding a single attribute-value combination (AV) in unstructured data but excluding a bond involving two atoms excludes more than two AVs. The effect of excluding structured instances is analogous. Patterns mined in an unsupervised manner need criteria about their unexpectedness, statistical significance etc. As soon as at least partial information w.r.t. a partition of the data is available, one can use measures related to discriminating different groups in the data, e.g. a particular cluster against the rest, a subcluster against the rest of the parent cluster, or describing predefined subsets.
  4. Interface. The interface that we are going to develop will support all the existing feedback options for pattern mining and clustering. We will also add so far unexplored options such as selecting a cluster and requiring that it be split or merged with an (potentially unspecified) other, selecting a pattern and demanding a specialization of it to be included in the result set, selecting a subset of instances and demanding that they form a cluster etc. The interface will be based on the visual building blocks as well as on the interaction options described in the page on visualization. By integrating automatic exploration/evaluation, we will go through several iterations of the interface, which will also be cross-evaluated by members of the consortium (both computer scientists and chemo-informaticians), but also outsiders for instance in the form of demo presentations at international conferences.