The overall goal of InvolvD is to develop a robust approach to interactive DM, and deliver a prototype that allows users to launch pattern mining or clustering algorithms, visualize the results, give feedback, and rerun mining operations, which will take the given feedback into account. Our research hypothesis can be expressed bluntly: putting the user back into the loop will lead to better understood results, and mining processes that actually run faster than in the non-interactive case.
To achieve this, we have to tackle a number of scientific challenges:
- Interpretability: The visualization of pattern mining results often suffers from the pattern explosion. In addition, patterns on their own, without an idea of related patterns and/or the data context (such as instance coverage and [partial] labels) are hard to process since they are local by definition. Visualizations of clusterings of data that have more than three dimensions pose obvious difficulties, an entire clustering might hide interesting underlying phenomena, but zooming into a clustering is not a straight-forward process.
- Feedback: There are a variety of feedback options and it is not clear whether some are clearly preferable over others. For pattern mining (in increasing order of difficulty and informativeness), one can imagine accepting or rejecting a pattern, judging its interestingness on a scale, deciding on which of two (or more) patterns is preferable to others, or actively suggesting that constituting elements be changed. Clusterings could go from straight-up rejecting a clustering to rejecting individual clusters to changing (parts of) the description to rearranging instance assignments.
- Translation into constraints: Whatever feedback is returned is not easy to translate into constraints. Rejection can take both the form of hard filtering of patterns and of down-weighting. Reassigning instances can be translated into must-link/cannot-link constraints or into more general interpretations.
- Enforcing constraints: Most pattern mining approaches depend on their ability to use constraint information to prune the search tree but quite a number of expressive constraints do not have characteristics that are well-suited for pruning. In clustering, enforcing constraints either takes the form of using powerful, but relatively slow, constraint-programming techniques, or exploiting constraints directly to influence instance assignment or the calculation of quality measures.
- Efficiency: The techniques most capable of enforcing a wide variety of constraints are also often those that are least efficient: the more flexible, the less efficient. Additionally, while supervised approaches can often exploit convex formulations of objective functions, which allow for efficient optimization procedures, this is far from a given for unsupervised or semi-supervised problem settings. One way of addressing this problem consists of developing hybrid solutions, combining dedicated mining and clustering techniques with more general constraint satisfaction/propagation methods that narrow the search space. Alternatively, sampling (both of instances and solutions) and parallelization techniques can be employed, yet while sampling is well established for vectorial or itemset data, the problem becomes much more complex for structured data representations such as sequences, trees and graphs.
To this end, we will define feedback op ons for unsupervised tasks, develop mechanisms to translate feedback into constraints for pattern mining and constrained clustering, i.e. learn
constraints and constraint settings, and the scalability solutions (e.g. parallelization and sampling) needed for responsive tools.
Interactive DM will have applications in many different fields but in InvolvD, we will focus on biological and chemical applications, where expert knowledge is difficult to formulate, hard to express by existing constraints, or changes depending on the context. The success of the process is strongly related to the capability of a chemoinformatician to parametrize and influence the clustering. Our main goal is to allow researchers to sharpen their knowledge of studied therapeutic targets, not to have an automatic prediction tool. This applica ve focus also informs our choice of DM instead of ML, even if ML has made great strides in recent years in the form of Deep Learning.