Early research on machine learning (ML) and data mining (DM) tried to fully automate knowledge discovery processes, and reduce human intervention. For good reasons: we cannot deal with large amounts of (high-dimensional) data, see patterns everywhere, and technical progress should relieve us of time-consuming tasks. This also motivates current work on automatic parameter tuning (cf. the Dagstuhl seminar). This can work in supervised settings if many labels are available and the goal is a well-working black box model, but this consensus is increasingly being questioned in DM today.
In ML, logic-based systems have used background knowledge for more than 30 years. Constrained or semi-supervised clustering, using additional knowledge to guide the clustering process, has been proposed about fifteen years ago. Yet methods using subjective interestingness in pattern mining, i.e. interestingness measures that involve users’ assumptions, are relatively recent. Also recent is research on interactive data mining methods, which allow the user to give feedback during the mining process – not just before and after – to change exploration strategies, narrow or widen search spaces etc. The reasons for this shift have been multifold:
- In unsupervised settings, e.g. clustering and pattern mining, labels are by definition absent and using labels to automate the process is therefore impossible. Yet even in many real-life “supervised” problem settings, a large proportion of data might be unlabeled, or existing labels might be unreliable, and a user finds herself at best in a semi-supervised setting.
- In an unsupervised or semi-supervised setting, it is almost impossible for users to a priori specify their assumptions, expectations, and goals. If they manage, translating them into, somewhat limited, available constraint languages is difficult. The current framework, where users set parameters before mining, then sift through and interpret output, wastes time (and money), and is counterintuitive to how we process information. Users can , however, react to (partial) results and indicate if those agree with their intuition, appear interesting etc.
- Often, experts need to understand why algorithms produce the results they do, when large amounts of money or resources are in play, e.g. in drug development or infrastructure deployment, or even lives at stake, e.g. in medicine, or disaster reparedness. Or they want to understand them because unsupervised DM can act as a hypothesis generator: observing the results of a pattern mining operation or a produced clustering triggers new insights and informs new research directions – the final step in the “knowledge discovery” process.
- Legal frameworks require interpretability and explainability of algorithmic decisions, e.g. the EU General Data Protection Regulation or the US Fair Credit Reporting Act. Complying with those frameworks and bringing the user back into the loop requires: 1) symbolic, i.e. human-interpretable, forms of knowledge discovery, and 2) a fully interactive framework for unsupervised or semi-supervised DM, supporting the whole data analysis process: a user specifies constraints, examines the result, affects new constraints, and so on. By rethinking what analysts need, how they navigate different hypotheses and formulate their feedback, we aim to provide such a framework, coupled to tools for DM.
The reaction to understandability requirements, legal and otherwise, from what we will refer to as the “black box” community (mainly researchers working on Deep Learning techniques but also Kernel Methods or large/extreme Random Forests) has been to first learn a black box model and then build a symbolic one (e.g. a decision tree) based on its predictions. This is different from what InvolvD will do: such approaches are mostly supervised, and users’ ability to interact with the process is rather limited: identifying a problematic branch in a decision tree cannot easily be fed back into learning a neural network since the understandable model is not the learned model. Also, changes to the learning process of black box models typically have to take the form of weighting or removing attributes or data points, instead of constraints that can be pushed directly into DM processes.