For several reasons, we will focus on a chemoinformatics use case. Explorative data mining in chemoinformatics is the prototypical
unsupervised/semi-supervised mining task. Also, molecular data can be represented in different, often challenging, ways other than attribute-value ones, allowing us to develop and test new approaches and having large sets of well-curated data is possible in this setting (if not always easy).
One of our objectives is to further the identification of associations between structural and chemical properties, and biochemical behavior. To do so successfully, a chemo-informatician will need to be able to easily parametrize a clustering. As such, the integrated system will be continuously employed for explorative data mining on molecular data with feedback given w.r.t. ease of use, plausibility of
delivered results, and intuitiveness of the system’s behavior when reacting to the solution. In addition, expert knowledge will be a vital source of constraint formulations, as detailed below.
A user would need to be able to choose a clustering algorithm, a similarity measure, and possibly to select attributes or to weight them (in the attribute-value case). This means that, instead of
optimizing the resulting system for a particular method or similarity measure working well with the constraints we intend to enforce, we will include several algorithms (e.g. k-Means, DBScan…) and
similarity measures. Next, the user should be able to express constraints based on graph matching.
The expert could, e.g., express that it is necessary that instances in clusters have common structural characteristics (so-called chemical scaffolds or patterns). This allows an easier comparison between
structures, match molecular pair (MMP) analysis and activity cliffs. Some examples of constraints:
- The presence of a common chemical substructure of a particular size range (called a core in MMP) is necessary for belonging to the same cluster. By manipulating the size of the core, one could observe the formation of clusters for which chemical variation is associated with the presence/absence of fragments around the core. For an initial clustering carried out on data encoded in terms of predefined molecular fingerprints, i.e. not patterns derived from such a clustering, this constraint would change the importance of some of them, e.g. by assessing how much a fingerprint deviates from the size constraint, or outright ignore others.
- The notion of pharmacophores and toxicophores is one of the major ways for an expert to understand the biological behavior (phenotype) of the chemicals. The presence of specific pharmacophoric patterns is equivalent to syntactic constraints on patterns, alternatively/additionally the user can specify constraints on the diversity of pharmacophores, e.g. not more than 10 different ones or not more than 3 different families present.
- Certain chemical functions or pharmacophoric properties should not be present in the same cluster based on their pharmacophoric/toxicophoric properties (PAINS, Herg….). An expert could, e.g., express that phenyl thioether should not be in the same cluster as phenyl ether due to different chemical properties. Such a constraint could be viewed as a form of a cannot-link constraint, which is not defined by explicitly indicating instances but in an indirect manner by passing via a chemical concept.
Manipulating constraints, and algorithmic parameters, allows to test assumptions by observing the changes to results:
- The notion of graph edit distance (GED) between two pharmacophores is one of the approaches currently explored at CERMN to calculate the distance value between two pharmacophoric graphs. An evolution of the clustering and an understanding of this evolution, in function of a constraint associated to this distance, will give insight into the viability of this research direction and the form such distances should take.
- Clusters showing particular activity profiles as well as particular substructure patterns will allow experts to form new hypotheses or validate existing ones.
We will use data from ChemblDB [21] that allows us to explore the relationship between chemical compounds and biological phenotypes. For instance, with the protein kinase family representing more than 500 kinases, we have in ChemblDB (octobre 2018) 849547 biological data for 167762 chemical compounds (around 5 biological data/compound). By considering only the species homo sapiens, 434 kinases with 612565 biological data for 144708 compounds are available. The kinome is classically grouped into 8 major groups (via sequence comparison of their catalytic domains) and 134 families of kinase (30 families for the tyrosine kinase groups, for instance). We have carried out an initial analysis on a set of 1475 molecules (ABL for the kinase) with a supervised approach (analysis of frequent pharmacophoric item sets into active chemicals compared to inactive one), which shows the potential importance of exploring the role of pharmacophoric patterns for understanding the results of cluster analysis. Organizing the data will involve several steps:
- Defining molecular fingerprints associated to each chemical as an initial representation, as well as narrowing the choice of algorithms and similarity measures.
- Potentially defining additional chemical substructures for comparison (molecular fragments associated with Match Molecular Pair techniques for instance).
- Defining graph pharmacophoric patterns associated with the chemicals (using the Norns program developed in collaboration by CERMN and Greyc).
- The definition and integration of metrics to compare these pharmacophores.