In the field of product development, it is necessary to identify compounds with specific properties as early as possible to minimize non-methodical trial-and-error approaches and consequently reduce the development costs. Consumers want new products, like cosmetics, to exhibit desirable, characteristic hedonic properties, e.g., particular odors, but also those that are harmless to their health and the environment. The situation is similar for food products and their ingredients, where it is imperative to identify substances that are considered hazardous to health early in the development cycle so that these can be avoided.
Especially as part of the European Green Deal from the EU Commission [1], the chemical strategy aims to ban chemicals that are harmful to the consumers or the environment. Thus, having a generalized automated system that can help in identifying such substances is key to overcoming this challenge. For this purpose, regulatory bodies such as the European Chemical Agency (ECHA) and the European Food and Safety Authority (EFSA) have been installed to monitor and maintain a list of substances that can be utilized for various use-cases [2, 3]. This problem can be well defined as a binary classification task that is well suited for an artificial neural network (ANN).
ANNs have previously used molecular structure relationships to classify substances as carcinogenic [4–6] or to predict molecular properties [7, 8]. In cheminformatics, molecular structures are often represented using specific notations, such as their InChI [9] (International Chemical Identifier) or SMARTS [10] (SMILES ARbitrary Target Specification), or, more popularly, with SMILES [11] (Simplified Molecular Input Line Entry Specification) representations, which are a subset of SMARTS. Other methods of encoding molecular structure and features are the SELFIES [12] (Self-Referencing Embedded Strings) notation when used in machine learning tasks to predict molecular properties or to generate new structures. These rule-based methods have the benefit of being rather straightforward to generate and easy to understand by chemists. Such representations have therefore been used extensively with machine learning in previous works, like those to generate new molecular structures [13–17]. Additionally, other encoding schemes, such as molecular graphs [15, 18–20], have been widely combined with machine learning to predict molecular properties, like toxicity [21, 22], medical activity in drug discovery [23–26], or even to predict the odor of molecules [27–31], among others actions.
We assert that such a 2D-representation of molecules is insufficient to model the true spatial domain of the molecule and, as such, at best can roughly approximate properties rooted in the 3D-structure of a molecule. The SMILES notation, for example, has several drawbacks, such as having no standard aromaticity handling or no standard method for generating canonical representations with various implementations consisting of implementational variations. On the one hand, this can yield multiple SMILES notations for a single structure [32, 33]. On the other hand, there are molecules that cannot easily be defined by graph models, such as those having delocalized bonds, for example, in metal carbonyl complexes [32]. This is also the case for molecules whose atomic arrangements are not fixed in 3D space, making meaningful graph representations difficult to generate.
This can be seen in Fig. 1, where the SMILES strings do not convey the complex structures of molecules compared to their 3D structures. Recent works have used 3D representation of molecules, like projecting a 3D molecular graph from its 2D structure [34], using 3D molecular conformations [35–37], or the representation of molecules in 3D coordinate space [38–42], and our method takes inspiration from these. In order to overcome the aforementioned limitations, we aim to develop a machine learning pipeline that aims to learn molecular features which are as closely related to the true physics of a molecule as possible without depending on intermediate representations, such as graphs or molecular fingerprints. For this purpose, we use 3D electron densities as training data for a deep artificial neural network (DNN) pipeline to allow capturing of the spatial features of molecules, which are rooted in quantum physics. We base our method on the core hypothesis of Density Functional Theory (DFT), which postulates that knowing the electron density of a molecule allows direct derivation of various other molecular properties, such as electrostatic potentials, energies, or dipole structures [43–45]. In our pipeline, we use segmented local electronegativity maps of chemical compounds that can be used to identify sites of high and low electronegativity based on a threshold derived from their percentile values. Voxels consisting of electronegativity values higher than the 90th percentile were marked as class 2, i.e., high strength electronegative sites, while those less than the 10th percentile were marked as class 1, i.e., low strength electronegative sites, and the remaining voxels were denoted as belonging to class 0, i.e., medium strength electronegative site. These are commonly considered as active sites where reactions would take place [45–49]. Thus, together with electron density distributions they can be used for the classification of compounds into chemical substances that are allowed, i.e., are not hazardous and those that are health hazards and hence prohibited for the two use-cases. Examples of the 3D electron density representation of molecules, and their corresponding ternary electronegativity maps are shown in Fig. 1.