Classification of substances by health hazard using deep neural networks and molecular electron densities

doi:10.21203/rs.3.rs-3719479/v1

Download PDF

Research Article

Classification of substances by health hazard using deep neural networks and molecular electron densities

https://doi.org/10.21203/rs.3.rs-3719479/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

In this paper we present a method that allows leveraging 3D electron density information to train a deep neural network pipeline to segment regions of high, medium and low electronegativity and classify substances as health hazardous or non-hazardous. We show that this can be used for use-cases such as cosmetics and food products. For this purpose, we first generate 3D electron density cubes using semiempirical molecular calculations for a custom European Chemical Agency (ECHA) subset consisting of substances labelled as hazardous and non-hazardous for cosmetic usage. Together with their 3-class electronegativity maps we train a modified 3D-UNet with electron density cubes to segment reactive sites in molecules and classify substances with an accuracy of 78.1%. We perform the same process on a custom food dataset (CompFood) consisting of hazardous and non-hazardous substances compiled from European Food Safety Authority (EFSA) OpenFoodTox, Food and Drug Administration (FDA) Generally Recognized as Safe (GRAS) and FooDB datasets to achieve a classification accuracy of 64.1%. Our results show that 3D electron densities and particularly masked electron densities denoting regions of high and low reactivity can be used to classify molecules for different use-cases and thus serve not only to guide safe-by-design product development but also aid in regulatory decisions.

electron density

machine learning

computational chemistry

health hazard

3D-UNet

In the field of product development, it is necessary to identify compounds with specific properties as early as possible to minimize non-methodical trial-and-error approaches and consequently reduce the development costs. Consumers want new products, like cosmetics, to exhibit desirable, characteristic hedonic properties, e.g., particular odors, but also those that are harmless to their health and the environment. The situation is similar for food products and their ingredients, where it is imperative to identify substances that are considered hazardous to health early in the development cycle so that these can be avoided.

Especially as part of the European Green Deal from the EU Commission [1], the chemical strategy aims to ban chemicals that are harmful to the consumers or the environment. Thus, having a generalized automated system that can help in identifying such substances is key to overcoming this challenge. For this purpose, regulatory bodies such as the European Chemical Agency (ECHA) and the European Food and Safety Authority (EFSA) have been installed to monitor and maintain a list of substances that can be utilized for various use-cases [2, 3]. This problem can be well defined as a binary classification task that is well suited for an artificial neural network (ANN).

ANNs have previously used molecular structure relationships to classify substances as carcinogenic [4–6] or to predict molecular properties [7, 8]. In cheminformatics, molecular structures are often represented using specific notations, such as their InChI [9] (International Chemical Identifier) or SMARTS [10] (SMILES ARbitrary Target Specification), or, more popularly, with SMILES [11] (Simplified Molecular Input Line Entry Specification) representations, which are a subset of SMARTS. Other methods of encoding molecular structure and features are the SELFIES [12] (Self-Referencing Embedded Strings) notation when used in machine learning tasks to predict molecular properties or to generate new structures. These rule-based methods have the benefit of being rather straightforward to generate and easy to understand by chemists. Such representations have therefore been used extensively with machine learning in previous works, like those to generate new molecular structures [13–17]. Additionally, other encoding schemes, such as molecular graphs [15, 18–20], have been widely combined with machine learning to predict molecular properties, like toxicity [21, 22], medical activity in drug discovery [23–26], or even to predict the odor of molecules [27–31], among others actions.

We assert that such a 2D-representation of molecules is insufficient to model the true spatial domain of the molecule and, as such, at best can roughly approximate properties rooted in the 3D-structure of a molecule. The SMILES notation, for example, has several drawbacks, such as having no standard aromaticity handling or no standard method for generating canonical representations with various implementations consisting of implementational variations. On the one hand, this can yield multiple SMILES notations for a single structure [32, 33]. On the other hand, there are molecules that cannot easily be defined by graph models, such as those having delocalized bonds, for example, in metal carbonyl complexes [32]. This is also the case for molecules whose atomic arrangements are not fixed in 3D space, making meaningful graph representations difficult to generate.

This can be seen in Fig. 1, where the SMILES strings do not convey the complex structures of molecules compared to their 3D structures. Recent works have used 3D representation of molecules, like projecting a 3D molecular graph from its 2D structure [34], using 3D molecular conformations [35–37], or the representation of molecules in 3D coordinate space [38–42], and our method takes inspiration from these. In order to overcome the aforementioned limitations, we aim to develop a machine learning pipeline that aims to learn molecular features which are as closely related to the true physics of a molecule as possible without depending on intermediate representations, such as graphs or molecular fingerprints. For this purpose, we use 3D electron densities as training data for a deep artificial neural network (DNN) pipeline to allow capturing of the spatial features of molecules, which are rooted in quantum physics. We base our method on the core hypothesis of Density Functional Theory (DFT), which postulates that knowing the electron density of a molecule allows direct derivation of various other molecular properties, such as electrostatic potentials, energies, or dipole structures [43–45]. In our pipeline, we use segmented local electronegativity maps of chemical compounds that can be used to identify sites of high and low electronegativity based on a threshold derived from their percentile values. Voxels consisting of electronegativity values higher than the 90th percentile were marked as class 2, i.e., high strength electronegative sites, while those less than the 10th percentile were marked as class 1, i.e., low strength electronegative sites, and the remaining voxels were denoted as belonging to class 0, i.e., medium strength electronegative site. These are commonly considered as active sites where reactions would take place [45–49]. Thus, together with electron density distributions they can be used for the classification of compounds into chemical substances that are allowed, i.e., are not hazardous and those that are health hazards and hence prohibited for the two use-cases. Examples of the 3D electron density representation of molecules, and their corresponding ternary electronegativity maps are shown in Fig. 1.

The pipeline for classifying molecules into two categories is depicted in Fig. 2. The neural network is supplied with two labels, the primary class information is delivered via a CSV file, where a label value of 1 indicates the "allowed/non-hazardous" category, and 0 represents the "prohibited/hazardous" category for both datasets. Concurrently, thresholded electronegativity cube files serve as a secondary label, aiming to segment labeled regions using the 3D-UNet architecture. Electron density cubes, alongside their thresholded electronegativity maps, are initially input into the 3D-UNet, producing an intermediate segmentation result, as displayed in Fig. 2. This segmented output is then employed to mask the input electron density, achieved by multiplying the input densities with the intermediate result. This emphasizes electron density regions that align with regions identified as class 2 or 1 in the intermediate output denoting high and low reactivity, respectively. Subsequently, a 1x1 convolution block is introduced to reduce channel count, followed by batch normalization and adaptive max pooling layers. Ultimately, two fully connected layers generate probabilities, determining the sample's classification.

Classification on ECHA Dataset

Table 1 shows the results achieved for classification of molecules into allowed/non-hazardous and prohibited/hazardous classes. Figure 3 shows the confusion matrix, and examples of the segmented electronegativity regions randomly sampled from the test set are shown in Fig. 4. Furthermore, we performed 5-fold cross validation on this dataset to ensure that the performance metrics are not due to a favorable train-test split; these results are shown in Table S2. The averaged dice coefficient values for the model on the test set are shown in Table 2 along with the dice coefficients for CompFood. The low dice coefficient values for classes 1 and 2 are somewhat expected given the fewer number of voxels that are assigned those classes compared to the dominant class. Overall, however, the network seems to be able to handle not only the imbalance in the two classes of allowed and prohibited, but also provide a high classification accuracy of 78.1% for this use-case.

Table 1

Results of the classification of molecules into prohibited, i.e., hazardous category (class 0) and allowed, i.e., non-hazardous (class 1) for the ECHA cosmetics dataset (higher is better). Chance prediction would be 62.8% for ECHA dataset
Class		Recall	F1 Score	Support
Class 0	0.73	0.65	0.69	68
Class 1	0.80	0.86	0.83	115
Accuracy			0.78	183
Macro Acc.	0.77	0.75	0.76	183
Weighted Acc.	0.78	0.78	0.78	183

Table 2

Average generalized dice scores on the test sets for the ECHA and CompFood datasets (higher is better).
	ECHA Dataset	CompFood Dataset
Class 0	0.8305	0.8473
Class 1	0.1960	0.1802
Class 2	0.2537	0.3991

Classification on CompFood Dataset

The results of classification on the CompFood dataset are shown in Table 3. The confusion matrix for the results is shown in Fig. 5. The classification report indicates that while the overall classification accuracy of 64.1% is still much higher than chance, there is still scope for significant improvement of the model. Examples from the segmentation of electronegativity files sampled randomly from the test set molecules are shown in Fig. 6 and the average dice coefficient values across the test set are shown in Table 2. Like the previous case, here, the dice coefficient for the two under-represented classes (class 1 and 2) is less than that of the majority class (class 0), which is somewhat expected.

Table 3

Classifications of molecules into prohibited, i.e., hazardous category (class 0) and allowed, i.e., non-hazardous (class 1) for the CompFood dataset (higher is better). Chance prediction would be 55.4% for this case
Class		Recall	F1 Score	Support
Class 0	0.58	0.72	0.64	373
Class 1	0.72	0.58	0.64	463
Accuracy			0.64	836
Macro Acc.	0.65	0.65	0.64	836
Weighted Acc.	0.66	0.64	0.64	836

Overall, we show that our model is able to achieve up to 78.1% binary class accuracy for the ECHA dataset and 64.1% accuracy for the CompFood dataset. Using thresholded electronegativity maps as reactive sites of the molecules and thus as weights for the electron densities allows specific spatial regions within the molecule to be highlighted, which would not be possible with a 2D representation. This enables the network to use only these electron densities for making a decision on the molecules being in the hazardous/non-hazardous class. Our prototype pipeline thus allows the molecular properties to be established directly based on the physics of the molecule without depending on intermediate steps, such as lossy fingerprint translation. This approach opens up various other future possibilities, such as molecular structure replacement by identifying sites that contribute to the reactive nature of the molecule and testing if the replacement structure leads to a change in the hazardous/non-hazardous class assignment. Among others, our future work will focus on improving the performance of the network and transferring this to other properties, such as the logP (partition coefficient) of the underlying molecule that can be verified in a laboratory setting.

In this paper we demonstrate a machine learning pipeline that uses 3D electron density and electronegativity information to segment regions of high, medium, and low electronegativity and classify substances as health hazardous or non-hazardous with considerably higher than chance accuracy. For this purpose, we first created a custom dataset of cube files by performing semi-empirical molecular calculations for all molecules present in the ECHA dataset consisting of molecules that are considered health hazardous and hence prohibited or non-hazardous and thus allowed for cosmetic use. These cube files were used to train a modified 3D-UNet to segment 3-class electronegativity maps that were derived by setting an upper and lower threshold on the electronegativities before being used for classification of the given molecules.

Moreover, we show that this kind of approach can be used for various use-cases, for example, in cosmetics or food products by performing the same data generation, pre-processing, and training steps on the CompFood dataset consisting of substances considered carcinogenic or safe in a binary class problem that were compiled from the OpenFoodTox, GRAS and FooDB datasets. With our work, we aim to demonstrate that a prototype pipeline that uses electron densities and deep neural networks can be used in the product development cycle as an early predictor to reduce future trial and errors, as well as aid in regulatory decisions.

Data Generation

Initially, a list of substances prohibited for use in cosmetic products under EU Cosmetic Products Regulation was retrieved from the European Chemicals Agency’s database for Information on Chemicals [2]. The prohibited substances are those chemicals that are classified as carcinogenic, mutagenic, or toxic for reproduction by the European Union and hence considered a health hazard. Additionally, a second list of allowed substances was created that do not belong to this list, i.e., those not restricted by ECHA. For this purpose, we sampled a disjoint set of molecules with molecular weight < 400 Da from the ZINC [50] database. In this work, we use ‘hazardous’ and ‘prohibited’ interchangeably and similarly, ‘non-hazardous’ and ‘allowed’ are used interchangeably.

For creating the training dataset, in a first step CAS numbers were used to query PubChem via their REST API [51] to retrieve the isomeric SMILES representations of the substances in our prohibited and allowed categories. Using RDKit [52], these SMILES strings were converted to 3D structures, optimized using the Merck Molecular Force Field (MMFF) [53] and exported to mol2 files, from which we generated input files for the EMPIRE [54] software. We used EMPIRE and the AM1S [55] Hamiltonian to perform geometry optimizations and to generate an electronic wave-function for each molecule. The wave-function was then used to generate electron density, electronegativity and electron-affinity cube files using the eh5cube software from Cepos [54]. The final dataset consisted of 3D-electronegativity and the 3D electron-affinity cube files for each of the 1841 molecules, of which 1158 are allowed and 683 are prohibited, and this was then divided randomly into stratified train, validation, and test sets in the ratio of 70:20:10.

The labels for classifying molecules into “allowed” or “prohibited” classes were one-hot encoded with allowed = 1 and prohibited = 0. To map physical properties onto the feature space, a local property map of electronegativity was used as a secondary label, as follows. Since high (local) electronegativity is correlated to high (local) reactivity [46], we derive a ternary reactivity mask from the electronegativity cube files for regions of high, medium and low local reactivity. To categorize reactivity, voxels with electronegativity values above the 90th percentile were labeled as class 2, signifying high reactive sites. Conversely, those below the 10th percentile were labeled as class 1 for low reactive sites. All remaining voxels were designated as class 0, indicating medium reactivity. Examples of the 3D electron density representation of molecules, and their corresponding ternary electronegativity maps, are shown in Fig. 1.

For the generation of the compiled food dataset, three independent datasets were combined. Firstly, the OpenFoodTox [3] from EFSA containing 4201 substances with their CAS numbers was downloaded. These consisted of 3409 substances found in food products labeled as ‘Positive’ denoting a carcinogenic compound, 375 as ‘No Data’, i.e., either no carcinogenicity assessment was made or no studies are available, 209 labelled as ‘Negative’, 51 as ‘Not Determined’, i.e., no clear conclusion could be made, 32 as ‘Other’, 37 as ‘Ambiguous’ and 88 as ‘Not applicable’. Thus, from this dataset, 3409 substances were selected for the prohibited class. To assign substances as allowed, only those belonging to the Negative class were chosen, i.e., 209 substances were assigned as allowed. To balance out the class distribution, additional substances were added to the allowed class from the GRAS database [56] that provided 381 compounds that are generally recognized as safe for consumption, such as in the form of a food additive and a further 3167 “non-hazardous” substances were randomly sampled from the FooDB dataset [57] that consists of a comprehensive collection of food compounds and their associated chemical compositions, nutrients, flavors, etc. Using our data preparation steps we generated a total of 5572 cube files, with 2599 compounds belonging to the prohibited class and 2973 to the allowed class. This data was then subdivided into train-test-validation sets in the ratio of 75:15:10 with the aim of training an ANN pipeline for this binary classification problem using 3D electron density and electronegativity representation of these substances.

Loss functions, evaluation and hyperparameter search

For classification of molecules into allowed/prohibited classes, a sum of cross entropy (CE) loss between ground-truth and predicted class labels along with the dice loss between the original and predicted electronegativity maps was used, denoted as \({L}_{ovr}={L}_{ce}+ {L}_{gen\_dice}\). Since the thresholded electronegativity voxels lead to a very imbalanced distribution of classes, we used generalized dice loss instead of the simple dice loss [58]. This allows introducing a weighting scheme for the different classes that are underrepresented. For our implementation, we used the generalized dice coefficient implementation from the Monai library [59]. This loss function is defined as \(L\_(gen\_dice)=1-(1/(y^2+ \epsilon \left)\right)\text{*}\left(\right(2\text{*}y\text{*} y\hat + \epsilon )/(y+ y\hat + \epsilon \left)\right)\). The CE loss used for training is defined as \({L}_{ce}=- \sum _{i=1}^{2}{w}_{i}{y}_{i}\text{log}\left(\widehat{{y}_{i}}\right)\). Here, \(y\) corresponds to the ground truth labels and \(\widehat{{y}_{i}}\) corresponds to the predicted labels. \({w}_{i}\) are the weights for class \(i\) shown in Supp. Table 3. Moreover, to account for class imbalance, especially classification on the ECHA dataset, the CE loss was provided with class weights for the ECHA dataset that were optimized along with the other hyperparameters. For the CompFood dataset, however, the CE class weights were found to be almost the same, which would make sense since the classes are sufficiently balanced for the classification task. The performance of the models was determined by calculating the accuracy on the test set, along with their confusion matrices. The models were trained using Pytorch [60] library (version 2.0.0) for Python using a cluster of 4 Nvidia Quadro 8000 GPUs. The hyper-parameters for both trainings were selected by performing hyperparameter search using Optuna [61] and a Tree-structured Parzen Estimator. The final parameters are listed in Table S3.

Availability of data and materials

The data used for training is accessible at https://osf.io/qd9ry/. The algorithm as well as all other further preprocessing steps are described in detail in the Method section.

Competing interests

The authors declare to have no competing interests.

Funding

This work was financially supported by the “Campus of the Senses” Initiative from the Bavarian Ministry of Economic Affairs, Regional Development and Energy (StMWi) and Fraunhofer (Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.).

Authors’ contributions

Satnam Singh: Conceptualization, Formal Analysis, Data Curation, Investigation, Methodology, Writing – original draft, Writing – review and editing. Gina Zeh:Methodology, Writing – review and editing. Jessica Freiherr: Supervision, Writing – review and editing. Thilo Bauer: Conceptualization, Data Curation, Methodology, Supervision, Writing – review and editing. Işik Türkmen: Conceptualization, Methodology, Writing – review and editing. Andreas T. Grasskamp: Conceptualization, Methodology, Supervision, Writing – review and editing

Acknowledgements

The authors are grateful to Paul Martini, Katharina Bauer and Tobias Kopyto for helpful discussions and critical comments and to Sally Arnhardt for helping generating plots and figures using biorender.com.

European Commission. Directive 2003/71/EC of the European Parliament 21.6.2017 https://environment.ec.europa.eu/strategy/chemicals-strategy_en 2018;:48–119.
European Union. Prohibited Substances: Annex II, Regulation 1223/2009/EC on Cosmetic Products https://echa.europa.eu/cosmetics-prohibited-substances. 2009. Accessed 10 November 2023.
Kovarich S, Ciacci A, Baldin R, Roncaglioni A, Mostrag A, Tarkhov A, et al. OpenFoodTox: EFSA’s chemical hazards database. 2022.
Chen Z, Zhang L, Sun J, Meng R, Yin S, Zhao Q. DCAMCP : A deep learning model based on capsule network and attention mechanism for molecular carcinogenicity prediction. J Cellular Molecular Medi. 2023;:jcmm.17889.
Limbu S, Dakshanamurthy S. Predicting Chemical Carcinogens Using a Hybrid Neural Network Deep Learning Method. Sensors. 2022;22:8185.
Wang Y-W, Huang L, Jiang S-W, Li K, Zou J, Yang S-Y. CapsCarcino: A novel sparse data deep learning tool for predicting carcinogens. Food and Chemical Toxicology. 2020;135:110921.
Walters WP, Barzilay R. Applications of Deep Learning in Molecule Generation and Molecular Property Prediction. Accounts of Chemical Research. 2021;54:263–70.
Hirohara M, Saito Y, Koda Y, Sato K, Sakakibara Y. Convolutional neural network based on SMILES representation of compounds for detecting chemical motif. BMC Bioinformatics. 2018;19:526.
Heller SR, McNaught A, Pletnev I, Stein S, Tchekhovskoi D. InChI, the IUPAC International Chemical Identifier. J Cheminform. 2015;7:23.
Daylight. Daylight Theory: SMARTS - A Language for Describing Molecular Patterns https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html. 2012.
Anderson E, Veith GD, Weininger D, editors. SMILES: a line notation and computerized interpreter for chemical structures. In: J. Chem. Inf. Comput. Sci. 1987.
Krenn M, Häse F, Nigam A, Friederich P, Aspuru-Guzik A. Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation. 2019. https://doi.org/10.48550/ARXIV.1905.13741.
Jin W, Barzilay R, Jaakkola T. Junction Tree Variational Autoencoder for Molecular Graph Generation. 2018;:17.
Takeda S, Hama T, Hsu H-H, Yamane T, Masuda K, Piunova VA, et al. AI-driven Inverse Design System for Organic Molecules. 2020. https://doi.org/10.48550/arXiv.2001.09038.
Cao N, Kipf T. MolGAN: An implicit generative model for small molecular graphs. 2018. https://doi.org/10.48550/arXiv.1805.11973.
Arús-Pous J, Patronov A, Bjerrum EJ, Tyrchan C, Reymond JL, Chen H, et al. SMILES-based deep generative scaffold decorator for de-novo drug design. Journal of Cheminformatics. 2020;12:1–32.
Wang L, Bai R, Shi X, Zhang W, Cui Y, Wang X, et al. A pocket-based 3D molecule generative model fueled by experimental electron density. Scientific Reports. 2022;12:15100.
You J, Ying R, Ren X, Hamilton WL, Leskovec J. GraphRNN: Generating realistic graphs with deep auto-regressive models. 35th International Conference on Machine Learning, ICML 2018. 2018;13:9072–81.
Ma T, Chen J, Xiao C. Constrained Generation of Semantically Valid Graphs via Regularizing Variational Autoencoders. Advances in Neural Information Processing Systems. 2018;:14.
Li Y, Vinyals O, Dyer C, Pascanu R, Battaglia P. Learning Deep Generative Models of Graphs. 2018. https://doi.org/10.48550/arXiv.1803.03324.
Mayr A, Klambauer G, Unterthiner T, Hochreiter S. DeepTox: Toxicity prediction using deep learning. Frontiers in Environmental Science. 2016;3 FEB.
Suzuki T, Katouda M. Predicting toxicity by quantum machine learning. Journal of Physics Communications. 2020;4:1–30.
Cangea C, Grauslys A, Liò P, Falciani F. Structure-Based Networks for Drug Validation. Workshop at NuerIPS. 2018;:1–5.
Sakai M, Nagayasu K, Shibui N, Andoh C, Takayama K, Shirakawa H, et al. Prediction of pharmacological activities from chemical structures with graph convolutional neural networks. Scientific Reports. 2021;11:525.
Gaudelet T, Day B, Jamasb AR, Soman J, Regep C, Liu G, et al. Utilizing graph machine learning within drug discovery and development. Briefings in Bioinformatics. 2021;22:bbab159–bbab159.
Jiang D, Wu Z, Hsieh C-Y, Chen G, Liao B, Wang Z, et al. Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models. Journal of Cheminformatics. 2021;13:12.
Sanchez-Lengeling B, Wei JN, Lee BK, Gerkin RC, Aspuru-Guzik A, Wiltschko AB. Machine Learning for Scent: Learning Generalizable Perceptual Representations of Small Molecules. 2019. https://doi.org/10.48550/arXiv.1910.10685.
Keller A, Gerkin RC, Guan Y, Dhurandhar A, Turu G, Szalai B, et al. Predicting human olfactory perception from chemical features of odor molecules. Science (New York, NY). 2017;355:820–6.
Lötsch J, Kringel D, Hummel T. Machine Learning in Human Olfactory Research. Chemical Senses. 2019;44:11–22.
Genva M, Kemene TK, Deleu M, Lins L, Fauconnier ML. Is it possible to predict the odor of a molecule on the basis of its structure? International Journal of Molecular Sciences. 2019;20.
Schicker D, Singh S, Freiherr J, Grasskamp AT. OWSum: algorithmic odor prediction and insight into structure-odor relationships. J Cheminform. 2023;15:51.
David L, Thakkar A, Mercado R, Engkvist O. Molecular representations in AI-driven drug discovery: a review and practical guide. Journal of Cheminformatics. 2020;12:1–22.
O’Boyle NM. Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI. J Cheminform. 2012;4:22.
Liu S, Wang H, Liu W, Lasenby J, Guo H, Tang J. Pre-training Molecular Graph Representation with 3D Geometry-Rethinking Self-Supervised Learning on Structured Data. https://arxiv.org/abs/211007728. 2021;:1–19.
Teredesai A, Kumar V, Li Y, Rosales R, Terzi E, Karypis G, editors. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. In: 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Anchorage AK USA: ACM; 2019.
Eickenberg M, Exarchakis G, Hirn M, Mallat S. Solid harmonic wavelet scattering: Predicting quantum molecular energy from invariant descriptors of 3D electronic densities. Advances in Neural Information Processing Systems. 2017;2017-Decem Nips 2017:6541–50.
Xu M, Wang W, Luo S, Shi C, Bengio Y, Gomez-Bombarelli R, et al. An End-to-End Framework for Molecular Conformation Generation via Bilevel Programming. 2021. https://doi.org/10.48550/2105.07246.
Elton DC, Boukouvalas Z, Fuge MD, Chung PW. Deep learning for molecular design - A review of the state of the art. Molecular Systems Design and Engineering. 2019;4:828–49.
Joshi RP, Gebauer NWA, Bontha M, Khazaieli M, James RM, Brown JB, et al. 3D-Scaffold: A Deep Learning Framework to Generate 3D Coordinates of Drug-like Molecules with Desired Scaffolds. The Journal of Physical Chemistry B. 2021;125:12166–76.
Gebauer NWA, Gastegger M, Hessmann SSP, Müller K-R, Schütt KT. Inverse design of 3d molecular structures with conditional generative neural networks. Nature Communications. 2022;13:973.
Simm GNC, Pinsler R, Hernández-Lobato JM. Reinforcement learning for molecular design guided by quantum mechanics. 37th International Conference on Machine Learning, ICML 2020. 2020;PartF16814:8906–16.
Nesterov V, Wieser M, Roth V. 3DMolNet: A Generative Network for Molecular Structures. 2020. https://doi.org/10.48550/2010.06477.
Lewis AM, Grisafi A, Ceriotti M, Rossi M. Learning Electron Densities in the Condensed Phase. Journal of chemical theory and computation. 2021;17:7203–14.
Parr RG. Density Functional Theory of Atoms and Molecules BT - Horizons of Quantum Chemistry. Horizons of Quantum Chemistry. 1980;:5–15.
Geerlings P, De Proft F, Langenaeker W. Conceptual Density Functional Theory. Chem Rev. 2003;103:1793–874.
Nordholm S. From Electronegativity towards Reactivity-Searching for a Measure of Atomic Reactivity. Molecules (Basel, Switzerland). 2021;26.
Franco-Pérez M, Gázquez JL. Electronegativities of Pauling and Mulliken in Density Functional Theory. J Phys Chem A. 2019;123:10065–71.
Baekelandt BG, Mortier WJ, Lievens JL, Schoonheydt RA. Probing the reactivity of different sites within a molecule or solid by direct computation of molecular sensitivities via an extension of the electronegativity equalization method. J Am Chem Soc. 1991;113:6730–4.
Jesús Sánchez-Márquez. Electronegativity equalization principle: new approaches and models for the study of chemical reactivity. In: Chemical Reactivity Volume 2: Approaches and Applications. Elsevier; 2023. p. 227–42.
John J. Irwin and Brian K. Shoichet. ZINC – A Free Database of Commercially Available Compounds for Virtual Screening. J Chem Inf Model. 2005;45(1):177–82.
National Library of Medicine P. PubChem Rest API. https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{cas_num}/property/IsomericSMILES/JSON
Landrum G. RDKit: Open-source cheminformatics. 2010. http://www.rdkit.org. Accessed 10 November 2023.
Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94.
Empire & EH5cube, Cepos InSilico. 2020.
Dewar MJS, Zoebisch EG, Healy EF, Stewart JJP. Development and use of quantum mechanical molecular models. 76. AM1: a new general purpose quantum mechanical molecular model. Journal of the American Chemical Society. 1985;107:3902–9.
Food and Drug Administration. Select Committee on GRAS Substances, https://www.cfsanappsexternal.fda.gov/scripts/fdcc/?set=SCOGS 1980. Accessed 10 November 2023.
Wishart D.S., "FooDB". https://www.foodb.ca. Accessed 10 November 2023.
Sudre CH, Li W, Vercauteren T, Ourselin S, Jorge Cardoso M. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2017;10553 LNCS:240–8.
MONAI Consortium. MONAI: Medical Open Network for AI. 2023.
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems. 2019;32 NeurIPS.
Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna. In: Teredesai A, Kumar V, Li Y, Rosales R, Terzi E, Karypis G, editors. Anchorage AK USA: ACM; 2019. p. 2623–31.

No competing interests reported.

071223BMCJourCheminfeDenSuppFinal.docx
Additional material Supplementary Material (.docx): Additional 5-fold cross validation results performed on the ECHA are shown in Supplementary Material in Table S1 and the hyperparameters chosen for both datasets shown in Table S2.

Download PDF

Editorial decision: Revision requested
17 Dec, 2023
Submission checks completed at journal
08 Dec, 2023
Editor assigned by journal
08 Dec, 2023
First submitted to journal
07 Dec, 2023

You are reading this latest preprint version

Classification of substances by health hazard using deep neural networks and molecular electron densities

Status:

Version 1

Abstract

Figures

Introduction

Results and Discussion

Classification on ECHA Dataset

Classification on CompFood Dataset

Conclusions

Methods

Data Generation

Loss functions, evaluation and hyperparameter search

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1