Next Article in Journal
Predicting Potential Endocrine Disrupting Chemicals Binding to Estrogen Receptor α (ERα) Using a Pipeline Combining Structure-Based and Ligand-Based in Silico Methods
Next Article in Special Issue
AKT Inhibitors: The Road Ahead to Computational Modeling-Guided Discovery
Previous Article in Journal
Microscopic Interactions of Melatonin, Serotonin and Tryptophan with Zwitterionic Phospholipid Membranes
Previous Article in Special Issue
Molecular Docking and QSAR Studies as Computational Tools Exploring the Rescue Ability of F508del CFTR Correctors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Inverse QSAR Method Based on a Two-Layered Model and Integer Programming

1
Department of Applied Mathematics and Physics, Kyoto University, Kyoto 606-8501, Japan
2
Graduate School of Advanced Integrated Studies in Human Survivability (Shishu-Kan), Kyoto University, Kyoto 606-8306, Japan
3
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji 611-0011, Japan
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Int. J. Mol. Sci. 2021, 22(6), 2847; https://0-doi-org.brum.beds.ac.uk/10.3390/ijms22062847
Submission received: 19 February 2021 / Revised: 5 March 2021 / Accepted: 7 March 2021 / Published: 11 March 2021
(This article belongs to the Special Issue QSAR and Chemoinformatics in Molecular Modeling and Drug Design 2.0)

Abstract

:
A novel framework for inverse quantitative structure–activity relationships (inverse QSAR) has recently been proposed and developed using both artificial neural networks and mixed integer linear programming. However, classes of chemical graphs treated by the framework are limited. In order to deal with an arbitrary graph in the framework, we introduce a new model, called a two-layered model, and develop a corresponding method. In this model, each chemical graph is regarded as two parts: the exterior and the interior. The exterior consists of maximal acyclic induced subgraphs with bounded height, the interior is the connected subgraph obtained by ignoring the exterior, and the feature vector consists of the frequency of adjacent atom pairs in the interior and the frequency of chemical acyclic graphs in the exterior. Our method is more flexible than the existing method in the sense that any type of graphs can be inferred. We compared the proposed method with an existing method using several data sets obtained from PubChem database. The new method could infer more general chemical graphs with up to 50 non-hydrogen atoms. The proposed inverse QSAR method can be applied to the inference of more general chemical graphs than before.

1. Introduction

Computer-aided design of chemical structures is one of the key topics in chemoinformatics. In particular, extensive studies have been done on inverse quantitative structure–activity relationships (inverse QSAR), which seek chemical structures having desired chemical activities under some constraints. In this framework, chemical compounds are usually represented as vectors of real or integer numbers, which are often called descriptors in chemoinformatics and correspond to feature vectors in machine learning. Using these chemical descriptors, various heuristic and statistical methods have been developed for inverse QSAR [1,2,3]. In many of such methods, inference or enumeration of graph structures from a given set of descriptors is a crucial subtask. Although various methods have been developed for that purpose [4,5,6,7], enumeration still remains a challenging task because the number of possible chemical graphs is huge, for example, chemical graphs with up to 30 atoms (vertices) C, N, O, and S, may exceed 10 60 [8]. Furthermore, even inference is a challenging task because it is NP-hard (computationally difficult) except for some simple cases [9]. Due to this inherent difficulty, most existing methods for inverse QSAR do not guarantee optimal or exact solutions.
On the other hand, the design of novel graph structures has recently become a hot topic in artificial neural network (ANN) studies, and thus extensive studies have been done for inverse QSAR using ANNs, especially with graph convolutional networks [10]. For example, variational autoencoders [11], recurrent neural networks [12,13], grammar variational autoencoders [14], generative adversarial networks [15], and invertible flow models [16,17] have been applied. Note that QSAR using three-dimensional structures of chemical compounds (3D-QSAR) has also been studied [18]. Particularly, comparative molecular field analysis (CoMFA) has been extensively studied and applied to various molecular design problems [19,20]. In CoMFA, electrostatic potential interaction energies across superimposed molecular structures are used as descriptors and then regression is performed by using the partial least squares (PLS) fitting. Recently, deep neural networks have been applied to 3D-QSAR by combining potential interaction energies with convolutional neural networks [21]. However, in order to apply 3D-QSAR, we need to calculate accurate three-dimensional structures of chemical compounds, which is not a straightforward task.
A novel framework for inferring chemical graphs has recently been developed [22,23] based on ANNs and mixed integer linear programming (MILP), as illustrated in Figure 1. It constructs a prediction function in the first phase and infers a chemical graph in the second phase. The first phase of the framework consists of three stages. In Stage 1, we choose a chemical property π and a class G of graphs, where a property function a is defined so that a ( G ) is the value of π in G G , and collect a data set D π of chemical graphs in G such that a ( G ) is available. In Stage 2, we introduce a feature function f : G R K for a positive integer K. In Stage 3, we construct a prediction function η N with an ANN N that, given a vector x R K , returns a value y = η N ( x ) R so that η N ( f ( G ) ) serves as a predicted value to a ( G ) for each G D π . Given a target chemical value y * , the second phase infers chemical graphs G * with η N ( f ( G * ) ) = y * in the next two stages. In Stage 4, we formulate an MILP that simulates the construction of f ( G ) from G and the computation process in the ANN so that given a target value, y * , and solve the MILP to infer a chemical graph G and a feature vector x * such that f ( G ) = x * and η N ( x * ) = y * . In Stage 5, we generate other chemical graphs G * such that η N ( f ( G * ) ) = y * based on the output chemical graph G .
MILP formulations required in Stage 4 have been designed for chemical compounds with cycle index 0 (i.e., acyclic) [23,24], cycle index 1 [25], and cycle index 2 [26]. In particular, Azam et al. [24] introduced a restricted class of acyclic graphs that is characterize by an integer ρ , called a “branch-parameter” such that the restricted class still covers most of the acyclic chemical compounds in the database.
Recently, Akutsu and Nagamochi [27] extended the idea to define a restricted class of cyclic graphs, called “ ρ -lean cyclic graphs”, that covers most of the cyclic chemical compounds in the database. Based on this, they also defined a set of rules for specifying several topological substructures of a target chemical graph in a flexible way in Stage 4 before we solve an MILP. The method has been implemented by Zhu et al. [28], and computational results showed that chemical graphs with around up to 50 non-hydrogen atoms can be inferred. Although the method can infer the class of ρ -lean cyclic graphs and specify topological structures of the cyclic part, we still need to introduce a new model to deal with an arbitrary graph and to include a prescribed structure in the acyclic part of a target chemical graph.
In this paper, we introduce a new model, called a two-layered model, for representing the feature of a chemical graph in order to deal with an arbitrary graph in the framework. In the two-layered model, a chemical graph G with a parameter ρ 1 is regarded as two parts: the exterior and the interior. The exterior consists of maximal acyclic induced subgraphs with height at most ρ and the interior is the connected subgraph obtained by ignoring the exterior. We define a feature vector f ( G ) of a chemical graph G to be the frequency of adjacent atom pairs in the interior and the frequency of chemical acyclic graphs in the exterior. Figure 2 illustrates an example of a chemical graph G. For a branch-parameter ρ = 2 , the interior of the chemical graph G in Figure 2 is obtained by removing the set of vertices with degree 1 ρ = 2 times, i.e., first remove the set V 1 = { w 1 , w 2 , , w 14 } of vertices of degree 1 in G, and then remove the set V 2 = { w 15 , w 16 , , w 19 } of vertices of degree 1 in G V 1 , where the removed vertices become the exterior-vertices of G and there are eight rooted trees T 1 , T 2 , , T 8 in the exterior of G.
We also introduce a new set of rules for specifying topological substructures of a target chemical graph G to be inferred so that a prescribed structure can be included in both of the acyclic and cyclic parts of G. The set of rules contains (i) a seed graph G C as an abstract form of a target chemical graph G; (ii) a set F of chemical rooted trees as candidates for trees in the exterior of G; and (iii) lower and upper bounds on the number of components in a target chemical graph such as chemical elements, double/triple bounds and the interior-vertices in G. Figure 3a,b illustrates examples of a seed graph G C and a set F of chemical rooted trees, respectively. Given a seed graph G C , the interior of a target chemical graph G is constructed from G C by replacing some edges a = u v with paths P a between the end-vertices u and v, and by attaching new paths Q v to some vertices v. For example, the chemical graph G in Figure 2 is constructed from the seed graph G C in Figure 3a as follows. First replace five edges a 1 = u 1 u 2 , a 2 = u 1 u 3 , a 3 = u 4 u 7 , a 4 = u 10 u 11 and a 5 = u 11 u 12 in G C with new paths P a 1 = ( u 1 , u 13 , u 2 ) , P a 2 = ( u 1 , u 14 , u 3 ) , P a 3 = ( u 4 , u 15 , u 16 , u 7 ) , P a 4 = ( u 10 , u 17 , u 18 , u 19 , u 11 ) and P a 5 = ( u 11 , u 20 , u 21 , u 22 , u 12 ) , respectively, to obtain the subgraph G 1 of G that consists of vertices depicted with squares. Next, attach to this graph G 1 three new paths, Q u 5 = ( u 5 , u 24 ) , Q u 18 = ( u 18 , u 25 , u 26 , u 27 ) , and Q u 22 = ( u 22 , u 28 ) , to obtain the interior of G in Figure 2. Finally, the chemical graph G in Figure 2 is obtained by attaching eight trees T 1 , T 2 , , T 8 selected from the set F and assigning chemical elements and bond-multiplicities in the interior. The frequency of chemical elements and the graph size are controlled with lower and upper bounds on the components in a target chemical graph G. See Section 2.2 for more details on the specification.
We implemented the two-layered model and the results of computational experiments suggest that the proposed method can infer chemical graphs with around up to 50 non-hydrogen atoms.
The paper is organized as follows. Section 2.1 introduces some notions on graphs, a modeling of chemical compounds, and a choice of descriptors. Section 2.2 introduces a method of specifying topological substructures of target chemical graphs in Stage 4. Section 3 reports the results on some computational experiments conducted for chemical properties such as octanol/water partition coefficient, boiling point, melting point, flash point, lipophylicity, and solubility. Section 4 makes some concluding remarks. An MILP formulation used in Stage 4 and a review of the dynamic programming algorithm for generating isomers in Stage 5 can be found in Supplementary Materials. The proposed method/system is available at GitHub https://github.com/ku-dml/mol-infer.

2. Materials and Methods

This section presents mathematical details of our developed method. Readers not interested in mathematical details can skip this section.

2.1. Preliminary

This section introduces some notions and terminology on graphs, a modeling of chemical compounds, and our choice of descriptors.
Let R , Z and Z + denote the sets of reals, integers and non-negative integers, respectively. For two integers a and b, let [ a , b ] denote the set of integers i with a i b .
Graphs. Given a graph G, let V ( G ) and E ( G ) denote the sets of vertices and edges, respectively. For a subset V V ( G ) (resp., E E ( G ) ) of a graph G, let G V (resp., G E ) denote the graph obtained from G by removing the vertices in V (resp., the edges in E ), where we remove all edges incident to a vertex in V in G V . The rank r ( G ) of a graph G is defined to be the minimum | F | of an edge subset F E ( G ) such that G F contains no cycle. A path with two end-vertices u and v is called a u , v -path. An edge e = u 1 u 2 in a connected graph G is called a bridge if the graph G e obtained from G by removing edge e is not connected, i.e., G e consists of two connected graphs G i containing vertex u i , i = 1 , 2 . For a cyclic graph G, an edge e is called a core-edge if it is in a cycle of G or is a bridge e = u 1 u 2 such that each of the connected graphs G i , i = 1 , 2 of G e contains a cycle. A vertex incident to a core-edge is called a core-vertex of G.
A vertex designated in a graph G is called a root. In this paper, we designated at most two vertices as roots, and denote by Rt ( G ) the set of roots of G. We call a graph G rooted (resp., bi-rooted) if | Rt ( G ) | = 1 (resp., | Rt ( G ) | = 2 ), where we call Gunrooted if Rt ( G ) = .
For a graph G, possibly with roots, a leaf-vertex is defined to be a non-root vertex v V ( G ) \ Rt ( G ) with degree 1, call the edge u v incident to a leaf vertex v a leaf-edge, and denote V leaf ( G ) and E leaf ( G ) the sets of leaf-vertices and leaf-edges in G, respectively. For a graph or a rooted graph G, we define graphs G i , i Z + obtained from G by removing the set of leaf-vertices i times so that
G 0 : = G ; G i + 1 : = G i V leaf ( G i ) ,
where we call a vertex v V leaf ( G k ) a leaf k-branch and we say that a vertex v V leaf ( G k ) has height height ht( v ) = k in G. The height ht( T ) of a rooted tree T is defined to be the maximum of ht( v ) of a vertex v V ( T ) . For an integer k 0 , we call a rooted tree T k-lean if T has at most one leaf k-branch. For an unrooted cyclic graph G, we regard the set of non-core-edges in G induces a collection T of trees each of which is rooted at a core-vertex, where we call G k-lean if each of the rooted trees in T is k-lean. Nearly 97% of cyclic chemical compounds with up to 100 non-hydrogen atoms in PubChem are 2-lean [24].
Two-layered Model. Let G be an unrooted graph. For an integer ρ 0 , which we call a branch-parameter, a two-layered model of G is a partition of G into an “interior” and an “exterior” in the following way. We call a vertex v V ( G ) (resp., an edge e E ( G ) ) of G an exterior-vertex (resp., exterior-edge) if ht( v ) < ρ (resp., e is incident to an exterior-vertex) and denote the sets of exterior-vertices and exterior-edges by V ex ( G ) and E ex ( G ) , respectively and denote V int ( G ) = V ( G ) \ V ex ( G ) and E int ( G ) = E ( G ) \ E ex ( G ) , respectively. We call a vertex in V int ( G ) (resp., an edge in E int ( G ) ) an interior-vertex (resp., interior-edge). The set E ex ( G ) of exterior-edges forms a collection of connected graphs each of which is regarded as a rooted tree T rooted at the vertex v V ( T ) with the maximum ht( v ) , where we call T a ρ-fringe-tree (or a fringe-tree). Let T ex ( G ) denote the set of fringe-trees in G. The interior of G is defined to be the subgraph ( V int ( G ) , E int ( G ) ) of G. Note that every core-vertex (resp., core-edge) in G is an interior-vertex (resp., interior-edge) of G. Figure 2 illustrates an example of a graph G, such that V int = { u 1 , u 2 , , u 28 } , V ex = { w 1 , w 2 , , w 19 } and T ex ( G ) = { T 1 , T 2 , , T 8 } for a branch-parameter ρ = 2 .

2.1.1. Modeling of Chemical Compounds

To represent a chemical compound, we assume that each chemical element a has a unique valence val ( a ) [ 1 , 4 ] and we use a hydrogen-suppressed model, because hydrogen atoms can be added at the final stage under the assumption. In the hydrogen-suppressed model, a chemical compound C is represented by a tuple G = ( H , α , β ) of a simple, connected undirected graph H and functions α : V ( H ) Λ and β : E ( H ) [ 1 , 3 ] , where Λ is a set of non-hydrogen chemical elements such as C (carbon), O (oxygen), N (nitrogen), and so on. The set of atoms and the set of bonds in the compound C are represented by the vertex set V ( H ) and the edge set E ( H ) , respectively. The chemical element assigned to a vertex v V ( H ) is represented by α ( v ) and the bond-multiplicity between two adjacent vertices u , v V ( H ) is represented by β ( e ) of the edge e = u v E ( H ) . We say that two tuples ( H i , α i , β i ) , i = 1 , 2 are isomorphic if they admit an isomorphism ϕ , i.e., a bijection ϕ : V ( H 1 ) V ( H 2 ) such that u v E ( H 1 ) , α 1 ( u ) = a , α 1 ( v ) = b , β 1 ( u v ) = m ϕ ( u ) ϕ ( v ) E ( H 2 ) , α 2 ( ϕ ( u ) ) = a , α 2 ( ϕ ( v ) ) = b , β 2 ( ϕ ( u ) ϕ ( v ) ) = m . When H i is rooted at a vertex r i , i = 1 , 2 , ( H i , α i , β i ) , i = 1 , 2 are rooted-isomorphic (r-isomorphic) if they admit an isomorphism ϕ such that ϕ ( r 1 ) = r 2 . Chemical rooted trees T 1 and T 5 in Figure 2 are r-isomorphic.
Associated with the two functions α and β in a tuple G = ( H , α , β ) , we introduce the following functions: β G : V ( H ) [ 0 , 12 ] , ac : V ( E ) Λ × Λ × [ 1 , 3 ] , cs : V ( E ) Λ × [ 1 , 4 ] , and ec : V ( E ) ( Λ × [ 1 , 4 ] ) × ( Λ × [ 1 , 4 ] ) × [ 1 , 3 ] .
For a notational convenience, we use a function β G : V ( H ) [ 0 , 4 ] such that β G ( u ) means the sum of bond-multiplicities of edges incident to a vertex u, i.e.,
β G ( u ) u v E ( H ) β ( u v )   for   each   vertex   u V ( H ) .
A chemical graph G is defined to be a tuple ( H , α , β ) such that the valence condition at each vertex v V ( H ) is satisfied, i.e.,
β G ( v ) val ( α ( v ) ) ,
where we define the hydro-degree deg hyd ( v ) of a vertex v to be val ( α ( v ) ) β G ( v ) .
Figure 2 illustrates an example of a chemical graph G = ( H , α , β ) .
To represent a feature of an edge e = u v E ( H ) such that α ( u ) = a , α ( v ) = b and β ( e ) = m in a chemical graph G = ( H , α , β ) , we use a tuple ( a , b , m ) Λ × Λ × [ 1 , 3 ] , which we call the adjacency-configuration ac ( e ) of the edge e. We introduce a total order < over the elements in Λ to distinguish with ( a , b , m ) and ( b , a , m ) ( a b ) notationally. For a tuple ν = ( a , b , m ) , let ν ¯ denote the tuple ( b , a , m ) .
To represent a feature of a vertex v V ( H ) with α ( v ) = a that has d atoms in its neighbor in a chemical graph G = ( H , α , β ) , we use a pair ( a , d ) Λ × [ 1 , 4 ] , which we call the chemical symbol cs ( v ) of the vertex v. We treat ( a , d ) as a single symbol a d , and define Λ dg to be the set of all chemical symbols μ = a d Λ × [ 1 , 4 ] .
To represent a feature of an edge e = u v E ( H ) such that cs ( u ) = μ , cs ( v ) = ξ and β ( e ) = m in a chemical graph G = ( H , α , β ) , we use a tuple ( μ , ξ , m ) Λ dg × Λ dg × [ 1 , 3 ] , which we call the edge-configuration ec ( e ) of the edge e. We introduce a total order < over the elements in Λ dg to distinguish with ( μ , ξ , m ) and ( ξ , μ , m ) ( μ ξ ) notationally. For a tuple γ = ( μ , ξ , m ) , let γ ¯ denote the tuple ( ξ , μ , m ) .
To represent a feature of the exterior of a chemical graph G = ( H , α , β ) , a ρ -fringe-tree in T ex ( G ) is called a fringe-configuration in the exterior.

2.1.2. Introducing Descriptors of Feature Vectors

This section introduces descriptors to define our feature vectors. Let π be a chemical property for which we will construct a prediction function η N from a feature vector f ( G ) of a chemical graph to a predicted value y R for the chemical property of G.
We first choose a set Λ of non-hydrogen chemical elements and then collect a data set D π of chemical compounds C whose chemical elements belong to Λ { H } , where we regard D π as a set of chemical graphs that represent the chemical compounds C in D π . To define the interior/exterior of chemical graphs G D π , we next choose a branch-parameter ρ , where we recommend ρ = 2 .
Let Λ int ( D π ) (resp., Λ ex ( D π ) ) denote the set of chemical elements used in the set of interior-vertices (resp., exterior-vertices) over all chemical graphs G D π , and Γ int ( D π ) denote the set of edge-configurations used in the set of interior-edges over all chemical graphs G D π . Let F ( D π ) denote the set of chemical rooted trees ψ r-isomorphic to a ρ -fringe-tree T T ex ( G ) over all chemical graphs G D π .
We define an integer encoding of a finite set A of elements to be a bijection σ : A [ 1 , | A | ] , where we denote by [ A ] the set [ 1 , | A | ] of integers. Introduce an integer coding of each of the sets Λ int ( D π ) , Λ ex ( D π ) , Γ int ( D π ) and F ( D π ) . Let [ a ] int (resp., [ a ] ex ) denote the coded integer of an element a Λ int ( D π ) (resp., a Λ ex ( D π ) ), [ γ ] denote the coded integer of an element γ in Γ int ( D π ) and [ ψ ] denote an element ψ in F ( D π ) .
For each chemical element a Λ , let mass ( a ) and val ( a ) denote the mass and valence of a , respectively. In our model, we use integers mass * ( a ) = 10 · mass ( a ) , a Λ .
We define the feature vector f ( G ) of a chemical graph G = ( H , α , β ) D π to be a vector that consists of the following non-negative integer descriptors dcp i ( G ) , i [ 1 , K ] , where K = 17 + | Λ int ( D π ) | + | Λ ex ( D π ) | + | Γ int ( D π ) | + | F ( D π ) | .
  • dcp 1 ( G ) : the number n ( G ) = | V ( G ) | of vertices in G.
  • dcp 2 ( G ) : the number | V int ( G ) | of interior-vertices in G.
  • dcp 3 ( G ) : the average ms ¯ ( G ) of mass * over all non-hydrogen atoms in G, i.e., ms ¯ ( G ) v V ( G ) mass * ( α ( v ) ) / n ( G ) .
  • dcp i ( G ) , i = 3 + d , d [ 1 , 4 ] : the number dg d ( G ) of interior-vertices of degree d in G.
  • dcp i ( G ) , i = 7 + d , d [ 1 , 4 ] : the number dg d int ( G ) of interior-vertices of interior-degree deg ( V int , E int ) ( v ) = d in the interior ( V int , E int ) of G.
  • dcp i ( G ) , i = 11 + d , d [ 0 , 3 ] : the number hydg d ( G ) of vertices in G of hydro-degree deg hyd ( v ) = d .
  • dcp i ( G ) , i = 15 + m , m [ 2 , 3 ] : the number bd m int ( G ) of interior-edges with bond multiplicity m in G, i.e., bd m int ( G ) { e E int β ( e ) = m } .
  • dcp i ( G ) , i = 17 + [ a ] int , a Λ int ( D π ) : the frequency na a int ( G ) of chemical element a in the set of interior-vertices in G.
  • dcp i ( G ) , i = 17 + | Λ int ( D π ) | + [ a ] ex , a Λ ex ( D π ) : the frequency na a ex ( G ) of chemical element a in the set of exterior-vertices in G.
  • dcp i ( G ) , i = 17 + | Λ int ( D π ) | + | Λ ex ( D π ) | + [ γ ] , γ Γ int ( D π ) : the frequency ec γ ( G ) of edge-configuration γ in the set of interior-edges e E int in G.
  • dcp i ( G ) , i = 17 + | Λ int ( D π ) | + | Λ ex ( D π ) | + | Γ int ( D π ) | + [ ψ ] , ψ F ( D π ) : the frequency fc ψ ( G ) of fringe-configuration ψ in the set of ρ -fringe-trees in G.

2.2. Specifying Target Chemical Graphs

Given a prediction function η N and a target value y * R , we call a chemical graph G * such that η N ( x * ) = y * for the feature vector x * = f ( G * ) a target chemical graph. This section presents a set of rules for specifying topological substructure of a target chemical graph in a flexible way in Stage 4.
We first describe how to reduce a chemical graph G = ( H , α , β ) into an abstract form based on which our specification rules will be defined. To illustrate the reduction process, we use the chemical graph G = ( H , α , β ) in Figure 2.
R1
Removal of all ρ -fringe-trees: The interior H int = ( V int ( H ) , E int ( H ) ) of G is obtained by removing the non-root vertices of each ρ -fringe-trees T T ex ( G ) . Figure 4 illustrates the interior H int of chemical graph G with ρ = 2 in Figure 2.
R2
Removal of some leaf paths: We call a u , v -path Q in H int a leaf path if vertex v is a leaf-vertex of H int and the degree of each internal vertex of Q in H int is 2, where we regard that Q is rooted at vertex u. A connected subgraph S of the interior H int of G is called a cyclical-base if S is obtained from H by removing the vertices in V ( Q u ) \ { u } , u X for a subset X of interior-vertices and a set { Q u u X } of leaf u , v -paths Q u such that no two paths Q u and Q u share a vertex. Figure 5a illustrates a cyclical-base S = H int u X ( V ( Q u ) \ { u } ) of the interior H int for a set { Q u 5 = ( u 5 , u 24 ) , Q u 18 = ( u 18 , u 25 , u 26 , u 27 ) , Q u 22 = ( u 22 , u 28 ) } of leaf paths in Figure 4.
R3
Contraction of some pure paths: A path in S is called pure if each internal vertex of the path is of degree 2. Choose a set P of several pure paths in S so that no two paths share vertices except for their end-vertices. A graph S is called a contraction of a graph S (with respect to P ) if S is obtained from S by replacing each pure u , v -path with a single edge a = u v , where S may contain multiple edges between the same pair of adjacent vertices. Figure 5b illustrates a contraction S obtained from the chemical graph S by contracting each u v -path P a P into a new edge a = u v , where a 1 = u 1 u 2 , a 2 = u 1 u 3 , a 3 = u 4 u 7 , a 4 = u 10 u 11 , and a 5 = u 11 u 12 , and P = { P a 1 = ( u 1 , u 13 , u 2 ) , P a 2 = ( u 1 , u 14 , u 3 ) , P a 3 = ( u 4 , u 15 , u 16 , u 7 ) , P a 4 = ( u 10 , u 17 , u 18 , u 19 , u 11 ) , P a 5 = ( u 11 , u 20 , u 21 , u 22 , u 12 ) } of pure paths in Figure 5a.
We will define a set of rules so that a chemical graph can be obtained from a graph (called a seed graph in the next section) by applying processes R3 to R1 in a reverse way. We specify topological substructures of a target chemical graph with a tuple ( G C , σ int , σ ce ) called a target specification defined under the set of the following rules.
Seed Graphs
A seed graph G C = ( V C , E C ) is defined to be a graph (possibly with multiple edges) such that the edge set E C consists of four sets E ( 2 ) , E ( 1 ) , E ( 0 / 1 ) , and E ( = 1 ) , where each of them can be empty. A seed graph plays a role of the most abstract form S in R3. Figure 3a illustrates an example of a seed graph, where V C = { u 1 , u 2 , , u 12 } , E ( 2 ) = { a 1 , a 2 , , a 5 } , E ( 1 ) = { a 6 } , E ( 0 / 1 ) = { a 7 } , and E ( = 1 ) = { a 8 , a 9 , , a 17 } .
A subdivision S of G C is a graph constructed from a seed graph G C according to the following rules:
-
Each edge e = u v E ( 2 ) is replaced with a u , v -path P e of length at least 2;
-
Each edge e = u v E ( 1 ) is replaced with a u , v -path P e of length at least 1 (equivalently e is directly used or replaced with a u , v -path P e of length at least 2);
-
Each edge e E ( 0 / 1 ) is either used or discarded; and
-
Each edge e E ( = 1 ) is always used directly.
We allow a possible elimination of edges in E ( 0 / 1 ) as an optional rule in constructing a target chemical graph from a seed graph, even though such an operation has not been included in the process R3. A subdivision S plays a role of a cyclical-base in R2. A target chemical graph G = ( H , α , β ) will contain S as a subgraph of the interior H int of G.
Interior-Specification
A graph H * that serves as the interior H int of a target chemical graph G will be constructed as follows. First, construct a subdivision S of a seed graph G C by replacing each edge edge e = u u E ( 2 ) E ( 1 ) with a pure u , u -path P e . Next, construct a supergraph H * of S by attaching a leaf path Q v at each vertex v V C or at an internal vertex v V ( P e ) \ { u , u } of each pure u , u -path P e for some edge e = u u E ( 2 ) E ( 1 ) , where possibly Q v = v , E ( Q v ) = (i.e., we do not attach any new edges to v). We introduce the following rules for specifying the size of H * , the length | E ( P e ) | of a pure path P e , the length | E ( Q v ) | of a leaf path Q v , the number of leaf paths Q v , and a bond-multiplicity of each interior-edge, where we call the set of prescribed constants an interior-specification σ int :
-
Lower and upper bounds n LB int , n UB int Z + on the number of interior-vertices of a target chemical graph G.
-
For each edge e = u u E ( 2 ) E ( 1 ) ,
  • a lower bound LB ( e ) and an upper bound UB ( e ) on the length | E ( P e ) | of a pure u , u -path P e . (For a notational convenience, set LB ( e ) : = 0 , UB ( e ) : = 1 , e E ( 0 / 1 ) and LB ( e ) : = 1 , UB ( e ) : = 1 , e E ( = 1 ) . )
  • a lower bound bl LB ( e ) and an upper bound bl UB ( e ) on the number of leaf paths Q v attached to at internal vertices v of a pure u , u -path P e .
  • a lower bound ch LB ( e ) and an upper bound ch UB ( e ) on the maximum length | E ( Q v ) | of a leaf path Q v attached at an internal vertex v V ( P e ) \ { u , u } of a pure u , u -path P e .
-
For each vertex v V C ,
  • a lower bound ch LB ( e ) and an upper bound ch UB ( e ) on the number of leaf paths Q v attached to v, where 0 ch LB ( e ) ch UB ( e ) 1 .
  • a lower bound ch LB ( v ) and an upper bound ch UB ( v ) on the length | E ( Q v ) | of a leaf path Q v attached to v.
-
For each edge e = u u E C , a lower bound bd m , LB ( e ) and an upper bound bd m , UB ( e ) on the number of edges with bond-multiplicity m [ 2 , 3 ] in u , u -path P e , where we regard P e , e E ( 0 / 1 ) E ( = 1 ) as single edge e.
We call a graph H * that satisfies an interior-specification σ int a σ int -extension of G C , where the bond-multiplicity of each edge has been determined.
Table 1 shows an example of an interior-specification σ int to the seed graph G C in Figure 3.
Figure 6 illustrates an example of an σ int -extension H * of seed graph G C in Figure 3 under the interior-specification σ int in Table 1.
Chemical-Specification
Let H * be a graph that serves as the interior H int of a target chemical graph G, where the bond-multiplicity of each edge in H * has be determined. Finally, we introduce a set of rules for constructing a target chemical graph G from H * by choosing a chemical element a Λ and assigning a ρ -fringe-tree ψ to each interior-vertex v V int . We introduce the following rules for specifying the size of G, a set of chemical rooted trees that are allowed to use as ρ -fringe-trees and lower and upper bounds on the frequency of a chemical element, a chemical symbol, and an edge-configuration, where we call the set of prescribed constants a chemical specification σ ce :
-
Lower and upper bounds n LB , n * Z + on the number of vertices in G, where n LB int n LB n * .
-
Subsets F ( v ) F ( D π ) , v V C and F E F ( D π ) of chemical rooted trees with height at most ρ , where we require that every ρ -fringe-tree T v rooted at a vertex v V C (resp., at an internal vertex v not in V C ) in G belongs to F ( v ) (resp., F E ). Let F * : = F E v V C F ( v ) and Λ ex denote the set of chemical elements assigned to non-root vertices over all chemical rooted trees in F * .
-
A subset Λ int Λ int ( D π ) , where we require that every chemical element α ( v ) assigned to an interior-vertex v in G belongs to Λ int . Let Λ : = Λ int Λ ex and na a ( G ) (resp., na a int ( G ) and na a ex ( G ) ) denote the number of vertices (resp., interior-vertices and exterior-vertices) v such that α ( v ) = a in G.
-
A set Λ dg int Λ × [ 1 , 4 ] of chemical symbols and a set Γ int Γ int ( D π ) of edge-configurations ( μ , ξ , m ) with μ ξ , where we require that the edge-configuration ec ( e ) of an interior-edge e in G belongs to Γ int . We do not distinguish ( μ , ξ , m ) and ( ξ , μ , m ) .
-
Define Γ ac int to be the set of adjacency-configurations such that Γ ac int : = { ( a , b , m ) ( a d , b d , m ) Γ int } . Let ac ν int ( G ) , ν Γ ac int denote the number of interior-edges e such that ac ( e ) = ν in G.
-
Subsets Λ * ( v ) { a Λ int val ( a ) 2 } , v V C , we require that every chemical element α ( v ) assigned to a vertex v V C in the seed graph belongs to Λ * ( v ) .
-
Lower and upper bound functions na LB , na UB : Λ [ 1 , n * ] and na LB int , na UB int : Λ t [ 1 , n * ] on the number of interior-vertices v such that α ( v ) = a in G.
-
Lower and upper bound functions ns LB int , ns UB int : Λ dg int [ 1 , n * ] on the number of interior-vertices v such that cs ( v ) = μ in G.
-
Lower and upper bound functions ac LB int , ac UB int : Γ ac int Z + on the number of interior-edges e such that ac ( e ) = ν in G.
-
Lower and upper bound functions ec LB int , ec UB int : Γ int Z + on the number of interior-edges e such that ec ( e ) = γ in G.
We call a chemical graph G that satisfies a chemical specification σ ce a ( σ int , σ ce ) -extension of G C , and denote by G ( G C , σ int , σ ce ) the set of all ( σ int , σ ce ) -extensions of G C .
Table 2 shows an example of a chemical-specification σ ce to the seed graph G C in Figure 3.
Figure 2 illustrates an example of a ( σ int , σ ce ) -extension of G C obtained from the σ int -extension H * in Figure 6 under the chemical-specification σ ce in Table 2.
Our specification of topological substructures is similar to that proposed by Akutsu and Nagamochi [27], wherein a target chemical graph is restricted to ρ -lean cyclic graphs and prescribed substructures cannot be specified in the acyclic part. In our new method, a chemical graph with any structure can be handled and substructures in the acyclic part can be fixed.

2.3. Examples of Specification

We here present some cases where a target specification ( G C , σ int , σ ce ) can be chosen based on a set G * of given chemical graphs with a similar structure so that G * becomes a subset of G ( G C , σ int , σ ce ) . In such a case, every target chemical graph in G ( G C , σ int , σ ce ) possesses a common structure over the given set G * .
Figure 7 illustrates a set G * of four flavonoids and a seed graph G C for ρ = 2 so that G * G ( G C , σ int , σ ce ) for a choice of an interior-specification σ int and a chemical-specification σ ce . Let Λ : = { C , O } . In the seed graph G C = ( V C , E C ) , we set E ( 1 ) : = { a 1 , a 2 } , E ( 0 / 1 ) : = { a 3 } , and E ( = 1 ) : = E C \ { a 1 , a 2 , a 3 } and predetermine the chemical element α ( u ) for each vertex u V C and the bond-multiplicity β ( e ) for each edge e E ( = 1 ) as in Figure 7e, i.e., Λ * ( u ) : = { a } for a = α ( u ) and bd m , LB ( e ) : = 1 for m = β ( e ) . Figure 7f illustrates a set F * of chemical rooted trees for the 2-fringe-trees in a target chemical graph. For vertices in G C , we set ch UB ( u ) : = 0 , u V C , F ( u i ) : = { ψ 3 } , i [ 1 , 3 ] , F ( u 4 ) : = { ψ 1 , ψ 3 } , F ( u 5 ) : = { ψ 4 } , F ( u 6 ) : = { ψ 2 } , and F ( u ) : = { ψ 1 } , u V C \ { u 1 , u 2 , , u 6 } . For edges a i E ( 1 ) , i = 1 , 2 , we set UB ( a i ) : = 2 , ch UB ( a i ) : = 0 and F E : = { ψ 1 , ψ 2 } , where a pure path P a i may be introduced in a target chemical graph. We see that every given chemical graph G i G * belongs to G ( G C , σ int , σ ce ) by setting the other specification in σ int and σ ce adequately.
Figure 8 illustrates a set G * of three dibenzodiazepine atypical antipsychotics, and a seed graph G C for ρ = 2 so that G * G ( G C , σ int , σ ce ) for a choice of an interior-specification σ int and a chemical-specification σ ce . Let Λ : = { C , O , N , S , Cl } . In the seed graph G C = ( V C , E C ) , we set E ( 2 ) : = { a 1 } and E ( = 1 ) : = E C \ { a 1 } and predetermine the chemical element α ( u ) for each vertex u V C and the bond-multiplicity β ( e ) for each edge e E ( = 1 ) as in Figure 8d. Figure 8e illustrates a set F * of chemical rooted trees for the 2-fringe-trees in a target chemical graph. For vertices in G C , we set ch UB ( u ) : = 0 , u V C \ { u 2 } , ch LB ( u 2 ) : = 0 , ch UB ( u 2 ) : = 4 , F ( u 1 ) : = { ψ 3 , ψ 7 } , F ( u 2 ) : = { ψ 1 , ψ 6 } , F ( u i ) : = { ψ 3 } , i [ 3 , 5 ] , and F ( u ) : = { ψ 1 } , u V C \ { u 1 , u 2 , , u 6 } , where a leaf path Q u 2 may be introduced in a target chemical graph. For edge a 1 E ( 2 ) , we set UB ( a 1 ) : = 3 , ch UB ( a 1 ) : = 0 and F E : = { ψ 1 , ψ 2 , ψ 4 , ψ 8 } . We see that every given chemical graph G i G * belongs to G ( G C , σ int , σ ce ) by setting the other specification in σ int and σ ce adequately.

3. Results

We implemented our method of Stages 1 to 5 for inferring chemical graphs under a given target specification and conducted experiments to evaluate the computational efficiency. We executed the experiments on a PC with Processor: 3.0 GHz Core i7-9700 (3.0 GHz) Memory: 16 GB RAM DDR4. We used ChemDoodle version 10.2.0 for constructing 2D drawings of chemical graphs.
To conduct experiments for Stages 1 to 5, we selected six chemical properties π : octanol/water partition coefficient (KOW), boiling point (BP), melting point (MP), flash point (closed cup) (FP), lipophylicity (LP), solubility (SL) provided by HSDB from PubChem [29] for KOW, BP, MP, and FP, figshare [30] for LP and MoleculeNet [31] for SL.
Results on Phase 1.
We implemented Stages 1, 2, and 3 in Phase 1 as follows.
Stage 1. We set a graph class G to be the set of all chemical graphs with any graph structure, and set a branch-parameter ρ to be 2. For each property π     { KOW, BP, MP, FP, LP, SL}, we first select a set Λ of chemical elements and then collect a data set D π on chemical graphs over the set Λ of chemical elements. To construct the data set D π , we eliminated chemical compounds that have at most three carbon atoms or contain a charged element such as N + or an element a Λ whose valence is different from our setting of valence function val .
Table 3 shows the size and range of data sets that we prepared for each chemical property in Stage 1, where we denote the following:
  • Λ : the set of selected chemical elements (hydrogen atoms are added at the final stage);
  • | D π | : the size of data set D π over Λ for property π ;
  • | Γ int ( D π ) | : the number of different edge-configurations of interior-edges over the compounds in D π ;
  • | F ( D π ) | : the number of non-isomorphic chemical rooted trees in the set of all 2-fringe-trees in the compounds in D π ;
  • [ n ̲ , n ¯ ] : the minimum and maximum values of n ( G ) over the compounds G in D π ; and
  • [ a ̲ , a ¯ ] : the minimum and maximum values of a ( G ) in π over compounds G in D π .
Stage 2. We used the new feature function that consists of the descriptors such as fringe-configuration defined in Section 2.1 and let f fc denote the feature function.
Stage 3. Let η : R K R be a prediction function to a property function a : D R with a feature function f : D R K for a data set D of chemical graphs. We define the coefficient of determination R 2 ( f , η , D ) of a prediction function η over a data set D to be
R 2 ( f , η , D ) 1 G D ( a ( G ) η ( f ( G ) ) ) 2 G D ( a ( G ) a ˜ ) 2 f o r a ˜ = 1 | D | G D a ( G ) .
To conduct an experiment in Stage 3, we first constructed ten architectures A j , j [ 1 , 10 ] with one or two hidden layers. For each pair ( π , A j ) of a property π     { KOW, BP, MP, FP, LP, SL}, and an architecture A j , j [ 1 , 10 ] , we constructed five prediction functions in order to evaluate the performance with cross-validation as follows. Partition data set D π into five subsets D π ( i ) , i [ 1 , 5 ] randomly and for each set D π \ D π ( i ) construct an ANN N ( j , i ) and its prediction function η N ( j , i ) using the feature function f fc . We used scikit-learn version 0.23.2 with Python 3.8.5, MLPRegressor and ReLU activation function to construct each ANN N ( j , i ) . We evaluated the resulting prediction function η N ( j , i ) with the coefficient R 2 ( f fc , η N ( j , i ) , D π ( i ) ) of determination for the test set D π ( i ) . For each property π , let t- R cv 2 ( j ) denote the average of R 2 ( f fc , η N ( j , i ) , D π ( i ) ) over all i [ 1 , 5 ] in the cross-validation to an architecture A j .
Table 4 shows the results on Stages 2 and 3, where we denote the following.
-
Λ : the set of selected chemical elements (hydrogen atoms are added at the final stage);
-
L-time: the average time (s) to construct an ANN over all 10 × 5 = 50 ANNs;
-
t- R cv 2 (best): the best value of t- R cv 2 ( j ) over all architectures A j , j [ 1 , 10 ] ;
-
t- R max 2 : the maximum of R 2 ( f fc , η N ( j , i ) , D π ( i ) ) over all j [ 1 , 10 ] , i [ 1 , 5 ] ; and
-
Arch.: The architecture A j , j [ 1 , 10 ] that attains t- R max 2 . An architecture ( K , p , 1 ) (resp., ( K , p 1 , p 2 , 1 ) ) consists of an input layer with K nodes, a hidden layer with p nodes (resp., two hidden layers with p 1 and p 2 nodes, respectively), and an output layer with a single node, where K is equal to the number of descriptors in the feature vector.
From Table 4, we see that the execution of Stage 3 was considerably successful, where most of t- R max 2 are around 0.85 to 0.95 for all six chemical properties.
An Additional Experiment in Stage 3. We conducted an additional experiment to compare our new feature function f fc with the feature function f ec based edge-configuration in the previous method [27] designed with the same framework. Note that the previous feature vector f ec ( G ) can be defined only for a cyclic graph G, whereas our feature vector f fc ( G ) is defined for an arbitrary graph G. For each property π     { KOW, BP, MP, FP, LP, SL}, we set a set Λ of chemical elements to be { C , O , N , S , Cl } and then collect a data set D ˜ π of chemical cyclic graphs from the data set D π of all chemical graphs over the set Λ of chemical elements in the previous experiment. For each of the feature functions f ec and f fc , we constructed five prediction functions with the same set of ten architectures A j , j [ 1 , 10 ] and the data set D ˜ π of chemical cyclic graphs in the same manner of the previous experiment.
Table 5 shows the results of this experiment, where the table also includes the result of prediction functions by f fc in the set D π of all chemical graphs. In the table, we denote the following:
-
| D ˜ π | , | D π | : the size of data set D ˜ π of cyclic graphs (resp., D π of all chemical graphs) for property π ;
-
t- R cv 2 (ave.): the average of R 2 ( f , η N ( j , i ) , D ( i ) ) over all j [ 1 , 10 ] , i [ 1 , 5 ] for f = f ec , f fc and D = D ˜ π , D π ; and
-
t- R cv 2 (best): max j [ 1 , 10 ] { the average of R 2 ( f fc , η N ( j , i ) , D π ( i ) ) over all i [ 1 , 5 ] } .
From Table 5, we see that the score of R 2 of the prediction function by f fc in chemical cyclic graphs (resp., in all chemical graphs) is improved from that by f ec for properties MP and FP (resp., BP, MP, and FP). Recall that our new feature function f fc can be defined for arbitrary graphs and we can select a larger data set than that by f ec in a learning stage. This advantage is observed in the experiment. We guess that the better prediction function for BP (resp., FP) is obtained by using f fc because the size of data set becomes considerably larger from | D ˜ π | = 224 to | D π | = 425 (resp., from | D ˜ π | = 218 to | D π | = 399 ).
Results on Phase 2.
We prepared the following instances (a–d) for conducting experiments of Stages 4 and 5 in Phase 2.
(a)
I a = ( G C , σ int , σ ce ) : The instance used in Section 2.2 to explain the target specification.
(b)
I b , i = ( G C i , σ int i , σ ce i ) , i = 1 , 2 , 3 , 4 : An instance for inferring chemical graphs with rank at most 2. In the four instances I b , i , i = 1 , 2 , 3 , 4 , the following specifications in ( σ int , σ ce ) are common.
  • Set Λ : = { C , N , O } , set Λ dg int to be the set of all possible symbols in Λ × [ 1 , 4 ] , and set Γ int to be the set of all possible edge-configurations. Set Λ * ( v ) : = Λ , v V C .
  • The lower bounds LB , bl LB , ch LB , bd 2 , LB , bd 3 , LB , na LB , na LB int , ns LB int , ac LB int , ec LB int are all set to be 0.
  • The upper bounds UB , bl UB , ch UB , bd 2 , UB , bd 3 , UB , na UB , na UB int , ns UB int , ac UB int , ec UB int are all set to be an upper bound n * on n ( G * ) .
  • For each property π , let F ( D π ) denote the set of 2-fringe-trees in the compounds in D π , and select a subset F π i F ( D π ) with | F π i | = 45 5 i , i [ 1 , 5 ] . For each instance I b , i , set F E : = F ( v ) : = F π i , v V C .
Instance I b , 1 is given by the rank-1 seed graph G C 1 in Figure 9a and Instances I b , i , i = 2 , 3 , 4 are given by the rank-2 seed graph G C i , i = 2 , 3 , 4 in Figure 9b–d.
(i)
For instance I b , 1 , select as a seed graph the monocyclic graph G C 1 = ( V C , E C = E ( 2 ) E ( 1 ) ) in Figure 9a, where V C = { u 1 , u 2 } , E ( 2 ) = { a 1 } and E ( 1 ) = { a 2 } . Set n LB int : = 0 , n UB int : = 12 and n LB : = n * : = 38 . We include a linear constraint ( a 1 ) ( a 2 ) as part of the side constraint.
(ii)
For instance I b , 2 , select as a seed graph the graph G C 2 = ( V C , E C = E ( 2 ) E ( 1 ) E ( = 1 ) ) in Figure 9b, where V C = { u 1 , u 2 , u 3 , u 4 } , E ( 2 ) = { a 1 , a 2 } , E ( 1 ) = { a 3 } and E ( = 1 ) = { a 4 , a 5 } . Set n LB int : = n UB int : = 30 and n LB : = n * : = 50 . We include a linear constraint ( a 1 ) ( a 2 ) .
(iii)
For instance I b , 3 , select as a seed graph the graph G C 3 = ( V C , E C = E ( 2 ) E ( 1 ) E ( = 1 ) ) in Figure 9c, where V C = { u 1 , u 2 , u 3 , u 4 } , E ( 2 ) = { a 1 } , E ( 1 ) = { a 2 , a 3 } and E ( = 1 ) = { a 4 , a 5 } . Set n LB int : = n UB int : = 30 and n LB : = n * : = 50 . We include linear constraints ( a 1 ) ( a 2 ) + ( a 3 ) and ( a 2 ) ( a 3 ) .
(iv)
For instance I b , 4 , select as a seed graph the graph G C 4 = ( V C , E C = E ( 2 ) E ( 1 ) E ( = 1 ) ) in Figure 9d, where V C = { u 1 , u 2 , u 3 , u 4 } , E ( 1 ) = { a 1 , a 2 , a 3 } and E ( = 1 ) = { a 4 , a 5 } . Set n LB int : = n UB int : = 30 and n LB : = n * : = 50 . We include linear constraints ( a 2 ) ( a 1 ) + 1 , ( a 2 ) ( a 3 ) + 1 and ( a 1 ) ( a 3 ) .
We define instances in (c) and (d) in order to find chemical graphs that have an intermediate structure of given two chemical cyclic graphs G A = ( H A = ( V A , E A ) , α A , β A ) and G B = ( H B = ( V B , E B ) , α B , β B ) . Let Λ A int and Λ dg , A int denote the sets of chemical elements and chemical symbols of the interior-vertices in G A , Γ A int denote the sets of edge-configurations of the interior-edges in G A , and F A denote the set of 2-fringe-trees in G A . Analogously define sets Λ B int , Λ dg , B int , Γ B int , and F B in G B .
(c)
I c = ( G C , σ int , σ ce ) : An instance aimed to infer a chemical graph G such that the core of G is equal to the core of G A and the frequency of each edge-configuration in the non-core of G is equal to that of G B . We use chemical compounds CID 24822711 and CID 59170444 in Figure 10a,b for G A and G B , respectively.
Set a seed graph G C = ( V C , E C = E ( = 1 ) ) to be the core of G A .
Set Λ : = { C , N , O } , and set Λ dg int to be the set of all possible chemical symbols in Λ × [ 1 , 4 ] .
Set Γ int : = Γ A int Γ B int and Λ * ( v ) : = { α A ( v ) } , v V C .
Set n LB int : = min { n int ( G A ) , n int ( G B ) } , n UB int : = max { n int ( G A ) , n int ( G B ) } ,
n LB : = min { n ( G A ) , n ( G B ) } 10 and n * : = max { n ( G A ) , n ( G B ) } + 5 .
Set lower bounds LB , bl LB , ch LB , bd 2 , LB , bd 3 , LB , na LB , na LB int , ns LB int and ac LB int to be 0.
Set upper bounds UB , bl UB , ch UB , bd 2 , UB , bd 3 , UB , na UB , na UB int , ns UB int and ac UB int to be n * .
Set ec LB int ( γ ) to be the number of core-edges in G A with γ Γ int and ec UB int ( γ ) to be the number interior-edges in G A and G B with edge-configuration γ .
Let F B ( p ) , p [ 1 , 2 ] denote the set of chemical rooted trees r-isomorphic p-fringe-trees in G B .
Set F E : = F ( v ) : = F B ( 1 ) F B ( 2 ) , v V C .
(d)
I d = ( G C 1 , σ int , σ ce ) : An instance aimed to infer a chemical monocyclic graph G such that the frequency vector of edge-configurations in G is a vector obtained by merging those of G A and G B . We use chemical monocyclic compounds CID 10076784 and CID 44340250 in Figure 10c,d for G A and G B , respectively. Set a seed graph to be the monocyclic seed graph G C 1 = ( V C , E C = E ( 2 ) E ( 1 ) ) with V C = { u 1 , u 2 } , E ( 2 ) = { a 1 } and E ( 1 ) = { a 2 } in Figure 9a.
Set Λ : = { C , N , O } , Λ dg int : = Λ dg , A int Λ dg , B int and Γ int : = Γ A int Γ B int .
Set n LB int : = min { n int ( G A ) , n int ( G B ) } , n UB int : = max { n int ( G A ) , n int ( G B ) } ,
n LB : = min { n ( G A ) , n ( G B ) } and n * : = max { n ( G A ) , n ( G B ) } .
Set lower bounds LB , bl LB , ch LB , bd 2 , LB , bd 3 , LB , na LB , na LB int , ns LB int and ac LB int to be 0.
Set upper bounds UB , bl UB , ch UB , bd 2 , UB , bd 3 , UB , na UB , na UB int , ns UB int and ac UB int to be n * .
For each edge-configuration γ Γ int , let x A * ( γ int ) (resp., x B * ( γ int ) ) denote the number of interior-edges with γ in G A (resp., G B ), γ Γ int and set
x min * ( γ ) : = min { x A * ( γ ) , x B * ( γ ) } , x max * ( γ ) : = max { x A * ( γ ) , x B * ( γ ) } ,
ec LB int ( γ ) : = ( 3 / 4 ) x min * ( γ ) + ( 1 / 4 ) x max * ( γ ) and
ec UB int ( γ ) : = ( 1 / 4 ) x min * ( γ ) + ( 3 / 4 ) x max * ( γ ) .
Set F E : = F ( v ) : = F A F B , v V C .
In Stage 5, before we formulate an MILP for inferring a target chemical graph G for each instance I, we reduce the input layer of an ANN N constructed in Stage 3 so that the input layer consists of input nodes that correspond to the descriptors actually used in the specification ( G C , σ int , σ ce ) of the instance I, i.e., we remove any input nodes in N that represent the frequency of edge-configurations in Γ int ( D π ) and chemical rooted trees ψ F ( D π ) not contained in the specification ( G C , σ int , σ ce ) of I. For example, there are | F ( D π ) | = 109 chemical rooted trees in the set of 2-fringe-trees in the data set D π with π = KOW in Table 3, and an ANN N constructed in Stage 3 contains 109 input nodes that correspond to the descriptors for the fringe-configuration. However, the set of input nodes for the fringe-configuration is reduced to a set of | F * | = 40 input nodes when we formulate an MILP for solving instance I b , 1 , saving the number of integer variables.
Table 6 shows the features of the seven test instances, where we denote the following:
-
Λ : the set of non-hydrogen chemical elements for inferring a target graph;
-
| Γ int | : the number of different edge-configurations of interior-edges for inferring a target graph;
-
| F * | : the number of different chemical rooted trees in the set F * = F E v V C F ( v ) ; and
-
[ n LB int , n UB int ] , [ n LB , n * ] : the lower and upper bounds on n int ( G ) and n ( G ) for inferring a target graph G .
Stage 4. To solve an MILP in Stage 4, we used CPLEX version 12.10. Table 7, Table 8, Table 9, Table 10, Table 11 and Table 12 show the results on Stages 4 and 5, where we denote the following:
-
[ a ̲ , a ¯ ] : the minimum and maximum values of a ( G ) in π over compounds G in D π in Table 3;
-
[ y ̲ , y ¯ ] : y ̲ (resp., y ¯ ) denotes the minimum (resp., maximum) target value y with a ̲ y a ¯ such that the MILP instance for the target value y * = y becomes feasible (i.e., admits a target chemical graph G ). To determine the minimum and minimum target values y ̲ and y ¯ , we solved many numbers of MILP instances. Note that the MILP instance may become infeasible for some value y within the range [ y ̲ , y ¯ ] ;
-
y * : a target value in [ y ̲ , y ¯ ] for a property π ;
-
#v: the number of variables in the MILP in Stage 4;
-
#c: the number of constraints in the MILP in Stage 4;
-
IP-time: the time (sec.) to solve the MILP in Stage 4;
-
n: the number n ( G ) of non-hydrogen atoms in the chemical graph G inferred in Stage 4; and
-
n int : the number n int ( G ) of interior-vertices in the chemical graph G inferred in Stage 4.
Figure 11a illustrates the chemical graph G inferred from instance I c with y * = 3.0 of KOW in Table 7.
Figure 11b illustrates the chemical graph G inferred from instance I d with y * = 1.6 of LP in Table 11.
The topological specification of instances I a , I c and I d is more restricted than that of the other instances, and thereby the feasible target range [ y ̲ , y ¯ ] of I a , I c and I d is rather narrower than the original range [ a ̲ , a ¯ ] for some property π . We see that the running time for solving an MILP instance with n = 50 is 8.5 to 122 (s), which is much smaller than the running time of 61 to 12058 (s) to solve a similar set of MILP instances with n = 50 in the experimental result for the previous method [28].
Stage 5. We computed chemical isomers G * of each target chemical graph G inferred in Stage 4. We execute the algorithm for generating chemical isomers of G up to 100 when the number of all chemical isomers exceeds 100. The algorithm can evaluate a lower bound on the total number of all chemical isomers G without generating all of them.
Table 13 and Table 14 show the computational results of the experiment, where we denote the following:
-
DP-time: the running time (s) to execute the dynamic programming algorithm in Stage 5 to compute a lower bound on the number of all chemical isomers G * of G and generate all (or up to 100) chemical isomers G * ;
-
G-LB: a lower bound on the number of all chemical isomers G * of G ; and
-
#G: the number of all (or up to 100) chemical isomers G * of G generated in Stage 5.
From Table 13 and Table 14, we observe that the running time for generating up to 100 target chemical graphs in Stage 5 is not considerably larger than that in Stage 4.

4. Discussions and Conclusions

The framework of designing chemical graphs using ANNs and MILP has been proposed [23] as a basis of a total system of the QSAR and the inverse of QSAR, where the inverse of a prediction function produced by an ANN is solved by an MILP. The merit of the framework is that the inverse problem can be treated exactly as a mathematical problem, and an MILP instance with a moderate size can be efficiently solved with a fast MILP solver. On the other hand, the main technical concern in applying the framework is in defining a feature vector of a chemical graph in terms of graph theoretical descriptors so that the computation of a feature vector can be simulated with a set of linear constraints in an MILP. So far, the framework has been applied to the design of new methods of inferring several restricted classes of chemical graphs such as the graphs with rank at most 2 and the ρ -lean cyclic graphs [26,28].
Herein, we examine some technical issues in the previous method before we observe some new features of our method in this paper.
In the feature vector of the previous models [26,28], the structure of subgraphs used as descriptors is only a pair of adjacent vertices, called adjacency-configuration or edge-configuration, which is significantly limited from a variety of subgraphs used in a more sophisticated construction of a feature vector such as the fingerprint. However, including the occurrence of a certain subgraph with only a few vertices as part of a feature vector may require realizing a mechanism of the subgraph isomorphism in an MILP that simulates the computation of such an occurrence and can easily make the resulting MILP very complicated and hard to solve. Furthermore, the feature vector can be defined only for cyclic graphs and we need to eliminate any acyclic graphs from the original data set before we construct a prediction function. This may reduce a data set to an unnecessarily small size or reduce the chances of capturing important information on QSAR over all types of graphs.
A branch-parameter ρ was originally introduced as a new measure to the “agglomeration degree” of trees [24] and then used to define restricted classes of acyclic and cyclic graphs [24,27]. In fact, such a restriction on the structure of target chemical graphs was rather necessary to reduce the size of an MILP formulation that simulates a selection process of a target chemical graph from a supergraph (called a scheme graph), where the number of variables and constraints required to infer a chemical graph with n * non-hydrogen atoms is O ( n * ) when some other parameters such as ρ are regarded as constants.
Although nearly 97% of cyclic chemical compounds with up to 100 non-hydrogen atoms in PubChem are 2-lean [24], the way of specifying the topological structure of a target chemical graph in the previous method [26,28] was based on the core and the non-core of a chemical graph, and we could not include a fixed substructure in the non-core of a target chemical graph.
Compared with the previous models, the two-layered model proposed in this paper is rather simple, where a chemical graph is regarded as a combination of the interior and the exterior. The new model can deal with chemical compounds with any graph structure and include a prescribed structure in both of the acyclic and cyclic parts of a target chemical graph as long as the requirement on target chemical graphs is described under the set of specification rules introduced in this paper. This considerably improves the availability of the framework in a practical application.
The feature vector of our two-layered model can be defined for arbitrary graphs. In the new feature vector, the exterior of a chemical graph is encoded into fringe-configurations, i.e., the occurrence of each chemical rooted tree with height at most ρ , where we may regard that the set of such a chemical rooted trees plays a similar role of some types of functional groups. In our method, we include as part of the descriptors of a feature vector the occurrence of each of such chemical rooted trees and the descriptors of our feature vector on the exterior of a chemical graph may have an analogous effect with the fingerprint.
Our specification of target chemical graphs can specify a candidate set F of chemical rooted trees that are allowed to be used as chemical rooted trees in the exterior of a target chemical graph. This allows us to control the chemical property of target chemical graphs in a more meaningful way since chemical properties of some rooted trees in F are known as functional groups and some kinds of rooted trees can be prohibited in a target chemical graph, if necessary, just by excluding from the candidate set F . Although the number | F ( D π ) | of different kinds of such chemical trees in a data set D π from PubChem is approximately up to 300 for ρ = 2 in many cases and the number of input nodes in an ANN N becomes over | F ( D π ) | , we derived an MILP formulation for inferring a chemical graph with with n * non-hydrogen atoms and a candidate set F of chemical rooted trees by using O ( n * + | F | ) variables and O ( n * | F | ) constraints when the number of interior-vertices is constant, where | F | can be quite small compared with | F ( D π ) | .
We have implemented the proposed method for inferring chemical compounds with a prescribed topological substructure setting ρ = 2 . The results of computational experiments using some chemical properties such as octanol/water partition coefficient, boiling point, melting point, flash point, lipophylicity, and solubility suggest that the proposed system can infer chemical graphs with 50 non-hydrogen atoms.
For a larger branch-parameter, say ρ = 3 , 4 , we obtain a more variety of chemical rooted trees which provides new descriptors in a feature vector and new candidates for fringe-trees in the exterior in a target chemical graph, whereas the number of different chemical rooted trees in F ( D π ) may increase rapidly.
It is left as a future work to use other learning methods such as decision tree, random forest, graph convolution, and an ensemble method in Stages 3 and 4 in the framework.

Supplementary Materials

Author Contributions

Conceptualization, H.N. and T.A.; methodology, H.N.; software, N.A.A., J.Z., Y.S., K.H., and L.Z.; validation, N.A.A., J.Z., and H.N.; formal analysis, H.N.; data resources, L.Z., K.H., H.N., and T.A.; writing—original draft preparation, H.N.; writing—review and editing, N.A.A. and T.A.; project administration, H.N.; funding acquisition, T.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported, in part, by Japan Society for the Promotion of Science, Japan, under Grant #18H04113.

Data Availability Statement

Source code of the implementation of our algorithm is freely available from https://github.com/ku-dml/mol-infer.

Acknowledgments

This research was supported, in part, by Japan Society for the Promotion of Science, Japan, under Grant #18H04113.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

ANNartificial neural network
MILPmixed integer linear programming

References

  1. Miyao, T.; Kaneko, H.; Funatsu, K. Inverse QSPR/QSAR analysis for chemical structure generation (from y to x). J. Chem. Inf. Model. 2016, 56, 286–299. [Google Scholar] [CrossRef]
  2. Ikebata, H.; Hongo, K.; Isomura, T.; Maezono, R.; Yoshida, R. Bayesian molecular design with a chemical language model. J. Comput. Aided Mol. Des. 2017, 31, 379–391. [Google Scholar] [CrossRef] [Green Version]
  3. Rupakheti, C.; Virshup, A.; Yang, W.; Beratan, D.N. Strategy to discover diverse optimal molecules in the small molecule universe. J. Chem. Inf. Model. 2015, 55, 529–537. [Google Scholar] [CrossRef] [PubMed]
  4. Fujiwara, H.; Wang, J.; Zhao, L.; Nagamochi, H.; Akutsu, T. Enumerating treelike chemical graphs with given path frequency. J. Chem. Inf. Model. 2008, 48, 1345–1357. [Google Scholar] [CrossRef]
  5. Kerber, A.; Laue, R.; Grüner, T.; Meringer, M. MOLGEN 4.0. MATCH Commun. Math. Comput. Chem. 1998, 37, 205–208. [Google Scholar]
  6. Li, J.; Nagamochi, H.; Akutsu, T. Enumerating substituted benzene isomers of tree-like chemical graphs. IEEE/ACM Trans. Comput. Biol. Bioinform. 2016, 15, 633–646. [Google Scholar] [CrossRef]
  7. Reymond, J.L. The chemical space project. Acc. Chem. Res. 2015, 48, 722–730. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Bohacek, R.S.; McMartin, C.; Guida, W.C. The art and practice of structure-based drug design: A molecular modeling perspective. Med. Res. Rev. 1996, 16, 3–50. [Google Scholar] [CrossRef]
  9. Akutsu, T.; Fukagawa, D.; Jansson, J.; Sadakane, K. Inferring a graph from path frequency. Discrete Appl. Math. 2012, 160, 1416–1428. [Google Scholar] [CrossRef] [Green Version]
  10. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  11. Gómez-Bombarelli, R.; Wei, J.N.; Duvenaud, D.; Hernández-Lobato, J.M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T.D.; Adams, R.P.; Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 2018, 4, 268–276. [Google Scholar] [CrossRef] [PubMed]
  12. Segler, M.H.S.; Kogej, T.; Tyrchan, C.; Waller, M.P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 2017, 4, 120–131. [Google Scholar] [CrossRef] [Green Version]
  13. Yang, X.; Zhang, J.; Yoshizoe, K.; Terayama, K.; Tsuda, K. ChemTS: An efficient python library for de novo molecular generation. Sci. Technol. Adv. Mater. 2017, 18, 972–976. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Kusner, M.J.; Paige, B.; Hernández-Lobato, J.M. Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 1945–1954. [Google Scholar]
  15. De Cao, N.; Kipf, T. MolGAN: An implicit generative model for small molecular graphs. arXiv 2018, arXiv:1805.11973. [Google Scholar]
  16. Madhawa, K.; Ishiguro, K.; Nakago, K.; Abe, M. GraphNVP: An invertible flow model for generating molecular graphs. arXiv 2019, arXiv:1905.11600. [Google Scholar]
  17. Shi, C.; Xu, M.; Zhu, Z.; Zhang, W.; Zhang, M.; Tang, J. GraphAF: A flow-based autoregressive model for molecular graph generation. arXiv 2020, arXiv:2001.09382. [Google Scholar]
  18. Cherkasov, A.; Muratov, E.M.N.; Fourches, D.; Varnek, A.; Baskin, I.I.; Cronin, M.; Dearden, J.; Gramatica, P.; Martin, Y.C.; Todeschini, R.; et al. QSAR modeling: Where have you been? Where are you going to? J. Med. Chem. 2014, 57, 4977–5010. [Google Scholar] [CrossRef] [Green Version]
  19. Cramer, R.D., III; Patterson, D.E.; Bunce, J.D. Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J. Am. Chem. Soc. 1988, 110, 5959–5967. [Google Scholar] [CrossRef]
  20. Cramer, R.D. Template CoMFA generates single 3D-QSAR models that, for twelve of twelve biological targets, predict all ChEMBL-tabulated affinities. PLoS ONE 2015, 10, e0129307. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  21. Moriwaki, H.; Tian, Y.-S.; Kawashita, N.; Takagi, T. Three-dimensional classification structure–activity relationship analysis using convolutional neural network. Chem. Pharm. Bull. 2019, 67, 426–432. [Google Scholar] [CrossRef]
  22. Azam, N.A.; Chiewvanichakorn, R.; Zhang, F.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A method for the inverse QSAR/QSPR based on artificial neural networks and mixed integer linear programming. In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies, Valletta, Malta, 24–26 February 2020; Volume 3, pp. 101–108. [Google Scholar]
  23. Zhang, F.; Zhu, J.; Chiewvanichakorn, R.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A new integer linear programming formulation to the inverse QSAR/QSPR for acyclic chemical compounds using skeleton trees. In Proceedings of the 33rd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Kitakyushu, Japan, 22–25 September 2020; pp. 433–444. [Google Scholar]
  24. Azam, N.A.; Zhu, J.; Sun, Y.; Shi, Y.; Shurbevski, A.; Zhao, L.; Nagamochi, H.; Akutsu, T. A novel method for inference of acyclic chemical compounds with bounded branch-height based on artificial neural networks and integer programming. 2020. submitted. [Google Scholar]
  25. Ito, R.; Azam, N.A.; Wang, C.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A novel method for the inverse QSAR/QSPR to monocyclic chemical compounds based on artificial neural networks and integer programming. In Proceedings of the International Conference on Bioinformatics & Computational Biology (BIOCOMP 2020), Las Vegas, NV, USA, 27–30 July 2020. [Google Scholar]
  26. Zhu, J.; Wang, C.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A novel method for inference of chemical compounds of cycle index two with desired properties based on artificial neural networks and integer programming. Algorithms 2020, 13, 124. [Google Scholar] [CrossRef]
  27. Akutsu, T.; Nagamochi, H. A novel method for inference of chemical compounds with prescribed topological substructures based on integer programming. arXiv 2020, arXiv:2010.09203. [Google Scholar]
  28. Zhu, J.; Azam, N.A.; Zhang, F.; Shurbevski, A.; Haraguchi, K.; Zhao, L.; Nagamochi, H.; Akutsu, T. A novel method for inferring of chemical compounds with prescribed topological substructures based on integer programming. 2020. submitted. [Google Scholar]
  29. PubChem. Available online: https://pubchem.ncbi.nlm.nih.gov/ (accessed on 13 May 2020).
  30. Figshare. Available online: https://figshare.com/articles/dataset/Lipophilicity_Dataset_-_logD7_4_of_1_130_Compounds/5596750/1 (accessed on 13 May 2020).
  31. A Benchmark for Molecular Machine Learning. Available online: http://moleculenet.ai/datasets-1 (accessed on 13 May 2020).
Figure 1. An illustration of a framework for inferring a set of chemical graphs G * .
Figure 1. An illustration of a framework for inferring a set of chemical graphs G * .
Ijms 22 02847 g001
Figure 2. An illustration of a chemical graph G, where for ρ = 2 , the exterior-vertices are w 1 , w 2 , , w 19 and the interior-vertices are u 1 , u 2 , , u 28 .
Figure 2. An illustration of a chemical graph G, where for ρ = 2 , the exterior-vertices are w 1 , w 2 , , w 19 and the interior-vertices are u 1 , u 2 , , u 28 .
Ijms 22 02847 g002
Figure 3. (a) An illustration of a seed graph G C where the vertices in V C are depicted with gray squares, the edges in E ( 2 ) are depicted with dotted lines, the edges in E ( 1 ) are depicted with dashed lines, the edges in E ( 0 / 1 ) are depicted with gray bold lines, and the edges in E ( = 1 ) are depicted with black solid lines. (b) A set F = { ψ 1 , ψ 2 , , ψ 11 } F ( D π ) of 11 chemical rooted trees ψ i , i [ 1 , 11 ] , where the root of each tree is depicted with a black circle.
Figure 3. (a) An illustration of a seed graph G C where the vertices in V C are depicted with gray squares, the edges in E ( 2 ) are depicted with dotted lines, the edges in E ( 1 ) are depicted with dashed lines, the edges in E ( 0 / 1 ) are depicted with gray bold lines, and the edges in E ( = 1 ) are depicted with black solid lines. (b) A set F = { ψ 1 , ψ 2 , , ψ 11 } F ( D π ) of 11 chemical rooted trees ψ i , i [ 1 , 11 ] , where the root of each tree is depicted with a black circle.
Ijms 22 02847 g003
Figure 4. The interior H int of chemical graph G with ρ = 2 in Figure 2.
Figure 4. The interior H int of chemical graph G with ρ = 2 in Figure 2.
Ijms 22 02847 g004
Figure 5. (a) A cyclical-base S = H int u { u 5 , u 18 , u 22 } ( V ( Q u ) \ { u } ) of the interior H int in Figure 4; (b) A contraction S of S for a pure path set P = { P a 1 , P a 2 , , P a 5 } in (a), where a new edge obtained by contracting a pure path is depicted with a thick line.
Figure 5. (a) A cyclical-base S = H int u { u 5 , u 18 , u 22 } ( V ( Q u ) \ { u } ) of the interior H int in Figure 4; (b) A contraction S of S for a pure path set P = { P a 1 , P a 2 , , P a 5 } in (a), where a new edge obtained by contracting a pure path is depicted with a thick line.
Ijms 22 02847 g005
Figure 6. An illustration of a graph H * that is obtained from the seed graph G C in Figure 3 under the interior-specification σ int in Table 1, where the vertices newly introduced by pure paths P a i and leaf paths Q v i are depicted with white squares and circles, respectively.
Figure 6. An illustration of a graph H * that is obtained from the seed graph G C in Figure 3 under the interior-specification σ int in Table 1, where the vertices newly introduced by pure paths P a i and leaf paths Q v i are depicted with white squares and circles, respectively.
Ijms 22 02847 g006
Figure 7. Illustration of a set G * = { G 1 , G 2 , G 3 , G 4 } of four flavonoids, a seed graph G C , and a set F * = { ψ 1 , ψ 2 , ψ 3 , ψ 4 } of chemical rooted trees for ρ = 2 : (a) fisetin G 1 ; (b) ruteorinn G 2 ; (c) aurone G 3 ; (d) chalcone G 4 ; (e) G C = ( V C , E C ) ; (f) F * = F E v V C F ( v ) .
Figure 7. Illustration of a set G * = { G 1 , G 2 , G 3 , G 4 } of four flavonoids, a seed graph G C , and a set F * = { ψ 1 , ψ 2 , ψ 3 , ψ 4 } of chemical rooted trees for ρ = 2 : (a) fisetin G 1 ; (b) ruteorinn G 2 ; (c) aurone G 3 ; (d) chalcone G 4 ; (e) G C = ( V C , E C ) ; (f) F * = F E v V C F ( v ) .
Ijms 22 02847 g007
Figure 8. Illustration of a set G * = { G 1 , G 2 , G 3 } of three dibenzodiazepine atypical antipsychotics, a seed graph G C and a set F * = { ψ 1 , ψ 2 , , ψ 8 } of chemical rooted trees for ρ = 2 : (a) clozabine G 1 ; (b) quetiapine G 2 ; (c) olanzapine G 3 ; (d) G C = ( V C , E C ) ; (e) F * = F E v V C F ( v ) .
Figure 8. Illustration of a set G * = { G 1 , G 2 , G 3 } of three dibenzodiazepine atypical antipsychotics, a seed graph G C and a set F * = { ψ 1 , ψ 2 , , ψ 8 } of chemical rooted trees for ρ = 2 : (a) clozabine G 1 ; (b) quetiapine G 2 ; (c) olanzapine G 3 ; (d) G C = ( V C , E C ) ; (e) F * = F E v V C F ( v ) .
Ijms 22 02847 g008
Figure 9. An illustration of seed graphs: (a) A monocyclic graph G C 1 ; (b) A rank-2 cyclic graph G C 2 with two vertex-disjoint cycles; (c) A rank-2 cyclic graph G C 3 with two disjoint cycles sharing a vertex; (d) A rank-2 cyclic graph G C 4 with three cycles.
Figure 9. An illustration of seed graphs: (a) A monocyclic graph G C 1 ; (b) A rank-2 cyclic graph G C 2 with two vertex-disjoint cycles; (c) A rank-2 cyclic graph G C 3 with two disjoint cycles sharing a vertex; (d) A rank-2 cyclic graph G C 4 with three cycles.
Ijms 22 02847 g009
Figure 10. An illustration of chemical compounds for instances I c and I d : (a) G A : CID 24822711; (b) G B : CID 59170444; (c) G A : CID 10076784; (d) G B : CID 44340250.
Figure 10. An illustration of chemical compounds for instances I c and I d : (a) G A : CID 24822711; (b) G B : CID 59170444; (c) G A : CID 10076784; (d) G B : CID 44340250.
Ijms 22 02847 g010
Figure 11. (a) G inferred from I c with y * = 3.0 of KOW; (b) G inferred from I d with y * = 1.6 of LP.
Figure 11. (a) G inferred from I c with y * = 3.0 of KOW; (b) G inferred from I d with y * = 1.6 of LP.
Ijms 22 02847 g011
Table 1. Example 1 of an interior-specification σ int .
Table 1. Example 1 of an interior-specification σ int .
Ijms 22 02847 i001
Table 2. Example 2 of a chemical-specification σ ce .
Table 2. Example 2 of a chemical-specification σ ce .
Ijms 22 02847 i002
Table 3. Data sets for stage 1 in phase 1.
Table 3. Data sets for stage 1 in phase 1.
π Λ | D π | | Γ int ( D π ) | | F ( D π ) | [ n ̲ , n ¯ ] [ a ̲ , a ¯ ]
KOW C,O,N 64424109[4, 58][−7.53, 13.45]
KOW C,O,N,S,Cl 83731142[4, 73][−7.53, 13.45]
BP C,O,N 3582191[4, 30][−11.70, 470.0]
BP C,O,N,S,Cl 42523114[4, 30][−11.70, 470.0]
MP C,O,N 4482294[4, 122][−185.3, 300.0]
MP C,O,N,S,Cl 54826118[4, 122][−185.3, 300.0]
FP C,O,N 3482085[4, 66][−82.99, 300.0]
FP C,O,N,S,Cl 39924107[4, 66][−82.99, 300.0]
LP C,O,N 5922771[6, 60][−3.62, 6.84]
LP C,O,N,S,Cl 7793278[6, 74][−3.62, 6.84]
SL C,O,N 64025111[4, 55][−9.33, 1.11]
SL C,O,N,S,Cl 84731144[4, 55][−11.60, 1.11]
Table 4. Results of Stages 2 and 3 in Phase 1.
Table 4. Results of Stages 2 and 3 in Phase 1.
π Λ L-Timet- R cv 2 (Best)t- R max 2 Arch.
KOWC,O,N 0.70.9590.983(156,10,10,1)
KOWC,O,N,S,Cl 0.70.9470.968(199,20,10,1)
BPC,O,N 3.50.8580.923(135,30,20,1)
BPC,O,N,S,Cl 3.30.8210.899(163,10,1)
MPC,O,N 3.80.7840.893(139,40,1)
MPC,O,N,S,Cl 4.10.7960.880(170,10,10,1)
FPC,O,N 1.10.7500.874(128,40,1)
FPC,O,N,S,Cl 1.80.7070.853(157,10,10,1)
LPC,O,N 0.50.8680.908(121,30,1)
LPC,O,N,S,Cl 0.70.8610.892(137,20,10,1)
SLC,O,N 0.70.8700.913(159,30,1)
SLC,O,N,S,Cl 0.90.8700.903(201,30,20,1)
Table 5. Results of prediction functions by f ec and f fc in data set D ˜ π of cyclic graphs and f fc in data set D π of all graphs.
Table 5. Results of prediction functions by f ec and f fc in data set D ˜ π of cyclic graphs and f fc in data set D π of all graphs.
f = f ec , D = D ˜ π f = f fc , D = D ˜ π f = f fc , D = D π
π | D ˜ π | t- R cv 2 (ave.)t- R cv 2 (Best)t- R cv 2 (ave.)t- R cv 2 (Best) | D π | t- R cv 2 (ave.)t- R cv 2 (Best)
KOW5800.9520.9590.9500.9548370.9440.947
BP2240.6880.7180.6800.6934250.8090.821
MP3480.6680.6940.7120.7365480.7760.796
FP2180.4350.4760.5740.6233990.6880.707
LP7760.8320.8420.8530.8617790.8540.861
SL6380.8510.8630.8530.8618470.8600.870
Table 6. Features of test instances.
Table 6. Features of test instances.
Instance Λ | Γ int | | F * | [ n LB int , n UB int ] [ n LB , n * ]
I a C,O,N1011[30,50][20,28]
I b , 1 C,O,N2840[38,38][6,6]
I b , 2 C,O,N2835[50,50][30,30]
I b , 3 C,O,N2830[50,50][30,30]
I b , 4 C,O,N2825[50,50][30,30]
I c C,O,N812[46,46][24,24]
I d C,O,N78[40,45][18,18]
Table 7. Results of Stage 4 for KOW.
Table 7. Results of Stage 4 for KOW.
Instance [ a ̲ , a ¯ ] [ y ̲ , y ¯ ] y * #v#c IP-Timen n int
I a [−7.53, 13.45][−7.0, 13.4]3.2766391623.93524
I b , 1 [−7.53, 13.45][−7.5, 13.4]3.09894662617.5387
I b , 2 [−7.53, 13.45][−7.5, 13.4]3.011,514893414.05030
I b , 3 [−7.53, 13.45][−7.5, 13.4]3.011,318892624.65030
I b , 4 [−7.53, 13.45][−7.5, 13.4]3.011,122891822.05030
I c [−7.53, 13.45][−7.5, 13.4]3.0786786302.14932
I d [−7.53, 13.45][−7.5, 13.4]3.0539568995.24523
Table 8. Results of Stage 4 for BP.
Table 8. Results of Stage 4 for BP.
Instance [ a ̲ , a ¯ ] [ y ̲ , y ¯ ] y * #v#c IP-Timen n int
I a [−11.70, 470.0][352, 470]411758389822.74225
I b , 1 [−11.70, 470.0][−11, 470]229981664492.7387
I b , 2 [−11.70, 470.0][−11, 470]22911,43687579.15030
I b , 3 [−11.70, 470.0][−11, 470]22911,240874911.05030
I b , 4 [−11.70, 470.0][−11, 470]22911,044874124.05030
I c [−11.70, 470.0][170, 470]3207575845025.94933
I d [−11.70, 470.0][151, 470]310531567194.44323
Table 9. Results of Stage 4 for MP.
Table 9. Results of Stage 4 for MP.
Instance [ a ̲ , a ¯ ] [ y ̲ , y ¯ ] y * #v#c IP-Timen n int
I a [−185.3, 300.0][55, 300]177.57602902316.14124
I b , 1 [−185.3, 300.0][−180, 300]60983364872.3389
I b , 2 [−185.3, 300.0][−185, 300]57.411,453879544.75030
I b , 3 [−185.3, 300.0][−185, 300]57.411,257878710.55030
I b , 4 [−185.3, 300.0][−185, 300]57.411,061877993.95030
I c [−185.3, 300.0][253, 300]260.07580617224.04133
I d [−185.3, 300.0][−75, 299]5851104050104.64523
Table 10. Results of Stage 4 for FP.
Table 10. Results of Stage 4 for FP.
Instance [ a ̲ , a ¯ ] [ y ̲ , y ¯ ] y * #v#c IP-Timen n int
I a [−82.99, 300.0][98, 300]199745986961.63522
I b , 1 [−82.99, 300.0][−82, 300]109969461661.4388
I b , 2 [−82.99, 300.0][−82, 300]10911,31484748.75030
I b , 3 [−82.99, 300.0][−82, 300]10911,118846625.85030
I b , 4 [−82.99, 300.0][−82, 300]10910,92284588.55030
I c [−82.99, 300.0][250, 300]2757667817060.94734
I d [−82.99, 300.0][54, 300]177519364362.04523
Table 11. Results of Stage 4 for LP.
Table 11. Results of Stage 4 for LP.
Instance [ a ̲ , a ¯ ] [ y ̲ , y ¯ ] y * #v#c IP-Timen n int
I a [−3.6, 6.84][−3.6, 6.8]1.6759790081.93923
I b , 1 [−3.6, 6.84][−3.6, 6.8]1.6983664812.9388
I b , 2 [−3.6, 6.84][−3.6, 6.8]1.611,456878921.15030
I b , 3 [−3.6, 6.84][−3.6, 6.8]1.611,260878120.45030
I b , 4 [−3.6, 6.84][−3.6, 6.8]1.611,064877324.25030
I c [−3.6, 6.84][−3.6, 6.8]1.6780184761.14732
I d [−3.6, 6.84][−3.6, 6.8]1.6533567544.34523
Table 12. Results of Stage 4 for SL.
Table 12. Results of Stage 4 for SL.
Instance [ a ̲ , a ¯ ] [ y ̲ , y ¯ ] y * #v#c IP-Timen n int
I a [−9.33, 1.11][−9.3, −2.0]−5.6767491862.44123
I b , 1 [−9.33, 1.11][−9.3, −2.0]−5.69906665022.33812
I b , 2 [−9.33, 1.11][−9.3, −2.0]−5.611,526895815.25030
I b , 3 [−9.33, 1.11][−9.3, −2.0]−5.611,330895016.25030
I b , 4 [−9.33, 1.11][−9.3, −2.0]−5.611,1348942122.75030
I c [−9.33, 1.11][−9.3, −2.0]−5.6787486481.25433
I d [−9.33, 1.11][−9.3, −3.0]−6.1540269178.14323
Table 13. Results of Stage 5 for KOW, LP, and BP.
Table 13. Results of Stage 5 for KOW, LP, and BP.
Kow Lp Bp
InstanceDP-TimeG-LB #GDP-TimeG-LB #GDP-timeG-LB #G
I a 0.03116160.1641281000.164 1.4 × 10 5 100
I b 1 0.149 2.8 × 10 5 1000.148 2.0 × 10 10 1000.162 4.4 × 10 5 100
I b 2 44.1 3.9 × 10 10 10011890010017166
I b 3 27.2202080.26628.677
I b 4 0.166600010073121214255
I c 0.16660001000.1682881000.168 4.0 × 10 5 100
I d 22.3 8.3 × 10 10 1001.44 3.2 × 10 8 1001.7 9.7 × 10 9 100
Table 14. Results of Stage 5 for FP, MP, and SL.
Table 14. Results of Stage 5 for FP, MP, and SL.
FP MP SL
InstanceDP-TimeG-LB #GDP-TimeG-LB #GDP-TimeG-LB #G
I a 0.05732320.1652561000.1651024100
I b 1 0.164 3.1 × 10 6 1000.166 1.4 × 10 6 1000.163 4.5 × 10 5 100
I b 2 28.87201008.26 2.4 × 10 10 1001.07 5.6 × 10 9 100
I b 3 72.2272751.91146.51680100
I b 4 40.32020125 6.1 × 10 7 1007.01 1.1 × 10 8 100
I c 0.169 1.1 × 10 5 1000.17360481000.168120100
I d 0.05732320.17 4.2 × 10 8 1000.1651024100
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Shi, Y.; Zhu, J.; Azam, N.A.; Haraguchi, K.; Zhao, L.; Nagamochi, H.; Akutsu, T. An Inverse QSAR Method Based on a Two-Layered Model and Integer Programming. Int. J. Mol. Sci. 2021, 22, 2847. https://0-doi-org.brum.beds.ac.uk/10.3390/ijms22062847

AMA Style

Shi Y, Zhu J, Azam NA, Haraguchi K, Zhao L, Nagamochi H, Akutsu T. An Inverse QSAR Method Based on a Two-Layered Model and Integer Programming. International Journal of Molecular Sciences. 2021; 22(6):2847. https://0-doi-org.brum.beds.ac.uk/10.3390/ijms22062847

Chicago/Turabian Style

Shi, Yu, Jianshen Zhu, Naveed Ahmed Azam, Kazuya Haraguchi, Liang Zhao, Hiroshi Nagamochi, and Tatsuya Akutsu. 2021. "An Inverse QSAR Method Based on a Two-Layered Model and Integer Programming" International Journal of Molecular Sciences 22, no. 6: 2847. https://0-doi-org.brum.beds.ac.uk/10.3390/ijms22062847

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop