Robust Learning with Implicit Residual Networks

Reshniak, Viktor; Webster, Clayton G.

doi:10.3390/make3010003

Open AccessFeature PaperArticle

Robust Learning with Implicit Residual Networks

by

Viktor Reshniak

^1,*,†

and

Clayton G. Webster

^2,3,†

¹

Data Analysis and Machine Learning, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA

²

Department of Mathematics, University of Tennessee at Knoxville, Knoxville, TN 37996, USA

³

Behavioral Reinforcement Learning Lab (BReLL), Lirio LLC, Knoxville, TN 37923, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mach. Learn. Knowl. Extr. 2021, 3(1), 34-55; https://0-doi-org.brum.beds.ac.uk/10.3390/make3010003

Submission received: 16 November 2020 / Revised: 22 December 2020 / Accepted: 25 December 2020 / Published: 31 December 2020

(This article belongs to the Special Issue Explainable Machine Learning)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In this effort, we propose a new deep architecture utilizing residual blocks inspired by implicit discretization schemes. As opposed to the standard feed-forward networks, the outputs of the proposed implicit residual blocks are defined as the fixed points of the appropriately chosen nonlinear transformations. We show that this choice leads to the improved stability of both forward and backward propagations, has a favorable impact on the generalization power, and allows for control the robustness of the network with only a few hyperparameters. In addition, the proposed reformulation of ResNet does not introduce new parameters and can potentially lead to a reduction in the number of required layers due to improved forward stability. Finally, we derive the memory-efficient training algorithm, propose a stochastic regularization technique, and provide numerical results in support of our findings.

Keywords:

ResNet; stability; robust

1. Introduction and Related Works

A large volume of empirical results has been collected in recent years illustrating the striking success of deep neural networks (DNNs) in approximating complicated maps by a mere composition of relatively simple functions [1]. Universal approximation property of DNNs with a relatively small number of parameters has also been shown for a large class of functions [2,3]. The training of deep networks nevertheless remains a notoriously difficult task due to the issues of exploding and vanishing gradients, which become more apparent and noticeable with increasing depth [4]. These issues accelerated efforts of the research community in an attempt to explain this behavior and gain new insights into the design of better architectures and faster algorithms. A promising approach in this direction was obtained by casting evolution of the hidden states

y_{t} \in Y_{t}

of a DNN as a dynamical system [5], i.e.,

\begin{matrix} y_{t + 1} & = Φ_{t} (γ_{t}, y_{t}), t = 0, . . ., T - 1, \end{matrix}

where for each layer t,

Φ_{t} : Γ_{t} \times Y_{t} \to Y_{t + 1}

is a nonlinear transformation parameterized by the weights

γ_{t} \in Γ_{t}

, and

Y_{t}

,

Γ_{t}

are the appropriately chosen spaces. In the case of a very deep network, when

T \to \infty

, it is convenient to consider the continuous time limit of the above expression such that

\begin{matrix} y (t) = Φ (γ (t), x), t > 0, \end{matrix}

where the parametric evolution function

Φ : Γ \times Y \to Y

defines a continuous flow through the input data

y (0) = x \in Y

. Parameter estimation for such continuous evolution can be viewed as an optimal controlling problem [6], given by

\begin{matrix} min_{γ (t)} E_{μ} [L (y (T), f (x)) + \int_{0}^{T} R (γ (t), y (t)) d t], \end{matrix}

(1)

\begin{matrix} subject to y (t) = Φ (γ (t), x), \end{matrix}

(2)

where

L (y (T), f (x))

is a terminal loss function,

R (γ (t), y (t))

is a regularizer, and

μ

is a probability distribution of the input–target data pairs

(x, f (x))

. More general models additionally consider continuity in the “patial” dimension as well by using differential [7] or integral formulations [8]. A continuous time formulation based on ordinary differential equations (ODEs) was proposed in [9] with the state Equation (2) of the form

\begin{matrix} \dot{y} (t) = Φ (γ (t), y (t)) . \end{matrix}

(3)

In the work [9], the authors relied on the black-box ODE solvers and used adjoint sensitivity analysis to derive equations for the backpropagation of errors through the continuous system.

The authors of [10] concentrated on the well-posedness of the learning problem for ODE-constrained control and emphasized the importance of stability in the design of deep architectures. For instance, the solution of a homogeneous linear ODE with constant coefficients

\begin{matrix} \dot{y} (t) = A y (t) \end{matrix}

is given by

\begin{matrix} y (t) = Q e^{Λ t} Q^{T} x, \end{matrix}

where

A = Q Λ Q^{T}

is the eigendecomposition of a matrix A, and

Λ

is the diagonal matrix with the corresponding eigenvalues. The similar equation holds for the backpropagation of gradients. To guarantee the efficient propagation of information through the network, one must ensure that the elements of

e^{Λ t}

have magnitudes close to one. This condition, of course, is satisfied when all eigenvalues of the matrix A are imaginary with real parts close to zero. In order to preserve this property, the authors of [10] proposed several time continuous architectures of the form

\begin{matrix} \{\begin{matrix} \dot{y} (t) = Φ_{1} (γ_{1} (t), y (t), z (t)), \\ \dot{z} (t) = Φ_{2} (γ_{2} (t), y (t), z (t)) . \end{matrix} \end{matrix}

(4)

When

Φ_{1} (y, z) = \nabla_{z} H (y, z)

,

Φ_{2} (y, z) = - \nabla_{y} H (y, z)

, the equations above provide an example of a conservative Hamiltonian system with the total energy H.

In the discrete setting of the ordinary feed forward networks, the necessary conditions for the optimal solution of (1) and (2) recover the well-known equations for the forward propagation (state Equation (2)), backward gradient propagation (co-state equation), and the optimality condition, to compute the weights (gradient descent algorithm); see, e.g, [11]. The continuous setting offers additional flexibility in the construction of discrete networks with the desired properties and efficient learning algorithms. Classical feed forward networks (Figure 1, left) is just the particular and the simplest example of such discretization, which is prone to all the issues of deep learning. In order to facilitate the training process, a skip-connection is often added to the network (Figure 1, middle) yielding

\begin{matrix} y = x + h \cdot Φ (γ, x), \end{matrix}

(5)

where h is a positive hyperparameter. Equation (5) can be viewed as a forward Euler scheme to solve the ODE in (3) numerically on the time grid with step size h. While it was shown that such residual layers help to mitigate the problem of vanishing gradients and speed-up the training process [12], the scheme has very restrictive stability properties [13]. This can result in the uncontrolled accumulation of errors at the inference stage reducing the generalization ability of the trained network. Moreover, the Euler scheme is not capable of preserving geometric structure of conservative flows and is thus a bad choice for the long time integration of such ODEs [14].

Memory efficient explicit reversible architectures can be obtained by considering time discretization of the partitioned system of ODEs in (4). The reversibility property allows for recovering the internal states of the system by propagating through the network in both directions and thus does not require one to cache these values for the evaluation of the gradients. First, such architecture (RevNet) was proposed in [15], and, without using a connection to discrete solutions of ODEs, it has the form

\begin{matrix} \{\begin{matrix} y_{1} = x_{1} + Φ_{1} (γ, x_{2}), \\ y_{2} = x_{2} + Φ_{2} (γ, y_{1}), \end{matrix} ⟷ \{\begin{matrix} x_{2} = y_{2} - Φ_{2} (γ, y_{1}), \\ x_{1} = y_{1} - Φ_{1} (γ, x_{2}) . \end{matrix} \end{matrix}

It was later recognized as the Verlet method applied to the particular form of the system in (4), see [10,16]. The leapfrog and midpoint networks are two other examples of reversible architectures proposed in [16].

Other residual architectures can be also found in the literature including Resnet in Resnet (RiR) [17], Dense Convolutional Network (DenseNet) [18] and linearly implicit network (IMEXNet) [19]. For some problems, all of these networks show a substantial improvement over the classical ResNet but still have an explicit structure, which has limited robustness to the perturbations of the input data and parameters of the network. Instead, in this effort, we propose new fully implicit residual architecture, which, unlike the above mentioned examples, is unconditionally stable and robust.

As opposed to the standard feed-forward networks, the outputs of the proposed implicit residual blocks are defined as the fixed points of the appropriately chosen nonlinear transformations as follows:

\begin{matrix} y = x + Φ (γ, x, y) . \end{matrix}

(6)

The right part of Figure 1 provides a graphical illustration of the proposed layer. One can immediately recognize the feedback loop which is typical for recurrent neural networks (RNN). The standard approach to train RNNs is by backpropagation through time, the algorithm which requires substantial memory resources for deep unrolled recurrent networks. The authors of [20,21] utilized related to our idea of implicit layers to cope with this issue by directly learning the equilibrium points of “infinite” depth recurrent models with a constant memory complexity. The main difference of our approach is that we design the feedback loop with a specific goal of driving the output of the implicit layer to the stable fixed point. We will discuss the choice of the nonlinear transformation

P h i

in (6) and the design of the training algorithm in the next section. It is also worth noting that the idea of learning fixed points is not quite new and is rooting way back to the Hopfield networks and content-addressable (“associative”) memory [22,23].

After the first version of this manuscript has appeared, another work proposed the similar idea of implicit Euler skip connections to enhance the adversarial robustness of residual networks [24]. The authors of [24] modified the original residual block by complementing it with a fixed number of steps of the gradient descent algorithm and applied adversarial training to the modified architecture. While both efforts are inspired by implicit numerical schemes for integrating ODEs, our approach is more general as we ensure the convergence of the proposed implicit layer to the stable fixed point. As discussed above and unlike the work in [24], in addition to the enhanced stability properties, this approach does not increase the memory complexity of the original residual networks. We also propose an initialization and regularization strategies which allow for training the network efficiently and admit simple interpretation.

The preliminary results for the work proposed in the current manuscript have been presented at the Second Symposium on Machine Learning and Dynamical Systems in the Fields Institute [25]. Here, we provide a completely revised version of this work including an in-depth description of the method, a new regularization approach, and much extended numerical results.

2. Description of the Method

We first motivate the necessity for our new method by letting the continuous model of a network be given by the ordinary differential equations in (4) that is:

\begin{matrix} \{\begin{matrix} \dot{y} (t) = Φ_{1} (γ_{1} (t), y (t), z (t)), \\ \dot{z} (t) = Φ_{2} (γ_{2} (t), y (t), z (t)) . \end{matrix} \end{matrix}

An s-stage Runge–Kutta (RK) method for the approximate solution of the above equations is given by

\begin{matrix} k_{i} & = Φ_{1} (γ_{1} (t_{0} + c_{i} h), y_{0} + h \sum_{j = 1}^{s} a_{i j} k_{j}, z_{0} + h \sum_{j = 1}^{s} {\hat{a}}_{i j} l_{j}) \\ l_{i} & = Φ_{2} (γ_{2} (t_{0} + {\hat{c}}_{i} h), y_{0} + h \sum_{j = 1}^{s} a_{i j} k_{j}, z_{0} + h \sum_{j = 1}^{s} {\hat{a}}_{i j} l_{j}) i, j = 1, . ., s \\ y_{1} & = y_{0} + h \sum_{i = 1}^{s} b_{i} k_{i}, z_{1} = z_{0} + h \sum_{i = 1}^{s} {\hat{b}}_{i} l_{i} . \end{matrix}

(7)

The order conditions for the coefficients

a_{i j}

,

b_{i}

,

c_{i}

,

{\hat{a}}_{i j}

,

{\hat{b}}_{i}

, and

{\hat{c}}_{i}

, which guarantee that convergence of the numerical solution is well known and can be found in any topical text, see, e.g., [13]. Note that, when

a_{i j} \neq 0

or

{\hat{a}}_{i j} \neq 0

for at least some

j \geq i

, the scheme is implicit and a system of nonlinear equations has to be solved at each iteration which obviously increases the complexity of the solver. Nevertheless, the following example illustrates the benefits of using implicit approximations.

2.1. Linear Stability Analysis

Consider the following linear differential system:

\begin{matrix} \dot{y} (t) & = - ω^{2} z (t), \dot{z} (t) = y (t) \end{matrix}

(8)

and four simple discretization schemes [13]:

Forward Euler:	$y_{1} = y_{0} - h ω^{2} z_{0}, z_{1} = z_{0} + h y_{0},$
Backward Euler:	$y_{1} = y_{0} - h ω^{2} z_{1}, z_{1} = z_{0} + h y_{1},$
Trapezoidal:	$y_{1} = y_{0} - \frac{h ω^{2}}{2} (z_{0} + z_{1}), z_{1} = z_{0} + \frac{h}{2} (y_{0} + y_{1}),$
Verlet:	$y_{0.5} = y_{0} - \frac{h ω^{2}}{2} z_{0}, z_{1} = z_{0} + h y_{0.5}, y_{1} = y_{0.5} - \frac{h ω^{2}}{2} z_{1} .$

Due to linearity of the system in (8), the numerical solution after n steps can be written as

\begin{matrix} (\begin{matrix} y_{n} \\ z_{n} \end{matrix}) = R^{n} (h ω) (\begin{matrix} y_{0} \\ z_{0} \end{matrix}) . \end{matrix}

(9)

The long time behavior of the discrete dynamics is hence determined by the spectral radius of the matrix

R (h ω)

(called the stability matrix), which needs to be less than or equal to one for the sake of stability. For example, we have

λ_{1, 2} = 1 \pm i h ω

for the forward Euler scheme, and the method is unconditionally unstable. The Backward Euler scheme gives

λ_{1, 2} = {(1 \pm i h ω)}^{- 1}

and the method is unconditionally stable. The corresponding eigenvalues of the trapezoidal scheme have a magnitude equal to one for all

ω

and h. Finally, the characteristic polynomial for the matrix of the Verlet scheme is given by

λ^{2} - (2 - h^{2} ω^{2}) λ + 1

, i.e., the method is only conditionally stable when

| h ω | \leq 2

.

Figure 2 illustrates this behavior for the particular case of

ω = 50

. Notice that the flows of the forward and backward Euler schemes are strictly expanding and contracting; if one had to fit the exact flow of the system in (8) using either of these methods, this would result in an inherently ill-posed problem. On the contrary, the implicit trapezoidal and explicit Verlet schemes seem to reproduce the original flow very well, but the latter is conditional on the size of the step h. Another nice property of the trapezoidal and Verlet schemes is their symmetry with respect to the exchanging

y_{n} \leftrightarrow y_{n - 1}

and

z_{n} \leftrightarrow z_{n - 1}

. Such methods play a central role in the geometric integration of reversible differential flows and are handy in the construction of the memory efficient reversible network architectures. Conditions for the reversibility of general Runge–Kutta schemes can be found in [14].

The discussion above highlights the importance of the appropriate choice for both the structure of a dynamical system (DS) and the corresponding time integrator. The stability of a general Runge–Kutta method is often studied in application to the simpler scalar test equation

\begin{matrix} \dot{y} (t) = λ y, \end{matrix}

(10)

and by analogy with (9) its stability function is given by

\begin{matrix} R (z) = \frac{det (I - z A + z 𝟙 b^{T})}{det (I - z A)}, z = h λ, \end{matrix}

where

𝟙 = {(1, . . ., 1)}^{T}

and

A = {(a_{i j})}_{i, j = 1}^{s}

,

b = {(b_{j})}_{j = 1}^{s}

are the parameters of the RK method in (7) with the remaining parameters set to zero for the scalar test equation. From this expression, one can see that the stability function of any explicit Runge–Kutta method is a polynomial and hence is bounded only for z in a finite region of the complex plane. The set of all such z with

R (z) < 1

is called the stability region of the method.

2.2. Implicit ResNet

Motivated by the discussion above, we propose an implicit residual layer in (6) with a nonlinear map

Φ (γ, x, y)

given by

\begin{matrix} Φ (γ, x, y) : = (1 - θ) F (γ, x) + θ F (γ, y), θ \in [0, 1], \end{matrix}

(11)

\begin{matrix} or & Φ (γ, x, y) : = F (γ, (1 - θ) x + θ y), \end{matrix}

(12)

where x, y,

γ

are the input, output, and parameters of the layer, and

F (γ, x)

is a vector field to be estimated. Table 1 shows the derivatives of the nonlinear maps in (11) and (12) with respect to their arguments.

The stability function of the layers in (11) and (12) is given by

\begin{matrix} R (z) = \frac{1 + (1 - θ) z}{1 - θ z} . \end{matrix}

(13)

The corresponding stability regions are illustrated in Figure 3 indicating the improved stability of implicit layers for increasing

θ

.

Additionally, instead of a single layer in (6), one might consider a block of implicit layers on a given “time interval”

t \in [0, T]

\begin{matrix} y_{t} = y_{t - 1} + Φ (γ_{t}, y_{t - 1}, y_{t}), t = 1, \dots, T, \\ y_{0} = x . \end{matrix}

(14)

Note that, by viewing (14) as a numerical ODE integrator and to get the above expressions, we have assumed that the ODE coefficients

γ (t)

are piecewise constant cádlág functions, see Figure 4.

We will use (12) as the definition of implicit layers in the rest of this work.

2.2.1. Forward Propagation

Assume that the fixed point in (6) exists and can be computed. To solve the corresponding nonlinear equation, consider the equivalent minimization problem

\begin{matrix} min_{y} {∥ r (y) ∥}^{2}, r (y) : = y - x - Φ (γ, x, y) . \end{matrix}

(15)

One way to construct the required solution is by applying the descent algorithm

\begin{matrix} y^{k + 1} \leftarrow y^{k} - λ^{k} s^{k}, k = 0, 1, 2, \dots \end{matrix}

where

s^{k}

is the descent direction and

λ^{k}

is the corresponding step size. Several common choices are summarized in Table 2.

The required gradient of the residual norm can be computed as

\begin{matrix} \nabla_{y} {∥ r (y) ∥}^{2} = 2 \cdot {\frac{\partial r (y)}{\partial y}}^{T} \cdot r (y), \end{matrix}

(16)

where the Jacobian of the residual vector is given by (c.f. Table 1)

\begin{matrix} \frac{\partial r (y)}{\partial y} = I - \frac{\partial Φ (γ, x, y)}{\partial y} . \end{matrix}

The expression in (16) can be efficiently computed with the automatic differentiation capabilities of any standard deep learning framework making it possible to interface existing iterative solvers. However, optimized implementations of the gradient descent algorithms are readily provided by any such framework. Hence, the forward propagation for implicit layers can be easily implemented using built-in tools native to each framework and without interfacing external solvers. Listing 1 shows the PyTorch pseudocode of the implicit layer and the left part of Figure 5 illustrates its corresponding computational graph.

Listing 1. PyTorch pseudocode of the implicit residual block.

import torch

class fpmap (torch.Function):

@staticmethod

def forward (ctx, γ, x, ŷ):

y = x + Φ(γ, x, ŷ)

ctx.save_for_backward (y, ŷ)

return y

@staticmethod

def backward (ctx, ∇_yL):

y, ŷ = ctx.saved_tensors

return (I − ∂y/∂ŷ)^−T∇_yL

def ImplicitResidualBlock (x):

nsolve = lambda γ, x: argmin_z ||z − fpmap (γ, x, z)||²

ŷ = nsolve (γ.detach(), x.detach())

return fpmap.apply (γ, x, ŷ)

2.2.2. Backpropagation

We now show that, even though the nonlinearity in (6) adds to the complexity of the forward propagation, the direct backpropagation through the nonlinear solver is not required. Firstly, using the chain rule, we can easily find the Jacobian matrices of the implicit residual layer as follows:

\begin{matrix} \frac{\partial y}{\partial x} & = I + \frac{\partial Φ (γ, x, y)}{\partial x} + \frac{\partial Φ (γ, x, y)}{\partial y} \frac{\partial y}{\partial x} = {(I - \frac{\partial Φ (γ, x, y)}{\partial y})}^{- 1} (I + \frac{\partial Φ (γ, x, y)}{\partial x}), \end{matrix}

and

\begin{matrix} \frac{\partial y}{\partial γ} & = \frac{\partial Φ (γ, x, y)}{\partial γ} + \frac{\partial Φ (γ, x, y)}{\partial y} \frac{\partial y}{\partial γ} = {(I - \frac{\partial Φ (γ, x, y)}{\partial y})}^{- 1} \frac{\partial Φ (γ, x, y)}{\partial γ} . \end{matrix}

The backpropagation formulas then follow immediately:

\begin{matrix} \nabla_{x} L = {(I + \frac{\partial Φ (γ, x, y)}{\partial x})}^{T} \bar{\nabla_{y} L} and \nabla_{γ} L = {\frac{\partial Φ (γ, x, y)}{\partial γ}}^{T} \bar{\nabla_{y} L}, \end{matrix}

where

\bar{\nabla_{y} L}

is the solution to the linear system

\begin{matrix} {(I - \frac{\partial Φ (γ, x, y)}{\partial y})}^{T} \bar{\nabla_{y} L} = \nabla_{y} L . \end{matrix}

(17)

Note that the custom backpropagation for the

fpmap

function in Listing 1 is responsible for the linear solve in (17). The gradients of the loss with respect to the parameters and the input are then computed automatically by the deep learning framework. Finally, DL frameworks allow for the cheap computation of the vector-Jacobian products in (17) and hence for the efficient implementation of iterative linear solvers. For example, we utilize restarted GMRES as a linear solver in our implementation.

2.3. fpResNet

Sophisticated solvers are not required for the nonlinear and linear systems in (15) and (17) when

Φ (γ, x, y)

is a contractive mapping, i.e., when

\begin{matrix} L i p (Φ) : = sup_{y} {∥\frac{\partial Φ (γ, x, y)}{\partial y}∥}_{2} = θ sup_{y} {∥\frac{\partial F (γ, y)}{\partial y}∥}_{2} < 1, \end{matrix}

(18)

where

{∥ \cdot ∥}_{2}

is the spectral norm of the matrix equal to its largest singular value. The red color in Figure 3 is used to highlight the part of the complex plane

{z : | z | < θ^{- 1}}

, where the above condition is satisfied for the Dahlquist test equation in (10). In this case, the Banach fixed-point theorem ensures the convergence of the recurrence relation

\begin{matrix} y^{k + 1} = x + Φ (γ, x, y^{k}), k = 0, 1, . . . \end{matrix}

during the forward propagation. The same condition guarantees the validity of the Neumann series expansion for the matrix inverse required for the backpropagation; one gets

\begin{matrix} \bar{\nabla_{y} L} = {(I - \frac{\partial Φ (γ, x, y)}{\partial y})}^{- T} \nabla_{y} L = \sum_{i = 0}^{\infty} {({\frac{\partial Φ (γ, x, y)}{\partial y}}^{T})}^{i} \nabla_{y} L = \sum_{i = 0}^{\infty} \nabla_{y}^{i} L, \end{matrix}

with

\nabla_{y}^{i} L = {\frac{\partial Φ (γ, x, y)}{\partial y}}^{T} \nabla_{y}^{i - 1} L, \nabla_{y}^{0} L = \nabla_{y} L .

Similarly to (17), each

\nabla_{y}^{i} L

has a form of the vector-Jacobian product which can be efficiently evaluated with any deep learning framework. In practice, however, such simple iterations converge linearly with the rate proportional to the Lipschitz constant

L i p (Φ)

which can be rather inefficient when

L i p (Φ) \approx 1

.

2.4. Regularization

One way to ensure the stability of the neural network when viewed as a nonlinear DS is by imposing a hard constraint on its parameters to make it globally dissipative or conservative. This approach has been utilized, for instance, in [7,10] and applied to several explicit, and hence only conditionally stable, residual network architectures. Another approach that we employ here is by imposing the structure by regularization. In this section, we review some common regularization techniques and propose a new one that suits the presented implicit architecture.

2.4.1. Lipschitz Continuous Architectures

Enforcing Lipschitz continuity of neural networks has been recognized as an important component in many applications. For instance, explicit bounds on the Lipschitz constants of the loss function have been utilized to improve the robustness and establish the generalization error of large margin classifiers and GAN discriminators in [27,28,29,30], respectively. Lipschitz continuity has also been considered implicitly in [31,32] to improve the adversarial robustness of DNNs and in [33] for the design of contractive auto-encoders. In all these works, bounding the Lipschitz constants was implemented by penalizing the norm of the Jacobian matrix of the network on the training dataset.

Spectral normalization of weight matrices can be used to provide the uniform bound on the Lipschitz constant of the network. It has been applied to improve the stability of deep networks in [34,35,36] and to construct invertible normalizing flows of probability distributions in [37]. To justify the method, consider the common choice of

F (γ, x)

in (11) as a composition of affine maps and contractive nonlinear activations, i.e.,

\begin{matrix} F (γ, x) = ϕ_{n} \circ ϕ_{n - 1} \circ \dots \circ ϕ_{1} \circ x with ϕ_{i} \circ x = σ (γ_{i} \circ x + b_{i}) s . t . ∥ σ ∥ \leq 1 . \end{matrix}

The Lipschitz constant of

F (γ, x)

as a function of x can be bounded from above by

\begin{matrix} L i p (F) : = sup_{x} {∥\frac{\partial F (γ, x)}{\partial x}∥}_{2} \leq \prod_{i = 1}^{n} {∥ γ_{i} ∥}_{2} = \prod_{i = 1}^{n} \sqrt{ρ (γ_{i}^{T} γ_{i})}, \end{matrix}

where

{∥ \cdot ∥}_{2}

is the spectral norm and

ρ (A)

denotes the spectral radius of a linear map A. Hence, to ensure that

L i p (F) \leq α

, it is enough to take

\begin{matrix} {\tilde{γ}}_{i} = \frac{γ_{i}}{max (1, \frac{∥ γ_{i} ∥_{2}}{α_{i}})} s . t . \prod_{i = 1}^{n} α_{i} \leq α . \end{matrix}

The exact calculation of the operator norm is expensive and one usually appeals to approximate techniques such as the power iteration method [38]. According to this method, the dominant singular vector v and the singular value

μ

of a linear operator A are estimated iteratively as

\begin{matrix} v^{k + 1} & = \frac{A^{T} A v^{k}}{∥ A^{T} A v^{k} ∥}, μ^{k} = \sqrt{{(v^{k})}^{T} A^{T} A v^{k}} = ∥ A v^{k} ∥ . \end{matrix}

In practice, it is enough to take a fixed number (usually 1) of iterations at each weight evaluation during the training stage since the parameters are not expected to change much close to the convergence of the training loop. By observing that

\begin{matrix} A^{T} A v^{k} = \frac{1}{2} \frac{\partial ∥ A v^{k} ∥^{2}}{\partial v^{k}} = \frac{1}{2} \frac{\partial {(μ^{k})}^{2}}{\partial v^{k}}, \end{matrix}

Algorithm 1 provides a simple implementation of this approach.

Algorithm 1 Power iteration method

Input: linear map A, max iterations $k_{m a x}$
Initialize $v^{0}$
$v^{0} \leftarrow v^{0} / ∥ v^{0} ∥$
for $k = 1, . . ., k_{m a x}$ do
- $μ \leftarrow ∥ A v^{k} ∥$
- $v^{k + 1} \leftarrow 0.5 \cdot \partial μ^{2} / \partial v^{k}$
- $v^{k + 1} \leftarrow v^{k + 1} / ∥ v^{k + 1} ∥$
end for
Output: $μ$

More generally, spectral normalization allows for controlling the spread of the Jacobian matrix

\frac{\partial F (γ, x)}{\partial x}

, i.e., the largest distance between its eigenvalues

s (\frac{\partial F (γ, x)}{\partial x}) : = max_{i, j} \{| λ_{i} - λ_{j} |\} .

By definition of the spectral radius, one has

\begin{matrix} ρ (\frac{\partial F (\tilde{γ}, x)}{\partial x}) \leq {∥\frac{\partial F (\tilde{γ}, x)}{\partial x}∥}_{2} \leq 1 \to R e (λ_{i}) \in [- 1, 1] \forall i . \end{matrix}

Hence, by denoting

F^{- 1, 1} (γ, x) : = F (\tilde{γ}, x)

, we obtain

\begin{matrix} F^{α, β} (γ, x) = \frac{α + β}{2} x + \frac{β - α}{2} F^{- 1, 1} (γ, x) \to R e (λ_{i}) \in [α, β] \forall i, \end{matrix}

so that all eigenvalues of the Jacobian

\frac{\partial F^{α, β} (γ, x)}{\partial x}

are located in the disc with radius

(β - α) / 2

centered at

(α + β) / 2

. In practice, we use the more flexible form of this definition

\begin{matrix} F^{α, β} (γ, x) = \frac{α + β}{2} x + \frac{β - α}{2} S (ϑ) ⊙ F^{- 1, 1} (γ, x), \end{matrix}

(19)

where ⊙ is the Hadamard product,

ϑ

are additional learnable parameters,

S (ϑ)

has the same dimensionality as

F^{- 1, 1} (γ, x)

, and each

S_{i} (ϑ_{i}) \in (0, 1)

is the sigmoid function.

2.4.2. Trajectory Regularization

The Lipschitz constant of the proposed residual block in (6) is defined as

\begin{matrix} L i p (y) : = sup_{x} {∥{(I - \frac{\partial Φ (γ, x, y)}{\partial y})}^{- 1} (I + \frac{\partial Φ (γ, x, y)}{\partial x})∥}_{2}, \end{matrix}

and, for the linear scalar test equation in (10), it is given by the stability function (13) of the method. The hatched circle in Figure 3 contains the spectrum of the 1-Lipschitz function

F (\tilde{γ}, x)

and shows that the Lipschitz constant of a residual block can fall outside of its stability region. Moreover, the pole of (13) is located at

z = θ^{- 1}

and the stability of implicit layers might actually degrade with increasing

θ \in [0, 1]

if no additional precautions are taken to isolate the spectrum of the layers from its vicinity. This can be achieved, for instance, by setting

F (γ, x) : = F^{α, θ^{- 1}} (γ, x)

for some

α < θ^{- 1}

.

Previous works have focused on improving the efficiency of residual and neural ODE architectures by regularizing their vector fields in a way that leads to a simpler dynamical behavior. For example, the authors of [39] considered the following regularizer

\begin{matrix} R (γ) : = α_{K} \int_{0}^{T} {∥F (γ (t), y (t))∥}^{2} d t + α_{J} \int_{0}^{T} {∥\frac{\partial F (γ (t), y (t))}{\partial y}∥}_{F}^{2} d t . \end{matrix}

The first term encourages the trajectories with origin in the training dataset to follow straight lines, while the second term reduces overfitting by restricting the vector field to be nearly constant in the vicinity of each trajectory. A more general approach has been taken in [40] by directly penalizing the K-th order total derivative of the vector field along the solution trajectories. It has been shown that, by matching K to the order of numerical integrator, it is possible to significantly reduce the cost of solving the learned dynamics without sacrificing the resulting accuracy.

Instead, we consider “discrete-time” regularization of the form

\begin{matrix} R (γ) : = \frac{1}{T} (\sum_{t = 0}^{T}^{'} [\frac{α_{d i v}}{d} {(\frac{t}{T})}^{p} \nabla \cdot F (γ_{t}, y_{t}) + \frac{α_{j a c}}{d^{2}} {∥\frac{\partial F (γ_{t}, y_{t})}{\partial y_{t}}∥}_{F}^{2}] + \frac{α_{T V}}{T} \sum_{t = 1}^{T} {∥γ_{t} - γ_{t - 1}∥}^{2}), \end{matrix}

(20)

where

\sum^{'}

is the trapezoidal quadrature rule, d is the dimension of the vector field

F : Γ \times R^{d} \to R^{d}

,

p \geq 0

is some fixed number, and

\nabla \cdot F (γ_{t}, y_{t})

is the divergence of

F (γ_{t}, y_{t})

.

The total variation (TV) like term in (20) is responsible for the temporal regularity of the vector field

F (γ (t), y (t))

. Similarly to the approach of [39,40], the second term penalizes the Jacobian matrix of the layer producing vector fields which tend to be constant in the vicinity of the learned trajectories; this results in simpler dynamics and faster convergence. Finally, the first term in (20) promotes the negative divergence of

F (γ_{t}, y_{t})

along the trajectories. By definition of the divergence of the vector field, one has

\begin{matrix} \nabla \cdot F (γ_{t}, y_{t}) = \sum_{i = 1}^{d} {(\frac{\partial F (γ_{t}, y_{t})}{\partial y_{t}})}_{i i} = Tr (\frac{\partial F (γ_{t}, y_{t})}{\partial y_{t}}) = \sum_{i = 1}^{d} λ_{i} (\frac{\partial F (γ_{t}, y_{t})}{\partial y_{t}}), \end{matrix}

i.e., it is equal to the sum of eigenvalues of the Jacobian matrix. By minimizing this term, we attempt to push the spectrum of the Jacobian matrix to the negative part of the complex plane so that we can take advantage of the enhanced stability of implicit layers in this region. In addition, note that, by definition, the squared Frobenius norm of a matrix is equal to the sum of its squared singular values

\begin{matrix} {∥\frac{\partial F (γ_{t}, y_{t})}{\partial y_{t}}∥}_{F}^{2} = \sum_{i = 1}^{d} \sum_{j = 1}^{d} {(\frac{\partial F (γ_{t}, y_{t})}{\partial y_{t}})}_{i j}^{2} = Tr (\frac{\partial F^{T} (γ_{t}, y_{t})}{\partial y_{t}} \frac{\partial F (γ_{t}, y_{t})}{\partial y_{t}}) = \sum_{i = 1}^{d} σ_{i}^{2} (\frac{\partial F (γ_{t}, y_{t})}{\partial y_{t}}) . \end{matrix}

Hence, the first two terms in (20) are competing with each other, and the coefficient

{(t / T)}^{p}

is used to balance these two components by increasing the level of dissipation along the trajectories. The divergence term, however, does not impact the off-diagonal part of the Jacobian matricies, and, for large enough

α_{d i v}

, this will promote their diagonal dominance.

According to the Gershgorin circle theorem, every eigenvalue of a matrix A lies within at least one of the Gershgorin discs

D (a_{i i}, r_{i})

with

r_{i} = \sum_{i \neq j} | a_{i j} |

. This means that the optimal solution of the optimization problem with the proposed regularizer in (20) will result in the dynamics that are strongly dissipative in the directions irrelevant for the accurate representation of the training data effectively reducing the dimension of the state space. The behavior of the dynamical system in the remaining directions along the so-obtained low-dimensional state manifold can potentially be arbitrary and, with a proper balance between the loss and regularization, should not decrease the expressive power of the network.

For the efficient evaluation of the divergence and Jacobian regularizers in (20), we utilize the unbiased stochastic Hutchinson trace estimator [41] to obtain

\begin{matrix} \nabla \cdot F (γ_{t}, y_{t}) & = E_{z \sim N (0, 1)} [z^{T} \frac{\partial F (γ_{t}, y_{t})}{\partial y_{t}} z], \\ {∥\frac{\partial F (γ_{t}, y_{t})}{\partial y_{t}}∥}_{F}^{2} & = E_{z \sim N (0, 1)} [{∥z^{T} \frac{\partial F (γ_{t}, y_{t})}{\partial y_{t}}∥}^{2}] . \end{matrix}

This approach has also been applied in [39,42]. Algorithm 2 provides a more general variant of Hutchinson algorithm which allows for estimating the diagonal of a matrix [43]. This can be convenient when one needs to control the magnitude of the diagonal elements rather than just their sum.

Algorithm 2 Stochastic diagonal estimator

Input: linear map A, max iterations $k_{m a x}$
Initialize $D^{0} = 0$
for $k = 1, . . ., k_{m a x}$ do
- $v^{k} \leftarrow N (0, 1)$
- $t^{k} \leftarrow t^{k - 1} + (A (v^{k}) ⊙ v^{k})$
- $q^{k} \leftarrow t^{k - 1} + (v^{k} ⊙ v^{k})$
- $D^{k} \leftarrow t^{k} ⊘ q^{k}$
end for
Output: Approximate diagonal $D^{k_{m a x}}$ of A

3. Results

The source code used to generate all examples below can be found at https://github.com/vreshniak/ImplicitResNet.

3.1. Example 1. (Regression)

For the first example, we consider the simple problem that can be easily visualized. The goal is to approximate the one-dimensional sine function in Figure 6 given

N = 20

data points evenly distributed on the interval

x \in [- 5, 5]

. Following the approach of [44], we augment the original one-dimensional data with an additional dimension initialized with zero. The resulting two-dimensional vector field

F (γ, y) : Γ \times R^{2} \to R^{2}

is approximated by a multilayer perceptron using three hidden layers of width 10 and GeLU activation function, i.e.,

\begin{matrix} F (γ, y) = γ_{o u t} σ (γ_{3} σ (γ_{2} σ (γ_{1} σ (γ_{i n} y + b_{0}) + b_{1}) + b_{2}) + b_{3}), \end{matrix}

where

γ_{i n} \in R^{10 \times 2}

,

γ_{o u t} \in R^{2 \times 10}

,

γ_{i} \in R^{10 \times 10}

and

b_{i} \in R^{10 \times 1}

.

We used

T = 5

residual layers with shared parameters initialized with a Xavier uniform initializer and trained the network for 3000 epochs using Adam optimizer and the regularized loss given by

\begin{matrix} L (γ) : = \frac{1}{N} & \sum_{i = 1}^{N} ({∥ g (y_{T}^{i}) - f (x^{i}) ∥}^{2} + \frac{1}{T} \sum_{t = 0}^{T}^{'} [\frac{α_{d i v}}{d} {(\frac{t}{T})}^{2} \nabla \cdot F (γ, y_{t}^{i}) + \frac{0.1}{d^{2}} {∥\frac{\partial F (γ, y_{t}^{i})}{\partial y_{t}^{i}}∥}_{F}^{2}]), \end{matrix}

(21)

where

d = 2

is the dimension of the hidden state space and

g (y_{T})

gives the last component of

y_{T}

. The initial learning rate was set to

10^{- 3}

and reduced dynamically using ReduceLROnPlateau PyTorch scheduler with patience and cooldown parameters set to 50 epochs.

Figure 7 shows the learned vector fields and eigenvalues of the Jacobian

\frac{\partial F (γ, y)}{\partial y}

along the learned trajectories for several values of

θ

and

α_{d i v}

. Additionally, the top row of Figure 8 depicts the evolution of the loss components in (21), and the number of nonlinear iterations of the trained network for the selected parameter values. One can see that, with the Jacobian regularization alone (

α_{d i v} = 0

,

α_{j a c} = 0.1

), implicit methods demonstrate similar dynamical behavior for all considered values of

θ

: (1) the learned vector fields take full advantage of all two available dimensions and tend to be expansive, note the increasing divergence in Figure 8, (2) the learned trajectories mostly follow straight lines, and (3) aside from the fully explicit scheme (

θ = 0

), the costs of all the methods are nearly identical, see the bottom row of Figure 8. By increasing

α_{d i v}

, one starts observing the formation of a lower dimensional invariant manifold with increasingly dissipative orthogonal dynamics indicated by the negative part of the spectrum of

\frac{\partial F (γ, y)}{\partial y}

and the evolution of the filled regions in Figure 7. As the resulting dynamics becomes more restricted and less trivial, the following observations can be made: (1) the model tends to be less flexible and more difficult to fit the data, (2) the cost of the model is increasing with

α_{d i v}

and

θ

, and (3) the explicit method demonstrates unstable oscillatory behavior when the level of dissipation exceeds its stability threshold.

3.2. Example 2. (Stiff ODE)

In our second example, we aim to fit the model to the given stiff ODE

\begin{matrix} \dot{z} = - 20 (z - cos (t)), t \in [0, 2] . \end{matrix}

(22)

The training data in Figure 6 consist of 100 trajectories with randomly sampled initial conditions and the given number of uniformly distributed points along each trajectory as shown in Figure 9.

Since the exact Lipschitz constant of the forcing term is equal to

- 20

, we use

F^{- 25, - 15} (γ, x)

in (19) with the one-dimensional vector field

F (γ, x) : Γ \times R^{1} \to R^{1}

given by the multilayer perceptron with 2 hidden layers of width 4 and ReLU activation function, i.e,

\begin{matrix} F (γ, y) = γ_{o u t} σ (γ_{2} σ (γ_{1} σ (γ_{i n} y + b_{0}) + b_{1}) + b_{2}), \end{matrix}

where

γ_{i n} \in R^{4 \times 1}

,

γ_{o u t} \in R^{1 \times 4}

,

γ_{i} \in R^{4 \times 4}

and

b_{i} \in R^{4 \times 1}

.

We took

T = 20

residual layers initialized with Xavier uniform initializer and trained the network for 50 epochs using Adam optimizer, batch of size 1, and the regularized loss given by

\begin{matrix} L (γ) : = \frac{1}{N \cdot {2, 4, 10}} \sum_{i = 1}^{N} \sum_{j = 1}^{{2, 4, 10}} ∥ y_{j}^{i} - z^{i} (t_{0}^{i} + \frac{j}{{1, 2, 5}}) ∥^{2} + \frac{0.1}{T} \sum_{t = 1}^{T} {∥γ_{t} - γ_{t - 1}∥}^{2} . \end{matrix}

(23)

The top row of Figure 9 illustrates the evolution of the single trajectory generated by three residual models with

θ = 0.0, 0.5

and

1.0

. Each model was trained using the loss function in (23) with 2, 4 and 10 points taken along the trajectories of the training dataset. One can see that the explicit ResNet is unstable as was expected for the stiff system in (22), and the generated solution becomes increasingly oscillatory along the transient part of the trajectory as we increase the number of the training points from 2 to 10. This is due to the model attempting to be increasingly expressive for the data it cannot potentially fit. Once the dynamical system relaxes to the slow manifold, the accuracy of the model improves slightly but remains susceptible to the orthogonal perturbations. In order to stabilize the model, more layers need to be taken which will lead to the increased memory footprint and computational complexity. At the same time, both implicit residual networks with

θ = 0.5

and

1.0

are unconditionally stable and improve their accuracy with increasing number of the training points as expected.

The bottom row of Figure 9 shows the continuous dynamics generated by the vector fields learned by three considered implicit models. In this case, the discrete-time stability is not an issue anymore. However, the corruption caused by the instability of the explicit method transfers to the inaccurate behavior of the continuous system as well. On the contrary, both implicit methods lead to the satisfactory approximation of the original vector field with the midpoint scheme (

θ = 0.5

) being observably more accurate, likely due to its higher order of convergence. This suggests the proposed implicit networks as a means for learning stiff continuous-time ODE systems.

3.3. Example 3. (Periodic ODE)

In this example, we consider the Lotka–Volterra system given by

\begin{matrix} {\dot{z}}_{1} & = α z_{1} - β z_{1} z_{2}, \\ {\dot{z}}_{2} & = δ z_{1} z_{2} - γ z_{2} \end{matrix}

(24)

with

α = \frac{2}{3}

,

β = \frac{4}{3}

, and

δ = γ = 1

. These equations are used to model the time evolution of the biological systems with two interacting species one being the prey and the other being the predator. The model has two equilibrium points when neither of the two interacting populations is changing. The first equilibrium is at

z_{1} = z_{2} = 0

and the non-trivial one is at

z_{1} = \frac{γ}{δ} = 1

,

z_{2} = \frac{α}{β} = \frac{1}{2}

. Other solutions are periodic and lie on the closed curves in the phase space. Figure 6 shows five such curves; each curve contains 51 points which are distributed uniformly on the time interval

t \in [0; 10]

and used as the training data for our example.

To fit this data, we utilized implicit residual networks with

T = 50

layers and the two-dimensional vector field approximated by the multilayer perceptron with four hidden layers of width 20 and ReLU activation functions, i.e.,

\begin{matrix} F (γ, y) = γ_{o u t} σ (γ_{4} σ (γ_{3} σ (γ_{2} σ (γ_{1} σ (γ_{i n} y + b_{0}) + b_{1}) + b_{2}) + b_{3}) + b_{4}), \end{matrix}

where

γ_{i n} \in R^{20 \times 2}

,

γ_{o u t} \in R^{2 \times 20}

,

γ_{i} \in R^{20 \times 20}

and

b_{i} \in R^{20 \times 1}

. We initialized the network with Xavier uniform initializer and trained it for 3000 epochs using Adam optimizer, full batch size, and the loss given by

\begin{matrix} L (γ) : = \frac{1}{50 \cdot N} \sum_{i = 1}^{N} \sum_{j = 1}^{50} ∥ y_{j}^{i} - z^{i} (0.2 j) ∥^{2} . \end{matrix}

Note that we did not use any form of weight normalization or regularization.

Figure 10 shows the learned trajectories and eigenvalues of the vector field along these trajectories for three implicit networks with

θ = 0.0, 0.5

and

1.0

on the time interval

t \in [0, 10]

. At a first glance, all three networks learn very similar vector fields and can accurately fit the data on the time interval of the training dataset. Moreover, the top row of Figure 11 shows that all three methods are also successful at extrapolating the dynamics to the time interval

t \in [0, 200]

. However, the bottom row of the same figure shows that the midpoint scheme (

θ = 0.5

) is the only one which produces the vector field accurate for the long-time integration of the continuous dynamics. This behavior is indeed expected since the method is a geometrical integrator for this type of system, recall Figure 2 and the discussion in Section 2. The explicit residual network (

θ = 0.0

) is strictly expanding for such conservative systems, and the learned vector field tends to compensate for this behavior resulting in the dissipative continuous dynamics. The situation is reverse for the backward Euler integrator (

θ = 1.0

) since the method is strictly dissipative and hence the learned vector field is overly expanding.

3.4. Example 4. (MNIST Classification)

For the last example, consider the problem of classifying images of handwritten digits from the subset of MNIST dataset of size 1000. For this purpose, we adopt the standard preactivation ResNet-18 architecture with eight initial channels and train the network using the Adam optimizer for 100 epochs with learning rate of

10^{- 2}

. We used the cross entropy loss function

\begin{matrix} L (γ) : = \frac{1}{N} \sum_{i = 1}^{N} [- y_{l a b e l}^{i} + log (\sum_{j = 0}^{9} exp (y_{j}^{i}))] \end{matrix}

and normalized the forcing terms

F (γ, x)

of all residual layers as

F^{- 3, 1} (γ, x)

using (19). Additionally, the divergence regularization in (20) with

α_{d i v} = 0.01

was applied to each residual layer.

To test the robustness of the trained network, we corrupted the original dataset with the Gaussian noise of varying standard deviation. Table 3 illustrates the classification accuracy for different levels of noise intensity and five different implicit residual networks. The results show that the implicit architectures with a proper regularization can significantly improve the robustness properties of trained networks.

4. Conclusions and Future Work

In this work, we presented a novel implicit residual layer and provided a memory-efficient algorithm to evaluate and train deep neural networks composed of such layers. We also proposed a regularization technique to control the spectral properties of the presented layer and showed that it leads to improved stability and robustness of the trained networks. The obtained numerical results support our findings.

We see several opportunities for potential improvements to the presented architecture. For example, implementations of the forward and backward propagation algorithms can be further optimized to account for the repetitive nature of the training process. This includes the better estimation of initial guesses for nonlinear solvers and preconditioners for linear solvers and other ways to reuse available information from previous runs. It is also interesting to study other, possibly multistep, types of implicit residual layers and their combinations. In this regard and in addition to the provided examples, we plan to identify the best use cases and applications for deep residual networks containing implicit layers. In particular, we are interested in exploring the impact of adversarial training and the proposed spectral regularization on the properties of the trained implicit networks. Other applications of interest include the identification of physical systems with known properties such as energy dissipation/conservation, large horizon time series forecasting, applications with corrupted and noisy data, etc. Finally, as the proper regularization is essential for the good performance of implicit layers, new regularization approaches tailored to specific applications should also be analyzed. We intend to study these questions in our future works.

Author Contributions

Conceptualization, V.R. and C.G.W.; methodology, V.R. and C.G.W.; software, V.R.; writing, V.R. and C.G.W.; funding acquisition, C.G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the U.S. Department of Energy, Office of Science, Early Career Research Program under award number ERKJ314; U.S. Department of Energy, Office of Advanced Scientific Computing Research under award numbers ERKJ331 and ERKJ345; the National Science Foundation, Division of Mathematical Sciences, Computational Mathematics program under contract number DMS1620280; and the Behavioral Reinforcement Learning Lab at Lirio LLC.

Conflicts of Interest

The authors declare no conflict of interest.

References

LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Hanin, B. Universal function approximation by deep neural nets with bounded width and ReLU activations. Mathematics 2019, 7, 992. [Google Scholar] [CrossRef] [Green Version]
Lu, Z.; Pu, H.; Wang, F.; Hu, Z.; Wang, L. The Expressive Power of Neural Networks: A View from the Width. In Advances in Neural Information Processing Systems 30: 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 6231–6239. [Google Scholar]
Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef] [PubMed]
Weinan, E. A Proposal on Machine Learning via Dynamical Systems. Commun. Math. Stat. 2017, 5, 1–11. [Google Scholar] [CrossRef]
Weinan, E.; Han, J.; Li, Q. A mean-field optimal control formulation of deep learning. Res. Math. Sci. 2018, 6, 10. [Google Scholar] [CrossRef] [Green Version]
Ruthotto, L.; Haber, E. Deep Neural Networks Motivated by Partial Differential Equations. J. Math. Imaging Vis. 2020, 62, 352–364. [Google Scholar] [CrossRef] [Green Version]
Sonoda, S.; Murata, N. Double continuum limit of deep neural networks. In Proceedings of the ICML Workshop on Principled Approaches to Deep Learning (ICML 2017), Sydney, Australia, 10 August 2017. [Google Scholar]
Chen, T.Q.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D.K. Neural Ordinary Differential Equations. In Advances in Neural Information Processing Systems 31: 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, QC, Canada, 3–8 December 2018; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; pp. 6571–6583. [Google Scholar]
Haber, E.; Ruthotto, L. Stable architectures for deep neural networks. Inverse Probl. 2017, 34, 014004. [Google Scholar] [CrossRef] [Green Version]
LeCun, Y. A theoretical framework for back-propagation. In Proceedings of the 1988 Connectionist Models Summer School; Touretzky, D., Hinton, G., Sejnowsky, T., Eds.; Morgan Kaufmann, CMU: Pittsburgh, PA, USA, 1988; pp. 21–28. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. In Proceedings of the 14th European Conference on Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 11–14 October 2016; Lecture Notes in Computer Science. Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; Volume 9908, pp. 630–645. [Google Scholar] [CrossRef] [Green Version]
Hairer, E.; Nørsett, S.P.; Wanner, G. Solving Ordinary Differential Equations I, Nonstiff Problems; Springer Series in Computational Mathematics; Springer: Berlin/Heidelberg, Germany, 1993; Volume 8. [Google Scholar]
Hairer, E.; Lubich, C.; Wanner, G. Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equations; Springer Series in Computational Mathematics; Springer: Berlin/Heidelberg, Germany, 2006; Volume 31. [Google Scholar]
Gomez, A.N.; Ren, M.; Urtasun, R.; Grosse, R.B. The Reversible Residual Network: Backpropagation Without Storing Activations. In Advances in Neural Information Processing Systems 30: 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 2214–2224. [Google Scholar]
Chang, B.; Meng, L.; Haber, E.; Ruthotto, L.; Begert, D.; Holtham, E. Reversible architectures for arbitrarily deep residual neural networks. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), New Orleans, LA, USA, 2–7 February 2018; AAAI Press: Palo Alto, CA, USA, 2018. [Google Scholar]
Targ, S.; Almeida, D.; Lyman, K. Resnet in resnet: Generalizing residual architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Los Alamitos, CA, USA, 2017; pp. 2261–2269. [Google Scholar] [CrossRef] [Green Version]
Haber, E.; Lensink, K.; Treister, E.; Ruthotto, L. IMEXnet A Forward Stable Deep Neural Network. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; Volume 97, pp. 2525–2534. [Google Scholar]
El Ghaoui, L.; Gu, F.; Travacca, B.; Askari, A. Implicit deep learning. arXiv 2019, arXiv:1908.06315. [Google Scholar]
Bai, S.; Kolter, J.Z.; Koltun, V. Deep Equilibrium Models. In Advances in Neural Information Processing Systems 32: 32nd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 690–701. [Google Scholar]
Pineda, F.J. Generalization of back-propagation to recurrent neural networks. Phys. Rev. Lett. 1987, 59, 2229. [Google Scholar] [CrossRef]
Hunt, K.J.; Sbarbaro, D.; Żbikowski, R.; Gawthrop, P.J. Neural networks for control systems—A survey. Automatica 1992, 28, 1083–1112. [Google Scholar] [CrossRef]
Li, M.; He, L.; Lin, Z. Implicit Euler Skip Connections: Enhancing Adversarial Robustness via Numerical Stability. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Vienna, Austria, 12–18 July 2020; Volume 119, pp. 5874–5883. [Google Scholar]
Reshniak, V.; Webster, C. Robust Learning with Implicit Residual Networks. Unpublished work. 2020. [Google Scholar]
Nocedal, J.; Wright, S. Numerical Optimization, 2nd ed.; Springer Series in Operations Research and Financial Engineering; Springer: New York, NY, USA, 2006. [Google Scholar]
Sokolić, J.; Giryes, R.; Sapiro, G.; Rodrigues, M.R. Robust large margin deep neural networks. IEEE Trans. Signal Process. 2017, 65, 4265–4280. [Google Scholar] [CrossRef]
Tsuzuku, Y.; Sato, I.; Sugiyama, M. Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks. In Advances in Neural Information Processing Systems 31: 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, QC, Canada, 3–8 December 2018; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; pp. 6541–6550. [Google Scholar]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved Training of Wasserstein GANs. In Advances in Neural Information Processing Systems 30: 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; pp. 5767–5777. [Google Scholar]
Qi, G.J. Loss-Sensitive Generative Adversarial Networks on Lipschitz Densities. Int. J. Comput. Vis. 2020, 128, 1118–1140. [Google Scholar] [CrossRef] [Green Version]
Jakubovitz, D.; Giryes, R. Improving DNN robustness to adversarial attacks using Jacobian regularization. In Proceedings of the 15th European Conference on Computer Vision (ECCV 2018), Munich, Germany, 8–14 September 2018; Lecture Notes in Computer Science. Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 514–529. [Google Scholar] [CrossRef] [Green Version]
Hoffman, J.; Roberts, D.A.; Yaida, S. Robust learning with Jacobian regularization. arXiv 2019, arXiv:1908.02729. [Google Scholar]
Alain, G.; Bengio, Y. What regularized auto-encoders learn from the data-generating distribution. J. Mach. Learn. Res. 2014, 15, 3563–3593. [Google Scholar]
Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral normalization for generative adversarial networks. In Proceedings of the Sixth International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Gouk, H.; Frank, E.; Pfahringer, B.; Cree, M. Regularisation of neural networks by enforcing Lipschitz continuity. Mach. Learn. 2020. [Google Scholar] [CrossRef]
Yoshida, Y.; Miyato, T. Spectral norm regularization for improving the generalizability of deep learning. arXiv 2017, arXiv:1705.10941. [Google Scholar]
Behrmann, J.; Grathwohl, W.; Chen, R.T.; Duvenaud, D.; Jacobsen, J.H. Invertible Residual Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 573–582. [Google Scholar]
Golub, G.H.; Van Loan, C.F. Matrix computations, 4 ed.; The Johns Hopkins University Press: Baltimore, MD, USA, 2013. [Google Scholar]
Finlay, C.; Jacobsen, J.H.; Nurbekyan, L.; Oberman, A.M. How to train your neural ODE: The world of Jacobian and kinetic regularization. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Vienna, Austria, 12–18 July 2020; Volume 119, pp. 3154–3164. [Google Scholar]
Kelly, J.; Bettencourt, J.; Johnson, M.J.; Duvenaud, D. Learning Differential Equations that are Easy to Solve. In Proceedings of the 34st Annual Conference on Neural Information Processing Systems (NIPS 2020), Vancouver, BC, Canada, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H.T., Eds.; [Google Scholar]
Hutchinson, M. A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines. Commun. Stat. Simul. Comput. 1989, 18, 1059–1076. [Google Scholar] [CrossRef]
Grathwohl, W.; Chen, R.T.; Bettencourt, J.; Sutskever, I.; Duvenaud, D. Ffjord: Free-form continuous dynamics for scalable reversible generative models. In Proceedings of the Seventh International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Bekas, C.; Kokiopoulou, E.; Saad, Y. An estimator for the diagonal of a matrix. Appl. Numer. Math. 2007, 57, 1214–1229. [Google Scholar] [CrossRef] [Green Version]
Dupont, E.; Doucet, A.; Teh, Y.W. Augmented neural ODEs. In Advances in Neural Information Processing Systems 32: 32nd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 3140–3150. [Google Scholar]

Figure 1. From left to right: feed forward layer, residual layer, proposed implicit residual layer.

Figure 2. Phase diagrams of different numerical solutions of the system in (8).

Figure 3. Stability regions (grey) and the contours of the stability function of implicit residual layers; the corresponding regions where layers are contractive are shown in red while the hatched circles contain the spectrum of the spectrally normalized 1-Lipschitz function

F (γ, x)

.

Figure 3. Stability regions (grey) and the contours of the stability function of implicit residual layers; the corresponding regions where layers are contractive are shown in red while the hatched circles contain the spectrum of the spectrally normalized 1-Lipschitz function

F (γ, x)

.

Figure 4. Cádlág function.

Figure 5. Forward and backward computational graphs of the implicit layer.

Figure 6. Training data in Examples 1–3.

Figure 7. (Top) Learned vector fields

F (γ, y)

in Example 1. Blue line is the initial state, red curve is the final state after

T = 5

steps, solid black lines are the trajectories of the training data. (Bottom) Eigenvalues of

\frac{\partial F (γ, y)}{\partial y}

evaluated along the learned trajectories at times

t = 0, . . ., T

. Red and blue dots are used for the train and test datasets, respectively. Stability regions of implicit layers are highlighted with grey color and the contours depict the values of the stability function.

Figure 7. (Top) Learned vector fields

F (γ, y)

in Example 1. Blue line is the initial state, red curve is the final state after

T = 5

steps, solid black lines are the trajectories of the training data. (Bottom) Eigenvalues of

\frac{\partial F (γ, y)}{\partial y}

evaluated along the learned trajectories at times

t = 0, . . ., T

. Red and blue dots are used for the train and test datasets, respectively. Stability regions of implicit layers are highlighted with grey color and the contours depict the values of the stability function.

Figure 8. (Top) Evolution of the training loss components in (21) for Example 1. (Bottom) Nonlinear iterations per residual layer of the trained network.

Figure 9. (Top) A single trajectory generated by three trained implicit residual networks for the problem in Example 2. (Bottom) Continuous-time trajectory generated by the learned vector fields of these residual networks.

Figure 10. (Top) Learned vector fields and trajectories for the system in Example 3 on the time interval

t \in [0, 10]

. (Bottom) Eigenvalues of the vector fields along these trajectories.

Figure 10. (Top) Learned vector fields and trajectories for the system in Example 3 on the time interval

t \in [0, 10]

. (Bottom) Eigenvalues of the vector fields along these trajectories.

Figure 11. (Top) A single trajectory generated by three trained implicit residual networks for the problem in Example 3 on the time interval

t \in [0, 200]

; (Bottom) continuous-time trajectory generated by the learned vector fields of these residual networks on the same time interval.

Figure 11. (Top) A single trajectory generated by three trained implicit residual networks for the problem in Example 3 on the time interval

t \in [0, 200]

; (Bottom) continuous-time trajectory generated by the learned vector fields of these residual networks on the same time interval.

Table 1. Derivatives of the nonlinear maps in (11) and (12).

$Φ (γ, x, y)$	$(1 - θ) F (γ, x) + θ F (γ, y)$	$F (γ, z), z = (1 - θ) x + θ y$
$\frac{\partial Φ (γ, x, y)}{\partial x}$	$(1 - θ) \frac{\partial F (γ, x)}{\partial x}$	$(1 - θ) \frac{\partial F (γ, z)}{\partial z}$
$\frac{\partial Φ (γ, x, y)}{\partial y}$	$θ \frac{\partial F (γ, y)}{\partial y}$	$θ \frac{\partial F (γ, z)}{\partial z}$
$\frac{\partial Φ (γ, x, y)}{\partial γ}$	$(1 - θ) \frac{\partial F (γ, x)}{\partial γ} + θ \frac{\partial F (γ, y)}{\partial γ}$	$\frac{\partial F (γ, z)}{\partial γ}$

Table 2. Common descent algorithms [26].

Method	$s^{k}$	$λ^{k}$	Parameters
gradient descent	$Λ^{k} \cdot \nabla_{y} {∥ r (y^{k}) ∥}^{2} + m^{k} / λ^{k}$	Wolfe conditions	scaling matrix $Λ^{k}$ momentum vector $m^{k}$
quasi-Newton	${(H^{k})}^{- 1} \cdot \nabla_{y} {∥ r (y^{k}) ∥}^{2}$	Wolfe conditions	approximate Hessian $H^{k}$
conjugate gradient	$\nabla_{y} {∥ r (y^{k}) ∥}^{2} - β^{k} s^{k - 1}$	${arg min}_{λ}$ $∥ r (y^{k} + λ s^{k}) ∥^{2}$	conjugate direction parameters $β^{k}$

Table 3. Classification accuracy in Example 4 for different levels of Gaussian noise corruption.

Noise	Top-1 Accuracy					Top-2 Accuracy
Intensity	$θ = 0$	0.25	0.50	0.75	1.00	$θ = 0$	0.25	0.50	0.75	1.00
0.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0
0.1	98.2	99.6	99.9	99.9	99.9	99.7	100.0	100.0	100.0	100.0
0.2	89.2	95.7	96.3	98.3	98.4	96.7	99.3	99.8	100.0	100.0
0.3	74.7	86.0	89.3	93.2	93.6	89.0	95.2	97.8	99.0	98.8
0.4	59.3	74.1	77.2	81.8	84.7	75.5	89.1	91.1	94.4	95.0
0.5	47.2	60.4	64.7	69.8	73.0	65.7	79.6	83.4	87.1	87.9

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Reshniak, V.; Webster, C.G. Robust Learning with Implicit Residual Networks. Mach. Learn. Knowl. Extr. 2021, 3, 34-55. https://0-doi-org.brum.beds.ac.uk/10.3390/make3010003

AMA Style

Reshniak V, Webster CG. Robust Learning with Implicit Residual Networks. Machine Learning and Knowledge Extraction. 2021; 3(1):34-55. https://0-doi-org.brum.beds.ac.uk/10.3390/make3010003

Chicago/Turabian Style

Reshniak, Viktor, and Clayton G. Webster. 2021. "Robust Learning with Implicit Residual Networks" Machine Learning and Knowledge Extraction 3, no. 1: 34-55. https://0-doi-org.brum.beds.ac.uk/10.3390/make3010003

Article Menu

Robust Learning with Implicit Residual Networks

Abstract

1. Introduction and Related Works

2. Description of the Method

2.1. Linear Stability Analysis

2.2. Implicit ResNet

2.2.1. Forward Propagation

2.2.2. Backpropagation

2.3. fpResNet

2.4. Regularization

2.4.1. Lipschitz Continuous Architectures

2.4.2. Trajectory Regularization

3. Results

3.1. Example 1. (Regression)

3.2. Example 2. (Stiff ODE)

3.3. Example 3. (Periodic ODE)

3.4. Example 4. (MNIST Classification)

4. Conclusions and Future Work

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI