A Unifying Generator Loss Function for Generative Adversarial Networks

Veiner, Justin; Alajaji, Fady; Gharesifard, Bahman

doi:10.3390/e26040290

Open AccessArticle

A Unifying Generator Loss Function for Generative Adversarial Networks

by

Justin Veiner

¹,

Fady Alajaji

^1,*

and

Bahman Gharesifard

²

¹

Department of Mathematics and Statistics, Queen’s University, Kingston, ON K7L 3N6, Canada

²

Department of Electrical and Computer Engineering, University of California, Los Angeles, CA 90095, USA

^*

Author to whom correspondence should be addressed.

Entropy 2024, 26(4), 290; https://0-doi-org.brum.beds.ac.uk/10.3390/e26040290

Submission received: 23 February 2024 / Revised: 18 March 2024 / Accepted: 22 March 2024 / Published: 27 March 2024

(This article belongs to the Special Issue Information-Theoretic Methods in Deep Learning: Theory and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

A unifying

α

-parametrized generator loss function is introduced for a dual-objective generative adversarial network (GAN) that uses a canonical (or classical) discriminator loss function such as the one in the original GAN (VanillaGAN) system. The generator loss function is based on a symmetric class probability estimation type function,

L_{α}

, and the resulting GAN system is termed

L_{α}

-GAN. Under an optimal discriminator, it is shown that the generator’s optimization problem consists of minimizing a Jensen-

f_{α}

-divergence, a natural generalization of the Jensen-Shannon divergence, where

f_{α}

is a convex function expressed in terms of the loss function

L_{α}

. It is also demonstrated that this

L_{α}

-GAN problem recovers as special cases a number of GAN problems in the literature, including VanillaGAN, least squares GAN (LSGAN), least kth-order GAN (LkGAN), and the recently introduced

(α_{D}, α_{G})

-GAN with

α_{D} = 1

. Finally, experimental results are provided for three datasets—MNIST, CIFAR-10, and Stacked MNIST—to illustrate the performance of various examples of the

L_{α}

-GAN system.

Keywords:

generative adversarial networks; deep learning; parameterized loss functions; f-divergence; Jensen-f-divergence

1. Introduction

Generative adversarial networks (GANs), first introduced by Goodfellow et al. in 2014 [1], have a variety of applications in media generation [2], image restoration [3], and data privacy [4]. GANs aim to generate synthetic data that closely resemble the original real data with (unknown) underlying distribution

P_{x}

. The GAN is trained such that the distribution of the generated data,

P_{g}

, approximates

P_{x}

well. More specifically, low-dimensional random noise is fed to a generator neural network G to produce synthetic data. Real data and the generated data are then given to a discriminator neural network D that scores the data between 0 and 1, with a score close to 1 meaning that the discriminator thinks the data belong to the real dataset. The discriminator and generator play a minimax game, where the aim is to minimize the generator’s loss and maximize the discriminator’s loss.

Since its initial introduction, several variants of GAN have been proposed. Deep convolutional GAN (DCGAN) [5] utilizes the same loss functions as VanillaGAN (the original GAN) while combining GANs with convolutional neural networks, which are helpful when applying GANs to image data as they extract visual features from the data. DCGANs are more stable than the baseline model but can suffer from mode collapse, which occurs when the generator learns that a select number of images can easily fool the discriminator, resulting in the generator only generating those images. Another notable issue with VanillaGAN is the tendency for the generator network’s gradients to vanish. In the early stages of training, the discriminator lacks confidence and assigns generated data values close to zero. Therefore, the objective function tends to zero, resulting in small gradients and a lack of learning. To mitigate this issue, a non-saturating generator loss function was proposed in [1] so that gradients do not vanish early on in training.

In the original (VanillaGAN) problem setup, the objective function, expressed as a negative sum of two Shannon cross-entropies, is to be minimized by the generator and maximized by the discriminator. It is demonstrated that if the discriminator is fixed to be optimal (i.e., as a maximizer of the objective function), the GAN’s minimax game can be reduced to minimizing the Jensen-Shannon divergence (JSD) between the real and generated data’s probability distributions [1]. An analogous result was proven in [6] for RényiGANs, a dual-objective GAN using distinct discriminator and generator loss functions. More specifically, under a canonical discriminator loss function (as in [1]) and a generator loss function expressed in terms of two Rényi cross-entropies, it is shown that the RényiGAN optimization problem reduces to minimizing the Jensen-Rényi divergence, hence extending VanillaGAN’s results.

Nowozin et al. generalized VanillaGAN by formulating a class of loss functions in [7] parametrized by a lower semicontinuous convex function f, devising f-GAN. More specifically, the f-GAN problem consists of minimizing an f-divergence between the true data distribution and the generator distribution via a minimax optimization of a Fenchel conjugate representation of the f-divergence, where the VanillaGAN discriminator’s role (as a binary classifier) is replaced by a variational function estimating the ratio of the true data and generator distributions. The f-GAN loss function may be tedious to derive, as it requires computation of the Fenchel conjugate of f. It can be shown that f-GAN can interpolate between VanillaGAN and HellingerGAN, among others [7].

More recently,

α

-GAN was presented in [8], for which the aim is to derive a class of loss functions parameterized by

α > 0

and expressed in terms of a class probability estimation (CPE) loss between a real label

y \in {0, 1}

and predicted label

\hat{y} \in [0, 1]

[8]. The ability to control

α

as a hyperparameter is beneficial to be able to apply one system to multiple datasets, as two datasets may be optimal under different

α

values. This work was further analyzed in [9] and expanded in [10] by introducing the dual-objective

(α_{D}, α_{G})

-GAN, which allowed for the generator and discriminator loss functions to have distinct

α

parameters with the aim of improving training stability. When

α_{D} = α_{G}

, the

α

-GAN optimization reduces to minimizing an Arimoto divergence, as originally derived in [8]. Note that

α

-GAN can recover several f-GANs, such as HellingerGAN, VanillaGAN, WassersteinGAN, and total variation GAN [8]. Furthermore, in their more recent work [11] that unifies [8,9,10], the authors establish, under some conditions, a one-to-one correspondence between CPE-loss-based GANs (such as

α

-GANs) and f-GANs that use a symmetric f-divergence (see Theorems 4–5 and Corollary 1 in [11]). They also prove various generalization and estimation error bounds for

(α_{D}, α_{G})

-GANs and illustrate their ability to mitigate training instability for synthetic Gaussian data as well as the Celeb-A and LSUN Classroom image datasets. The various

(α_{D}, α_{G})

-GAN equilibrium results do not provide an analogous result to JSD and Jensen-Rényi divergence minimization for the VanillaGAN [1] and RényiGAN [6] problems, respectively, as they do not involve a Jensen-type divergence. More specifically given a divergence measure

D (p ∥ q)

between distributions p and q (i.e., a positive-definite bivariate function:

D (p ∥ q) \geq 0

with equality if and only if (iff)

p = q

almost everywhere (a.e.)), a Jensen-type divergence of

D

is given by

\frac{1}{2} D (p ∥ \frac{p + q}{2}) + \frac{1}{2} D (q ∥ \frac{p + q}{2});

i.e., it is the arithmetic average of two

D

-divergences: one between p and the mixture

(p + q) / 2

and the other between q and

(p + q) / 2

.

The main objective of our work is to present a unifying approach that provides an axiomatic framework to encompass several existing GAN generator loss functions so that GAN optimization can be simplified in terms of a Jensen-type divergence. In particular, our framework classifies the set of

α

-parameterized CPE-based loss functions

L_{α}

, generalizing the

α

-loss function in [8,9,10,11]. We then propose

L_{α}

-GAN: a dual-objective GAN that uses a function from this class for the generator and uses any canonical discriminator loss function that admits the same optimizer as VanillaGAN [1]. We show that under some regularity (convexity/concavity) conditions on

L_{α}

, the minimax game played with these two loss functions is equivalent to the minimization of a Jensen-

f_{α}

-divergence: a Jensen-type divergence and another natural extension of the Jensen-Shannon divergence (in addition to the Jensen-Rényi divergence [6]), where the generating function

f_{α}

of the divergence is directly computed from the CPE loss function

L_{α}

. This result recovers various prior dual-objective GAN equilibrium results, thus unifying them under one parameterized generator loss function. The newly obtained Jensen-

f_{α}

-divergence, which is noted to belong to the class of symmetric f-divergences with different generating functions (see Remark 1), is a useful measure of dissimilarity between distributions as it requires a convex function f with a restricted domain given by the interval

[0, 2]

(see Remark 2) in addition to its symmetry and finiteness properties.

The rest of the paper is organized as follows. In Section 2, we review f-divergence measures and introduce the Jensen-f-divergence as an extension of the Jensen-Shannon divergence. In Section 3, we establish our main result regarding the optimization of our unifying generator loss function (Theorem 1) and show that it can be applied to a large class of known GANs (Lemmas 2–4). We conduct experiments in Section 4 by implementing different manifestations of

L_{α}

-GAN on three datasets: MNIST, CIFAR-10, and Stacked MNIST. Finally, we conclude the paper in Section 5.

2. Preliminaries

We begin by presenting key information measures used throughout the paper. Let

f : [0, \infty) \to (- \infty, \infty]

be a convex continuous function that is strictly convex at 1 (i.e.,

f (λ u_{1} + (1 - λ) u_{2}) < λ f (u_{1}) + (1 - λ) f (u_{2})

for all

u_{1}, u_{2} \geq 0

,

u_{1} \neq u_{2}

, and

λ \in (0, 1)

such that

λ u_{1} + (1 - λ) u_{2} = 1

) and satisfying

f (1) = 0 .

Note that the convexity of f already implies its continuity on

(0, \infty)

. Here, the continuity of f at 0 is extended, setting

f (0) = {lim}_{u ↓ 0} f (u)

, which may be infinite. Otherwise,

f (u)

is assumed to be finite for

u > 0

.

Definition 1

([12,13,14]). The f-divergence between two probability densities p and q with common support

R \subseteq R^{d}

on the Lebesgue measurable space

(R, B (R), μ)

is denoted by

D_{f} (p ∥ q)

and given by

\begin{matrix} D_{f} (p ∥ q) = \int_{R} q f (\frac{p}{q}) d μ, \end{matrix}

(1)

where we have used the shorthand

\int_{R} g d μ : = \int_{R} g (x) d μ (x)

, where g is a measurable function; we follow this convention from now on. Here, f is referred to as the generating function of

D_{f} (p ∥ q)

.

For simplicity, we consider throughout densities with common supports. A comprehensive definition of f-divergence for arbitrary distributions can be found in Section III of [15]. We require that f is strictly convex around 1 and that it satisfies the normalization condition

f (1) = 0

to ensure positive-definiteness of the f-divergence, i.e.,

D_{f} (p ∥ q) \geq 0

with equality holding iff

p = q

(a.e.). We present examples of f-divergences under various choices of their generating function f in Table 1. We will be invoking these divergence measures in different parts of the paper.

The Rényi divergence of order

α

(

α > 0

,

α \neq 1

) between densities p and q with common support

R

is used in [6] in the RényiGAN problem; it is given by [23,24]

\begin{matrix} D_{α} (p ∥ q) = \frac{1}{α - 1} log (\int_{R} p^{α} q^{1 - α} d μ) . \end{matrix}

(2)

Note that the Rényi divergence is not an f-divergence; however, it can be expressed as a transformation of the Hellinger divergence (which is itself an f-divergence):

\begin{matrix} D_{α} (p ∥ q) = \frac{1}{α - 1} log (1 + (α - 1) H_{α} (p ∥ q)) . \end{matrix}

(3)

We now introduce a new measure, the Jensen-f-divergence, which is analogous to the Jensen-Shannon and Jensen-Rényi divergences.

Definition 2.

The Jensen-f-divergence between two probability distributions p and q with common support

R \subseteq R^{d}

on the Lebesgue measurable space

(R, B (R), μ)

is denoted by

{JD}_{f} (p ∥ q)

and given by

\begin{matrix} {JD}_{f} (p ∥ q) = \frac{1}{2} D_{f} (p | | \frac{p + q}{2}) + \frac{1}{2} D_{f} (q | | \frac{p + q}{2}), \end{matrix}

(4)

where

D_{f} (\cdot ∥ \cdot)

is the f-divergence.

We next verify that the Jensen-Shannon divergence is a Jensen-f-divergence.

Lemma 1.

Let p and q be two densities with common support

R \subseteq R^{d}

, and consider the function

f : [0, \infty) \to (- \infty, \infty]

given by

f (u) = u log u

. Then we have that

\begin{matrix} {JD}_{f} (p ∥ q) = JSD (p ∥ q) . \end{matrix}

(5)

Proof.

As f is convex (and continuous) on its domain with

f (1) = 0

, we have that

\begin{matrix} JSD (p ∥ q) & = \frac{1}{2} KL (p | | \frac{p + q}{2}) + \frac{1}{2} KL (q | | \frac{p + q}{2}) \\ = \frac{1}{2} \int_{R} p log (\frac{2 p}{p + q}) d μ + \frac{1}{2} \int_{R} q log (\frac{2 q}{p + q}) d μ \\ = \frac{1}{2} \int_{R} \frac{p + q}{2} (\frac{2 p}{p + q} log (\frac{2 p}{p + q})) d μ \\ + \frac{1}{2} \int_{R} \frac{p + q}{2} (\frac{2 q}{p + q} log (\frac{2 q}{p + q})) d μ \\ = {JD}_{f} (p ∥ q) . \end{matrix}

□

Remark 1

(Jensen-f-divergence is a symmetric f-divergence). Note that

{JD}_{f} (p ∥ q)

is itself a symmetric f-divergence (with a modified generating function). Indeed, given the continuous convex function f that is strictly convex around 1 with

f (1) = 0

, consider the functions

f_{1} (u) : = \frac{u + 1}{2} f (\frac{2 u}{u + 1}), u \geq 0,

and

f_{2} (u) : = \frac{u + 1}{2} f (\frac{2}{u + 1}), u \geq 0,

which are both continuous convex, strictly convex around 1, and satisfy

f_{1} (1) = f_{2} (1) = 0

. Now, direct calculations yield that

D_{f} (p | | \frac{p + q}{2}) = D_{f_{1}} (p ∥ q)

and

D_{f} (q | | \frac{p + q}{2}) = D_{f_{2}} (p ∥ q) .

Thus,

\begin{matrix} {JD}_{f} (p ∥ q) & = \frac{1}{2} D_{f_{1}} (p ∥ q) + \frac{1}{2} D_{f_{2}} (p ∥ q) = D_{\bar{f}} (p ∥ q), \end{matrix}

where

\bar{f} : = \frac{1}{2} (f_{1} + f_{2})

, i.e.,

\bar{f} (u) = \frac{u + 1}{4} (f (\frac{2 u}{u + 1}) + f (\frac{2}{u + 1})), u \geq 0,

(6)

is also continuous convex, strictly convex around 1, and satisfies

\bar{f} (1) = 0

. Since by (4),

{JD}_{f} (p ∥ q) = {JD}_{f} (q ∥ p),

we conclude that the Jensen-f-divergence is a symmetric

\bar{f}

-divergence. An equivalent argument is to note that

\bar{f} = {\bar{f}}^{★}

, where

{\bar{f}}^{★} (u) : = u \bar{f} (\frac{1}{u})

,

u \geq 0

(with

{\bar{f}}^{★} (0) = {lim}_{t \to \infty} \bar{f} (t) / t

), which is a necessary and sufficient condition for the

\bar{f}

-divergence to be symmetric (see p. 4399 in [15]).

Remark 2

(Domain of f). Examining (4), we note that the Jensen-f-divergence between p and q involves the f-divergences between either p or q and their mixture

(p + q) / 2

. In other words, to determine

{JD}_{f} (p ∥ q)

, we only need

f (\frac{2 p}{p + q})

and

f (\frac{2 q}{p + q})

when taking the expectations in (1). Thus, it is sufficient to restrict the domain of the convex function f to the interval

[0, 2]

.

3. Main Results

We now present our main theorem that unifies various generator loss functions under a CPE-based loss function

L_{α}

for a dual-objective GAN,

L_{α}

-GAN, with a canonical discriminator loss function that is optimized as in [1]. Under some regularity conditions on the loss function

L_{α}

, we show that under the optimal discriminator, our generator loss becomes a Jensen-f-divergence.

Let

(X, B (X), μ)

be the measured space of

n \times n \times m

images (where

m = 1

for black and white images and

m = 3

for RGB images), and let

(Z, B (Z), μ)

be a measured space such that

Z \subseteq R^{d}

. The discriminator neural network is given by

D : X \to [0, 1]

, and the generator neural network is given by

G : Z \to X

. The generator’s noise input is sampled from a multivariate Gaussian distribution

P_{z} : Z \to [0, 1]

. We denote the probability distribution of the real data by

P_{x} : X \to [0, 1]

and the probability distribution of the generated data by

P_{g} : X \to [0, 1]

. We also set

P_{x}

and

P_{g}

as the densities corresponding to

P_{x}

and

P_{g}

, respectively. We begin by introducing the

L_{α}

-GAN system.

Definition 3.

Fix

α \in A \subseteq R

and let

L_{α} : {0, 1} \times [0, 1] \to [0, \infty)

be a loss function such that

\hat{y} L_{α} (1, \frac{\hat{y}}{2})

is a continuous function that is either convex or concave in

\hat{y} \in [0, 2]

with strict convexity (respectively, strict concavity) around

\hat{y} = 1

and such that

L_{α}

is symmetric in the sense that

\begin{matrix} L_{α} (1, \hat{y}) = L_{α} (0, 1 - \hat{y}), \hat{y} \in [0, 1] . \end{matrix}

(7)

Then the

L_{α}

-GAN system is defined by

(V_{D}, V_{L_{α}, G})

, where

V_{D} : X \times Z \to R

is the discriminator loss function, and

V_{L_{α}, G} : X \times Z \to R

is the generator loss function, which is given by

\begin{matrix} V_{L_{α}, G} (D, G) & = E_{A \sim P_{x}} [- L_{α} (1, D (A))] + E_{B \sim P_{g}} [- L_{α} (0, D (B))] . \end{matrix}

(8)

Moreover, the

L_{α}

-GAN problem is defined by

\begin{matrix} sup_{D} V_{D} (D, G) \end{matrix}

(9)

\begin{matrix} inf_{G} V_{L_{α}, G} (D, G) . \end{matrix}

(10)

We now present our main result about the

L_{α}

-GAN optimization problem.

Theorem 1.

For a fixed

α \in A \subseteq R

and

L_{α} : {0, 1} \times [0, 1] \to [0, \infty)

, let

(V_{D}, V_{L_{α}, G})

be the loss functions of

L_{α}

-GAN and consider joint optimization in (9)–(10). If

V_{D}

is a canonical loss function in the sense that it is maximized at

D = D^{*}

, where

\begin{matrix} D^{*} = \frac{P_{x}}{P_{x} + P_{g}}, \end{matrix}

(11)

then (10) reduces to

\begin{matrix} inf_{G} V_{L_{α}, G} (D^{*}, G) = inf_{G} 2 a {JD}_{f α} (P_{x} ∥ P_{g}) - 2 a b, \end{matrix}

(12)

where

{JD}_{f α} (\cdot ∥ \cdot)

is the Jensen-

f_{α}

-divergence, and

f_{α} : [0, 2] \to R

is a continuous convex function that is strictly convex around 1 and is given by

\begin{matrix} f_{α} (u) = - u (\frac{1}{a} L_{α} (1, \frac{u}{2}) - b), \end{matrix}

(13)

where a and b are real constants chosen so that

f_{α} (1) = 0

with

a < 0

(respectively,

a > 0

) if

u L_{α} (1, \frac{u}{2})

is convex (respectively, concave). Finally, (12) is minimized when

P_{x} = P_{g}

(a.e.).

Proof.

Under the assumption that

V_{D}

is maximized at

D^{*} = \frac{P_{x}}{P_{x} + P_{g}}

, we have that

\begin{matrix} V_{L_{α}, G} (D^{*}, G) & = E_{A \sim P_{x}} [- L_{α} (1, D^{*} (A))] + E_{B \sim P_{g}} [- L_{α} (0, D^{*} (B))] \\ = - \int_{X} P_{x} L_{α} (1, D^{*}) d μ - \int_{X} P_{g} L_{α} (0, D^{*}) d μ \\ = - \int_{X} P_{x} L_{α} (1, \frac{P_{x}}{P_{x} + P_{g}}) d μ - \int_{X} P_{g} L_{α} (0, \frac{P_{x}}{P_{x} + P_{g}}) d μ \\ = - 2 \int_{X} (\frac{P_{x} + P_{g}}{2}) \frac{P_{x}}{P_{x} + P_{g}} L_{α} (1, \frac{P_{x}}{P_{x} + P_{g}}) d μ \\ - 2 \int_{X} (\frac{P_{x} + P_{g}}{2}) \frac{P_{g}}{P_{x} + P_{g}} L_{α} (0, \frac{P_{x}}{P_{x} + P_{g}}) d μ \\ \overset{(a)}{=} - 2 \int_{X} (\frac{P_{x} + P_{g}}{2}) \frac{P_{x}}{P_{x} + P_{g}} L_{α} (1, \frac{P_{x}}{P_{x} + P_{g}}) d μ \\ - 2 \int_{X} (\frac{P_{x} + P_{g}}{2}) \frac{P_{g}}{P_{x} + P_{g}} L_{α} (1, \frac{P_{g}}{P_{x} + P_{g}}) d μ \\ \overset{(b)}{=} - 2 \int_{X} (\frac{P_{x} + P_{g}}{2}) \frac{P_{x}}{P_{x} + P_{g}} (\frac{- a f_{α} (\frac{2 P_{x}}{P_{x} + P_{g}})}{\frac{2 P_{x}}{P_{x} + P_{g}}} + a b) d μ \\ - 2 \int_{X} (\frac{P_{x} + P_{g}}{2}) \frac{P_{g}}{P_{x} + P_{g}} (\frac{- a f_{α} (\frac{2 P_{g}}{P_{x} + P_{g}})}{\frac{2 P_{g}}{P_{x} + P_{g}}} + a b) d μ \\ = 2 a (\frac{1}{2} \int_{X} \frac{P_{x} + P_{g}}{2} f_{α} (\frac{2 P_{x}}{P_{x} + P_{g}}) d μ \\ + \frac{1}{2} \int_{X} \frac{P_{x} + P_{g}}{2} f_{α} (\frac{2 P_{g}}{P_{x} + P_{g}}) d μ) - 2 a b \\ = 2 a {JD}_{f α} (P_{x} ∥ P_{g}) - 2 a b, \end{matrix}

where:

(a) holds since $L_{α} (1, u) = L_{α} (0, 1 - u)$ by (7), where $u = \frac{P_{x}}{P_{x} + P_{g}}$ .
(b) holds by solving for $L_{α} (1, u)$ in terms of $f_{α} (2 u)$ in (13), where $u = \frac{P_{x}}{P_{x} + P_{g}}$ in the first term and $u = \frac{P_{g}}{P_{x} + P_{g}}$ in the second term.

The constants a and b are chosen so that

f_{α} (1) = 0

. Finally, the continuity and convexity of

f_{α}

(as well as its strict convexity around 1) directly follow from the corresponding assumptions imposed on the loss function

L_{α}

in Definition 3 and on the condition imposed on the sign of a in the theorem’s statement. □

Remark 3.

Note that not only

D^{*}

given in (11) is an optimal discriminator of the (original) VanillaGAN discriminator loss function, but it also optimizes the LSGAN/LkGAN discriminators loss functions when their discriminators’ labels for fake and real data, γ and β, respectively satisfy

γ = 1

and

β = 0

(see Section 3.3).

We next show that the

L_{α}

-GAN of Theorem 1 recovers as special cases a number of well-known GAN generator loss functions and their equilibrium points (under an optimal classical discriminator

D^{*}

).

3.1. VanillaGAN

VanillaGAN [1] uses the same loss function

V_{VG}

for the both generator and discriminator, which is

\begin{matrix} V_{VG} (D, G) = E_{A \sim P_{x}} [- log D (A)] + E_{B \sim P_{g}} [- log (1 - D (B))] \end{matrix}

(14)

and can be cast as a saddle point optimization problem:

\begin{matrix} inf_{G} sup_{D} V_{V G} (D, G) . \end{matrix}

(15)

It is shown in [1] that the optimal discriminator for (15) is given by

D^{*} = \frac{P_{x}}{P_{x} + P_{g}}

, as in (11). When

D = D^{*}

, the optimization reduces to minimizing the Jensen-Shannon divergence:

\begin{matrix} inf_{G} V_{VG} (D^{*}, G) = inf_{G} 2 JSD (P_{x} ∥ P_{g}) - 2 log 2 . \end{matrix}

(16)

We next show that (16) can be obtained from Theorem 1.

Lemma 2.

Consider the optimization of VanillaGAN given in (15). Then we have that

\begin{matrix} V_{VG} (D^{*}, G) = 2 JSD (P_{x} ∥ P_{g}) - 2 log 2 = V_{L_{α}, G} (D^{*}, G), \end{matrix}

where

L_{α} (y, \hat{y}) = - y log (\hat{y}) - (1 - y) log (1 - \hat{y})

for all

α \in A = R

.

Proof.

For any fixed

α \in R

, let the function

L_{α}

in (8) be as defined in the statement:

\begin{matrix} L_{α} (y, \hat{y}) = - y log (\hat{y}) - (1 - y) log (1 - \hat{y}) . \end{matrix}

Note that

L_{α}

is symmetric, since for

\hat{y} \in [0, 1]

, we have that

\begin{matrix} L_{α} (1, \hat{y}) = - log (\hat{y}) = L_{α} (0, 1 - \hat{y}) . \end{matrix}

Instead of showing the continuity and convexity/concavity conditions imposed on

\hat{y} L_{α} (1, \frac{\hat{y}}{2})

in Definition 3, we implicitly verify them by directly deriving

f_{α}

from

L_{α}

using (13) and showing that it is continuous convex and strictly convex around 1. Setting

a = 1

and

b = log 2

, we have that

\begin{matrix} f_{α} (u) & = - u (\frac{1}{a} L_{α} (1, \frac{u}{2}) - b) \\ = - u (- log \frac{u}{2} - log 2) = u log u . \end{matrix}

Clearly, f is convex (actually strictly convex on

(0, \infty)

and hence strictly convex around 1) and continuous on its domain (where

f (0) = {lim}_{u \to 0} u log (u) = 0

). It also satisfies

f (1) = 0

. By Lemma 1, we know that under the generating function

f (u) = u log (u)

, the Jensen-f divergence reduces to the Jensen-Shannon divergence. Therefore, by Theorem 1, we have that

\begin{matrix} V_{L_{α}, G} (D^{*}, G) & = 2 a {JD}_{f α} (P_{x} ∥ P_{g}) - 2 a b \\ = 2 JSD (P_{x} ∥ P_{g}) - 2 log 2 \\ = V_{VG} (D^{*}, G), \end{matrix}

which finishes the proof. □

3.2. $α$ -GAN

The notion of

α

-GANs is introduced in [8] as a way to unify several existing GANs using a parameterized loss function. We describe

α

-GANs next.

Definition 4

([8]). Let

y \in {0, 1}

be a binary label,

\hat{y} \in [0, 1]

, and fix

α > 0

. The $α$ -loss between y and

\hat{y}

is the map

ℓ_{α} : {0, 1} \times [0, 1] \to [0, \infty)

given by

ℓ_{α} (y, \hat{y}) = {\begin{matrix} \frac{α}{α - 1} (1 - y {\hat{y}}^{\frac{α - 1}{α}} + (1 - y) {(1 - \hat{y})}^{\frac{α - 1}{α}}), & α \in (0, 1) \cup (1, \infty) \\ - y log \hat{y} - (1 - y) log (1 - \hat{y}), & α = 1 . \end{matrix}

(17)

Definition 5

([8]). For

α > 0

, the

α

-GAN loss function is given by

\begin{matrix} V_{α} (D, G) = E_{A \sim P_{x}} [- ℓ_{α} (1, D (A))] + E_{B \sim P_{g}} [- ℓ_{α} (0, D (B))] . \end{matrix}

(18)

Joint optimization of the

α

-GAN problem is given by

\begin{matrix} inf_{G} sup_{D} V_{α} (D, G) . \end{matrix}

(19)

It is known that

α

-GAN recovers several well-known GANs by varying the

α

parameter: notably, VanillaGAN (

α = 1

) [1] and HellingerGAN (

α = \frac{1}{2}

) [7]. Furthermore, as

α \to \infty

,

V_{α}

recovers a translated version of the WassersteinGAN loss function [25]. We now present the solution to the joint optimization problem presented in (19).

Proposition 1

([8]). Let

α > 0

and consider joint optimization of the α-GAN presented in (19). The discriminator

D^{*}

that maximizes the loss function is given by

\begin{matrix} D^{*} & = \frac{{P_{x}}^{α}}{{P_{x}}^{α} + {P_{g}}^{α}} . \end{matrix}

(20)

Furthermore, when

D = D^{*}

is fixed, the problem in (19) reduces to minimizing an Arimoto divergence (as defined in Table 1) when

α \neq 1

:

\begin{matrix} inf_{G} V_{α} (D^{*}, G) = inf_{G} A_{α} (P_{x} ∥ P_{g}) + \frac{α}{α - 1} (2^{\frac{1}{α}} - 2) \end{matrix}

(21)

and a Jensen-Shannon divergence when

α = 1

:

\begin{matrix} inf_{G} V_{1} (D^{*}, G) = inf_{G} JSD (P_{x} ∥ P_{g}) - 2 log 2, \end{matrix}

(22)

where (21) and (22) achieve their minima iff

P_{x} = P_{g}

(a.e.).

Recently,

α

-GAN was generalized in [10] to implement a dual-objective GAN, which we describe next.

Definition 6

([10]). For

α_{D} > 0

and

α_{G} > 0

, the

(α_{D}, α_{G})

-GAN’s optimization is given by

\begin{matrix} sup_{D} V_{α_{D}} (D, G) \end{matrix}

(23)

\begin{matrix} inf_{G} V_{α_{G}} (D, G) \end{matrix}

(24)

where

V_{α_{D}}

and

V_{α_{G}}

are defined in (18), with α replaced by

α_{D}

and

α_{G}

, respectively.

Proposition 2

([10]). Consider the joint optimization in (23) and (24). Let parameters

α_{D}

,

α_{G} > 0

satisfy

\begin{matrix} (α_{D} \leq 1, α_{G} > \frac{α_{D}}{α_{D} + 1}) or (α_{D} > 1, \frac{α_{D}}{2} < α_{G} \leq α_{D}) . \end{matrix}

(25)

The discriminator

D^{*}

that maximizes

V_{α_{D}}

is given by

\begin{matrix} D^{*} & = \frac{{P_{x}}^{α_{D}}}{{P_{x}}^{α_{D}} + {P_{g}}^{α_{D}}} . \end{matrix}

(26)

Furthermore, when

D = D^{*}

is fixed, the minimization of

V_{α_{G}}

in (24) is equivalent to the following f-divergence minimization:

\begin{matrix} inf_{G} V_{α_{G}} (D^{*}, G) = inf_{G} D_{f_{α_{D}, α_{G}}} (P_{x} ∥ P_{g}) + \frac{α}{α - 1} (2^{\frac{1}{α}} - 2), \end{matrix}

(27)

where

f_{α_{D}, α_{G}} : [0, \infty) \to R

is given by

\begin{matrix} f_{α_{D}, α_{G}} (u) & = \frac{α_{G}}{α_{G} - 1} (\frac{u^{α_{D} (1 - \frac{1}{α_{G}}) + 1} + 1}{{(u^{α_{D}} + 1)}^{1 - \frac{1}{α_{G}}}}) . \end{matrix}

(28)

We now apply the

(α_{D}, α_{G})

-GAN to our main result in Theorem 1 by showing that (12) can recover (27) when

α_{D} = 1

(which corresponds to a VanillaGAN discriminator loss function).

Lemma 3.

Consider the

(α_{D}, α_{G})

-GAN given in Definition 6. Let

α_{D} = 1

and

α_{G} = α > \frac{1}{2}

. Then, the solution to (24) presented in Proposition 2 is equivalent to minimizing a Jensen-

f_{α}

-divergence: specifically, if

D^{*}

is the optimal discriminator given by (26), which is equivalent to (11) when

α_{D} = 1

, then

V_{α, G} (D^{*}, G)

in (27) satisfies

\begin{matrix} V_{α, G} (D^{*}, G) & = 2^{\frac{1}{α}} {JD}_{f α} (P_{x} ∥ P_{g}) + \frac{α}{α - 1} (2^{\frac{1}{α}} - 2) = V_{L_{α}, G} (D^{*}, G), \end{matrix}

(29)

where

L_{α} (y, \hat{y}) = ℓ_{α} (y, \hat{y})

, and

\begin{matrix} f_{α} (u) = \frac{α}{α - 1} (u^{2 - \frac{1}{α}} - u), u \geq 0 . \end{matrix}

(30)

Proof.

We show that Theorem 1 recovers Proposition 2 by setting

L_{α} (y, \hat{y}) = ℓ_{α} (y, \hat{y})

. Note that

ℓ_{α}

is symmetric since

\begin{matrix} ℓ_{α} (1, \hat{y}) = \frac{α}{α - 1} (1 - {\hat{y}}^{1 - \frac{1}{α}}) = ℓ_{α} (0, 1 - \hat{y}) . \end{matrix}

As in the proof of Lemma 2, instead of proving the conditions imposed on

\hat{y} L_{α} (1, \frac{\hat{y}}{2})

in Definition 3, we derive

f_{α}

directly from

L_{α}

using (13) and show that it is continuous convex and strictly convex around 1. From Lemma 2, we know that when

α = 1

,

f_{α} (u) = u log u

(which is strictly convex and continuous). For

α \in (0, 1) \cup (1, \infty)

, setting

a = 2^{\frac{1}{α} - 1}

and

b = \frac{α}{α - 1} (2^{1 - \frac{1}{α}} - 1)

in (13), we have that

\begin{matrix} f_{α} (u) & = - u (\frac{1}{a} L_{α} (1, \frac{u}{2}) - b) \\ = - u (2^{1 - \frac{1}{α}} \frac{α}{α - 1} (1 - {(\frac{u}{2})}^{1 - \frac{1}{α}}) - \frac{α}{α - 1} (2^{1 - \frac{1}{α}} - 1)) \\ = \frac{α}{α - 1} (- u) [2^{1 - \frac{1}{α}} - u^{1 - \frac{1}{α}} - (2^{1 - \frac{1}{α}} - 1)] \\ = \frac{α}{α - 1} (u^{2 - \frac{1}{α}} - u) . \end{matrix}

Clearly,

f_{α} (1) = 0

. Furthermore for

α \neq 1

, we have that

\begin{matrix} f_{α}^{″} (u) & = \frac{(2 α - 1) u^{\frac{- 1}{α}}}{α}, u \geq 0, \end{matrix}

which is positive for

α > \frac{1}{2}

, and

f_{α}

is convex for

α > \frac{1}{2}

(as well as continuous on its domain and strictly convex around 1). Thus, by Theorem 1, we have that

\begin{matrix} V_{L_{α}, G} (D^{*}, G) & = 2 a {JD}_{f α} (P_{x} ∥ P_{g}) - 2 a b \\ = 2 \cdot 2^{\frac{1}{α} - 1} {JD}_{f α} (P_{x} ∥ P_{g}) - 2 \frac{α}{α - 1} 2^{\frac{1}{α} - 1} (2^{1 - \frac{1}{α}} - 1) \\ = 2^{\frac{1}{α}} {JD}_{f α} (P_{x} ∥ P_{g}) + \frac{α}{α - 1} (2^{\frac{1}{α}} - 2) . \end{matrix}

We now show that the above Jensen-

f_{α}

-divergence is equal to the

f_{1, α}

-divergence originally derived for the

(1, α)

-GAN problem of Proposition 2 (note from Proposition 2 that if

α_{D} = 1

, then

α_{G} = α > \frac{1}{2}

, so the range of

α

concurs with the range required above for the convexity of

f_{α}

). For any two distributions p and q with common support

X

, we have that

\begin{matrix} D_{f_{1, α}} (p ∥ q) & = \frac{α}{α - 1} \int_{X} q \frac{{(\frac{p}{q})}^{2 - \frac{1}{α}} + 1}{{(\frac{p}{q} + 1)}^{1 - \frac{1}{α}}} d μ - \frac{α}{α - 1} 2^{\frac{1}{α}} \\ = \frac{α}{α - 1} \int_{X} q \frac{{(\frac{p}{q})}^{2 - \frac{1}{α}} + 1}{{(\frac{p + q}{q})}^{1 - \frac{1}{α}}} d μ - \frac{α}{α - 1} 2^{\frac{1}{α}} \\ = \frac{α}{α - 1} \int_{X} ((p + q) {(\frac{p}{p + q})}^{2 - \frac{1}{α}} + (p + q) {(\frac{q}{p + q})}^{2 - \frac{1}{α}}) d μ \\ - \frac{α}{α - 1} 2^{\frac{1}{α}} \\ = \frac{α}{α - 1} \frac{2}{2^{2 - \frac{1}{α}}} \int_{X} (\frac{p + q}{2} {(\frac{2 p}{p + q})}^{2 - \frac{1}{α}} + \frac{p + q}{2} {(\frac{2 q}{p + q})}^{2 - \frac{1}{α}}) d μ \\ - \frac{α}{α - 1} 2^{\frac{1}{α}} \\ = \frac{α}{α - 1} 2^{\frac{1}{α} - 1} \int_{X} (\frac{p + q}{2} ({(\frac{2 p}{p + q})}^{2 - \frac{1}{α}} - \frac{2 p}{p + q}) + p) d μ \\ + \frac{α}{α - 1} 2^{\frac{1}{α} - 1} \int_{X} (\frac{p + q}{2} ({(\frac{2 q}{p + q})}^{2 - \frac{1}{α}} - \frac{2 q}{p + q}) + q) d μ \\ - \frac{α}{α - 1} 2^{\frac{1}{α}} \\ = \frac{α}{α - 1} 2^{\frac{1}{α}} \frac{1}{2} (\int_{X} \frac{p + q}{2} ({(\frac{2 p}{p + q})}^{2 - \frac{1}{α}} - \frac{2 p}{p + q}) d μ + 1) \\ + \frac{α}{α - 1} 2^{\frac{1}{α}} \frac{1}{2} (\int_{X} \frac{p + q}{2} ({(\frac{2 q}{p + q})}^{2 - \frac{1}{α}} - \frac{2 q}{p + q}) d μ + 1) \\ - \frac{α}{α - 1} 2^{\frac{1}{α}} \\ = 2^{\frac{1}{α}} {JD}_{f α} (p ∥ q) + \frac{α}{α - 1} 2^{\frac{1}{α} - 1} (2) - \frac{α}{α - 1} 2^{\frac{1}{α}} \\ = 2^{\frac{1}{α}} {JD}_{f α} (p ∥ q) . \end{matrix}

Therefore,

V_{L_{α}, G} (D^{*}, G) = V_{α} (D^{*}, G)

. □

Note that this lemma generalizes Lemma 2; VanillaGAN is a special case of

(1, α)

-GAN for

α = 1

.

3.3. Shifted LkGANs and LSGANs

Least squares GAN (LSGAN) was proposed in [26] to mitigate the vanishing gradient problem with VanillaGAN and to stabilize training performance. LSGAN’s loss function is derived from the squared error distortion measure, whereby we aim to minimize the distortion between the data samples and a target value we want the discriminator to assign the samples to. LSGAN was generalized with LkGAN in [6] by replacing the squared error distortion measure with an absolute error distortion measure of order

k \geq 1

, therefore introducing an additional degree of freedom to the generator’s loss function. We first state the general LkGAN problem. We then apply the result of Theorem 1 to the loss functions of LSGAN and LkGAN.

Definition 7

([6]). Let γ, β,

c \in [0, 1]

, and let

k \geq 1

. LkGAN’s loss functions, denoted by

V_{LSGAN, D}

and

V_{k, G}

, are given by

\begin{matrix} V_{LSGAN, D} (D, G) & = - \frac{1}{2} E_{A \sim P_{x}} [{(D (A) - β)}^{2}] - \frac{1}{2} E_{B \sim P_{g}} [{(D (B) - γ)}^{2}] \end{matrix}

(31)

\begin{matrix} V_{k, G} (D, G) & = E_{A \sim P_{x}} {[| D (A) - c |}^{k}] + E_{B \sim P_{g}} {[| D (B) - c |}^{k}] . \end{matrix}

(32)

The LkGAN problem is the joint optimization

\begin{matrix} sup_{D} V_{LSGAN, D} (D, G) \end{matrix}

(33)

\begin{matrix} inf_{G} V_{k, G} (D, G) . \end{matrix}

(34)

We next recall the solution to (33), which is a minimization of the Pearson–Vajda divergence

{| χ |}^{k} (\cdot ∥ \cdot)

of order k (as defined in Table 1).

Proposition 3

([6]). Consider the joint optimization for LkGAN presented in (33). Then the optimal discriminator

D^{*}

that maximizes

V_{LSGAN, D}

in (31) is given by

\begin{matrix} D^{*} & = \frac{γ P_{x} + β P_{g}}{P_{x} + P_{g}} . \end{matrix}

(35)

Furthermore, if

D = D^{*}

and

γ - β = 2 (c - β)

, the minimization of

V_{k, G}

in (32) reduces to

\begin{matrix} inf_{G} V_{k, G} (D, G) & = inf_{G} {| c - β |}^{k} {| χ |}^{k} (P_{x} + P_{g} ∥ 2 P_{g}) . \end{matrix}

(36)

Note that LSGAN [26] is a special case of LkGAN, as we recover LSGAN when

k = 2

[6].

By scrutinizing Proposition 3 and Theorem 1, we observe that the former cannot be recovered from the latter. However, we can use Theorem 1 by slightly modifying the LkGAN generator’s loss function. First, for the dual-objective GAN proposed in Theorem 1, we need

D^{*} = \frac{P_{x}}{P_{x} + P_{g}}

. By (35), this is achieved for

γ = 1

and

β = 0

. Then, we define the intermediate loss function

\begin{matrix} {\tilde{V}}_{k, G} (D, G) & = E_{A \sim P_{x}} [| D (A) - c_{1} |^{k}] + E_{B \sim P_{g}} [| D (B) - c_{2} |^{k}] . \end{matrix}

(37)

Comparing the above loss function with (8), we note that setting

c_{1} = 0

and

c_{2} = 1

in (37) satisfies the symmetry property of

L_{α}

. Finally, to ensure the generating function

f_{α}

satisfies

f_{α} (1) = 0

, we shift each term in (37) by 1. Putting these changes together, we propose a revised generator loss function denoted by

{\hat{V}}_{k, G}

and given by

\begin{matrix} {\hat{V}}_{k, G} (D, G) = E_{A \sim P_{x}} {[| D (A) |}^{k} - 1] + E_{B \sim P_{g}} {[| 1 - D (B) |}^{k} - 1] . \end{matrix}

(38)

We call a system that uses (38) as a generator loss function a Shifted L $k$ GAN (SL $k$ GAN). If

k = 2

, we have a shifted version of the LSGAN generator loss function, which we call Shifted LSGAN (SLSGAN). Note that none of these modifications alter the gradients of

V_{k, G}

in (32), since the first term is independent of G, the choice of

c_{1}

is irrelevant, and translating a function by a constant does not change its gradients. However, from Proposition 3, for

γ = 0

,

β = 1

, and

c = 1

, we do not have that

γ - β = 2 (c - β)

, and as a result, this modified problem does not reduce to minimizing a Pearson–Vajda divergence. Consequently, we can relax the condition on k in Definition 7 to just

k > 0

. We now show how Theorem 1 can be applied to

L_{α}

-GAN using (38).

Lemma 4.

Let

k > 0

. Let

V_{D}

be a discriminator loss function, and let

{\hat{V}}_{k, G}

be the generator’s loss function defined in (38). Consider the joint optimization

\begin{matrix} sup_{D} V_{D} (D, G) \end{matrix}

(39)

\begin{matrix} inf_{G} {\hat{V}}_{k, G} (D, G) \end{matrix}

(40)

If

V_{D}

is optimized at

D^{*} = \frac{P_{x}}{P_{x} + P_{g}}

(i.e.,

V_{D}

is canonical), then we have that

\begin{matrix} {\hat{V}}_{k, G} (D^{*}, G) = \frac{1}{2^{k - 1}} {JD}_{f k} (P_{x} ∥ P_{g}) + \frac{1}{2^{k - 1}} - \frac{1}{2}, \end{matrix}

where

f_{k}

is given by

\begin{matrix} f_{k} (u) = u (u^{k} - 1), u \geq 0 . \end{matrix}

Examples of

V_{D} (D, G)

that satisfy the requirements of Lemma 4 include the LkGAN discriminator loss function given by (31) with

γ = 1

and

β = 0

and the VanillaGAN discriminator loss function given by (14).

Proof.

Let

k > 0

. We can restate SLkGAN’s generator loss function in (38) in terms of

V_{L_{α}, G}

in (8): we have that

V_{L_{α}, G} (D^{*}, G) = {\hat{V}}_{k, G} (D^{*}, G)

, where

α = k

, and

L_{k} : {0, 1} \times [0, 1] \to [0, \infty)

is given by

\begin{matrix} L_{k} (y, \hat{y}) = - (y ({\hat{y}}^{k} - 1) + (1 - y) ({(1 - \hat{y})}^{k} - 1)) . \end{matrix}

(41)

We have that

L_{k}

is symmetric, since

\begin{matrix} L_{k} (1, \hat{y}) & = - ({\hat{y}}^{k} - 1) = L_{k} (0, 1 - \hat{y}) . \end{matrix}

We derive

f_{α}

from

L_{α}

via (13) and directly check that it is continuous convex and strictly convex around 1. Setting

a = \frac{1}{2^{k}}

and

b = 2^{k} - 1

in (13), we have that

\begin{matrix} f_{k} (u) & = - u (\frac{1}{a} L_{k} (1, \frac{u}{2}) - b) \\ = - u (2^{k} (1 - {(\frac{u}{2})}^{k}) - (2^{k} - 1)) \\ = - u (2^{k} - u^{k} - 2^{k} + 1) \\ = u (u^{k} - 1) . \end{matrix}

We clearly have that

f_{k} (1) = 0

and that

f_{k}

is continuous. Furthermore, we have that

f_{k}^{″} (u) = k (k + 1) u

, which is non-negative for

u \geq 0

. Therefore,

f_{k}

is convex (as well as strictly convex around 1). As a result, by Theorem 1, we have that

\begin{matrix} {\hat{V}}_{k, G} (D^{*}, G) & = \frac{1}{2^{k - 1}} {JD}_{f k} (P_{x} ∥ P_{g}) - \frac{1}{2^{k - 1}} (2^{k} - 1) \\ = \frac{1}{2^{k - 1}} {JD}_{f k} (P_{x} ∥ P_{g}) + \frac{1}{2^{k - 1}} - \frac{1}{2} . \end{matrix}

□

We conclude this section by emphasizing that Theorem 1 serves as a unifying result recovering the existing loss functions in the literature and, moreover, provides a way for generalizing new ones. Our aim in the next section is to demonstrate the versatility of this result in experimentation.

4. Experiments

We perform two experiments on three different image datasets that we describe below.

Experiment 1: In the first experiment, we compare

(α, α)

-GAN with

(1, α)

-GAN while controlling the value of

α

. Recall that

α_{D} = 1

corresponds to the canonical VanillaGAN (or DCGAN) discriminator. We aim to verify whether or not replacing an

α

-GAN discriminator with a VanillaGAN discriminator stabilizes or improves the system’s performance depending on the value of

α

. Note that the result of Theorem 1 only applies to the

(α_{D}, α_{G})

-GAN for

α_{D} = 1

. We herein confine the comparison of

(1, α)

-GAN with

(α, α)

-GAN only so that both systems have the same tunable free parameter

α

. The results obtained in [10] for the Stacked MNIST dataset show that

(α_{D}, α_{G})

-GAN provides consistently robust performance when

α_{D} = α_{G}

. Other experiments illustrating the performance of

(α_{D}, α_{G})

-GAN with

α_{D} \neq 1

are carried for the Celeb-A and LSUN Classroom image datasets in [11] and show improved training stability for

α_{D} < 1

values.

Experiment 2: We train two variants of SLkGAN with the generator loss function as described in (38) and parameterized by

k > 0

. We then utilize two different canonical discriminator loss functions to align with Theorem 1. The first is the VanillaGAN discriminator loss given by (14); we call the resulting dual-objective GAN Vanilla-SL $k$ GAN. The second is the LkGAN discriminator loss given by (31), where we set

γ = 1

and

β = 0

such that the optimal discriminator is given by (11). We call this system L $k$ -SL $k$ GAN. We compare the two variants to analyze how the value of k and choice of discriminator loss impacts the system’s performance.

4.1. Experimental Setup

We run both experiments on three image datasets: MNIST [27], CIFAR-10 [28], and Stacked MNIST [29]. The MNIST dataset is a dataset of black and white handwritten digits between 0 and 9 and with a size of

28 \times 28 \times 1

. The CIFAR-10 dataset is an RGB dataset of small images of common animals and modes of transportation with a size of

32 \times 32 \times 3

. The Stacked MNIST dataset is an RGB dataset derived from the MNIST dataset and constructed by taking three MNIST images, assigning each to one of the three color channels, and stacking the images on top of each other. The resulting images are then padded so that each one of them has a size of

32 \times 32 \times 3

.

For Experiment 1, we use

α

values of 0.5, 5.0, 10.0, and 20.0. For each value of

α

, we train (

α

,

α

)-GAN and

(1, α)

-GAN. We additionally train DCGAN, which corresponds to

(1, 1)

-GAN. For Experiment 2, we use k values of 0.25, 1.0, 2.0, 7.5, and 15.0. Note that when

k = 2

, we recover LSGAN. For the MNIST dataset, we run 10 trials with the random seeds 123, 500, 1600, 199,621, 60,677, 20,435, 15,859, 33,764, 79,878, and 36,123 and train each GAN for 250 epochs. For the RGB datasets (CIFAR-10 and Stacked MNIST), we run five trials with the random seeds 123, 1600, 60,677, 15,859, and 79,878 and train each GAN for 500 epochs. All experiments utilize an Adam optimizer for the stochastic gradient descent algorithm with a learning rate of

2 \times 10^{- 4}

and parameters

β_{1} = 0.5

,

β_{2} = 0.999

, and

ϵ = 10^{- 7}

[30]. We also experiment with the addition of a gradient penalty (GP); we add a penalty term to the discriminator’s loss function to encourage the discriminator’s gradient to have a unit norm [31].

The MNIST experiments were run on one 6130 2.1 GHz 1xV100 GPU, 8 CPUs, and 16 GB of memory. The CIFAR-10 and Stacked MNIST experiments were run on one Epyc 7443 2.8 GHz GPU, 8 CPUs, and 16 GB of memory. For each experiment, we report the best overall Fréchet inception distance (FID) score [32], the best average FID score amongst all trials and its variance, and the average epoch the best FID score occurs and its variance. The FID score for each epoch was computed over 10,000 images. For each metric, the lowest numerical value corresponds to the model with the best metric (indicated in bold in the tables). We also report how many trials we include in our summary statistics, as it is possible for a trial to collapse and not train for the full number of epochs. The neural network architectures used in our experiments are presented in Appendix A. The training algorithms are presented in Appendix B.

4.2. Experimental Results

We report the FID metrics for Experiment 1 in Table 2, Table 3 and Table 4 and for Experiment 2 in Table 5, Table 6 and Table 7. We report only on those experiments that produced meaningful results. Models that utilize a simplified gradient penalty have the suffix “-GP”. For

(α_{D}, α_{G})

-GANs, we display the output of the best-performing systems in Figure 1 and plot the trajectories of the FID scores throughout the training epochs in Figure 2. Similarly for SLKGANs, outputs of the best-performing systems and FID scores vs. epochs trajectories are provided in Figure 3 and Figure 4, respectively.

4.3. Discussion

4.3.1. Experiment 1

From Table 2, we note that 37 of the 90 trials collapse before 250 epochs have passed without a gradient penalty. The (5,5)-GAN collapses for all five trials, and hence, it is not displayed in Table 2. This behavior is expected, as (

α

,

α

)-GAN is more sensitive to exploding gradients when

α

does not tend to 0 or

+ \infty

[8]. The addition of a gradient penalty could mitigate the discriminator’s gradients diverging in the (5,5)-GAN by encouraging gradients to have a unit norm. Using a VanillaGAN discriminator with an

α

-GAN generator (i.e., (1,

α

)-GAN) produces better quality images for all tested values of

α

compared to when both networks utilize an

α

-GAN loss function. The (1,10)-GAN achieves excellent stability, converging in all 10 trials, and also achieves the lowest average FID score. The (1,5)-GAN achieves the lowest FID score overall, marginally outperforming DCGAN. Note that when the average best FID score is very close to the best FID score, the resulting best FID score variance is quite small (of the order of

10^{- 3}

), indicating little statistical variability over the trials.

Likewise, for the CIFAR-10 and Stacked MNIST datasets, (1,

α

)-GAN produces lower FID scores than

(α, α)

-GAN (see Table 3 and Table 4). However, both models are more stable with the CIFAR-10 dataset. With the exception of DCGAN, no model converged to its best FID score for all five trials with the Stacked MNIST dataset. Comparing the trials that did converge, both

(α, α)

-GAN and

(1, α)

-GAN performed better on the Stacked MNIST dataset than the CIFAR-10 dataset. For CIFAR-10, the (1,10)- and (1,20)-GANs produced the best overall FID score and the best average FID score, respectively. On the other hand, the (1,0.5)-GAN produced the best overall FID score and the best average FID score for the Stacked MNIST dataset. We also observe a tradeoff between speed and performance for the CIFAR-10 and Stacked MNIST datasets: the

(1, α)

-GANs arrive at their lowest FID scores later than their respective

(α, α)

-GANs but achieve lower FID scores overall.

Comparing Figure 2c and Figure 2d, we observe that

(α, α)

-GAN-GP provides more stability than

(1, α)

-GAN for lower values of

α

(i.e.,

α = 0.5)

, while

(1, α)

-GAN-GP exhibits more stability for higher

α

values (

α = 10

and

α = 20

). Figure 2e,f show that the two

α

-GANs trained on the Stacked MNIST dataset exhibit unstable behavior earlier into training when

α = 0.5

or

α = 20

. However, both systems stabilize and converge to their lowest FID scores as training progresses. The (0.5,0.5)-GAN-GP system in particular exhibits wildly erratic behavior for the first 200 epochs then finishes training with a stable trajectory that outperforms DCGAN-GP.

A future direction is to explore how the complexity of an image dataset influences the best choice of

α

. For example, the Stacked MNIST dataset might be considered to be less complex than CIFAR-10, as images in the Stacked MNIST dataset only contain four unique colors (black, red, green, and blue), while the CIFAR-10 dataset utilizes significantly more colors.

4.3.2. Experiment 2

We see from Table 5 that all Lk-LkGANs and Vanilla-SLkGANs have FID scores comparable to the DCGAN. When

k = 15

, Vanilla-SLkGAN and Lk-SLkGAN arrive at their lowest FID scores slightly earlier than DCGAN and other SLkGANs.

The addition of a simplified gradient penalty is necessary for Lk-SLkGAN to achieve overall good performance on the CIFAR-10 dataset (see Table 6). Interestingly, Vanilla-SLkGAN achieves lower FID scores without a gradient penalty for lower k values (

k = 1, 2

) and with a gradient penalty for higher k values (

k = 7.5, 15

). When

k = 0.25

, both SLkGANs collapsed for all five trials without a gradient penalty.

Table 7 shows that Vanilla-SLkGANs achieve better FID scores than their respective Lk-LkGAN counterparts. However, Lk-LkGANs are more stable, as no single trial collapsed, while 10 of the 25 Vanilla-SLkGAN trials collapsed before 500 epochs had passed. While all Vanilla-SLkGANs outperform the DCGAN with a gradient penalty, Lk-SLkGAN-GP only outperforms DCGAN-GP when

k = 15

. Except for when

k = 7.5

, we observe that the Lk-SLkGAN system takes fewer epochs to arrive at its lowest FID score. Comparing Figure 4e and Figure 4f, we observe that Lk-SLkGANs exhibit more stable FID score trajectories than their respective Vanilla-SLkGANs. This makes sense, as the LkGAN loss function aims to increase the GAN’s stability compared to DCGAN [6].

5. Conclusions

We introduced a parameterized CPE-based generator loss function for a dual-objective GAN termed

L_{α}

-GAN that, when used in tandem with a canonical discriminator loss function that achieves its optimum in (11), minimizes a Jensen-

f_{α}

-divergence. We showed that this system can recover VanillaGAN,

(1, α)

-GAN, and LkGAN as special cases. We conducted experiments with the three aforementioned

L_{α}

-GANs on three image datasets. The experiments indicate that

(1, α)

-GAN exhibits better performance than

(α, α)

-GAN with

α > 1

. They also show that the devised SLkGAN system achieves lower FID scores with a VanillaGAN discriminator compared with an LkGAN discriminator.

Future work consists of unveiling more examples of existing GANs that fall under our result as well as applying

L_{α}

-GAN to novel, judiciously designed CPE losses

L_{α}

and evaluating the performance (in terms of both quality and diversity of generated samples) and the computational efficiency of the resulting models. Another interesting and related direction is to study

L_{α}

-GAN within the context of f-GANs, given that the Jensen-f-divergence is itself an f-divergence (see Remark 1), by systematically analyzing different Jensen-f-divergences and the role they play in improving GAN performance and stability. Other worthwhile directions include incorporating the proposed

L_{α}

loss into state-of-the-art GAN models, such as, among others, BigGAN [33], StyleGAN [34], and CycleGAN [35], for high-resolution data generation and image-to-image translation applications and conducting a meticulous analysis of the sensitivity of the models’ performance to different values of the

α

parameter and providing guidelines on how best to tune

α

for different types of datasets.

Author Contributions

Conceptualization, investigation and manuscript preparation, all authors; formal analysis, all authors; software development and simulation, J.V. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada.

Data Availability Statement

All codes used in our experiments can be found at this https://github.com/justin-veiner/MASc, accessed on 20 February 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Neural Network Architectures

We outline the architectures used for the generator and discriminator. For the MNIST dataset, we use the architectures of [6]. For the CIFAR-10 and Stacked MNIST datasets, we base the architectures on [5]. We summarize some aliases for the architectures in Table A1. For all models, we use a batch size of 100 and a noise size of 784 for the generator input.

Table A1. Summary of aliases used to describe neural network architectures.

Alias	Definition
FC	Fully Connected
UpConv2D	Deconvolutional Layer
Conv2D	Convolutional Layer
BN	Batch Normalization
LeakyReLU	Leaky Rectified Linear Unit

We omit the bias in the convolutional and deconvolutional layers to decrease the number of parameters being trained, which in turn decreases computation times. We initialize our kernels using a normal distribution with zero mean and variance 0.01. We present the MNIST architectures in Table A2 and Table A3 and the CIFAR-10 and Stacked MNIST architectures in Table A4 and Table A5.

Table A2. Discriminator architecture for the MNIST dataset.

Layer	Output Size	Kernel	Stride	BN	Activation
Input	$28 \times 28 \times 1$	No
Conv2D	$14 \times 14 \times 64$	$5 \times 5$	2	No	LeakyReLU (0.3)
Dropout (0.3)				No
Conv2D	$7 \times 7 \times 128$	$5 \times 5$	2	No	LeakyReLU (0.3)
Dropout(0.3)				No
FC	1			No	Sigmoid

Table A3. Generator architecture for the MNIST dataset.

Layer	Output Size	Kernel	Stride	BN	Activation
Input	784
FC	$7 \times 7 \times 256$
UpConv2D	$7 \times 7 \times 128$	$5 \times 5$	1	Yes	LeakyReLU (0.3)
UpConv2D	$14 \times 14 \times 64$	$5 \times 5$	2	Yes	LeakyReLU (0.3)
UpConv2D	$28 \times 28 \times 1$	$5 \times 5$	2	No	Tanh

Table A4. Discriminator architecture for the CIFAR-10 and Stacked MNIST datasets.

Layer	Output Size	Kernel	Stride	BN	Activation
Input	$32 \times 32 \times 3$
Conv2D	$16 \times 16 \times 128$	$3 \times 3$	2	No	LeakyReLU (0.2)
Conv2D	$8 \times 8 \times 128$	$3 \times 3$	2	No	LeakyReLU (0.2)
Conv2D	$4 \times 4 \times 256$	$3 \times 3$	2	No	LeakyReLU (0.2)
Dropout (0.4)				No
FC	1				Sigmoid

Table A5. Generator architecture for the CIFAR-10 and Stacked MNIST datasets.

Layer	Output Size	Kernel	Stride	BN	Activation
Input	784
FC	$4 \times 4 \times 256$
UpConv2D	$8 \times 8 \times 128$	$4 \times 4$	2	Yes	LeakyReLU (0.2)
UpConv2D	$16 \times 16 \times 128$	$4 \times 4$	2	Yes	LeakyReLU (0.2)
UpConv2D	$32 \times 32 \times 128$	$4 \times 4$	2	Yes	LeakyReLU (0.2)
Conv2D	$32 \times 32 \times 3$	$3 \times 3$	1	No	Tanh

Appendix B. Algorithms

We outline the algorithms used to train our models in Algorithms A1–A3.

Algorithm A1 Overview of (

α_{D}

,

α_{G}

)-GAN training

Require

α_{D}

,

α_{G}

, number of epochs

n_{e}

, batch size B, learning rate

η

Initialize generator G with parameters

θ_{G}

, discriminator D with parameters

θ_{D}

.
for

i = 1

to

n_{e}

do
Sample batch of real data

x = {x_{1}, \dots, x_{B}}

from dataset
Sample batch of Gaussian noise vectors

z = {z_{1}, \dots, z_{B}} \sim N (0, I)

Update the discriminator’s parameters using an Adam optimizer with learning rate

η

by descending the gradient:

\begin{matrix} \nabla_{θ_{D}} (- \frac{1}{B} \sum_{i = 1}^{B} (- ℓ_{α} (1, D (x_{i})) - ℓ_{α} (0, D (G (z_{i}))))) \end{matrix}

or update the discriminator’s parameters with a simplified GP:

\begin{matrix} \nabla_{θ_{D}} (- \frac{1}{B} \sum_{i = 1}^{B} (- ℓ_{α} (1, D (x_{i})) - ℓ_{α} (0, D (G (z_{i})))) \\ + 5 (\sum_{i = 1}^{B} | | \nabla_{x} log (\frac{D (x)}{1 - D (x)}) | |_{2}^{2})) \end{matrix}

Update the generator’s parameters using an Adam optimizer with learning rate

η

and descending the gradient:

\begin{matrix} \nabla_{θ_{G}} (\frac{1}{B} \sum_{i = 1}^{B} ℓ_{α} (0, D (G (z_{i})))) \end{matrix}

end for

Algorithm A2 Overview of Lk-SLkGAN training

Require k, number of epochs

n_{e}

, batch size B, learning rate

η

Initialize generator G with parameters

θ_{G}

, discriminator D with parameters

θ_{D}

.
for

i = 1

to

n_{e}

do
Sample batch of real data

x = {x_{1}, \dots, x_{B}}

from dataset
Sample batch of Gaussian noise vectors

z = {z_{1}, \dots, z_{B}} \sim N (0, I)

Update the discriminator’s parameters using an Adam optimizer with learning rate

η

by descending the gradient:

\begin{matrix} \nabla_{θ_{D}} (\frac{1}{B} \sum_{i = 1}^{B} (\frac{1}{2} {(D (x_{i}) - 1)}^{2} + \frac{1}{2} (D {(G (z_{i}))}^{2}))) \end{matrix}

or update the discriminator’s parameters with a simplified GP:

\begin{matrix} \nabla_{θ_{D}} (\frac{1}{B} \sum_{i = 1}^{B} (\frac{1}{2} {(D (x_{i}) - 1)}^{2} + \frac{1}{2} (D {(G (z_{i}))}^{2})) \\ + 5 (\sum_{i = 1}^{B} | | \nabla_{x} log (\frac{D (x)}{1 - D (x)}) | |_{2}^{2})) \end{matrix}

Update the generator’s parameters using an Adam optimizer with learning rate

η

and descending the gradient:

\begin{matrix} \nabla_{θ_{G}} (\frac{1}{B} \sum_{i = 1}^{B} \frac{1}{2} (| 1 - D (G (z_{i})) |^{k} - 1)) \end{matrix}

end for

Algorithm A3 Overview of Vanilla-SLkGAN training

Require k, number of epochs

n_{e}

, batch size B, learning rate

η

Initialize generator G with parameters

θ_{G}

, discriminator D with parameters

θ_{D}

.
for

i = 1

to

n_{e}

do
Sample batch of real data

x = {x_{1}, \dots, x_{B}}

from dataset
Sample batch of noise vectors

z = {z_{1}, \dots, z_{B}} \sim N (0, I)

Update the discriminator’s parameters using an Adam optimizer with learning rate

η

by descending the gradient:

\begin{matrix} \nabla_{θ_{D}} (- \frac{1}{B} \sum_{i = 1}^{B} (log (D (x_{i})) + log (1 - D (G (z_{i}))))) \end{matrix}

or update the discriminator’s parameters with a simplified (GP):

\begin{matrix} \nabla_{θ_{D}} (- \frac{1}{B} \sum_{i = 1}^{B} (log (D (x_{i})) + log (1 - D (G (z_{i})))) \\ + 5 (\sum_{i = 1}^{B} | | \nabla_{x} log (\frac{D (x)}{1 - D (x)}) | |_{2}^{2})) \end{matrix}

Update the generator’s parameters using an Adam optimizer with learning rate

η

and descending the gradient:

\begin{matrix} \nabla_{θ_{G}} (\frac{1}{B} \sum_{i = 1}^{B} \frac{1}{2} (| 1 - D (G (z_{i})) |^{k} - 1)) \end{matrix}

end for

References

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27, pp. 2672–2680. [Google Scholar]
Kwon, Y.H.; Park, M.G. Predicting future frames using retrospective cycle GAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Pan, X.; Zhan, X.; Dai, B.; Lin, D.; Loy, C.C.; Luo, P. Exploiting deep generative prior for versatile image restoration and manipulation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7474–7489. [Google Scholar] [CrossRef] [PubMed]
Jordon, J.; Yoon, J.; Van Der Schaar, M. PATE-GAN: Generating synthetic data with differential privacy guarantees. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proceedings of the 9th International Conference on Image and Graphics, Shanghai, China, 13–15 September 2017; pp. 97–108. [Google Scholar]
Bhatia, H.; Paul, W.; Alajaji, F.; Gharesifard, B.; Burlina, P. Least kth-order and Rényi generative adversarial networks. Neural Comput. 2021, 33, 2473–2510. [Google Scholar] [CrossRef] [PubMed]
Nowozin, S.; Cseke, B.; Tomioka, R. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016; Volume 29. [Google Scholar]
Kurri, G.R.; Sypherd, T.; Sankar, L. Realizing GANs via a tunable loss function. In Proceedings of the IEEE Information Theory Workshop (ITW), Virtual, 17–21 October 2021; pp. 1–6. [Google Scholar]
Kurri, G.R.; Welfert, M.; Sypherd, T.; Sankar, L. α-GAN: Convergence and estimation guarantees. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Espoo, Finland, 26 June–1 July 2022; pp. 276–281. [Google Scholar]
Welfert, M.; Otstot, K.; Kurri, G.R.; Sankar, L. (α_D,α_G)-GANs: Addressing GAN training instabilities via dual objectives. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Taipei, Taiwan, 25–30 June 2023. [Google Scholar]
Welfert, M.; Kurri, G.R.; Otstot, K.; Sankar, L. Addressing GAN training instabilities via tunable classification losses. arXiv 2023, arXiv:2310.18291. [Google Scholar]
Csiszar, I. Eine Informationstheoretische Ungleichung und ihre Anwendung auf den Bewis der Ergodizitat on Markhoffschen Ketten. Publ. Math. Inst. Hung. Acad. Sci. Ser. A 1963, 8, 85–108. [Google Scholar]
Csiszár, I. Information-type measures of difference of probability distributions and indirect observations. Stud. Sci. Math. Hung. 1967, 2, 299–318. [Google Scholar]
Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. (Methodol.) 1966, 28, 131–142. [Google Scholar] [CrossRef]
Liese, F.; Vajda, I. On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 2006, 52, 4394–4412. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Nielsen, F. On a generalization of the Jensen–Shannon divergence and the Jensen–Shannon centroid. Entropy 2020, 22, 221. [Google Scholar] [CrossRef] [PubMed]
Nielsen, F.; Nock, R. On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Process. Lett. 2013, 21, 10–13. [Google Scholar] [CrossRef]
Arimoto, S. Information-theoretical considerations on estimation problems. Inf. Control. 1971, 19, 181–194. [Google Scholar] [CrossRef]
Österreicher, F. On a class of perimeter-type distances of probability distributions. Kybernetika 1996, 32, 389–393. [Google Scholar]
Hellinger, E. Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. J. Reine Angew. Math. 1909, 1909, 210–271. [Google Scholar] [CrossRef]
Sason, I. On f-divergences: Integral representations, local behavior, and inequalities. Entropy 2018, 20, 383. [Google Scholar] [CrossRef] [PubMed]
Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics; University of California Press: Berkeley, CA, USA, 1961; Volume 4, pp. 547–562. [Google Scholar]
Van Erven, T.; Harremos, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.; Wang, Z.; Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Deng, L. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 2012, 29, 141–142. [Google Scholar] [CrossRef]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images 2009. Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 22 February 2024).
Lin, Z.; Khetan, A.; Fanti, G.; Oh, S. PacGAN: The power of two samples in generative adversarial networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2018; Volume 31, pp. 1–10. [Google Scholar]
Kingma, D.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30, pp. 1–11. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30, pp. 6626–6637. [Google Scholar]
Brock, A.; Donahue, J.; Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. arXiv 2018, arXiv:1809.11096. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar]
Almahairi, A.; Rajeshwar, S.; Sordoni, A.; Bachman, P.; Courville, A. Augmented CycleGAN: Learning many-to-many mappings from unpaired data. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 195–204. [Google Scholar]

Figure 1. Generated images for the best-performing (

α_{D}

,

α_{G}

)-GANs. (a) (

α_{D}, α_{G}

)-GAN for MNIST,

α_{D} = 1.0

,

α_{G} = 5.0

, FID: 1.125. (b)

(α_{D}, α_{G})

-GAN-GP for CIFAR-10,

α_{D} = 1.0

,

α_{G} = 20.0

, FID = 8.466. (c)

(α_{D}, α_{G})

-GAN-GP for Stacked MNIST,

α_{D} = 1.0

,

α_{G} = 0.5

, FID = 4.833.

Figure 1. Generated images for the best-performing (

α_{D}

,

α_{G}

)-GANs. (a) (

α_{D}, α_{G}

)-GAN for MNIST,

α_{D} = 1.0

,

α_{G} = 5.0

, FID: 1.125. (b)

(α_{D}, α_{G})

-GAN-GP for CIFAR-10,

α_{D} = 1.0

,

α_{G} = 20.0

, FID = 8.466. (c)

(α_{D}, α_{G})

-GAN-GP for Stacked MNIST,

α_{D} = 1.0

,

α_{G} = 0.5

, FID = 4.833.

Figure 2. Average FID scores vs. epochs for various

(α_{D}, α_{G})

-GANs.

Figure 2. Average FID scores vs. epochs for various

(α_{D}, α_{G})

-GANs.

Figure 3. Generated images for best-performing SLkGANs. (a) Vanilla-SLkGAN-0.25 for MNIST, FID = 1.112. (b) Vanilla-SLkGAN-2.0 for CIFAR-10, FID = 4.58. (c) Vanilla-SLkGAN-15.0-GP for Stacked MNIST, FID = 3.836.

Figure 4. FID scores vs. epochs for various SLkGANs.

Table 1. Examples of f-divergences.

f-Divergence	Symbol	Formula	$f (u)$
Kullback–Leiber [16]	KL	$\int_{R} p log (\frac{p}{q}) d μ$	$u log u$
Jensen-Shannon [17]	JSD	$\frac{1}{2} KL (p \| \| \frac{p + q}{2}) + \frac{1}{2} KL (q \| \| \frac{p + q}{2})$	$\frac{1}{2} (u log u - (u + 1) log \frac{u + 1}{2})$
Pearson $χ^{2}$ [18]	$χ^{2}$	$\int_{R} \frac{{(q - p)}^{2}}{p} d μ$	${(\sqrt{x} - \frac{1}{\sqrt{x}})}^{2}$
Pearson–Vajda ( $k > 1$ ) [18]	${\| χ \|}^{k}$	$\int_{R} \frac{{\| q - p \|}^{k}}{p^{k - 1}} d μ$	$u^{1 - k} {\| 1 - u \|}^{k}$
Arimoto ( $α > 0$ , $α \neq 1$ ) [15,19,20]	$A_{α}$	$\frac{α}{α - 1} (\int_{R} {(p^{α} + q^{α})}^{\frac{1}{α}} d μ - 2^{\frac{1}{α}})$	$\frac{α}{α - 1} ({(1 + u)}^{\frac{1}{α}} - (1 + u) - 2^{\frac{1}{α}} + 2)$
Hellinger ( $α > 0$ , $α \neq 1$ ) [15,21,22]	$H_{α}$	$\frac{1}{α - 1} (\int_{R} p^{α} q^{1 - α} d μ - 1)$	$\frac{u^{α} - 1}{α - 1}$

Table 2.

(α_{D}, α_{G})

-GAN results for MNIST.

Table 2.

(α_{D}, α_{G})

-GAN results for MNIST.

( $α_{D}, α_{G}$ )-GAN	Best FID Score	Average Best FID Score	Best FID Score Variance	Average Epochs	Epoch Variance	Number of Successful Trials (/10)
(1,0.5)-GAN	$1.264$	$1.288$	$2.979 \times 10^{- 4}$	$227.25$	$420.25$	4
(0.5,0.5)-GAN	$1.209$	$1.265$	$0.001$	$234.5$	$156.7$	6
(1,5)-GAN	$1.125$	$1.17$	$8.195 \times 10^{- 4}$	$230.3$	$617.344$	10
(1,10)-GAN	$1.147$	$1.165$	$7.984 \times 10^{- 4}$	$225.6$	$253.156$	10
(10,10)-GAN	$36.506$	$39.361$	$16.312$	$1.5$	$0.5$	2
(1,20)-GAN	$1.135$	$1.174$	$0.001$	$237.5$	$274.278$	10
(20,20)-GAN	$33.23$	$33.23$	$0.0$	$1.0$	$0.0$	1
DCGAN	$1.154$	$1.208$	$0.001$	$231.3$	$357.122$	10

Table 3.

(α_{D}, α_{G})

-GAN results for CIFAR-10.

Table 3.

(α_{D}, α_{G})

-GAN results for CIFAR-10.

( $α_{D}, α_{G}$ )-GAN	Best FID Score	Average Best FID Score	Best FID Score Variance	Average Epochs	Epoch Variance	Number of Successful Trials (/5)
(1,0.5)-GAN-GP	$10.551$	$14.938$	$12.272$	$326.2$	$1808.7$	5
(0.5,0.5)-GAN-GP	$13.734$	$14.93$	$0.517$	$223.6$	11,378.3	5
(1,5)-GAN-GP	$10.772$	$11.635$	$0.381$	$132.0$	$1233.5$	5
(5,5)-GAN-GP	$20.79$	$21.72$	$0.771$	$84.8$	$1527.2$	5
(1,10)-GAN-GP	$9.465$	$10.187$	$0.199$	$182.6$	$1096.3$	5
(10,10)-GAN-GP	$19.99$	$21.095$	$0.434$	$131.8$	13,374.7	5
(1,20)-GAN-GP	$8.466$	$10.217$	$1.479$	$216.2$	$6479.7$	5
(20,20)-GAN-GP	$19.378$	$21.216$	$2.315$	$138.2$	29,824.2	5
DCGAN-GP	$25.731$	$28.378$	$3.398$	$158.0$	$2510.5$	5

Table 4.

(α_{D}, α_{G})

-GAN results for Stacked MNIST.

Table 4.

(α_{D}, α_{G})

-GAN results for Stacked MNIST.

( $α_{D}, α_{G}$ )-GAN	Best FID Score	Average Best FID Score	Best FID Score Variance	Average Epochs	Epoch Variance	Number of Successful Trials (/5)
(1,0.5)-GAN-GP	$4.833$	$4.997$	$0.054$	$311.5$	23,112.5	2
(0.5,0.5)-GAN-GP	$6.418$	$6.418$	$0.0$	$479.0$	$0.0$	1
(1,5)-GAN-GP	$7.98$	$7.988$	$1.357 \times 10^{- 4}$	$379.5$	11,704.5	2
(5,5)-GAN-GP	$12.236$	$12.836$	$0.301$	$91.5$	$387.0$	4
(1,10)-GAN-GP	$7.502$	$7.528$	$0.001$	$326.5$	14,280.5	2
(10,10)-GAN-GP	$14.22$	$14.573$	$0.249$	$95.0$	$450.0$	2
(1,20)-GAN-GP	$8.379$	$8.379$	$0.0$	$427.0$	$0.0$	1
(20,20)-GAN-GP	$16.584$	$16.584$	$0.0$	$94.0$	$0.0$	1
DCGAN-GP	$7.507$	$7.774$	$0.064$	$303.4$	11,870.8	5

Table 5. SLkGAN results for MNIST.

Variant-SLkGAN-k	Best FID Score	Average Best FID Score	Best FID Score Variance	Average Epochs	Epoch Variance	Number of Successful Trials (/10)
Lk-SLkGAN-0.25	$1.15$	$1.174$	$6.298 \times 10^{- 4}$	$224.3$	$940.9$	10
Vanilla-SL $k$ GAN-0.25	$1.112$	$1.162$	$0.001$	$237.0$	$124.0$	10
Lk-SLkGAN-1.0	$1.122$	$1.167$	$8.857 \times 10^{- 4}$	$233.0$	$124.0$	10
Vanilla-SLkGAN-1.0	$1.126$	$1.17$	$9.218 \times 10^{- 4}$	$226.2$	$1182.844$	10
Lk-SLkGAN-2.0	$1.148$	$1.198$	$5.248 \times 10^{- 4}$	$237.2$	$288.4$	10
Vanilla-SLkGAN-2.0	$1.124$	$1.184$	$8.933 \times 10^{- 4}$	$237.8$	$138.4$	10
Lk-SLkGAN-7.5	$1.455$	$1.498$	$4.422 \times 10^{- 4}$	$229.0$	$322.222$	10
Vanilla-SLkGAN-7.5	$1.439$	$1.511$	$0.001$	$212.2$	$1995.067$	10
Lk-SLkGAN-15.0	$1.733$	$1.872$	$0.005$	$198.8$	$1885.733$	10
Vanilla-SLkGAN-15.0	$1.773$	$1.876$	$0.005$	$171.6$	$3122.267$	10
DCGAN	$1.154$	$1.208$	$0.001$	$231.3$	$357.122$	10

Table 6. SLkGAN results for CIFAR-10.

Variant-SLkGAN-k	Best FID Score	Average Best FID Score	Best FID Score Variance	Average Epochs	Epoch Variance	Number of Successful Trials (/5)
Lk-SLkGAN-1.0	$4.727$	$118.242$	10,914.643	$60.8$	$1897.2$	5
Vanilla-SLkGAN-1.0	$4.821$	$5.159$	$0.092$	$88.0$	$506.5$	5
Lk-SLkGAN-2.0	$4.723$	$145.565$	$7492.26$	$73.2$	$3904.2$	5
Vanilla-SL $k$ GAN-2.0	$4.58$	$5.1$	$0.261$	$105.4$	$740.8$	5
Lk-SLkGAN-7.5	$6.556$	$155.497$	$7116.521$	$254.6$	18,605.3	5
Vanilla-SLkGAN-7.5	$6.384$	$48.905$	$8698.195$	$72.2$	$1711.7$	5
Lk-SLkGAN-15.0	$8.576$	$145.774$	$5945.097$	$263.0$	36,463.0	5
Vanilla-SLkGAN-15.0	$7.431$	$50.868$	$8753.002$	$82.6$	$3106.8$	5
DCGAN	$4.753$	$5.194$	$0.117$	$88.6$	$462.8$	5
Lk-SLkGAN-0.25-GP	$17.366$	$18.974$	$2.627$	$87.8$	$1897.2$	5
Vanilla-SLkGAN-0.25-GP	$16.013$	$17.912$	$1.961$	$189.0$	$9487.5$	5
Lk-SLkGAN-1.0-GP	$10.771$	$12.567$	$1.083$	$77.8$	$239.2$	5
Vanilla-SLkGAN-1.0-GP	$8.569$	$9.588$	$0.749$	$197.6$	$2690.3$	5
Lk-SLkGAN-2.0-GP	$23.11$	$25.013$	$1.924$	$75.4$	$658.8$	5
Vanilla-SLkGAN-2.0-GP	$28.215$	$29.69$	$1.242$	$232.0$	20,438.5	5
Lk-SLkGAN-7.5-GP	$33.304$	$41.48$	$49.187$	$82.8$	$1081.2$	5
Vanilla-SLkGAN-7.5-GP	$33.085$	$34.799$	$1.597$	$290.8$	12,714.7	5
Lk-SLkGAN-15.0-GP	$9.157$	$12.504$	$3.839$	$310.4$	$6976.8$	5
Vanilla-SL $k$ GAN-15.0-GP	$7.283$	$8.568$	$1.535$	$185.6$	$5978.3$	5
DCGAN-GP	$25.731$	$28.378$	$3.398$	$158.0$	$2510.5$	5

Table 7. SLkGAN results for Stacked MNIST.

Variant-SLkGAN-k	Best FID Score	Average Best FID Score	Best FID Score Variance	Average Epochs	Epoch Variance	Number of Successful Trials (/5)
Lk-SLkGAN-0.25-GP	$10.541$	$11.824$	$0.678$	$113.6$	$356.3$	5
Vanilla-SLkGAN-0.25-GP	$5.197$	$5.197$	$0.0$	$496.0$	$0.0$	1
Lk-SLkGAN-1.0-GP	$11.545$	$12.046$	$0.291$	$89.0$	$238.5$	5
Vanilla-SLkGAN-1.0-GP	$7.475$	$7.626$	$0.045$	$177.0$	$3528.0$	2
Lk-SLkGAN-2.0-GP	$10.682$	$12.782$	$2.12$	$180.2$	28,484.7	5
Vanilla-SLkGAN-2.0-GP	$6.023$	$7.096$	$0.991$	$416.667$	12,244.333	3
Lk-SLkGAN-7.5-GP	$8.912$	$9.906$	$0.577$	$239.0$	35,663.5	5
Vanilla-SLkGAN-7.5-GP	$6.074$	$6.43$	$0.164$	$238.0$	21,729.5	5
Lk-SLkGAN-15.0-GP	$4.458$	$4.74$	$0.029$	$253.4$	11,512.3	5
Vanilla-SL $k$ GAN-15.0-GP	$3.836$	$3.873$	$0.002$	$485.0$	$354.667$	4
DCGAN-GP	$7.507$	$7.774$	$0.064$	$303.4$	11,870.8	5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Veiner, J.; Alajaji, F.; Gharesifard, B. A Unifying Generator Loss Function for Generative Adversarial Networks. Entropy 2024, 26, 290. https://0-doi-org.brum.beds.ac.uk/10.3390/e26040290

AMA Style

Veiner J, Alajaji F, Gharesifard B. A Unifying Generator Loss Function for Generative Adversarial Networks. Entropy. 2024; 26(4):290. https://0-doi-org.brum.beds.ac.uk/10.3390/e26040290

Chicago/Turabian Style

Veiner, Justin, Fady Alajaji, and Bahman Gharesifard. 2024. "A Unifying Generator Loss Function for Generative Adversarial Networks" Entropy 26, no. 4: 290. https://0-doi-org.brum.beds.ac.uk/10.3390/e26040290

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Unifying Generator Loss Function for Generative Adversarial Networks

Abstract

1. Introduction

2. Preliminaries

3. Main Results

3.1. VanillaGAN

3.2. $α$ -GAN

3.3. Shifted LkGANs and LSGANs

4. Experiments

4.1. Experimental Setup

4.2. Experimental Results

4.3. Discussion

4.3.1. Experiment 1

4.3.2. Experiment 2

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Neural Network Architectures

Appendix B. Algorithms

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Unifying Generator Loss Function for Generative Adversarial Networks

Abstract

1. Introduction

2. Preliminaries

3. Main Results

3.1. VanillaGAN

3.2. α -GAN

3.3. Shifted LkGANs and LSGANs

4. Experiments

4.1. Experimental Setup

4.2. Experimental Results

4.3. Discussion

4.3.1. Experiment 1

4.3.2. Experiment 2

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Neural Network Architectures

Appendix B. Algorithms

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.2. $α$ -GAN