Variational Bayesian Approximation (VBA): A Comparison between Three Optimization Algorithms

Fallah Mortezanejad, Seyedeh Azadeh; Mohammad-Djafari, Ali

doi:10.3390/psf2022005048

Open AccessProceeding Paper

Variational Bayesian Approximation (VBA): A Comparison between Three Optimization Algorithms^†

by

Seyedeh Azadeh Fallah Mortezanejad

¹ and

Ali Mohammad-Djafari

^2,3,*

¹

School of Automotive and Traffic Engineering, Jiangsu University, Zhenjiang 212013, China

²

International Science Consulting and Training (ISCT), 91440 Bures sur Yvette, France

³

Shanfeng Company, Shaoxing 312352, China

^*

Author to whom correspondence should be addressed.

^†

Presented at the 41st International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Paris, France, 18–22 July 2022.

Phys. Sci. Forum 2022, 5(1), 48; https://0-doi-org.brum.beds.ac.uk/10.3390/psf2022005048

Published: 8 February 2023

(This article belongs to the Proceedings of The 41st International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

In many Bayesian computations, we first obtain the expression of the joint distribution of all the unknown variables given the observed data. In general, this expression is not separable in those variables. Thus, obtaining the marginals for each variable and computing the expectations is difficult and costly. This problem becomes even more difficult in high dimensional quandaries, which is an important issue in inverse problems. We may then try to propose a surrogate expression with which we can carry out approximate computations. Often, a separable expression approximation can be useful enough. The variational Bayesian approximation (VBA) is a technique that approximates the joint distribution p with an easier, for example separable, distribution q by minimizing the Kullback–Leibler divergence

K L (q | p)

. When q is separable in all the variables, the approximation is also called the mean field approximation (MFA), and so q is the product of the approximated marginals. A first standard and general algorithm is the alternate optimization of

K L (q | p)

with respect to

q_{i}

. A second general approach is its optimization in the Riemannian manifold. However, in this paper, for practical reasons, we consider the case where p is in the exponential family and so is q. For this case,

K L (q | p)

becomes a function of the parameters

θ

of the exponential family. Then, we can use any other optimization algorithm to obtain those parameters. In this paper, we compare three optimization algorithms, namely a standard alternate optimization, a gradient-based algorithm and a natural gradient algorithm, and study their relative performances in three examples.

Keywords:

variational Bayesian approach (VBA); Kullback–Leibler divergence; mean field approximation (MFA); optimization algorithm

1. Introduction

In many applications, with direct or indirect observations, the use of the Bayesian computations starts with obtaining the expression of the joint distribution of all the unknown variables given the observed data. Then, we must use it for inference. In general, this expression is not separable in all the variables of the problem. So, the computations become hard and costly. For example, obtaining the marginals for each variable and computing the expectations are difficult and costly. This problem becomes even more crucial in high dimensional quandaries, which is an important issue in inverse problems. We may then need to propose a surrogate expression with which we can carry out approximate computations.

The variational Bayesian approximation (VBA) is a technique that approximates the joint distribution p with an easier, for example a separable one, q, by minimizing the Kullback–Leibler divergence

K L (q | p)

, which makes the marginal computations much easier. For example, in the case of two variables,

p (x, y)

is approximated by

q (x, y) = q_{1} (x) q_{2} (y)

via minimizing

K L (q_{1} q_{2} | p)

. When q is separable in all the variables of p, the approximation is also called mean field approximation (MFA).

To obtain the approximate marginals

q_{1}

and

q_{2}

, we have to minimize

K L (q_{1} q_{2} | p)

. The first standard and general algorithm is the alternate optimization of

K L (q_{1} q_{2} | p)

with respect to

q_{1}

and

q_{2}

. By finding the expression of the functional derivatives of

K L (q_{1} q_{2} | p)

with respect to

q_{1}

and

q_{2}

and then equating them to zero alternatively, we obtain an iterative optimization algorithm. A second general approach is its optimization in the Riemannian manifold. However, in this paper, for practical reasons, we consider the case where p is in the exponential family and so are

q_{1}

and

q_{2}

. For this case,

K L (q_{1} q_{2} | p)

becomes a function of the parameters

θ

of the exponential family. Then, we can use any other optimization algorithm to obtain those parameters.

In this paper, we compare three optimization algorithms: a standard alternate optimization (Algorithm 1), a gradient-based algorithm [1,2] (Algorithm 2) and a natural gradient algorithm [3,4,5] (Algorithm 3). The aim of this paper is to consider the first algorithm as the VBA method and compare it with the two other algorithms.

Of the main advantages of the VBA for inference problems, such as inverse problems and machine learning, we can mention the following:

-: First, VBA builds a sufficient model according to prior information and the final posterior distribution. Especially in the mean field approximation (MFA), the result ends in an explicit form for each unknown component using conjugate priors and works well for small sample sizes [6,7,8].
-: The second benefit is, for example in machine learning, that it is a robust way for classification based on the predictive posterior distribution and diminishes over-trained parameters [7].
-: The third privilege is that the target structure has uncertainty in the VBA recursive processes. This feature prevents further error propagation and increases the robustness of VBA [9].

Besides all these preponderances, the VBA has some weaknesses, such as difficulty regarding the solution of integrals and expectations in terms of obtaining a posterior distribution, and there is no evidence of finding an exact posterior [6]. Its most significant drawback arises when there are strong dependencies between unknown parameters, and the VBA ignores them. Then estimates, computed based on this approximation, may be very far from the exact values. However, it works well when the number of dependencies are low [8].

In this article, we examine three different estimating algorithms of the unknown parameters in a model concerning prior information. The first iterative algorithm is a standard alternate optimization based on VBA, which begins a certain initial points. Sometimes, the points are estimated from an available dataset, but most of the time, we do not have enough data on the parameters to make certain pre-estimations of them. To solve this obstacle, we can start the algorithm with certain desired points, and then by repeating the process, they approach the true values using the posterior distribution. The second two algorithms are gradient-based and natural gradient algorithms, whose base function is Kullback–Leibler divergence. First, the gradient of Kullback–Leibler for all unknown parameters must be found, then must start from points either estimated from data or desired choices. Then, we repeat the iterative algorithm until it converges to certain points. If we denote the unknown parameter space with

θ

, then the recursive formula is

{\tilde{θ}}^{(k + 1)} = {\tilde{θ}}^{(k)} - γ \nabla K L ({\tilde{θ}}^{(k)})

for gradient-based and natural gradient algorithms with different values of

γ

.

Additionally, we consider three examples, normal-inverse-gamma, multivariate normal and linear inverse problem for checking the performance and convergence speed of the algorithms.

We propose the following organization of this paper: In Section 2, we present a brief explanation of the basic analytical aspect of VBA. In Section 3, we explain our first example related to normal-inverse-gamma distribution analytically and, in practice, explain the outcomes of three algorithms to estimate the unknown parameters. In Section 5, we study a more complex example of a multivariate normal distribution whose means and variance–covariance matrix are unknown and have normal-inverse-Wishart distribution. The aim of this section is to demonstrate the marginal distributions of

\tilde{μ}

and

\tilde{Σ}

using a set of multivariate normal observations using the mean and variance. In Section 6, the example is closer to realistic situations and is a linear inverse problem. In Section 7, we present a summary of the work carried out in the article and compare the three recursive algorithms through three different examples.

2. Variational Bayesian Approach (VBA)

As we mentioned previously, VBA uses Kullback–Leibler divergence. Kullback–Leibler divergence [10]

K L (q | p)

is an information measure of discrepancy between two probability functions defined as follows. Let

p (x)

and

q (x)

be two density functions of a continuous random variable x with respect to support set

S_{X}

.

K L (q | p)

function is introduced as:

K L (q | p) = \int_{x \in S_{X}} q (x) ln \frac{q (x)}{p (x)} d x .

(1)

For simplicity, we assume a bivariate case of distribution

p (x, y)

and want to assess it via VBA; therefore, we have:

K L (q | p) = - H (q_{1}) - H (q_{2}) - {〈 ln p (x, y) 〉}_{q_{1} q_{2}},

(2)

where

H (q_{1}) = - \int_{x \in S_{X}} q_{1} (x) ln q_{1} (x) d x and H (q_{2}) = - \int_{y \in S_{Y}} q_{2} (y) ln q_{2} (y) d y

are, respectively, the Shannon entropies of x and of y, and

{〈 ln p (x, y) 〉}_{q_{1} q_{2}} = \int \int_{(x, y) \in S_{X Y}} q_{1} (x) q_{2} (y) ln p (x, y) d x d y .

Now, differentiating the Equation (2) with respect to

q_{1}

and then with respect to

q_{2}

and equating them to zero, we obtain:

q_{1} (x) \propto exp \{{〈 ln p (x, y) 〉}_{q_{2} (y)}\} and q_{2} (y) \propto exp \{{〈 ln p (x, y) 〉}_{q_{1} (x)}\}

(3)

These results can be easily extended to more dimensions [11]. They do not have any closed form because they depend on the expression of

p (x, y)

and that of

q_{1}

and

q_{2}

. An interesting case is that of exponential families and conjugate priors, where writing

p (x, y) = p (x | y) p (y), and p (y | x) = \frac{p (x, y)}{p (x)} = \frac{p (x | y) p (y)}{p (x)},

(4)

we can consider

p (y)

as prior,

p (x | y)

as the likelihood, and

p (y | x)

as the posterior distributions. Then, if

p (y)

is a conjugate prior for the likelihood

p (x | y)

, then the posterior

p (y | x)

will be in the same family as the prior

p (y)

. To illustrate all these properties, we provide details of these expressions for a first simple example of normal-inverse-gamma

p (x, y) = N (x | μ, y) I G (y | α, β)

with

q_{1} (x) = N (x | μ, v)

and

q_{2} = I G (y | α, β)

. For this simple case, first we give the expression of

K L (q | p)

with

q_{1} (x) = N (x | \tilde{μ}, \tilde{v})

and

q_{2} (y) = I G (y | \tilde{α}, \tilde{β})

as a function of the parameters

θ = (\tilde{μ}, \tilde{v}, \tilde{α}, \tilde{β})

and then the expressions of the three above-mentioned algorithms; after which, we study their convergence.

3. Normal-Inverse-Gamma Distribution Example

The purpose of this section is to explain in detail the process of performing calculations in VBA. For this we consider a simple case for which we have all the necessary expressions. The objective here is to compare the three different algorithms mentioned above. Additionally, its practical application can be explained as follows:

We have a sensor which measures a quantity X, N times

x_{1}, \dots, x_{N}

. We want to model these data. In a first step, we model them as

N (x | μ, v)

with fixed

μ

and v. Then, it is easy to estimate the parameters

(μ, v)

either by maximal likelihood or Bayesian strategy. If we assume that the model is Gaussian with unknown variance and call this variance y and assign an

I G

prior to it, then we have a model

N I G

for

p (x, y)

. The

N I G

priors were applied to the wavelet context with correlated structures because they were conjugated with normal priors [12]. We chose the normal-inverse-gamma distribution because of this conjugated property and ease of handling.

We showed that the margins are

S t

and

I G

. Working directly with

S t

is difficult. So, we want to approximate it with a Gaussian

q_{1} (x)

. This is equivalent to approximating

p (x, y)

with

q_{1} (x) q_{2} (y)

. Now, we want to find the parameters

μ

, v,

α

, and

β

, which minimize

K L (q_{1} q_{2} | p)

. This process is called VBA. Then, we want to compare three algorithms to obtain the parameters which minimize

K L (\cdot | \cdot)

.

K L (\cdot | \cdot)

is convex with respect to

q_{1}

if

q_{2}

is fixed and is convex with respect to

q_{2}

if

q_{1}

is fixed. So, we hope that the iterative algorithm converges. However,

K L (\cdot | \cdot)

may not be convex in the space of parameters. So, we have to study the shape of this criterion concerning the parameters

\tilde{v}

,

\tilde{α}

and

\tilde{β}

.

The practical problem considered here is the following: A sensors delivers a few samples

x = {x_{1}, x_{2}, \dots, x_{N}}

of a physical quantity X. We want to find

p (x)

. For this process, we assume a simple Gaussian model but with unknown variance y. Thus, the forward model can be written as

p (x, y) = N (x | μ, y) I G (y | α, β)

. In this simple example, we know that

p (x)

is Student’s t-distribution obtained by:

S (x | μ, α, β) = \int N (x | μ, y) I G (y | α, β) d y

(5)

Our objective is to find the three parameters

θ = (μ, α, β)

from the data

x

and an approximate marginal

q (x)

for

p (x)

.

The main idea is to find such

q_{1} (x) q_{2} (y)

as an approximation of

p (x, y)

. Here, we show the VBA, step by step. For this, we start by choosing the conjugate families

q_{1} (x) = N (x | \tilde{μ}, \tilde{v})

and

q_{2} (y) = I G (y | \tilde{α}, \tilde{β})

.

In the first step, we have to calculate

ln p (x, y)

ln p (x, y) = c - \frac{1}{2} ln y - \frac{1}{2 y} {(x - \tilde{μ})}^{2} - (\tilde{α} + \frac{1}{2}) ln y - \frac{\tilde{β}}{y} .

(6)

where c is a constant value term independent of x and y. First of all, to use the iterative algorithm given in (3), starting by

q_{1} = N (x | μ^{'}, v^{'})

we have to find

q_{2} (y)

, so we have to start by finding

q_{2} (y)

. The integration of

ln p (x, y)

concerns

q_{1} (x)

{〈 ln p (x, y) 〉}_{q_{1}} = c - \frac{1}{2 y} {〈 {(x - \tilde{μ})}^{2} 〉}_{q_{1}} - (\tilde{α} + 1) ln y - \frac{\tilde{β}}{y} .

(7)

Since the mean of x is the same in prior and posterior distribution,

\tilde{μ} = {\tilde{μ}}^{'}

then

{〈 {(x - \tilde{μ})}^{2} 〉}_{q_{1}} = \tilde{v}

, otherwise

{〈 {(x - \tilde{μ})}^{2} 〉}_{q_{1}} = \tilde{v} + {(\tilde{μ} - {\tilde{μ}}^{'})}^{2}

. Thus when

\tilde{μ} = {\tilde{μ}}^{'}

q_{2} (y) \propto exp [- (\tilde{α} + 1) ln y - (\frac{\tilde{v}}{2} + \tilde{β}) \frac{1}{y}] .

(8)

Thus, the function

q_{2} (y)

is equivalent to an inverse gamma distribution

I G (\tilde{α}, \frac{\tilde{v}}{2} + \tilde{β})

. Similarly,

q_{2} (y)

is

I G (\tilde{α}, \frac{\tilde{v} + {(\tilde{μ} - {\tilde{μ}}^{'})}^{2}}{2} + \tilde{β})

when

\tilde{μ} \neq {\tilde{μ}}^{'}

. We have to take integral of

ln p (x, y)

over

q_{2}

to find

q_{1}

{〈 ln p (x, y) 〉}_{q_{2}} = c - (\tilde{α} + 1) {〈 ln y 〉}_{q_{2}} - (\tilde{β} + \frac{1}{2} {(x - \tilde{μ})}^{2}) {〈 \frac{1}{y} 〉}_{q_{2}}

(9)

Note that the first term does not depend on x and

{〈 \frac{1}{y} 〉}_{q_{2}} = \frac{2 \tilde{α}}{2 \tilde{β} + \tilde{v}}

, so

q_{1} (x) \propto exp [- \frac{2 \tilde{α}}{2 \tilde{β} + \tilde{v}} (\tilde{β} + \frac{1}{2} {(x - \tilde{μ})}^{2})] \propto exp [- \frac{{(x - \tilde{μ})}^{2}}{2 \frac{2 \tilde{β} + \tilde{v}}{2 \tilde{α}}}] .

(10)

We see that

q_{1}

is, again, a normal distribution but with updated parameters

N (\tilde{μ}, \frac{2 \tilde{β} + \tilde{v}}{2 \tilde{α}})

, so

\tilde{v} = \frac{2 \tilde{β} + \tilde{v}}{2 \tilde{α}}

. Note that we obtained the conjugacy property: if

p (x | y) = N (x | μ, y)

and

p (y) = I G (y | α, β)

, then

p (y | x) = I G (y | α^{'}, β^{'})

where

μ^{'}

,

α^{'}

and

β^{'}

are

μ^{'} = μ

,

α^{'} = α

,

β^{'} = β + \frac{2 β + v}{2 α}

. In this case, we also know that

p (x | α, β) = S t (x | μ^{'}, α, β)

.

In standard alternate optimization based on VBA (Algorithm 1), there is no need for an iterative process for

\tilde{μ}

and

\tilde{α}

, which are approximated by

\tilde{μ} = μ_{0}

and

\tilde{α} = α_{0}

, respectively. The situation for

\tilde{β}

and

\tilde{v}

is different because there are circular dependencies among them. So, the approximation needs an iterative process, staring from

{\tilde{μ}}^{(1)} = μ_{0}

,

{\tilde{v}}^{(1)} = v_{0}

,

α^{(1)} = α_{0}

and

β^{(1)} = β_{0}

. As a conclusion for this case is that the values of

α

and

μ

do not change during the iterations, and so depend on the initial values. However, the values of

β

and v are interdependent and change during the iterations. This algorithm is summarized below.

Alternate optimization algorithm based on VBA (Algorithm 1):

\begin{matrix} {\tilde{α}}^{(k + 1)} & = & {\tilde{α}}^{(k)}, \\ {\tilde{β}}^{(k + 1)} & = & {\tilde{β}}^{(k)} + \frac{{\tilde{v}}^{(k)}}{2}, \\ {\tilde{μ}}^{(k + 1)} & = & {\tilde{μ}}^{(k)}, \\ {\tilde{v}}^{(k + 1)} & = & \frac{2 {\tilde{β}}^{(k)} + {\tilde{v}}^{(k)}}{2 {\tilde{α}}^{(k)}} . \end{matrix}

This algorithm converges to

v = (2 \tilde{β} + \tilde{v}) / (2 \tilde{α})

, which gives

\tilde{v} = (2 \tilde{β}) / (2 \tilde{α} - 1)

and

\tilde{β} = 0

, so

\tilde{v} = 0

, which is a degenrate solution.

The two other algorithms, gradient- and natural gradient-based, require to find the expression of

K L (q_{1} q_{2} : p)

as a function of the parameters

θ = (α, β, μ, v)

:

\begin{matrix} K L (\tilde{θ}) = & 2 ln Γ (\tilde{α} - \frac{1}{2}) - (2 \tilde{α} + \frac{3}{2}) ψ_{0} (\tilde{α} - \frac{1}{2}) + \frac{(2 \tilde{α} - 1) (\tilde{v} + 2 \tilde{β})}{4 \tilde{β}} + \tilde{α} + \frac{5}{2} ln \tilde{β} - \frac{1}{2} ln \tilde{v} - 1 . \end{matrix}

(11)

We also need the gradient expression of

\nabla K L (\tilde{θ})

for

\tilde{θ}

:

\begin{matrix} \nabla K L (\tilde{θ}) = & (\frac{\tilde{v}}{2 \tilde{β}} - (2 \tilde{α} + \frac{3}{2}) ψ_{1} (\tilde{α} - \frac{1}{2}) + 2, - (2 \tilde{α} - 1) \frac{\tilde{v}}{4 {\tilde{β}}^{2}} + \frac{5}{2 \tilde{β}}, 0, \frac{2 \tilde{α} - 1}{4 \tilde{β}} - \frac{1}{2 \tilde{v}}) . \end{matrix}

(12)

As we can see, these expressions do not depend on

\tilde{μ}

, so their derivatives with respect to

\tilde{μ}

are zero. The means are preserved.

Gradient and natural gradient algorithms (Algorithms 2 and 3):

\begin{matrix} {\tilde{α}}^{(k + 1)} & = & {\tilde{α}}^{(k)} - γ (\frac{{\tilde{v}}^{(k)}}{2 {\tilde{β}}^{(k)}} - (2 {\tilde{α}}^{(k)} + \frac{3}{2}) ψ_{1} ({\tilde{α}}^{(k)} - \frac{1}{2}) + 2), \\ {\tilde{β}}^{(k + 1)} & = & {\tilde{β}}^{(k)} - γ (- (2 {\tilde{α}}^{(k)} - 1) \frac{{\tilde{v}}^{(k)}}{4 {[{\tilde{β}}^{(k)}]}^{2}} + \frac{5}{2 {\tilde{β}}^{(k)}}), \\ {\tilde{μ}}^{(k + 1)} & = & {\tilde{μ}}^{(k)}, \\ {\tilde{v}}^{(k + 1)} & = & {\tilde{v}}^{(k + 1)} - γ (\frac{2 {\tilde{α}}^{(k)} - 1}{4 {\tilde{β}}^{(k)}} - \frac{1}{2 {\tilde{v}}^{(k)}}) . \end{matrix}

Here,

γ

is fixed for the gradient algorithm and is proportional to

1 / ∥ \nabla K L ∥

for the natural gradient algorithm.

4. Numerical Experimentations

To show the relative performances of these algorithms, we generate

n = 100

samples from the model

p (x, y) = N (x | 1, y) I G (y | 3, 1)

for the numerical computations. Thus, it should be noted that we know the exact values of the unknown parameters which can be used to show the performances of the proposed algorithms. The following results:

{\tilde{θ}}_{1}

,

{\tilde{θ}}_{2}

and

{\tilde{θ}}_{3}

are the estimated parameters using, respectively, Algorithm 1, Algorithm 2 and Algorithm 3. The contour plots of the corresponding probability density functions are shown in Figure 1 compared with original model.

{\tilde{θ}}_{1} = \{\begin{matrix} \tilde{μ} = 0.044724, \\ \tilde{v} = 0.590894, \\ \tilde{α} = 3.792337, \\ \tilde{β} = 1.765735 \end{matrix} {\tilde{θ}}_{2} = \{\begin{matrix} \tilde{μ} = 0.044724, \\ \tilde{v} = 7.910504, \\ \tilde{α} = 4.991621, \\ \tilde{β} = 3.002594 \end{matrix} {\tilde{θ}}_{3} = \{\begin{matrix} \tilde{μ} = 0.044724, \\ \tilde{v} = 0.423761, \\ \tilde{α} = 4.415239, \\ \tilde{β} = 0.706049 . \end{matrix}

All three algorithms try to minimize the same criterion. So, the objectives are all the same, but the number of steps may differ. The requirements must reach the minimum

K L (\cdot)

. In this simple example of the normal-inverse-gamma distribution, the convergence step numbers of VBA, gradient-based and natural gradient are 1, 2 and 1 using moment initializations. The overall performance of the standard alternate optimization (VBA) is more precise than any other. The poorest estimation is from a gradient-based algorithm. So, the algorithms are able to approximate the joint density function with a separable one but with different accuracy. In the following section, we will tackle a more complex model.

5. Multivariate Normal-Inverse-Wishart Example

In previous section, we explained how to preform VBA in order to approximate a complicated joint distribution function by tractable marginal factorials using a simple case study. In this section, a multivariate normal case

p (x) = N (x | \tilde{μ}, \tilde{Σ})

is considered, which is approximated by

q (x) = \prod_{i} N (x_{i} | {\tilde{μ}}_{i}, {\tilde{v}}_{i})

for different shapes of the covariance matrix

\tilde{Σ}

.

We assume that the basic structure of an available dataset is multivariate normal with unknown mean vector

\tilde{μ}

and variance–covariance matrix

\tilde{Σ}

. Their joint prior distribution is a normal-inverse-Wishart distribution of

N I W (\tilde{μ}, \tilde{Σ} | {\tilde{μ}}_{0}, \tilde{κ}, \tilde{Ψ}, \tilde{ν})

, which is the generalized form of the classical

N I G

. The posteriors are multivariate normal for mean vector and inverse-Wishart in the variance–covariance matrix. Since the normal-inverse-Wishart distribution is a conjugate prior distribution for multivariate normal, the posterior distribution of

\tilde{μ}

and

\tilde{Σ}

again belongs to the same family, and their corresponding margins are

M N (\frac{\tilde{κ} {\tilde{μ}}_{0} + n \bar{x}}{\tilde{κ} + n}, \frac{1}{\tilde{κ} + n} \tilde{Λ}), I W (\tilde{Λ} + \tilde{Ψ} + \sum_{i = 1}^{n} (x_{i} - \bar{x}) {(x_{i} - \bar{x})}^{T}, \tilde{ν} + n),

(13)

where n is the sample size. To present the performance of the three algorithms, we examined on a dataset based on

x \sim N I W (x | μ, Σ)

, whose parameters have the following low-density structure:

μ \sim M N (μ | [\begin{matrix} 2 \\ 1 \end{matrix}], \frac{1}{2} [\begin{matrix} 3 & - 1 \\ - 1 & 1 \end{matrix}]), Σ \sim I W (Σ | [\begin{matrix} 3 & - 1 \\ - 1 & 1 \end{matrix}], 6) .

We used only the data of

x

in the estimation processes. The results of algorithms are presented in Figure 2 along with the true contour plot of the model. The VBA estimation is the most separable distribution compared with gradient and natural gradient methods. The next best case is the natural gradient algorithm, but its weakness is transferring the dependency slightly to the approximation. The results for the gradient-based algorithm show the dependency completely, along with its inability to obtain a separable model.

6. Simple Linear Inverse Problem

Finally the third example is the case of linear inverse problems with

g = H f + ϵ

with priors

p (ϵ) = N (ϵ | 0, v_{ϵ} I)

,

p (f) = N (f | 0, diag [v])

and

p (v | α, β) = \prod_{j} I G (v_{j} | α, β)

, where

f = [f_{1}, f_{2}, \dots, f_{N}]

and

v = [v_{f_{1}}, v_{f_{2}}, \dots, v_{f_{N}}]

. Using these priors, we get

p (f, v | g) \propto p (g | f, v_{ϵ}) p (f | v) p (v)

with

p (g | f, v_{ϵ}) = N (g | H f, v_{ϵ} I)

. See [13] for details.

Thus, the joint distribution of

g

,

f

, and

v

:

p (g, f, v) \propto p (g | f, v_{ϵ}) p (f | v) p (v) .

(14)

is approximated by

q (g, \tilde{f}, \tilde{v}) = q_{1} (g | \tilde{f}) q_{2} (\tilde{f}) q_{3} (\tilde{v})

using the VBA method. Even if the main interest is the estimation of

q_{2} (\tilde{f})

, but in the recursive process,

q_{1} (g | \tilde{f})

and

q_{3} (\tilde{v})

are also updated. For simplicity, we suppose that the transfer matrix

H

is an identical matrix

I

. The final outputs are as follows:

\begin{matrix} \tilde{f} \sim M N (\frac{{\tilde{μ}}_{\tilde{f}}}{1 + \frac{2 \tilde{v_{ϵ}} \tilde{α}}{n (\tilde{v} + {\tilde{μ}}_{\tilde{f}}^{2} + 2 \tilde{β})}}, diag [\frac{\tilde{v_{ϵ}} (\tilde{v} + {\tilde{μ}}_{\tilde{f}}^{2} + 2 \tilde{β})}{n (\tilde{v} + {\tilde{μ}}_{\tilde{f}}^{2}) + 2 n \tilde{β} + 2 \tilde{v_{ϵ}} \tilde{α}}]), \\ g \sim N ({\tilde{μ}}_{\tilde{f}}, \tilde{v_{ϵ}} I), and {\tilde{v}}_{j} \sim I G ({\tilde{α}}_{j}, \frac{{\tilde{v}}_{j} + {\tilde{μ}}_{{\tilde{f}}_{j}}^{2}}{2} + {\tilde{β}}_{j}), j = 1, \dots, N . \end{matrix}

(15)

For simulations, we chose a model to examine the performance of these margins and compare them with gradient and natural gradient algorithms. The selected model is

g = H f + ϵ

with the following assumptions:

H = I, f \sim M N (f | 0, diag [v_{1}, v_{2}]), v_{1} \sim I G (v_{1} | 3, 2), v_{2} \sim I G (v_{2} | 4, 3), ϵ \sim M N (ϵ | 0, I) .

(16)

In the assessment procedure, we do not apply the above information. The output of algorithms are shown in Figure 3, as well as the actual contour plot. In this example, the best diagnosis is from the natural gradient algorithm. The VBA by construction is separable and cannot be the same as the original.

7. Conclusions

This paper presents an approximation method of the unknown density functions for hidden variables called VBA. It is compared with gradient and natural gradient algorithms. We also consider three examples normal-inverse-gamma, normal-inverse-Wishart and linear inverse problem. We provided details of the first model and showed examples of two other examples throughout the whole paper. In all three models, the parameters are unexplored and need to be estimated by recursive algorithms. We attempted to approximate the joint complex distribution with a simpler version of the margin factorials that appeared to be independent cases. The VBA and natural gradient converged fairly early. The major discrepancy in algorithms comes from the accuracy of the results. They estimate the intricate joint distribution with separable ones. Here, the best overall performance of VBA is demonstrated.

Author Contributions

Both authors contributed equally. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Acerbi, L. Variational bayesian monte carlo. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, Montréal, QC, Canada, 3–8 December 2018; pp. 8223–8233. [Google Scholar]
Kuusela, M.; Raiko, T.; Honkela, A.; Karhunen, J. A gradient-based algorithm competitive with variational Bayesian EM for mixture of Gaussians. In Proceedings of the 2009 International Joint Conference on Neural Networks, Atlanta, GA, USA, 14–19 June 2009; pp. 1688–1695. [Google Scholar]
Gharsalli, L.; Duchêne, B.; Mohammad-Djafari, A.; Ayasso, H. A gradient-like variational Bayesian approach: Application to microwave imaging for breast tumor detection. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; pp. 1708–1712. [Google Scholar]
Zhang, G.; Sun, S.; Duvenaud, D.; Grosse, R. Noisy natural gradient as variational inference. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 5852–5861. [Google Scholar]
Lin, W.; Khan, M.E.; Schmidt, M. Fast and simple natural-gradient variational inference with mixture of exponential-family approximations. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 3992–4002. [Google Scholar]
Ghahramani, Z.; Beal, M. Variational inference for Bayesian mixtures of factor analysers. Adv. Neural Inf. Process. Syst. 1999, 12, 449–455. [Google Scholar]
Watanabe, S.; Minami, Y.; Nakamura, A.; Ueda, N. Variational Bayesian estimation and clustering for speech recognition. IEEE Trans. Speech Audio Process. 2004, 12, 365–381. [Google Scholar] [CrossRef]
Montesinos-López, O.A.; Montesinos-López, A.; Crossa, J.; Montesinos-López, J.C.; Luna-Vázquez, F.J.; Salinas-Ruiz, J.; Herrera-Morales, J.R.; Buenrostro-Mariscal, R. A variational Bayes genomic-enabled prediction model with genotype× environment interaction. G3 Genes Genomes Genet. 2017, 7, 1833–1853. [Google Scholar] [CrossRef] [PubMed]
Babacan, S.D.; Molina, R.; Katsaggelos, A.K. Variational Bayesian super resolution. IEEE Trans. Image Process. 2010, 20, 984–999. [Google Scholar] [PubMed]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Fox, C.W.; Roberts, S.J. A tutorial on variational Bayesian inference. Artif. Intell. Rev. 2012, 38, 85–95. [Google Scholar] [CrossRef]
De Canditiis, D.; Vidakovic, B. Wavelet Bayesian block shrinkage via mixtures of normal-inverse gamma priors. J. Comput. Graph. Stat. 2004, 13, 383–398. [Google Scholar] [CrossRef]
Ayasso, H.; Mohammad-djafari, A. Joint image restoration and segmentation using Gauss-Markov-Potts prior models and variational Bayesian computation. In Proceedings of the 2009 16th IEEE International Conference on Image Processing (ICIP), Cairo, Egypt, 7–10 November 2009; pp. 1297–1300. [Google Scholar] [CrossRef] [Green Version]

Figure 1. The true model is

N (x | 0, y) I G (y | 3, 1)

. The numbers of iterations until the convergence are different in the algorithms.

Figure 1. The true model is

N (x | 0, y) I G (y | 3, 1)

. The numbers of iterations until the convergence are different in the algorithms.

Figure 2. In this example, the numbers of iterations to obtain the convergence are different between the three algorithms. These numbers are, respectively, 2, 280 and 4. The VBA and natural gradient algorithms estimate the distribution with separable functions in fewer steps than the gradient which seems not converged even up to 280 iterations.

Figure 3. The true model is almost separable. The natural gradient algorithm works better in this example, and the poorest approximation is that of the gradient-based algorithm.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fallah Mortezanejad, S.A.; Mohammad-Djafari, A. Variational Bayesian Approximation (VBA): A Comparison between Three Optimization Algorithms. Phys. Sci. Forum 2022, 5, 48. https://0-doi-org.brum.beds.ac.uk/10.3390/psf2022005048

AMA Style

Fallah Mortezanejad SA, Mohammad-Djafari A. Variational Bayesian Approximation (VBA): A Comparison between Three Optimization Algorithms. Physical Sciences Forum. 2022; 5(1):48. https://0-doi-org.brum.beds.ac.uk/10.3390/psf2022005048

Chicago/Turabian Style

Fallah Mortezanejad, Seyedeh Azadeh, and Ali Mohammad-Djafari. 2022. "Variational Bayesian Approximation (VBA): A Comparison between Three Optimization Algorithms" Physical Sciences Forum 5, no. 1: 48. https://0-doi-org.brum.beds.ac.uk/10.3390/psf2022005048

Article Menu

Variational Bayesian Approximation (VBA): A Comparison between Three Optimization Algorithms^†

Abstract

1. Introduction

2. Variational Bayesian Approach (VBA)

3. Normal-Inverse-Gamma Distribution Example

4. Numerical Experimentations

5. Multivariate Normal-Inverse-Wishart Example

6. Simple Linear Inverse Problem

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Variational Bayesian Approximation (VBA): A Comparison between Three Optimization Algorithms †

Abstract

1. Introduction

2. Variational Bayesian Approach (VBA)

3. Normal-Inverse-Gamma Distribution Example

4. Numerical Experimentations

5. Multivariate Normal-Inverse-Wishart Example

6. Simple Linear Inverse Problem

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Variational Bayesian Approximation (VBA): A Comparison between Three Optimization Algorithms^†