Next Article in Journal
A Study of Ising Formulations for Minimizing Setup Cost in the Two-Dimensional Cutting Stock Problem
Next Article in Special Issue
Accelerating Symmetric Rank-1 Quasi-Newton Method with Nesterov’s Gradient for Training Neural Networks
Previous Article in Journal
Damage Identification of Long-Span Bridges Using the Hybrid of Convolutional Neural Network and Long Short-Term Memory Network
Previous Article in Special Issue
An Effective Decomposition-Based Stochastic Algorithm for Solving the Permutation Flow-Shop Scheduling Problem
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Approximately Optimal Control of Nonlinear Dynamic Stochastic Problems with Learning: The OPTCON Algorithm

by
Dmitri Blueschke
*,
Viktoria Blueschke-Nikolaeva
and
Reinhard Neck
Department of Economics, University of Klagenfurt, 9020 Klagenfurt, Austria
*
Author to whom correspondence should be addressed.
Submission received: 31 March 2021 / Revised: 31 May 2021 / Accepted: 4 June 2021 / Published: 8 June 2021
(This article belongs to the Special Issue Stochastic Algorithms and Their Applications)

Abstract

:
OPTCON is an algorithm for the optimal control of nonlinear stochastic systems which is particularly applicable to economic models. It delivers approximate numerical solutions to optimum control (dynamic optimization) problems with a quadratic objective function for nonlinear economic models with additive and multiplicative (parameter) uncertainties. The algorithm was first programmed in C# and then in MATLAB. It allows for deterministic and stochastic control, the latter with open loop (OPTCON1), passive learning (open-loop feedback, OPTCON2), and active learning (closed-loop, dual, or adaptive control, OPTCON3) information patterns. The mathematical aspects of the algorithm with open-loop feedback and closed-loop information patterns are presented in more detail in this paper.

1. Introduction

Optimal control of stochastic processes is a topic which occurs in many contexts of applied mathematics such as engineering, biology, chemistry, economics, and management science. These techniques are used to steer the behavior of a plant, natural, human or social system over a certain (finite or infinite) time period where the system is driven by differential or difference equations whose time path can be influenced by an external controller. The controller (decision maker, policy maker) aims at achieving the best trajectory of his/her controls according to some criterion (performance measure, objective function) over that time horizon.
Unfortunately, only a few special cases can be solved explicitly, in particular the linear-quadratic-Gaussian problem, where the dynamic system is linear, the objective function is quadratic, and the stochastics are confined to additive normally distributed noise affecting the system dynamics (e.g., [1]). In this case, it is possible to separate the estimation of the state and the optimization of its behavior due to the property of certainty equivalence (the separation theorem). For most practical applications, however, especially those where the parameters of the system are not known precisely, an exact solution to the optimal stochastic control problem is not available and some kind of approximation is required. One of the reasons is the curse of dimensionality, which prevents the solving of dynamic functional equations for multivariable systems [2,3]. This is particularly complicated by the so-called dual effect of controls preventing the separation of estimation and control in adaptive processes; it was discovered by [4] that “the control must have a reciprocal, ‘dual’ character: to a certain extent they must be of a probing nature, but also controlling to a known degree” ([4], p. 31). The dynamic functional equations for general problems of this kind are known (see, e.g., [5]) but cannot be solved except in very special cases; hence approximations are required (see [6,7,8]).
In this paper, we present the OPTCON algorithm for calculating numerically (approximately) optimal control solutions to nonlinear dynamic stochastic optimization problems without and with learning. The algorithm was developed in three stages (OPTCON1, OPTCON2, and OPTCON3) over a long period. OPTCON1 determines optimal controls for deterministic problems and stochastic problems based on the open-loop information pattern [9]. OPTCON2 solves problems with passive learning about the unknown parameters of the system model [10], so-called open-loop feedback policies (see [11] for the terminology). Finally, OPTCON3 also includes active learning (dual, adaptive control) according to an approximation procedure initiated by [12,13,14], and adapted to linear economic systems by [15,16], cf. [17]. The present paper provides mathematical details for OPTCON3, which is the most sophisticated version of the OPTCON algorithm; in [18] we concentrate on computations aspects and applications of OPTCON3.
The OPTCON algorithm is applicable for stochastic control problems with the following properties: The model of the process is multivariable, formulated in discrete time, and described by a system of nonlinear difference equations with known functional form but additive noise and (possibly) unknown parameters. The state is stochastically observable and controllable. No inequality constraints on states or controls are given. Open-loop, open-loop feedback, and closed-loop information structures can be considered. The objective function is quadratic and perfectly known; it is formulated in tracking form but can easily be transformed to the general quadratic form. Quadratic functions can be interpreted as second-order Taylor approximations to more general functional forms. There is only one decision maker; for decentralized problems, see the OPTGAME algorithm [19].
The paper has the following structure. In Section 2 the relevant basic background is provided, namely the class of problems to be solved by the algorithm as well as some basic information about the linear-quadratic framework and nonlinear system solving algorithms. Section 3 presents the OPTCON algorithm stepwise, starting with the basic version (OPTCON1) and then introducing the more advanced versions OPTCON2 and OPTCON3 for handling stochastic components in the optimal control process. In Section 4 some details on computational time and accuracy of the OPTCON algorithm are presented. Section 5 concludes.

2. Theoretical Background

2.1. Optimal Control Problem

We consider optimal control problems with a quadratic objective function and a nonlinear multivariate discrete-time dynamic system under additive and parameter uncertainties. The basis for an optimal control problem is a deterministic dynamic system to be controlled (plant, firm, economy, etc.) in discrete time in the form:
x t = f ( x t 1 , x t , u t , z t ) , t = 1 , , T ,
where:
-
x t R n is a vector of state variables that describes the state of the system at any point in time t,
-
u t R m is a vector of control variables; we assume that the decision maker determines the values of the control variables exactly (without randomization) according to the approximately optimal solution of the problem,
-
z t R l denotes a vector of non-controlled deterministic exogenous variables, whose values are known to the decision maker at time t,
-
T denotes the terminal time period of the finite planning horizon.
The function f is assumed to be twice continuously differentiable with respect to all of its arguments.
We assume that there is some true law of motion given by Equation (1) in the background which is at least partially unknown to the policy maker while the function f is known to him/her. The policy maker faces two sources of uncertainty, additive and parameter uncertainties where:
-
θ R p denotes a vector of stochastic variables of system parameters (parameter uncertainty),
-
θ ¯ R p is a vector of true parameters whose values are assumed to be constant but unknown to the policy maker,
-
ε t R n is a vector of additive stochastic disturbances (system error).
θ t and ε t are assumed to be independent Gaussian random vectors with expectations θ ^ and O n respectively and covariance matrices Σ θ θ and Σ ε ε respectively.
Including uncertainty, the system (1) can be written as:
x t = f ( x t 1 , x t , u t , θ t , z t ) + ε t , t = 1 , , T .
The optimal control problem to be solved approximately assumes that there is a modeler who wishes to control the system (2). It means that the controller wishes to bring the state variables, using the control variables, as close as possible to some pre-defined desired time path, according to a given optimality criterion. The OPTCON algorithm allows for the optimal control of the system (2) using a quadratic objective function. To this end the modeler needs to define the following variables:
-
x ˜ t R n are given target (‘ideal’) values (for t = 1 , , T ) of the state variables,
-
u ˜ t R m are given target (‘ideal’) values (for t = 1 , , T ) of the control variables,
-
W t is an ( ( n + m ) × ( n + m ) ) symmetric matrix defining the relative weights of the state and control variables in the objective function. In a frequent special case, W t includes a discount factor α with W t = α t 1 W .
The resulting intertemporal objective function is given in quadratic tracking form:
J = E t = 1 T L t ( x t , u t ) ,
with
L t ( x t , u t ) = 1 2 x t x ˜ t u t u ˜ t W t x t x ˜ t u t u ˜ t .
The OPTCON algorithm allows us to find approximate optimal control solutions which minimize the functions (3) and (4) and satisfy the system dynamics (2).

2.2. Linear-Quadratic Optimal Control (LQ) Framework

The novel feature of the OPTCON algorithm is the combination of a nonlinear dynamic system and the above-mentioned stochastics when solving optimal control problems. Before we can discuss this technique in detail, a few words should be said about the underlying linear-quadratic optimal control framework. The dynamic system is given in linear form as
x t = A x t 1 + B u t + c t .
The corresponding optimal control objective function can be written in the same form as in (3) and (4). Using well-known LQ optimization techniques, the rules for defining control variables can be found iteratively backwards in time using dynamic Riccati equations. We start from the familiar LQ framework (e.g., [20]), which was adapted to economic models by [15,21,22]. The same standard technique is applied in the OPTCON algorithm whenever an optimal control for a linearized system should be obtained.
Furthermore, for computational reasons it is useful to transfer the objective function from the quadratic tracking form as used in Equation (4) into the general quadratic form:
L t ( x t , u t ) = 1 2 x t u t W t x t u t + x t u t w t x w t u + w t c ,
where
w t x w t u = W t x ˜ t u ˜ t a n d w t c = 1 2 x ˜ t u ˜ t W t x ˜ t u ˜ t .
The equivalence between the quadratic tracking form and the general quadratic form is shown, for instance, in [23].

2.3. Solving Nonlinear Dynamic Systems

The OPTCON algorithm allows us to find approximately optimal control solutions for nonlinear stochastic systems. In the solution process, the nonlinear dynamic system (2) should be solved in a large number of iterations. For this reason the choice of an appropriate solver is important to guarantee obtaining reliable solutions taking the computational costs into account. In the OPTCON algorithm, different solution algorithms are included attaching different importance to the trade-off between reliability and computational speed. At the moment the following methods can be used: Levenberg–Marquardt [24], trust region [25], Newton–Raphson [26], or Gauss–Seidel [27].

3. The OPTCON Algorithm

3.1. The OPTCON1 Algorithm

The first version of the OPTCON algorithm, the OPTCON1 algorithm, was developed by [9]. It allows us to calculate an open-loop (OL) solution to a nonlinear stochastic dynamic optimal control problem with a quadratic objective function under additive and parameter uncertainties. Open-loop controls either do not take account of the effect of uncertainties in the system or assume the stochastics (expectation and covariance matrices of additive and multiplicative disturbances) to be given for all time periods at the beginning of the planning horizon. The basic idea behind this algorithm is that it extends linear-quadratic stochastic optimal control techniques (see, e.g., [1]) to nonlinear problems using an iterative method of linear approximations.
The OPTCON algorithm solves the nonlinear optimal control problem iteratively by a sequence of linear approximations where the next approximation is derived using the optimal control solution of the previous one. The (optimal) solutions of the intermediate linear approximations are iterated from one time path to another until the algorithm converges. The criterion for convergence is that the difference in the values of the state and control variables of current and previous iterations is smaller than a pre-specified number. or the maximal number of iterations is reached. In each iteration the same procedure is conducted, namely linearizing and optimizing the linearized system. The system is linearized around the previous iteration’s result and the problem is solved for the resulting time-varying linearized system. The solution of the linearized model is obtained via Bellman’s principle of optimality and is used as the tentative path for the next iteration, starting off the procedure all over again. If the OPTCON algorithm converges, the solution of the last iteration is taken as the approximately optimal solution to the nonlinear problem and the value of the objective function is calculated. Figure 1 shows the structure of the OPTCON1 algorithm.

3.2. The OPTCON2 Algorithm

The second version of the algorithm (called OPTCON2, see [10,23] for more details) extends the first version with a passive learning strategy (open-loop feedback (OLF)). OLF uses the idea of re-estimating the model at the end of each time period t = 1 , , T . For that purpose, the model builder (policy maker) observes the current situation and uses the new information, namely the current values of the state variables, to improve his/her knowledge of the system.
Recall that the stochastic system includes two kinds of uncertainties, namely additive (random system errors) and multiplicative (‘structural’ errors in parameters). We assume that there is some true law of motion in the background which is unknown to the policy maker. The passive learning strategy deals with the ‘true’ parameters θ ¯ which generate the model. However, the policy maker does not know these true parameters θ ¯ and works with the ‘wrong’ estimated parameters θ ^ . The idea of the OLF strategy is to observe the outcome of the model in each time period and use this information to bring the estimated parameters θ ^ closer to the true values θ ¯ .
The OLF strategy of the OPTCON2 algorithm consists of the following (rough) steps running in a forward loop.
In each time period S ( S = 1 , , T ) do the following:
  • Find an (approximately) optimal open-loop solution for the remaining subproblem (for the time periods from S to T) using the OPTCON1 algorithm.
  • Fix the predicted solution for time period S, namely x S and u S .
  • Observe the realized state variables x S a which result from the chosen control variables ( u S ), the true law of motion with parameter vector θ ¯ , and the realization of the random disturbances ε S . The main difference between x S a and x S is that the former is driven by the true dynamic process with parameter vector θ ¯ and the latter by the estimated dynamic system with θ ^ .
  • Update the parameter estimate θ ^ (via the Kalman filter and using the difference between x S a and x S ) and use it in the next iteration step S + 1 .
The last step is carried out using the idea of the Kalman filter (see, e.g., [28] or [29]). The updating procedure according to the Kalman filter consists of two distinct phases, prediction and correction. First (the prediction phase), the predicted values of the state variables x ^ S / S 1 , the vector of parameters θ ^ S / S 1 , and the covariances of the parameters Σ S / S 1 θ θ are calculated using the estimates from the previous time period. Next (the correction phase), these ‘a priori’ values of the state variables, the vector of the parameters, and the values of the covariances are corrected using the current observations of the state variables.
The phase of calculating the predicted values of x ^ S / S 1 and θ ^ S / S 1 is integrated in the previous steps of the OPTCON2 algorithm. At the end of time period S the predicted values of x ^ S / S 1 = x S are calculated and θ ^ S / S 1 = θ ^ S 1 is known.
The correction phase is reduced to the calculation of the parameter estimates θ ^ S / S and the parameter covariances Σ S / S θ θ because the ‘corrected’ values of the state variables x S a = f ( x S 1 a , x S a , u S , θ ¯ , z S ) + ε S are calculated in the previous step of the algorithm or simply observed. As explained before, the updating procedure is based on the difference between x S = f ( x S 1 a , x S , u S , θ ^ , z S ) and x S a at the end of each time period.
The OPTCON2 algorithm allows us to find an approximate solution of the optimal control problem using a passive learning strategy. The policy maker uses real observations in each time period to update his/her knowledge about the stochastic dynamic system. In many cases it allows him/her to obtain more reliable results (see [30] for a performance study on optimal control with a passive learning strategy).

3.3. The OPTCON3 Algorithm

In this section we present the detailed description of the OPTCON3 algorithm. As mentioned above, the latest version of the OPTCON algorithm includes an active learning strategy. In the literature this strategy is also called closed-loop, adaptive, or dual control. The passive learning method in the OPTCON2 algorithm uses current observations to update the information about the dynamic system. The active learning strategy in OPTCON3 actively uses system perturbations to improve the optimal control performance by reducing the uncertainty in the future.
The procedure for finding the closed-loop solution in the OPTCON3 algorithm is based on the linear active learning control method in [15]. Similar to the iterative structure in the OPTCON2 algorithm, the optimization runs in a forward loop ( S = 1 , , T ). However, in each time period, more information about future measurements is used. To this end the objective function J (as defined in Equations (3) and (4)) is extended for stochastic components. We define in each time period J d as the approximate total cost-to-go with T S periods remaining. The approximate cost-to-go is broken down into three terms: J d = J D + J C + J P . J D is called deterministic term and incorporates only non-stochastic components. J C is the cautionary factor that includes the stochastic elements in the current period. J P is called probing term and captures the effect of dual learning on the uncertainty in the future periods. J C and J P constitute a separate quadratic minimization problem constrained by the nonlinear system. The original system needs to be expanded to the perturbation form δ x t . The optimization problem has to be solved for the perturbed system, where Δ J t is the perturbed objective function. Using Taylor expansion of nonlinear system and Bellman optimization technique, we obtain the solution Δ J t as a quadratic function of δ x t 1 . The original J t can be derived from Δ J t . To take uncertain parameters into account, all the terms and formulas need to be adjusted to the augmented system x t . . . . . . . . θ .
Next, a schematic structure of the OPTCON3 algorithm is presented. This structure is visualized in Figure 2.
The optimization is carried out in a forward loop from 1 to T conducting following procedure in each time period S ( S = 1 , , T ). The subproblem from S to T is solved via the open-loop (OL) strategy (see Figure 1 in Section 3.1). The OL solution ( x S , u S ) for the time period S is fixed. After that the core part of the dual control strategy starts, where the system is actively perturbed to learn about it in order to minimize the uncertainty and the objective function values in the future. In the OPTCON3 algorithm a grid search method is used (instead of grid search many other methods can be applied, such as gradient optimization or a more advanced heuristic approach). In the so-called “ π -loop” we create a grid of possible solutions around the existing path ( x S , u S ) . In each iteration ( π = 1 , , Π ), i.e., in each grid point, the optimal control solution is obtained and evaluated using the objective function. Inside the π -loop (for each π ), the following steps are carried out.
An optimal open-loop solution for the subproblem from S + 1 to T is determined. Then the OL solution ( x S + 1 π , u S + 1 π ) for the time period S + 1 is fixed. Next, after some auxiliary calculations (including, among other, Riccati matrices), the deterministic J D , cautionary J C , and probing J P terms of the cost-to-go are determined in a forward loop from S + 1 to T. Once the iteration of the π -loop has been completed, the total approximate objective function J d = J D + J C + J P can be obtained. When the search is completed, i.e., the approximately optimal path with m i n J d has been found, the obtained optimal control in time period S + 1 is carried out by the policy maker. As a result of these controls and the true dynamic systems the actual values of the state variables can be obtained. This new information is used by the policy maker to update and to adjust the parameter estimate θ ^ , whereby the Kalman filter is used. Using the updated stochastic parameters, the same active learning procedure is applied for the remaining time periods from S + 2 to T.
The OPTCON3 algorithm basically uses the technique presented in [13,15] but is augmented by approximating, in each step, the nonlinear system in a series of linear systems (iterative procedure as described in Section 3.1).
The following steps (I–IV) of the OPTCON3 algorithm describe how to obtain an approximately optimal dual control solution to a stochastic problem.
The algorithm needs the following input values (A legitimate question would be how to obtain the required (tentative) inputs. It obviously depends on the research question. In the case of some numerical experiments, one can resort to Monte Carlo simulations. In the case of a real-world application, one can estimate system parameters by an appropriate method or simply guess some values):
fsystem function
x 0 = x 0 initial values of state variables
( u t ) t = 1 T tentative path of control variables
θ ^ 0 = θ ^ expected values of system parameters
Σ 0 θ θ = Σ θ θ covariance matrix of system parameters
Σ ε ε covariance matrix of system noise
( ε t ) t = 1 T system noises
( z t ) t = 1 T path of exogenous variables
( x ˜ t ) t = 1 T target path for state variables
( u ˜ t ) t = 1 T target path for control variables
W x x , W u x , W u u weighting matrices of objective function
w t u , w t x weighting vectors of objective function
α discount factor of objective function.
As a result of the algorithm, an approximately optimal solution ( x t a ) t = 1 T , ( u t ) t = 1 T and the corresponding value of the objective function J are obtained.
Step I: Do the following search steps [1], [2] and [3] for each S = 1 , , T :
Step I-[1]: Find the OL solution for the subproblem ( S , , T ) : apply the procedure already used in OPTCON1. Fix ( x S , u S ) .
Step I-[2]: Run a grid search of size Π around ( x S , u S ) , i.e., perform steps (A)–(G) for each π = 1 , , Π :
Step I-2A: Find the OL solution ( x t π , u t π ) S + 1 T for the subproblem ( S + 1 , …, T).
• The nonlinearity loop is run until the stop criterion is fulfilled (the stop criterion is fulfilled when the difference between the values of the current and the previous iteration is smaller than a pre-specified number or the maximum number of iterations is achieved). As a result the approximately optimal solution ( x t π , u t π ) S + 1 T has been found. Then go to the next step I-2B.
Notes:
-
After several runs of the nonlinearity loop, only the solution ( x S + 1 π , u S + 1 π ) for the time period S + 1 will be taken as the optimal (nominal) solution. The calculations of the pairs ( x t π , u t π ) for other periods ( t > S + 1 ) have to be done again, taking into account the re-estimated parameters for all periods.
-
After linearization the parameter matrices for the linearized system are obtained: A t = ( I F x t x ) 1 F x t 1 x and B t = ( I F x t x ) 1 F u t x , where F x t x , F x t 1 x , and F u t x are the derivatives of the system function f with respect to x t , x t 1 , and u t respectively.
Step I-2B: Do a backward recursion to obtain auxiliary matrices for calculating components of the objective function.
Initialize the following auxiliary matrices for backward recursion:
H T + 1 x x = O n × n , h T + 1 x = O n , H T + 1 x θ = O n × p , H T + 1 θ θ = O p × p .
Calculate the Riccati matrices K t x x , K t θ x , and K t θ θ and the auxiliary matrices Λ t x x , Λ t x u , Λ t u u , λ t x , λ t u , H t x x , H t θ x , H t θ θ , and h t x backwards in time from T to S + 1 .
K t x x = W t x x + H t + 1 x x , K t θ x = H t + 1 θ x , K t θ θ = H t + 1 θ θ , k t x = h t + 1 x + x W t x x + W t x u u + w t x .
Λ t x x = ( A t ) K t x x A t , Λ t u x = ( B t ) K t x x A t + W t u x A t , Λ t x u = ( Λ t u x ) , Λ t u u = ( B t ) K t x x B t + 2 ( B t ) W t x u + W t u u .
λ t x = k t x A t , λ t u = k t x B t + x W t x u + u W t u u + ( w t u ) .
H t x x = Λ t x x Λ t x u ( Λ t u u ) 1 Λ t u x , H t θ x = [ D K t x x + K t θ x ] A t [ [ D K t x x + K t θ x ] B t + D W x u ] ( Λ t u u ) 1 Λ t u x , H t θ θ = D ( K t x x D + K t x θ ) + K t θ x D + K t θ θ [ [ D K t x x + K t θ x ] B t + D W x u ] × ( Λ t u u ) 1 [ B t [ K t x x D + K t x θ ] + W u x D ] , h t x = λ t x Λ t x u ( Λ t u u ) 1 λ t u ,
where D = ( I F x t x ) 1 F θ x .
F θ x is the derivative of the system function with respect to θ .
Step I-2C: Calculate the deterministic component of the approximate objective function J D , S and the cautionary component J C , S :
J D , S = 1 2 [ x S π x ˜ S ] W x x [ x S π x ˜ S ] + [ x S π x ˜ S ] W x u [ u S π u ˜ S ] + 1 2 [ u S π u ˜ S ] W u u [ u S π u ˜ S ] and J C , S = 1 2 t r ( H S + 1 x x Σ S + 1 / S x x ) + t r ( H S + 1 θ x Σ S + 1 / S x θ ) + 1 2 t r ( H S + 1 θ θ Σ S + 1 / S θ θ ) .
Step I-2D: Repeat steps [a]–[c] for each j = S + 1 , , T :
[a]:
Calculate the components J D , j and J C , j :
J D , j = 1 2 [ x j π x ˜ j ] W x x [ x j π x ˜ j ] + [ x j π x ˜ j ] W x u [ u j π u ˜ j ] + 1 2 [ u j π u ˜ j ] W u u [ u j π u ˜ j ] ) , J C , j = 1 2 t r ( K j x x Σ ξ ξ ) .
[b]:
Calculate the matrix Σ j / j θ θ :
Σ j / j θ θ = Σ j / j 1 θ θ Σ j / j 1 θ x ( Σ j / j 1 x x ) 1 Σ j / j 1 x θ
with
Σ j / j 1 x x = F θ x Σ j 1 / j 1 θ θ ( F θ x ) + Σ j ε ε and Σ j / j 1 x θ = ( Σ j / j 1 θ x ) = F θ x Σ j 1 / j 1 θ θ , Σ j / j 1 θ θ = Σ j 1 / j 1 θ θ .
[c]:
Compute the probing component J P j :
J P j = 1 2 t r [ Λ x u ( Λ u u ) 1 Λ u x Σ j / j x x ] + t r [ Λ x u ( Λ u u ) 1 ( B ( K x x D + K x θ ) + W u x D ) Σ j / j θ x ] + 1 2 t r [ ( ( K θ x + D K x x ) B + D W x u ) ( Λ u u ) 1 × ( B ( K x x D + K x θ ) + W u x D ) Σ j / j θ θ ] .
Step I-2E: Calculate the sum of the deterministic, cautionary, and probing terms over the periods S , , T :
J d = ( J D , S + j = S + 1 T J D , j ) + ( J C , S + j = S + 1 T J C , j ) + j = S + 1 T J P j .
Step I-2F: Take a new control u S = u S π + 1 (a new point in the grid search) and go to step I-2A.
Step I-[3]: Choose an optimal u S with m i n J = J d ( u S ) . End of grid search. Fix the corresponding x S .
Step II: Calculate the following (a) and (b) for only one time period S:
(a)
Σ S / S 1 x x = F θ x Σ S 1 / S 1 θ θ ( F θ x ) + Σ S ε ε and Σ S / S 1 x θ = ( Σ S / S 1 θ x ) = F θ x Σ S 1 / S 1 θ θ , Σ S / S 1 θ θ = Σ S S / S 1 θ θ .
(b)
x S a = f ( x S 1 a , x S a , u S , θ ¯ , z ) + ε S .
Step III: Update the parameter estimates θ ^ and Σ S / S θ θ :
θ ^ S / S = θ ^ S / S 1 + Σ S / S 1 θ x ( Σ S / S 1 x x ) 1 [ x S a x S ] and x ^ S / S = x S a .
Σ S / S θ θ = Σ S / S 1 θ θ Σ S / S 1 θ x ( Σ S / S 1 x x ) 1 Σ S / S 1 x θ .
Step IV: Set θ ^ = θ ^ S / S and Σ θ θ = Σ S / S θ θ , go to Step I and run the procedure for the time period S + 1 .
The OPTCON3 algorithm is finished when S = T and the approximately optimal dual control and state variables have been found for all periods.

3.4. Computational Details for the AL Procedure

This subsection includes the technical computations for the AL procedure of the OPTCON3 algorithm.
The basic tasks are applying Bellman’s principle of optimality, linearization of the nonlinear system in perturbation form, minimizing the objective function, and inserting the obtained feedback rule for controls back into the objective function. In this way the (temporary) components of the optimal objective function can be derived. After that, due to the stochastic nature of the parameters ( θ ), all the components have to be adjusted to the extended system x t . . . . . . . . θ .
Start with Bellman’s principle of optimality:
J T t = min u t E { L t ( x t , u t ) + J T ( t + 1 ) ( x t ) } .
The last term of (19) is
J T ( t + 1 ) ( x t ) = min u t + 1 E min u T 1 E { j = t + 1 T L j ( x j , u j ) / P T 1 } / P t + 1 } ,
where P t = ( x ^ t / t , Σ t / t ) .
For j = t + 1 T L j ( x j , u j ) use the (Taylor) series expansion:
j = t + 1 T L j ( x j , u j ) = j = t + 1 T L j ( x o j , u o j ) + j = t + 1 T ( L j x δ x j + 1 2 δ x j L j x x δ x j + δ x j L j x u δ u j + L j u δ u j + 1 2 δ u j L j u u δ u j ) ,
where L j x , L j u are the gradients and L j x x , L j x u , and L j u u are the Hessian matrices; δ x j = x j x o j , δ u j = u j u o j , and ( x o j , u o j ) is a nominal path.
Thus,
J T ( t + 1 ) ( x t ) = J o , T ( t + 1 ) ( x t ) + Δ J T ( t + 1 ) ( x t )
and
Δ J T ( t + 1 ) ( x t ) = min u t + 1 E { min u T 1 E { j = t + 1 T ( L j x δ x j + 1 2 δ x j L j x x δ x j + δ x j L j x u δ u j + L j u δ u j + 1 2 δ u j L j u u δ u j / P T 1 } / P t + 1 ) } .
Next, apply the Taylor expansion to the nonlinear system, i.e., linearize the system function f in (2) around the reference values:
x t f t ( x ^ t 1 / t 1 , u t , x t ) + F x t 1 ( x t 1 x ^ t 1 / t 1 ) + F u ( u t u t ) + F x t ( x t x ^ t / t ) + ε t .
This can be rewritten in perturbation form:
δ x t = F x t 1 δ x t 1 + F u δ u t + F x t δ x t + ε t
After transformation
δ x t = ( I F x t ) 1 F x t 1 δ x t 1 + ( I F x t ) 1 F u t δ u t + ( I F x t ) 1 ε t .
Thus, (21) and (22) constitute the problem in perturbation form.
Next, we have to prove that the term (21) can be expressed in the quadratic form as a function of δ x t only.
Assume
Δ J T ( t + 1 ) ( x t ) = g t + 1 + E { h t + 1 x δ x t + 1 2 δ x t H t + 1 δ x t / P t + 1 }
and apply the method of induction. By doing so the parameters g, h x and H are determined.
The next step is to prove that rule (23) is also true for Δ J T j ( x t ) j .
Start with
J T j = min u j E { L j ( x j , u j ) + J T ( j + 1 ) ( x j ) } .
By second order expansion we get:
J T j = min u j E { L j ( x o j , u o j ) + L x δ x j + 1 2 δ x j L x x δ x j + δ x j L x u δ u j + L u δ u j + 1 2 δ u j L u u δ u j + ( J o , T ( j + 1 ) + Δ J T ( j + 1 ) )
J T j L j ( x o j , u o j ) J o , T ( j + 1 ) = min u j E { L x δ x j + 1 2 δ x j L x x δ x j + δ x j L x u δ u j + L u δ u j + 1 2 δ u j L u u δ u j + Δ J T ( j + 1 ) }
Use L j ( x o j , u o j ) + J o , T ( j + 1 ) = J o , T j and assumption (23)
J T j J o , T j = min u j E { L x δ x j + 1 2 δ x j L x x δ x j + δ x j L x u δ u j + L u δ u j + 1 2 δ u j L u u δ u j + g j + 1 + E { h j + 1 x δ x j + 1 2 δ x j H j + 1 δ x j / P j + 1 / P j } }
Taking expectations over P j + 1 yields
Δ J T j : = J T j J o , T j = min u j E { ( L x + h j + 1 x ) δ x j + 1 2 δ x j ( L x x + H j + 1 ) δ x j + δ x j L x u δ u j + L u δ u j + 1 2 δ u j L u u δ u j + g j + 1 / P j } .
Put (22) into (24) and set
L x + h j + 1 x : = k j x , L x x + H j + 1 : = K j , H T + 1 = 0 .
Then by taking expectations over P j
Δ J T j = min u j E { k j x [ ( I F x j ) 1 F x j 1 δ x j 1 + ( I F x j ) 1 F u δ u j + ( I F x j ) 1 ε j ] + 1 2 [ ( I F x j ) 1 F x j 1 δ x j 1 + ( I F x j ) 1 F u δ u j + ( I F x j ) 1 ε j ] K j × [ ( I F x j ) 1 F x j 1 δ x j 1 + ( I F x j ) 1 F u δ u j + ( I F x j ) 1 ε j ] + [ ( I F x j ) 1 F x j 1 δ x j 1 + ( I F x j ) 1 F u δ u j + ( I F x j ) 1 ε j ] L x u δ u j + L u δ u j + 1 2 δ u j L u u δ u j + g j + 1 / P j 1 }
Δ J T j = min u j E { k j x ( I F x j ) 1 F x j 1 δ x j 1 + [ k j x ( I F x j ) 1 F u + L u ] δ u j + 1 2 δ x j 1 [ F x j 1 ( ( I F x j ) 1 ) K j ( I F x j ) 1 F x j 1 ] δ x j 1 + δ x j 1 [ F x j 1 ( ( I F x j ) 1 ) K j ( I F x j ) 1 F u + F x j 1 ( ( I F x j ) 1 ) L x u ] δ u j + 1 2 δ u j [ F u ( ( I F x j ) 1 ) K j ( I F x j ) 1 ) F u + 2 F u ( ( I F x j ) 1 ) L x u + L u u ] δ u j + 1 2 ε j ( ( I F x j ) 1 ) K j ( I F x j ) 1 ε j + g j + 1 / P j 1 } .
The terms which are higher than second order are dropped here.
Then use the following notations:
λ x : = k j x [ ( I F x j ) 1 F x j 1 λ u : = k j x [ ( I F x j ) 1 F u + L u Λ x x : = F x j 1 ( ( I F x j ) 1 ) K j ( I F x j ) 1 F x j 1 Λ x u : = F x j 1 ( ( I F x j ) 1 ) K j ( I F x j ) 1 F u + F x j 1 ( ( I F x j ) 1 ) L x u Λ u u : = F u ( ( I F x j ) 1 ) K j ( I F x j ) 1 ) F u + 2 F u ( ( I F x j ) 1 ) L x u + L u u ξ j : = ( I F x j ) 1 ε j
and get
Δ J T j = min u j E { λ x δ x j 1 + λ u δ u j + 1 2 δ x j 1 Λ x x δ x j 1 + δ x j 1 Λ x u δ u j + 1 2 δ u j Λ u u δ u j + 1 2 ξ j K j ξ j + g j + 1 / P j 1 } .
Apply the rule E ( x A x ) = x ^ A x ^ + t r ( A Σ x x ) and take expectations over P j 1 :
Δ J T j = min u j E { λ x δ x ^ j 1 / j 1 + λ u δ u j + g j + 1 + 1 2 δ x ^ j 1 / j 1 Λ x x δ x ^ j 1 / j 1 + δ x ^ j 1 / j 1 Λ x u δ u j + 1 2 δ u j Λ u u δ u j + 1 2 t r ( Λ x x Σ j / j 1 x x ) + 1 2 t r ( K j Σ ξ ξ ) } ,
where E ( ξ ) = 0 .
Next, minimize the objective function Δ J T j with respect to δ u j :
Δ J T j δ u j = λ u + δ x ^ j 1 / j 1 Λ x u + δ u j Λ u u = 0 δ u j = ( Λ u u ) 1 ( λ u + Λ x u δ x ^ j 1 / j 1 )
Put rule (29) into (27):
Δ J T j = E { λ x δ x j 1 + λ u [ ( Λ u u ) 1 ( λ u + Λ x u δ x ^ j 1 / j 1 ) ] + 1 2 δ x j 1 Λ x x δ x j 1 + δ x j 1 Λ x u [ ( Λ u u ) 1 ( λ u + Λ x u δ x ^ j 1 / j 1 ) ] + 1 2 [ ( Λ u u ) 1 ( λ u + Λ x u δ x ^ j 1 / j 1 ) ] Λ u u × [ ( Λ u u ) 1 ( λ u + Λ x u δ x ^ j 1 / j 1 ) ] + 1 2 ξ K j ξ + g j + 1 }
Δ J T j = E { g j + 1 + ( λ x + Λ x u ( Λ u u ) 1 λ u ) δ x j 1 1 2 λ u ( Λ u u ) 1 λ u + 1 2 δ x j 1 Λ x x δ x j 1 + 1 2 x ^ j 1 / j 1 Λ x u ( Λ u u ) 1 Λ u x x ^ j 1 / j 1 δ x j 1 Λ x u ( Λ u u ) 1 Λ u x δ x ^ j 1 / j 1 + 1 2 ξ K j ξ }
Define
Λ x u ( Λ u u ) 1 Λ u x : = Ω x x .
Note that 1 2 ξ K j ξ = 1 2 [ E ( ξ ) K j E ( ξ ) + t r ( K j Σ ξ ξ ) ] = 1 2 t r ( K j Σ ξ ξ ) because E ( ξ ) = 0 . Moreover E { δ x j 1 Ω x x δ x ^ j 1 / j 1 } = δ x ^ j 1 / j 1 Ω x x δ x ^ j 1 / j 1 . Thus,
Δ J T j = g j + 1 1 2 λ u ( Λ u u ) 1 λ u + 1 2 t r ( K j Σ ξ ξ ) + E { ( λ x + Λ x u ( Λ u u ) 1 λ u ) δ x j 1 + 1 2 δ x j 1 Λ x x δ x j 1 } 1 2 δ x ^ j 1 / j 1 Ω x x δ x ^ j 1 / j 1 .
Use the following rule:
E { δ x j 1 Ω x x δ x j 1 } = x ^ j 1 / j 1 Ω x x δ x ^ j 1 / j 1 + t r ( Ω x x Σ x x ) x ^ j 1 / j 1 Ω x x δ x ^ j 1 / j 1 = E { δ x j 1 Ω x x δ x j 1 } t r ( Ω x x Σ x x )
and get
Δ J T j = g j + 1 1 2 λ u ( Λ u u ) 1 λ u + 1 2 t r ( K j Σ ξ ξ ) + E { ( λ x + Λ x u ( Λ u u ) 1 λ u ) δ x j 1 + 1 2 δ x j 1 Λ x x δ x j 1 } 1 2 ( E { δ x j 1 Ω x x δ x j 1 } t r ( Ω x x Σ x x ) ) = g j + 1 1 2 λ u ( Λ u u ) 1 λ u + 1 2 t r ( K j Σ ξ ξ ) + t r ( Ω x x Σ x x ) + E { ( λ x + Λ x u ( Λ u u ) 1 λ u ) δ x j 1 + 1 2 δ x j 1 ( Λ x x Ω x x ) δ x j 1 } .
Defining
h j x : = λ x + Λ x u ( Λ u u ) 1 λ u , H j : = Λ x x Ω x x : = Λ x x Λ x u ( Λ u u ) 1 Λ u x
and
g j = g j + 1 1 2 λ j u ( Λ u u ) 1 λ j u + 1 2 t r ( K j Σ j / j ξ ξ + Ω x x Σ j x x ) ,
we get
Δ J T j = g j + E { h j x δ x j 1 + 1 2 δ x j 1 H j δ x j 1 } .
This shows that Δ J T j is a quadratic function of δ x j 1 .
 
Next, using the results of [31] we find the solution of the g recursion:
g j = g j + 1 1 2 λ j u ( Λ u u ) 1 λ j u + 1 2 t r ( K j Σ j / j ξ ξ + Ω x x Σ j x x ) .
Let us define a deterministic term M j = 1 2 λ j u ( Λ u u ) 1 λ j u and a stochastic term N j = 1 2 t r ( K j Σ j / j ξ ξ + Ω x x Σ j x x ) , so that g j = g j + 1 M j + N j . By working backwards from period T the following can be shown:
g T j = g T j + 1 i = T j T M i + s = T j T N i a n d g T + 1 = 0 .
Next, solve it partially for the stochastic term N j . For that define γ j = γ j + 1 M j with γ T + 1 = 0 .
Using again the backward recursion get the equation:
γ T j = γ T j + 1 i = T j T M i = i = T j T M i .
Put it into (33) and get
g T j = γ T j + i = T j T N i
or in general
g j + 1 = γ j + 1 + i = j + 1 T N i .
Thus
g j + 1 = γ j + 1 + 1 2 i = j + 1 T t r ( K i Σ i 1 / i 1 ξ ξ + Ω x x Σ i 1 x x )
and
γ j = γ j + 1 1 2 λ j 1 u ( Λ u u ) 1 λ j 1 u .
Use the fact that Δ J T ( t + 1 ) and Δ J t + 1 are equivalent; thus put (34) into (32) for t + 1 :
Δ J t + 1 = γ t + 1 + 1 2 i = t + 1 T t r ( K i Σ ξ ξ + Ω x x Σ x x ) + E { h t + 1 x δ x t + 1 2 δ x t H t + 1 δ x t / P t } .
Then
J t + 1 ( x t ) = J o , t + 1 + Δ J t + 1 = C o , t + 1 + γ j + 1 + 1 2 ( i = t + 1 T t r ( K i Σ ξ ξ + Ω x x Σ x x ) ) + E { h t + 1 x δ x t + 1 2 δ x t H t + 1 δ x t / P t } ,
where C o , t + 1 = i = t + 1 T L t + 1 ( x o , t + 1 , u o , t + 1 ) .
J t = min u t E t 1 { L t ( x t , u t ) + J t + 1 ( x t ) } = min u t E { L t ( x t , u t ) + C o , t + 1 + γ t + 1 + 1 2 ( i = t + 1 T t r ( K i Σ ξ ξ + Ω x x Σ x x ) ) + E { h t + 1 x δ x t + 1 2 δ x t H t + 1 δ x t / P t / P t 1 } }
E { h t + 1 x δ x t / P t 1 } = 0 and
E { 1 2 δ x t H t + 1 δ x t / P t 1 } = 1 2 δ x ^ t / t 1 H t + 1 δ x ^ t / t 1 + 1 2 t r ( H t + 1 Σ t / t 1 x x ) = 1 2 t r ( H t + 1 Σ t / t 1 x x ) . Then
J t = : J t , d = min u t E t 1 { L t ( x t , u t ) + C o , t + 1 + γ t + 1 + 1 2 ( i = t + 1 T t r ( K i Σ i / i ξ ξ + Ω x x Σ i / i x x ) ) + 1 2 t r ( H t + 1 Σ t / t 1 x x ) } .
This objective function can be split into three parts:
deterministic:
J D , t = L t ( x t , u t ) + C o , t + 1 + γ t + 1 ,
cautionary:
J c , t = 1 2 i = t T t r ( K i Σ i / i ξ ξ ) + 1 2 t r ( H t + 1 Σ t / t 1 x x ) ,
probing:
J p , t = 1 2 i = t + 1 T t r ( Ω x x Σ i / i x x ) .
In this way the optimal control problem is solved and the components of the objective function are found. However, these are not the final results. In the next step, the stochastic parameters have to be taken into account, i.e., the extension of the system has to be considered.

3.5. Extension of the System

Recall that the algorithm deals with a quadratic criterion function when the parameters of the system equations are unknown. One technique to incorporate the uncertain parameters is to treat the random parameters as additional state variables.
Thus, apply the equations of the objective function to the system
Ψ t = x t θ t .
Then
L t ( Ψ t , u t ) = 1 2 [ ( x t x ˜ t ) 0 ] W x x 0 0 0 ( x t x ˜ t ) 0 + [ ( x t x ˜ t ) 0 ] W x u 0 [ u t u ˜ t ] + 1 2 [ u t u ˜ t ] W u u [ u t u ˜ t ]
and
f t ( Ψ t , u t ) = f x ( Ψ t , u t ) f θ ( Ψ t , u t ) = f ( x t , x t 1 , u t , θ t ) I θ t 1 .
In order to calculate the adjusted components of the objective function (especially (40) and (41)) we need the terms H and Ω .
First, we calculate the adapted Ω Ψ Ψ using Definition (30):
Ω Ψ Ψ = Λ Ψ u ( Λ u u ) 1 Λ u Ψ = [ F Ψ t 1 ( ( I F Ψ t ) 1 ) K t ( I F Ψ t ) 1 F u + F Ψ t 1 ( ( I F Ψ t ) 1 ) L Ψ u ] ( Λ u u ) 1 [ F u ( ( I F Ψ t ) 1 ) K t ( I F Ψ t ) 1 F Ψ t 1 + L Ψ u ( ( I F Ψ t ) 1 ) F Ψ t 1 ] = ( ( F x t 1 x ) F x t 1 θ ( F θ t 1 x ) F θ t 1 θ ( ( I F x t x ) 1 ) 0 ( ( I F x t x ) 1 F θ t x ) I K x x K x θ K θ x K θ θ × ( ( I F x t x ) 1 ) ( I F x t x ) 1 F θ t x 0 I F u x F u θ + F x t 1 x F x t 1 θ F θ t 1 x F θ t 1 θ × ( I F x t x ) 1 0 ( I F x t x ) 1 F θ t x I L x u 0 ) ( Λ u u ) 1 ( F u 0 ( I F x t x ) 1 0 ( I F x t x ) 1 F θ t x I × K x x K x θ K θ x K θ θ ( I F x t x ) 1 ( I F x t x ) 1 F θ t x 0 I F x t 1 x F θ t 1 x F x t 1 θ F θ t 1 θ + L u x 0 ( I F x t x ) 1 ( I F x t x ) 1 F θ t x 0 I F x t 1 x F θ t 1 x F x t 1 θ F θ t 1 θ ) ,
where
I F Ψ t = I 0 0 I F x t x F θ t x F x t θ F θ t θ = I 0 0 I F x t x F θ t x 0 0 = I F x t x ( F θ t x ) 0 I
and
( I F Ψ t ) 1 = ( I F x t x ) 1 ( I F x t x ) 1 F θ t x 0 I .
We know that F x t 1 θ = 0 , F θ t 1 x = 0 , F θ t 1 θ = I , and L x u = W x u . Then we get
Ω Ψ Ψ = ( ( F x t 1 x ) ( I F x t x ) 1 0 ( ( I F x t x ) 1 F θ t x ) I K x x K x θ K θ x K θ θ ( I F x t x ) 1 F u t x 0 + ( F x t 1 x ) ( I F x t x ) 1 0 ( ( I F x t x ) 1 F θ t x ) I W x u 0 ) ( Λ u u ) 1 × ( ( F u t x ) ( I F x t x ) 1 0 K x x K x θ K θ x K θ θ × ( I F x t x ) 1 F x t 1 x ( I F x t x ) 1 F θ t x 0 I + W u x 0 ( I F x t x ) 1 F x t 1 x ( I F x t x ) 1 F θ t x 0 I ) .
Define
D = ( I F x t x ) 1 F θ t x .
Then using A = ( I F x t x ) 1 F x t 1 x and B = ( I F x t x ) 1 F u t x we get the following:
Ω Ψ Ψ = A 0 D I K x x B K θ x B + A W x u D W x u ( Λ u u ) 1 × B K x x B K x θ A D 0 I + W u x A W u x D = ( A K x x B + A W x u ) ( D K x x B + K θ x B + D W x u ) ( Λ u u ) 1 × ( B K x x A + W u x A ) ( B K x x D + B K x θ + W u x D ) = : ψ 1 ψ 2 ψ 3 ψ 4 .
Use the definition Λ x u = A K x x B + A W x u and get the following:
ψ 1 = Λ x u ( Λ u u ) 1 Λ u x , ψ 2 = Λ x u ( Λ u u ) 1 ( B K x x D + B K x θ + W u x D ) , ψ 3 = ( D K x x B + K θ x B + D W x u ) ( Λ u u ) 1 Λ u x , ψ 4 = ( D K x x B + K θ x B + D W x u ) ( Λ u u ) 1 ( B K x x D + B K x θ + W u x D ) .
For the calculation of H we need the term Λ Ψ Ψ :
Λ Ψ Ψ = ( F Ψ t 1 ) ( ( I F Ψ t ) 1 ) K t ( I F Ψ t ) 1 F Ψ t = A 0 D I K x x K x θ K θ x K θ θ A D 0 I = A K x x A A ( K x x D + K x θ ) ( D K x x + K θ x ) A D ( K x x D + K x θ ) + ( K θ x D + K θ θ ) .
Now we can calculate the term H for the extended system:
H Ψ Ψ = Λ Ψ Ψ Λ Ψ u ( Λ u u ) 1 Λ u Ψ = Λ Ψ Ψ Ω Ψ Ψ
H t Ψ Ψ = H x x H x θ H θ x H θ θ t
Thus,
H x x = Λ x x Λ x u ( Λ u u ) 1 Λ u x H x θ = A ( K x x D + K x θ ) Λ x u ( Λ u u ) 1 ( B K x x D + B K x θ + W u x D ) H θ θ = D ( K x x D + K x θ ) + ( K θ x D + K θ θ ) ( D K x x B + K θ x B + D W x u ) ( Λ u u ) 1 ( B K x x D + B K x θ + W u x D ) .
After that we go back to the components of the objective function. According to (41) we need to calculate the term Ω Ψ Ψ Σ t / t Ψ Ψ .
Ω Ψ Ψ Σ t / t Ψ Ψ = ψ 1 ψ 2 ψ 3 ψ 4 Σ x x Σ x θ Σ θ x Σ θ θ t / t = ψ 1 Σ t / t x x + ψ 2 Σ t / t θ x ψ 1 Σ t / t x θ + ψ 2 Σ t / t θ θ ψ 3 Σ t / t x x + ψ 4 Σ t / t θ x ψ 3 Σ t / t x θ + ψ 4 Σ t / t θ θ .
Thus,
J p , t = 1 2 j = t + 1 T t r ( Ω Ψ Ψ Σ j / j Ψ Ψ ) = 1 2 ( j = t + 1 T t r [ Λ x u ( Λ u u ) 1 Λ u x Σ j / j x x ] ) + t r [ Λ x u ( Λ u u ) 1 ( B ( K x x D ¯ + K x θ ) + W u x D ¯ ) Σ j / j θ x ] + 1 2 t r [ ( ( K θ x + ( D ) K x x ) B + ( D ) W x u ) ( Λ u u ) 1 × ( B ( K x x D + K x θ ) + W u x D ) Σ j / j θ θ ] .
Next we calculate J C , t . According to (40) we need to calculate t r ( K j Ψ Ψ Σ j / j 1 ξ ξ ) and t r ( H t + 1 Ψ Ψ Σ t / t 1 x x ) :
t r ( K j Ψ Ψ Σ j / j ξ ξ ) = t r K x x K x θ K θ x K θ θ j Σ ξ ξ 0 0 0 j / j = t r ( K j x x Σ j / j ξ ξ )
t r ( H t + 1 Ψ Ψ Σ t / t 1 x x ) = t r H x x H x θ H θ x H θ θ t + 1 Σ x x Σ x θ Σ θ x Σ θ θ t / t 1 = t r ( H t + 1 x x Σ t / t 1 x x + H t + 1 x θ Σ t / t 1 θ x ) + t r ( H t + 1 θ x Σ t / t 1 x θ + H t + 1 θ θ Σ t / t 1 θ θ ) = t r ( H t + 1 x x Σ t / t 1 x x ) + 2 t r ( H t + 1 θ x Σ t / t 1 x θ ) + t r ( H t + 1 θ θ Σ t / t 1 θ θ ) .
Thus,
J c , t = 1 2 [ i = j T t r ( K i x x Σ i / i ξ ξ ) ] + 1 2 t r ( H t + 1 x x Σ t / t 1 x x ) + t r ( H t + 1 θ x Σ t / t 1 x θ ) + 1 2 t r ( H t + 1 θ θ Σ t / t 1 θ θ ) .
Under the assumption γ = 0 (Kendrick (1981), Appendix J) the deterministic term is
J D , t = L t ( Ψ t , u t ) + C o , t + 1 + γ j + 1 = L t ( x t , u t ) + j = t + 1 T ( L j ( x o j , u o j ) ) = 1 2 ( x t x ˜ t ) W x x ( x t x ˜ t ) + ( x t x ˜ t ) W x u ( u t u ˜ t ) + 1 2 ( u t u ˜ t ) W u u ( u t u ˜ t ) + j = t + 1 T [ ( x o t x ˜ t ) W x x ( x o t x ˜ t ) + ( x o t x ˜ t ) W x u ( u o t u ˜ t ) + 1 2 ( u o t u ˜ t ) W u u ( u o t u ˜ t ) ] ,
where ( x o t , u o t ) = ( x t , u t ) is a nominal path.
In order to calculate the components of the objective function we need to determine one more term, K t Ψ Ψ .
Use K t Ψ Ψ = L Ψ Ψ + H t + 1 Ψ Ψ
K t Ψ Ψ = K x x K x θ K θ x K θ θ t = L x x L x θ L θ x L θ θ t + H x x H x θ H θ x H θ θ t + 1 = W x x 0 0 0 t + H x x H x θ H θ x H θ θ t + 1 ,
K t x x = W t x x + H t + 1 x x , K t x θ = H t + 1 x θ , K t θ x = H t + 1 θ x , K t θ θ = H t + 1 θ θ ,
where H T + 1 x x = O n × n , H T + 1 x θ = O n × p , H T + 1 θ x = O p × n , H T + 1 θ θ = O p × p .
Use the linearized system, i.e., apply:
A = ( I F x t x ) 1 F x t 1 x , B = ( I F x t x ) 1 F u t x , L u = x W x u + u W u u + ( w u )
L u u = W t u u and L x u = W t x u
to (26) and obtain variables as used in Step I-2B in Section 3.3:
Λ t x x = ( A t ) K t x x A t , Λ t u x = ( B t ) K t x x A t + W t u x A t , Λ t x u = ( Λ t u x ) , Λ t u u = ( B t ) K t x x B t + 2 ( B t ) W t x u + W t u u ,
λ t x = k t x A t , λ t u = k t x B t + x t W t x u + u t W t u u + ( w u ) .

4. Computational Aspects

In this paper we concentrate on mathematical details of the OPTCON algorithm. However, it is important to mention the computational characteristics of this algorithm. In order to show computational time and accuracy we used two different models: the MacRae model and the ATOPT model. As it would go beyond the scope of the present paper, we do not discuss these models and their economic interpretation (see detailed descriptions in [18]) but provide computational details of applying different versions of the OPTCON algorithm. We compared four different optimization strategies: deterministic solution (det), stochastic open-loop solution (OL) as described in Section 3.1, stochastic passive learning solution (OLF) as described in Section 3.2, and stochastic active learning solution (AL) as described in Section 3.3. In the case of the active learning strategy, a grid search with 100 grid points for each control variable was applied (see parameter Π in Step-I-[2] in Section 3.3). All calculations were performed on a Windows 10 computer with 16 GB RAM. The OPTCON algorithm is programmed in MATLAB (MATLAB R2020a); the MATLAB computer code is available on request.
The MacRae model is a simple linear model with one state and one control variable. One parameter is stochastic and the optimal control was calculated for two periods only. Table 1 summarizes the computational performance of the OPTCON algorithm for the MacRae model. In the case of a linear-quadratic deterministic optimal control framework convergence can be shown analytically. This is true for the MacRae model, at least for the deterministic and the OL case. Moreover, as a check for the special case of a linear system, we compared the results of OPTCON with those of the same model calculated by available software and found them to be the same apart from round-off errors.
The ATOPT model is a small nonlinear model consisting of three equations, one control variable, and two stochastic parameters. The optimal control is calculated for five periods. Table 1 summarizes the computational performance of the OPTCON algorithm for the MacRae model. Table 2 summarizes the computational performance of the OPTCON algorithm for the ATOPT model.

5. Concluding Remarks

In this paper, we described the OPTCON algorithm for the approximately optimal control of stochastic processes under a quadratic objective function. For systems with complete information, either deterministic or stochastic with known statistical characteristics of the disturbances, the open-loop version of OPTCON1 is suitable. OPTCON2 assumes partial information, in particular uncertain parameters of the system, with passive storage of information accruing during the control horizon, i.e., not used for control purposes. If active storage of information is possible, the dual control policy of OPTCON3 is appropriate. All three versions of OPTCON were programmed first in C# and then in MATLAB. We have applied OPTCON1 to various economic policy problems ([32,33], among others); Monte Carlo experiments with OPTCON2 and OPTCON3 have also yielded satisfactory results in terms of computing time. The results of approximately optimal policies were also as expected from the point of view of the economic problems under consideration. Due to the numerical nature of the algorithm, we cannot prove convergence in general; however, no problems of non-convergence have occurred so far. For future research, more applications are desirable to gain more insight into the different policies resulting from the various assumptions about the information structure of adaptive stochastic policy problems.

Author Contributions

Conceptualization, R.N., D.B. and V.B.-N.; methodology, V.B.-N., R.N. and D.B.; software, D.B.; formal analysis, V.B.-N.; writing—original draft preparation, V.B.-N., D.B. and R.N.; writing—review and editing, R.N., D.B. and V.B.-N.; visualization, D.B. and V.B.-N.; supervision, R.N.; funding acquisition, V.B.-N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Austrian Science Fund FWF (grant T 1012-GBL).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Athans, M. The role and use of the stochastic linear-quadratic-Gaussian problem in control system design. IEEE Trans. Autom. Control 1971, 16, 529–552. [Google Scholar] [CrossRef]
  2. Bellman, R. Dynamic Programming; Princeton University Press: Princeton, NJ, USA, 1957. [Google Scholar]
  3. Bellman, R. Adaptive Control Processes: A Guided Tour; Princeton University Press: Princeton, NJ, USA, 1961. [Google Scholar]
  4. Fel’dbaum, A.A. Optimal Control Systems; New York Academic Press: New York, NY, USA, 1965. [Google Scholar]
  5. Aoki, M. Optimization of Stochastic Systems: Topics in Discrete-Time Systems; Academic Press: Cambridge, MA, USA, 1967; Volume 32. [Google Scholar]
  6. Bertsekas, D.P. Dynamic Programming and Optimal Control; Athena Scientific: Belmont, MA, USA, 2005; Volume 1, 2. [Google Scholar]
  7. Powell, W. Approximate Dynamic Programming: Solving the Curses of Dimensionality; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
  8. Si, J.; Barto, A.; Powell, W.; Wunsch, D. Handbook of Learning and Approximate Dynamic Programming; John Wiley & Sons: Hoboken, NJ, USA, 2004; Volume 2. [Google Scholar]
  9. Matulka, J.; Neck, R. OPTCON: An Algorithm for the Optimal Control of Nonlinear Stochastic Models. Ann. Oper. Res. 1992, 37, 375–401. [Google Scholar] [CrossRef]
  10. Blueschke-Nikolaeva, V.; Blueschke, D.; Neck, R. Optimal Control of Nonlinear Dynamic Econometric Models: An Algorithm and an Application. Comput. Stat. Data Anal. 2012, 56, 3230–3240. [Google Scholar] [CrossRef] [Green Version]
  11. Kendrick, D.A.; Amman, H.M. A Classification System for Economic Stochastic Control Models. Comput. Econ. 2006, 27, 453–481. [Google Scholar] [CrossRef]
  12. Bar-Shalom, Y.; Tse, E. Dual effect, certainty equivalence, and separation in stochastic control. IEEE Trans. Autom. Control 1974, 19, 494–500. [Google Scholar] [CrossRef]
  13. Bar-Shalom, Y.; Tse, E. Caution, probing, and the value of information in the control of uncertain systems. In Annals of Economic and Social Measurement; NBER: Cambridge, MA, USA, 1976; Volume 5, pp. 323–337. [Google Scholar]
  14. Tse, E.; Bar-Shalom, Y.; Meier, L. Wide-sense adaptive dual control for nonlinear stochastic systems. IEEE Trans. Autom. Control 1973, 18, 98–108. [Google Scholar] [CrossRef]
  15. Kendrick, D.A. Stochastic Control for Economic Models; McGraw-Hill: New York, NY, USA, 1981. [Google Scholar]
  16. Kendrick, D.A. Caution and probing in a macroeconomic model. J. Econ. Dyn. Control 1982, 4, 149–170. [Google Scholar] [CrossRef]
  17. Amman, H.; Kendrick, D. Active learning monte carlo results. J. Econ. Dyn. Control 1994, 18, 119–124. [Google Scholar] [CrossRef]
  18. Blueschke-Nikolaeva, V.; Blueschke, D.; Neck, R. OPTCON3: An Active Learning Control Algorithm for Nonlinear Quadratic Stochastic Problems. Comput. Econ. 2019, 56, 145–162. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  19. Behrens, D.; Neck, R. Approximating Solutions for Nonlinear Dynamic Tracking Games. Comput. Econ. 2015, 45, 407–433. [Google Scholar] [CrossRef] [Green Version]
  20. Athans, M.; Falb, P. Optimal Control: An Introduction to the Theory and Its Applications; McGraw-Hill Book Co.: New York, NY, USA, 1966. [Google Scholar]
  21. Chow, G.C. Analysis and Control of Dynamic Economic Systems; John Wiley & Sons: New York, NY, USA, 1975. [Google Scholar]
  22. Chow, G.C. Econometric Analysis by Control Methods; John Wiley & Sons: New York, NY, USA, 1981. [Google Scholar]
  23. Blueschke-Nikolaeva, V. OPTCON2: An Algorithm for the Optimal Control of Nonlinear Stochastic Models; Südwestdeutscher Verlag für Hochschulschriften: Saarbrücken, Germany, 2013. [Google Scholar]
  24. Conn, A.R.; Gould, N.I.M.; Toint, P.L. Trust-Region Methods; Society for Industrial and Applied Mathematics and Mathematical Programming Society: Philadelphia, PA, USA, 2000. [Google Scholar]
  25. Coleman, T.; Li, Y. An interior trust region approach for nonlinear minimization subject to bounds. SIAM J. Optim. 1996, 6, 418–445. [Google Scholar] [CrossRef] [Green Version]
  26. Kelley, C. Solving Nonlinear Equations with Newton’s Method; SIAM: Philadelphia, PA, USA, 2003. [Google Scholar]
  27. Rheinboldt, W. Methods for Solving Systems of Nonlinear Equations; SIAM: Philadelphia, PA, USA, 1998. [Google Scholar]
  28. Kalman, R.E. A new approach to linear filtering and prediction problems. J. Basic Eng. 1960, 82D, 33–45. [Google Scholar] [CrossRef] [Green Version]
  29. Welch, G.; Bishop, G. An Introduction to the Kalman Filter; Technical Report; Department of Computer Science, University of North Carolina: Chapel Hill, NC, USA, 2006. [Google Scholar]
  30. Blueschke, D.; Blueschke-Nikolaeva, V.; Neck, R. Stochastic control of linear and nonlinear econometric models: Some computational aspects. Comput. Econ. 2013, 42, 107–118. [Google Scholar] [CrossRef]
  31. Bar-Shalom, Y.; Tse, E.; Larson, R. Some recent advances in the development of closed-loop stochastic control and resource allocation algorithms. In Proceedings of the IFAC Symposium on Stochastic Control, Budapest, Hungary, 25–27 September 1974. [Google Scholar]
  32. Neck, R.; Matulka, J. Stochastic optimum control of macroeconometric models using the algorithm OPTCON. Eur. J. Oper. Res. 1994, 73, 384–405. [Google Scholar] [CrossRef]
  33. Neck, R.; Karbuz, S. Optimal budgetary and monetary policies under uncertainty: A stochastic control approach. Ann. Oper. Res. 1995, 58, 379–402. [Google Scholar] [CrossRef]
Figure 1. Flow chart of OPTCON1.
Figure 1. Flow chart of OPTCON1.
Algorithms 14 00181 g001
Figure 2. Flow chart of OPTCON3.
Figure 2. Flow chart of OPTCON3.
Algorithms 14 00181 g002
Table 1. Computational aspects of optimal control for the MacRae model.
Table 1. Computational aspects of optimal control for the MacRae model.
MethodComputational TimeConvergence 1
det0.024879 sec.2
OL0.036867 sec.2
OLF0.134725 sec.2
AL0.771425 sec.2
1 The number of iterations for convergence is available for the deterministic and OL solution techniques. In the case of the OLF and AL solution techniques the number of iterations for S = 1 is given, as this is the optimization for the whole planing horizon.
Table 2. Computational aspects of optimal control for the ATOPT model.
Table 2. Computational aspects of optimal control for the ATOPT model.
MethodComputational TimeConvergence
det0.148106 sec.15
OL0.134021 sec.7
OLF0.265504 sec.7
AL15.726499 sec.7
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Blueschke, D.; Blueschke-Nikolaeva, V.; Neck, R. Approximately Optimal Control of Nonlinear Dynamic Stochastic Problems with Learning: The OPTCON Algorithm. Algorithms 2021, 14, 181. https://0-doi-org.brum.beds.ac.uk/10.3390/a14060181

AMA Style

Blueschke D, Blueschke-Nikolaeva V, Neck R. Approximately Optimal Control of Nonlinear Dynamic Stochastic Problems with Learning: The OPTCON Algorithm. Algorithms. 2021; 14(6):181. https://0-doi-org.brum.beds.ac.uk/10.3390/a14060181

Chicago/Turabian Style

Blueschke, Dmitri, Viktoria Blueschke-Nikolaeva, and Reinhard Neck. 2021. "Approximately Optimal Control of Nonlinear Dynamic Stochastic Problems with Learning: The OPTCON Algorithm" Algorithms 14, no. 6: 181. https://0-doi-org.brum.beds.ac.uk/10.3390/a14060181

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop