Convergence Analysis of a Stochastic Progressive Hedging Algorithm for Stochastic Programming

Stochastic programming is an approach for solving optimization problems with uncertain data whose probability distribution is assumed to be known, and progressive hedging algorithm (PHA) is a well-known decomposition method for solving the underlying model. However, the per iteration computation of PHA could be very costly since it solves a large number of subproblems corresponding to all the scenarios. In this paper, a stochastic variant of PHA is studied. At each iteration, only a small fraction of the scenarios are selected uniformly at random and the corresponding variable components are updated accordingly, while the variable components corresponding to those not selected scenarios are kept untouch. Therefore, the per iteration cost can be controlled freely to achieve very fast iterations. We show that, though the per iteration cost is reduced signiﬁcantly, the proposed stochastic PHA converges in an ergodic sense at the same sublinear rate as the original PHA.


Introduction
Stochastic programming (SP) and robust optimization are two main approaches to deal with optimization problems with uncertain data. Unlike robust optimization, which usually assumes the unknown data lies in certain region, SP takes advantage of the fact that the probability distribution of the uncertain data are known or can be estimated. Therefore, SP is frequently applied to settings in which decisions are made repeatedly in essentially the same circumstances, and the goal is to make decisions that perform well on average. It has many applications in finance, portfolio investment, energy optimization, wireless communication, machine learning, etc., see the monographs [1,17].
We now review very briefly the multistage convex SP model introduced by Rockafellar and Wets [15]. In an N -stage decision problem, decisions are made and uncertain information is revealed alternatingly and sequentially as follows Here, for each j, x j denotes the decision to be made at stage j, and ξ j represents the uncertain information revealed after decision x j and before x j+1 . SP aims at determining an optimal response to the information available at each Throughout this paper, we assume that Ξ has a finite cardinality and the probability distribution of Ξ is given, i.e., each ξ ∈ Ξ has a known probability p(ξ) such that ∑ ξ∈Ξ p(ξ) = 1. Assume that x j ∈ ℜ nj and let n := ∑ N j=1 n j . A response mapping x(·) : Ξ → ℜ n is given by The set of all response mappings from Ξ to ℜ n , which forms a linear space, is denoted by L. For convenience, we endow L with the following weighted inner product ⟨x j (ξ), y j (ξ)⟩, ∀ x(·), y(·) ∈ L, (1.1) where ⟨x j (ξ), y j (ξ)⟩ = x j (ξ) T y j (ξ) denotes the usual inner product in ℜ nj . The nonanticipativity constraint on the decision mapping is denoted by N = {x(·) ∈ L | x j (ξ) does not depend on ξ j , . . . , ξ N , j = 1, 2, . . . , N } .
It is easy to show that N forms a subspace of L and its orthogonal complement space is given by where E ξ|ξ1,...,ξj−1 y j (ξ) denotes the conditional expectation of y j (ξ) given ξ 1 , . . . , ξ j−1 . For each ξ ∈ Ξ, let C(ξ) be a nonempty closed convex subset of ℜ n and define Besides the nonanticipativity constraint, we assume that the decision mapping has the abstract constraint x(·) ∈ C. This constraint on response mapping can be specified in particular applications. In this paper, we are interested in the progressive hedging algorithm (PHA) and the convergence analysis of a stochastic variant of it, and it is not necessary to go beyond an abstract representation of the constraint on the response mapping. We assume that For each ξ ∈ Ξ, we let g(·, ξ) : C(ξ) → ℜ be a lower semicontinuous convex function. An alternating sequence of decisions and observations (x 1 , ξ 1 , x 2 , ξ 2 , . . . , x N , ξ N ) will result in a cost g(x, ξ), which depends on both x and ξ, and each choice of x(·) ∈ L yields a function from Ξ to ℜ, i.e., Clearly, g(x(·), ·) can be regarded as a random variable. A commonly used risk measure in SP is the expected cost G : L → ℜ given by Here E ξ denotes the expectation taken with respect to ξ. The classical SP model with nonanticipativity constraint originally proposed by Rockafellar and Wets in [15] aims to find x(·) ∈ C ∩ N such that the expected cost G(x(·)) An important property of the expected cost is that it is completely separable with respect to scenarios, which often fulfills an important requirement in designing decomposition algorithms, e.g., PHA. By using this risk measure, it is also implied that the decision maker is risk neutral since a response mapping that performs well on average is pursued. Risk averse measures are also popular in SP, e.g., the conditional value-at-risk measure. However, the separability feature is lost in that case and some additional efforts are required to reactivate PHA, see the recent work [12]. We organize this paper as follows. In Section 2, we state the connections between the PHA and the classical alternating direction method of multipliers (ADMM). A stochastic variant of PHA is proposed in Section 3 and the main convergence results are given in Section 4. A summary is given in Section 5.

PHA and its connection with ADMM
PHA was originally introduced in [15] for multistage convex SP and has recently been extended in [13] to multistage stochastic monotone linear complementarity and variational inequality problems. It was shown in [15,Theorem 5.1] and [13, Theorem 1] that PHA is a customized application of the proximal point algorithm [7,9]. In this section, we first review very briefly the PHA for SP. In particular, we show that PHA for SP problem (1.2) is an application of the classical ADMM for linearly constrained separable convex optimization. This explanation serves as a basis of the stochastic PHA which can be viewed essentially as a randomized variant of ADMM.
Note that the norm in the minimand of Step 1 is the common 2-norm. By categorizing PHA as a proximal point algorithm, it is explained in [15,13] that the sequence generated by PHA is contractive with respect to the solution set. Linear convergence rate results are also given in the linear quadratic case [15,Theorem 5.2], i.e., C(ξ)'s are polyhedron and g(·, ξ)'s are convex quadratic functions.
To apply ADMM, we introduce an auxiliary variable y(·) and rewrite (1.2) as Although the variables are duplicated, the constraint x(·) ∈ C ∩ N is now separated into x(·) ∈ N and y(·) ∈ C. Moreover, the objective function, as well as the constraints x(·) = y(·) and y(·) ∈ C are now separable with respect to each scenario. This makes the classical ADMM [4,2] for linearly constrained separable convex optimization problem readily applicable. Let β > 0 be a penalty parameter and w(·) ∈ L be a Lagrange multiplier. The augmented Lagrangian function associated with (2.1) is given by Stat., Optim. Inf. Comput. Vol. 8, September 2020 ZHENGUO MU, JUNFENG YANG 659 Note that here the norm in the last term is the one induced by the weighted inner product (1.1). Given x 0 (·) ∈ L and w 0 (·) ∈ L, the ADMM iterates, for k = 0, 1, . . ., aŝ The following theorem summarizes the well known fact that PHA is an application of ADMM. For completeness, we give a proof.
Theorem 2. Assume that w 0 (·) ∈ N ⊥ . Then, the ADMM (2.2) generates exactly the same sequence of points as the PHA given in Algorithm 1, i.e., PHA is an application of ADMM.

Proof
It is easy to observe that (2.2a) is separable with respect to each scenario ξ ∈ Ξ, and (2.2b) is no more than a projection onto N . Thus, (2.2) is equivalent tô Note that the norm in the minimand of (2.3a) is the common 2-norm. Since w 0 (·) ∈ N ⊥ , it follows from (2.3b) and (2.3c) that for all k ≥ 0. Thus, the ADMM (2.3) generates exactly the same sequence of points as the PHA given in Algorithm 1, i.e., the PHA for multistage convex SP is an application of the ADMM.
Compared with the classical augmented Lagrangian method [6,8], which minimizes L β (x(·), y(·), w(·)), for fixed w(·), jointly with x(·) ∈ L and y(·) ∈ C, an important advantage of PHA/ADMM is that it decomposes this large and coupled subproblem into many smaller pieces. In particular, PHA needs to solve at each iteration the following subproblems Although this step is fully separable, in cases when the cardinality of Ξ is large, even a single iteration of PHA could be very costly. Note that |Ξ| = Π N j=1 |Ξ j |, which can be very large when either N or some |Ξ j |'s are large. On the other hand, the projection onto N is much cheaper. Given the above connection between PHA and ADMM, we propose in the next section a stochastic variant of PHA, which at each iteration selects a small fraction of ξ's in Ξ and updates the corresponding variables only. Since the number of selected scenarios can be controlled, the per iteration cost of this stochastic PHA can be significantly lower than that of the original algorithm.

A stochastic PHA
For clearness of presentation and analysis, we denote the decision variables by vectors instead of mappings. For this purpose, we now redefine some notation. For any positive integer q, we let . Throughout this paper, we endow ℜ qn with the usually inner product for q ≤ m − 1, while for q = m, i.e., ℜ mn , we endow it with the weighted inner product The norm ∥ · ∥ always denotes the induced norm by the corresponding endowed inner product. With all the above notation, (2.1) can be simplified to where N is a subspace of ℜ mn corresponding to the nonanticipativity constraint and each Note that here the inner product in the middle term is that defined in (3.1) and the norm in the last term is the corresponding induced norm. Based on the equivalence of PHA and ADMM established in Section 2, we propose to solve (3.2) by the following stochastic variant of PHA, which can essentially be viewed as a randomized variant of ADMM.

Generate S k ⊆ [m]
uniformly at random with a fixed cardinality |S k | = θm. 2. Perform the following updates: 3. Set k = k + 1 and repeat until convergence.
Here, for simplicity of analysis we assumed that S k has a fixed cardinality θm. Since w 0 ∈ N ⊥ , it follows from (3.3b)-(3.3c) that w k ∈ N ⊥ for all k ≥ 1. From (3.3a), only the variable components {y i : i ∈ S k } are updated at the kth iteration and {y i : i / ∈ S k } are kept untouch. Since the cardinality of S k can be controlled freely, the per iteration cost of this stochastic variant of PHA can be significantly smaller than than of the original PHA. Our main contribution is to show in the next section that this stochastic PHA shares in an ergodic sense the same sublinear rate of convergence as the original algorithm.

Convergence analysis
Below we analyze the convergence property of the stochastic PHA given in Algorithm 3. Throughout the convergence analysis, we make the following assumption.
Let S = S k , y S = (y i ) i∈S , p S = (p i ) i∈S andg S (y S ) = ∑ i∈Sg i (y i ). Let E S be the expectation about S conditional on all previous history, while E denote the overall expectation. For convenient of analysis, we further introduce a sequence of dummy vectors We first establish some lemmas, which characterize the one-step convergence behavior of the stochastic PHA.
Lemma 5. For all y ∈ C and k ≥ 0, it holds that

Proof
Let y ∈ C be arbitrarily fixed. Then, the optimality condition of (3.3a) reads Multiplying both sides by p i and taking a sum over i ∈ S, we obtain ⟨ y k+1 Here and hereafter, we let "⊗" be the Kronecker product. Note that since we assume θ < 1 and thus |S| = θm < m. Therefore, the inner product in (4.3) is the standard one. It follows from the convexity ofg i , i ∈ S, and the fact y k+1 By taking expectation E S on both sides of the above inequality, we obtain We use the fact y k+1 i = y k i , ∀i ̸ ∈ S, and write ⟨ y k+1 Note that the inner product in the term ⟨ y k+1 − y k , w k ⟩ in (4.5) is defined by (3.1) since the corresponding space is ℜ mn , while inner products in all other terms in this equation are standard. Take expectation E S on both sides and obtain By taking expectation E S on (4.3), plugging (4.4) and (4.6) and rearranging terms, we obtain Similar to (4.6), we have Therefore, Plugging (4.8) into (4.7) and rearranging terms, we obtain (4.2).
Lemma 6. Let T > 0 be an integer. For any w and (x, y) such that x = y ∈ N ∩ C, it holds that (4.9)

Proof
Taking expectation on both sides of (4.2) and summing it from k = 0 through T , we have (4.10) Recall thatw T +1 is defined in (4.1). Plug the identity into (4.10) and rearrange terms, we obtain Since w k ,w k ∈ N ⊥ and x k ∈ N for all k ≥ 1 and x = y ∈ N ∩ C, it follows from (3.3b) that for all k = 0, 1, . . . , T − 1. Plug the above identities into (4.11), we obtain Since y k+1 − x k+1 = 1 ρ (w k+1 − w k ) and ρ = θβ, it holds that (4.14) Recalling x = y, we have It follows from (4.14) and (4.17) that Taking expectation on (4.13) and (4.18) and adding them into (4.12), we obtain (4.9). Now, we are ready to establish a key result for the ergodic convergence.
Theorem 7. Assume x 0 = y 0 and w 0 = 0. Let T > 0 be any integer and definē For any w and x = y ∈ N ∩ C, we have where the inequality follows from x k+1 = P N (y k+1 ). Apparently, this holds trivially if x 0 ∈ N since in that case x k ∈ N and w k ∈ N ⊥ for all k ≥ 0. In addition, there hold and Adding (4.20), (4.21) and (4.22) together, we Taking expectation on both sides of (4.23) and adding it to (4.9), we obtain

CONVERGENCE ANALYSIS OF A STOCHASTIC PROGRESSIVE
Since x 0 = y 0 and w 0 = 0, we have from the convexity of g and (4.24) that We thus have proved (4.19).
To prove the main convergence result, we need the following lemmas, which follow directly from [3, Lemmas 3.3, 3.5].
Lemma 8. Let h be a continuous function andx andȳ are random vectors. If for any (x, y, w) with x = y, which may depend onx andȳ, it holds that then for any γ > 0 and any optimal solution (x * , y * ) we also have Lemma 9. Let (x * , y * , w * ) be an optimal solution and γ > ∥w * ∥. If a random vector (x,ȳ) satisfies The following theorem summarizes the main convergence result of this paper.

Concluding remarks
Progressive hedging algorithm (PHA) is a classical decomposition approach for solving multistage convex stochastic programming (SP). It decomposes a large and complicated problem into many smaller pieces and can be implemented in parallel. Due to this feature, it is especially advantageous in large scale computing. Thus, PHA has been used in various domains ever since it was invented. Lately, it becomes even more popular due to the series of work by Rockafellar and his collaborators, see [16,13,12,10,11,14].
At each iteration, PHA solves for each scenario a convex programming problem. Since the cardinality of the scenario set can be very large [18,1,17], the per iteration cost of PHA can be prohibitive in practice. Based on a connection between PHA and the alternating direction method of multipliers, in this paper we proposed a stochastic PHA, which selects a small fraction of the scenarios and updates the corresponding variables only. The number of selected scenarios can be controlled freely. Thus, the per iteration cost of the new algorithm can be significantly lower. Furthermore, our analyses have shown that the new algorithm shares the same sublinear ergodic convergence rate in the sense of expectation. An interesting problem might be extending the present analysis for SP to the case of multistage stochastic variational inequality [16,13].