Nonmonotone Spectral Gradient Method for l1-regularized Least Squares

In the paper, we investigate a linear constraint optimization reformulation to a more general form of the l1 regularization problem and give some good properties of it. We first show that the equivalence between the linear constraint optimization problem and the l1 regularization problem. Second, the KKT point of the linear constraint problem always exists since the constraints are linear; we show that the half constraints must be active at any KKT point. In addition, we show that the KKT points of the linear constraint problem are the same as the stationary points of the l1 regularization problem. Based on the linear constraint optimization problem, we propose a nonmonotone spectral gradient method and establish its global convergence. Numerical experiments with compressive sense problems show that our approach is competitive with several known methods for standard l2-l1 problem.


Introduction
In recent years, many studies focus on the following ℓ 1 regularization problem: where f is continuously differentiable, µ is a given nonnegative regularization parameter and ∥ • ∥ 1 is the one-norm.
A particular case of ( 1) is the so-called ℓ 2 -ℓ 1 problem where A ∈ R m×n is dense (usually m ≤ n), b ∈ R m and n is large, which has attracted much attention in signal/image denoising and data mining/classification [5,9,17].Various types of algorithms have been proposed for solving (1).One of the most popular methods for solving problem (3) is the class of iterative shrinkage-thresholding algorithms (ISTA), where each iteration involves a matrix-vector multiplication involving A and A T followed by a shrinkage/soft-threshold step, see, e.g., [13,21].To accelerate the convergence, a two-step ISTA (TWISTA) algorithm was developed in [6], the sequential subspace optimization techniques was added to ISTA [20], a faster shrinkage-thresholding algorithm, called 266 NONMONOTONE SPECTRAL GRADIENT METHOD FOR ℓ 1 -REGULARIZED LEAST SQUARES FISTA was constructed in [4] and continuation schemes FPC was proposed in [24], respectively.To improve practical performance of the above methods further, Wright, Nowak, and Figueiredo [35] introduced the sparse reconstruction by separable approximation (SpaRSA) algorithm for solving (1).The rules for choosing this parameter and the line search are quiet different.Hager, Phan and Zhang [25] analyzed the convergence rate of SpaRSA and proposed an improved version of SpaRSA based on a cyclic version of the BB iteration and an adaptive choice for the reference function value in the line search.Wen, Yin, Goldfarb and Zhang [33,34] proposed an abridged version of the active-set algorithm FPC AS for (3) by adding an active set (AS) step to FPC [24].
Another class algorithms such as interior point methods [11,26,30], projected gradient method [22] and alternating direction method of multipliers SALSA [1,8] are designed to solve the constrained optimization reformulation of the ℓ 1 regularized problem.Other algorithms for the ℓ 1 minimization include coordinate-wise descent methods [31]; Bergman iterative regularization based methods [36]; gradient methods [27] for minimizing the more general function J(x) + H(x), where J is nonsmooth, H is smooth, and both are convex; a smoothed penalty algorithm (SPA) [2].We refer to papers [5,9,17] for a review on recent advances in this area.
The ℓ 1 regularized problem (1) can be transformed a convex quadratic problem with linear inequality constraints.Many standard interior points method [11,26,30] have been developed for solving the equivalent quadratic program.However, some numerical results [22,26,35] show that interior point methods [11,26,30] is slow.In the paper, we investigate a linear constraint optimization reformulation to the more general form of ℓ 1 regularization problem and give some good properties of it.We first show that the equivalence between the linear constraint optimization reformulation and the ℓ 1 regularization problem.Second, the KKT point of the linear constraint problem always exists since the constraints are linear.We show that the half linear constraints must be active at any KKT point.At last, we show that the KKT points of the linear constraint problem are the same as the stationary points of the the ℓ 1 regularization problem.Based on the linear constraint optimization problem, we propose a projection gradient algorithm.To accelerate convergence of the algorithm, the Barzilar-Borwein steplength together with nonmonotone line search technique is applied to the algorithm.The use of the projection gradient method reduces the storage requirement of our method.Hence, the method can be used to solve large-scale problems.In addition, the method has the following advantages: (a)The method is suitable for solving a more general problem of (1); (b) the method based on a nonmonotone line search technique [23] is showed to be globally convergent; (c)the main computational burden at each iterations involves matrix-vector multiplication involving A and A T ; (d) preliminary numerical experiments show that the method is effective and competitive with the famous and existing methods.
The remainder of the the paper is organized as follows.We investigate some interesting properties of the reformulation in Section 2. In Section 3, we propose the algorithm and establish the global convergence of the algorithm.Some numerical results are reported in Section 4 and conclusions are made in the last section.
Throughout the paper, ∥.∥ denotes the Euclidean norm of vectors.A T and x T denote the collections of columns and entries of A and x, whose indices are in an index set T ⊂ {1, 2, 3, • • • , n}, respectively.

Equivalent Form and Properties
The first key step of our algorithm approach is to express (1) as a constraint optimization problem as in [26].Specifically, the problem (1) can be transformed to the following problem with linear inequality constraints ( In this section, we shall show that the equivalence between the linear constraint optimization problem (3) and the ℓ 1 regularization problem (1).Second, the KKT point of the linear constraint problem always exists since the constraints are linear; we shall show that the constraint must be active at any KKT points.At last, we shall show that the KKT points of the linear constraint problem are the same as the stationary points of the ℓ 1 regularization problem.The following theorem shows the equivalent between (1) and problem (3).

Proof
Let x be an any vector in R n and choose ū such that (ū, x) is a feasible point of (3).Then we have ūi ≥ | xi |, for i = 1, 2, • • • , n.Since x * is a solution of (1), we have ) is a solution of (3).On the other hand, suppose that (u * , x * ) is a solution of (3) and is a feasible point of (3) and we thus have Then the first inequality and third inequality imply that x * is a solution of (1).
Since the constraints in (3) are all linear, the KKT point of the linear constraint problem (3) always exists.The following theorem about the first order necessary condition of (3) becomes obvious.

Theorem 2.2
Suppose that f : R n → R is continuously differentiable and z = (u, x) is a local solution of the constrained problem (3).Then there must exist multipliers λ 1 ∈ R n and λ 2 ∈ R n such that the following KKT conditions hold: The following theorem show that at any KKT point of (3), either constraint That is, the half constraints of (3) must be active at any KKT point.
Suppose that x i = 0 for some i.By the KKT condition (4), we get that for all Furthermore, by µ = λ 1 i + λ 2 i , we get which shows u i = 0.If there exists an index i with By the KKT condition, we have The following theorem show that the KKT points of the linear constraint problem (3) are the same as the stationary points of the ℓ 1 regularization problem.

Theorem 2.4
Suppose that f : R n → R is continuously differentiable and (u, x, λ 1 , λ 2 ) is a KKT point of the constrained problem (3).Then x is a stationary point of (1).Conversely, if x is a stationary point of (1), then there exist multipliers λ 1 and By the first, third and fourth equality of (4), we have If x i > 0, then u i = x i .By the second, third, fourth inequality of (4), we get λ 1 i = 0 and λ 2 i = µ.By the first inequality of (4), we get ∇f i (x) + µ = 0.If x i < 0, then u i = −x i .By the second, third and fourth inequality of (4), we get λ 2 i = 0, λ 1 i = µ and ∇f i (x) − µ = 0. Thus, x is a stationary point of (1).Conversely, suppose that x is a stationary point of (1).We let In this case, we let λ 1 i = 0 and λ 2 i = µ which satisfy (5).If x i < 0, then we have u i = −x i and ∇f i (x) − µ = 0.In this case, we let λ 1 i = µ and λ 2 i = 0 which satisfy (5).If x i = 0, we have u i = 0, it suffices to let λ 1 i and λ 2 i be any positive constants that satisfy λ 1 i + λ 2 i = µ.Consequently, the KKT conditions (4) hold.

Algorithm
In this section, we begin with some notation.
denote the projection of any vector v on the set Ω. We first gives the explicit form of P Ω (v).Then, we propose the algorithm and give some convergence result.The following theorem shows that Ω is a nonempty closed convex set.

Theorem 3.1
The set Ω is a nonempty closed convex set.
Then, we get α(u Above theorem shows that Ω is a nonempty closed convex set.Thus, we can compute the projection of any vector v on the set Ω.The following theorem gives the explicit form of P Ω (v).

Proof
Define the Lagrange function of ( 6) where ρ 1 ∈ R n and ρ 2 ∈ R n are multipliers.The KKT condition for ( 6) is as follows Note that ( 6) is a convex optimization problem.Thus for any solution (u, x, ρ 1 , ρ 2 ) of ( 7), (u, x) is a solution of (6).We consider the following four cases.Case (i) if and u(i) = x(i) = 0 which satisfy (7).
Based on the above discussion, we propose the projection gradient method for (1) as follows.
Step 1. Perform the convergence test and terminate with an approximate solution z k if the stopping criterion is satisfied.
To accelerate the projection gradient method, we shall apply the Barzilar-Borwein steplength to Algorithm 1.To this aim, we briefly recall the Barzilar-Borwein method (for example, see [3,15]).Consider the unconstrained minimization problem min x∈R n G(x).where G : R n → R is continuously differentiable.The Barzilai-Borwein method is defined by where the scalar α k BB is given by where α k BB is called the Barzilai-Borwein steplength [3], ). Due to its easy implementation, efficiency and low storage requirement, BB-type methods have widely been used in many applications such as box constrained optimization [7,14], nonlinear equations [12] and sparse reconstruction [35,22].The Barzilai-Borwein steplength used in Algorithm 1 is defined by is a constant.Thus, we get To avoid small and large values of θ k BB , we project it in the interval [α min , α max ], where α min < α max are given positive constants.That is, we let For simplicity, we call Algorithm (1) with the steplength (9) used in step 2 of Algorithm (1) as the projection Barzilai-Borwein algorithm and abbreviate it as PBB.

Convergence Result
In this section, we analyze the convergence of Algorithm 1.To this aim, we make the following assumptions on the objective function.
(ii) f has continuous partial derivative on an open set that contains the level set L(z 0 ).
The following theorem shows that every accumulation point of {z k } is a stationary point of (3).The proof of the following theorem is similar to the one in [7] and hence was omitted.

Theorem 3.3
Assume that ϕ satisfies Assumption 3.1.Let {z k } be the sequence generated by Algorithm 1.If d k ̸ = 0 for all k, every accumulation point z * of {z k } is a stationary point of (3).Moreover, if f is convex, then every accumulation point z * of {z k } is a solution of (3).

Numerical Experiments
In this section, we do some numerical experiments to test the performance of the proposed method and compare it with the following three existing solvers, ℓ1 ℓs [26] FPC AS [33] and GPSR BB [22].All codes are written in MATLAB 7.0 and all tests described in this section were performed on a PC with Intel I5-3230 2.6GHZ CPU processor and 4G RAM memory with a Windows operating system.
Experiments in [22,24,33,35] have confirmed the effectiveness of continuation.Therefore, we embedded our method in an adaptive continuation procedure.Specifically, we use the adaptive continuation procedure in [35] which was also used in GPSR BB [22].We implemented Algorithm 1 with the following parameters M = 5, α min = 10 −10 , α max = 10 10 , δ = 10 −2 and η = 0.5 and implement the continuation procedure with the parameter ς = 0.2.The initial point of all tested algorithms is the zero vector.The other three algorithms were run with default parameters.
In Table 1, we summarize a list of symbols used in the subsequent tables and figures.
Table1.Summary of symbols used in all subsequent tables and figures.m,n numbers of rows and columns of A, respectively cpu cpu time nnzx number of the nonzeros in the reovered solution nMat total number of matrix-vector products involving A and A T MSE the relative error between the recovered solution x and the exact sparsest solution xs, i.e., MSE= ∥x−xs∥ ∥xs∥

ℓ 2 -ℓ 1 problem
In this subsection, we consider a typical compressed sensing scenario, that is the problem (2), where the goal is to reconstruct a lenghth-n sparse signal from m observations, where m < n.The m × n measure matrix A is obtained by first filling it with independent samples of a standard Gaussian distribution and then orthonormalizing the rows.These random matrices are generated by using MATLAB command randn.To generate the signal x s , we first generated the support by randomly selecting T indices between 1 and n and then assigned a value to x i for each i in the support by one of the following four methods: Type 1: one (zero-one signal); Type 2: the sign of a normally distributed random variable; Type 3: a normally distributed random variable (Gaussian signal); Type 4: a uniformly distributed random variable (−1, 1); In this experiments, we tested the matrix A with size n = 4096 and m = round(0.1 * n) or m = round(0.2* n) and considered a range of degrees of sparseness: the number T of nonzero spikes in x s ranges from 1 to 30 for each type of the elements in the support.The observation b is generated by b = Ax s and the regularization parameter µ is taken as µ = 0.05∥A T b∥ ∞ .The above procedure yields a total of 240 problems.For each data set (x s , A, b), we first ran ℓ1 ℓs and stored the final value of the objective function and then ran the other algorithms until they reach the same objective function value.
Each component of signal x s and the final solution obtained by each tested method is considered as a nonzero component when its absolute value is great than 0.001∥x s ∥ ∞ .We adopt the performance profiles by Dolan and Moré [19] to evaluate the CPU time and the numbers of MSE, nMat and nnz.Figures 1-8 show the performance profiles of the five methods relative to CPU time and the numbers of MSE, nMat and nnz.It shows that the PBB method performs best for the 240 test problems and generally requires less CPU time and fewer numbers of nMat and obtain less MSE and the same nnz as other algorithms.

Group-separable regularizer
In this subsection, we examine the performance of the proposed methods using the group separable regularizers [35] where , are m disjoint subvectors of x and A ∈ R 1024×4096 was obtained by the same way as that in subsection 4.1.The vector x s has 4096 components, divided into k = 64 groups of length l i = 64.To generate x s , we randomly chose from one to eight groups and filled them with zero-mean Gaussian random samples of unit variance, while all the other groups are filled with zeros.The target vector is b = Ax s + e, where the noise e is a Gaussian noise with mean zero and variance 10 −4 .The regularization parameter is chosen as suggested in [35]: µ = 0.05∥A T b∥ ∞ .We used the same stopping criterion as that in Section 4.1.We ran 10 test problems and gives the average CPU time needed by the four methods in Table 2.  From Table 2, we can observe that PBB is much faster than the l1 ls method and are comparable to the FPC AS and GPSR BB methods.

Image deblurring problem
In this subsection, we present results for one image restoration problems referred to as Cameraman (see Figure 7).The images are 256 × 256 grayscale images; that is, n = m = 256 2 = 65536.The image restoration problem has the form (3), where µ = 0.00005.We used the same stopping criterion as that in Section 4.1.In the test problem, the FPC AS method fails to satisfy the stopping condition.So we do not report the CPU time and the obtained image of it.Table 3 reports the average CPU time.The results in Table 3 and Figure 9 again indicate that PBB yields much better performance for the test problem.

Conclusion
In the paper, we investigate a linear constraint optimization reformulation to a more general form of the ℓ 1 regularization problem and given some good properties of it.We first show that the equivalence between the linear constraint optimization problem and the ℓ 1 regularization problem.Second, the KKT point of the linear constraint problem always exists since the constraints are linear; we show that the half constraints must be active at any KKT point.In addition, we show that the KKT points of the linear constraint problem are the same as the stationary points of the ℓ 1 regularization problem.Based on the linear constraint optimization problem, we propose a nonmonotone spectral gradient Method.Under appropriate conditions, we showed that the method is globally convergent.The numerical results in Section 4 demonstrated the effectiveness of the algorithm for solving some standard ℓ 2 -ℓ 1 problems.

Figure 9 .
Figure 9. From top to bottom: original signal reconstruction via the minimization of (3) obtianed by ℓ1 ℓs, PBB and GPSR BB