Two-Step Proximal Gradient Algorithm for Low-Rank Matrix Completion

In this paper, we propose a two-step proximal gradient algorithm to solve nuclear norm regularized least squares for the purpose of recovering low-rank data matrix from sampling of its entries. Each iteration generated by the proposed algorithm is a combination of the latest three points, namely, the previous point, the current iterate, and its proximal gradient point. This algorithm preserves the computational simplicity of classical proximal gradient algorithm where a singular value decomposition in proximal operator is involved. Global convergence is followed directly in the literature. Numerical results are reported to show the efﬁciency of the algorithm.


Introduction
Over the past few years, finding low-rank matrix subject to some given constraints has attracted significant attentions due to its wide range of engineering applications.The general rank minimization problem can be expressed as min X∈R m×n Rank(X) s.t.X ∈ C, (1) where X ∈ R m×n is a decision variable, and C is a convex set.As a special case, the affine rank minimization problem can be expressed as follows: where A : R m×n → R p is a closed linear operator and A * is its adjoint, and b ∈ R p is a known observation value.
The rank minimization problem is generally known to be computationally intractable (NP-hard) [1,2].When the matrix variable is restricted to be diagonal, the problem (2) reduces to the linearly constrained nonsmooth cardinality minimization problem.One common approach is to replace the cardinality term by ℓ 1norm and then to solve a convex minimization problem.Similarly, to solve (2) via a tractable problem, it is natural 175 and common to replace "Rank(X)" by the so-called nuclear norm ∥ • ∥ * , that is min X∈R m×n ∥X∥ * s.t.A(X) = b, (3) where ∥X∥ * is defined as the sum of its singular values.That is, assume that X has r positive singular values of σ 1 ≥ σ 2 ≥ . . .≥ σ r > 0, then ∥X∥ * = ∑ r i=1 σ i .A frequent alternative to (3) is the following nuclear norm regularized linear least squares problem where ∥ • ∥ F is the Frobenius norm induced by the standard trace inner product, and parameter µ > 0 is used to trade off both terms for minimization.The nuclear norm is alternatively known as the Schatten ℓ 1 -norm and the Ky Fan norm [2], and it is the best convex approximation of the rank function over the unit ball of matrices [3].
A particular category of ( 3) is the so-called matric completion problem: given a random subset of entries of low-rank matrix, it would like to recover all the missing entries.Specially, according to Candès & Recht [4], it can be formulated as where M is the unknown matrix with some available sampled entries and Ω is a set of index pairs (i, j).Throughout our discussion, the solution set of ( 5) is assumed to be nonempty.It is known that when the observations are corrupted by Gaussian noise, the following nuclear norm regularized linear least squares problem is preferable, where P Ω (•) is defined as Over the past few years, the numerical approaches for solving the formulations (3) and its different variants have been draw increasingly attention, e.g.[7,8,10].Cai, Candès & Shen [3] proposed a singular value thresholding algorithm and showed that if the parameter goes infinity, then the sequence of optimal solutions converges to the one of (3) with minimum Frobenius norm.Ma, Goldfarb & Chen [11] introduced a fixed point continuation algorithm (FPCA), which is a matrix extension of the fixed point continuation algorithm by Hale et al. [12].Liu, Sun & Toh [13] developed a proximal point algorithm from the primal, dual, and primal-dual forms, in which the inner sub-problems are solved by gradient projection method or accelerated proximal gradient method.Toh and Yun [5] extended Beck & Teboulle's algorithm [6] to solve matrix completion problems and designed an accelerated proximal gradient algorithm.
Due to its simplicity, the proximal gradient algorithm is popular to solve a closed proper convex optimization problem.However, the classical proximal gradient algorithm has been regarded as a slower until the exciting progress in recent years.The "accelerated" approaches mainly base on an extrapolation step which relies on not only the current point, but also two or more previous computed iterations (e.g., [14,15]).Other dramatic algorithms can refer to Nesterov [16], Beck & Teboulle [6], Tseng [17], O'Donoghue & Candès [18], and references therein.
The main contribution of the paper is to further investigate the efficiency of the proximal gradient algorithms in solving the matric completion problem (6).The distinguished characters of the proposed algorithm is that each generated iteration is a combination of the previous point, the current point, and its proximal gradient point.As a result, the combination improves the algorithm's flexibility and practicability.The algorithm is also closely related to the well-known two-step iterative shrinkage/thresholding algorithm (TwIST) of the Bioucas-Dias & Figueiredo's algorithm [14] for sparse signal recovering.In other words, the proposed algorithm can be regarded as an extension or application of TwIST to solve nuclear norm regularized least squares for the purpose of testing its efficiency to find low rank matrices.The proposed algorithm preserves the computational simplicity of classical proximal gradient algorithm and mainly involves a singular value decomposition in proximal operator.We give its global convergence property without proof under some appropriate conditions.Finally, we test the algorithm to show its numerical performance.
We organize the rest of this paper as follows.In Section 2, we summarize some preliminaries which are useful for further analysis, and quickly review some closed related approaches in the literature.In Section 3, we develop a two-step proximal gradient algorithm and present its convergence theorem.In Section 4, in order to investigate the benefit of the proposed algorithm, we test the algorithm by a series of numerical experiments.Finally, we conclude our paper in Section 5.
We summarize the notation used in this paper.Matrices are written as uppercase letters.Vectors are written as lowercase letters.For matrix X ∈ R m×n , its Frobenius norm is defined as For any x ∈ R n , we denote by Diag(x) the diagonal matrix possessing the components of vector x on the diagonal.For any X ∈ R n×n , we use λ max (X) to denote its largest eigenvalue.We define "⊤" as the transpose of a vector or a matrix.Additional notation will be introduced when it occurs.

Preliminaries
In this section, we review some preliminaries on proximal gradient algorithms which are useful in the later analysis.Let f : R n → R ∪ {+∞} be a closed proper convex function.The proximal operator prox λf : R n → R n of the scaled function λf is defined by which is also called the proximal operator of f with parameter λ.The definition indicates that Prox λf (y) is a point that compromises between minimizing f and being near to y. Proximal gradient algorithm is to solve the convex separable minimization problem where f : R n → R and g : R n → R ∪ {+∞} are closed proper convex and f is differentiable.The proximal gradient algorithm is ) , (10) where λ k > 0 is a step size.It is well-known that when ∇f is Lipschitz continuous with constant L, this method converges with rate O(1/k) when a fixed step size The accelerated version of the basic proximal gradient algorithm mainly includes an extrapolation.One popular approach is the well-known fast iterative shrinkage-thresholding algorithm (FISTA) of Beck & Teboulle [6].Dramatically, the iteration complexity can achieve O(1/k 2 ) with proper choice on steplength λ k .Another accelerated approach is the so-called two-step iterative shringkage/thresholding algorithm (TwIST) of Bioucas-Dias & Figueiredo [14].TwIST is motivated from [19] for solving a linear system in case that the coefficient matrix can be split into two terms and one of them is positive definite and easy to invert.Compared to the standard proximal gradient iteration (10), a general formulation with scalars β and α is formulated as Clearly, the case of α = 1 and β = 1 reduces to (10).It was shown in [14] that, the iteration ( 11) converges globally at the case of f (x) = ∥Ax − b∥ 2 and g(x) = ∥x∥ 1 by using some proper values of α and β.

Two-step proximal gradient algorithm
Based on the above preliminaries, we are now ready to construct the two-step proximal gradient algorithm to solve problem (6).Comparing the formulations of ( 6) and ( 9), we can take It's easy to know that g : R m×n → R is a nonsmooth and continuous convex function and f : R m×n → R is a continuously differentiable convex function with Lipschitz continuous gradient where L is a Lipschitz constant.
For convenience, we denote P Ω (X) = A(X) and P Ω (M ) = b.Using the notation, we consider the following quadratic approximation of Its unique minimizer admits the following formulation: By definition, evaluating a proximal operator involves solving a convex optimization problem generally.In some cases, exploiting special structure in the problem like sparsity may produce the simplest algorithm, even derive analytical solutions.
, and suppose that the rank of Y k is r.The reduced singular value decomposition of Y k can be expressed as where U k and V k are respectively m × r and n × r matrix with orthogonal columns, σ ∈ R r is the vector of positive singular values arranged in descending order It follows [3] that the proximal operator can be expressed as Prox g/L where and " max " is interpreted as componentwise.Based on the above analysis, the standard proximal gradient algorithm for solving (4) reduces to which means that the next iteration is only determined by the current X k .For deriving an algorithm with more general form, as formula (11), we choose two positive scalars α and β, and set It is clear that the new iteration is a combination of the previous point X k+1 , the current point X k , and its proximal gradient point Prox g/L (Y k ).
For the general nuclear norm regularized linear least squares problem (4), the Lipschitz constant L which depends on the maximum eigenvalue of A * A is not always easily computable.Specially, in this case of the matrix completion problem (6), the linear operator A always satisfies A * A = I (i.e, the identity matrix), and then L = 1 from the fact that L = λ max (A * A).Based on the above analysis, the standard version of the two-step proximal gradient algorithm (abbr.Ts PGA) for solving separable convex minimization (6) can be outlined as follows.
Step 1. Compute where Step 2. If termination criterion is not met, let k := k + 1. Go to Step 1.
For being easily understood, we make a remark on the close relationship between Ts PGA and the solver TwIST of Bioucas-Dias & Figueiredo [14], and the solver FPCA of Ma, Goldfarb, and Chen [11].

Remark 1
It is well-known that TwIST is designed for solving ℓ 1 -norm regularized minimization problems in compressive sensing.It is also easy to derive that Ts PGA is equivalent to TwIST when it used to recover large sparse signal in compressive sensing.Generally speaking, the proposed algorithm Ts PGA can be considered as an extension of TwIST to solve the problems of recovering corrupted low rank matrix.At the special case of α = β = 1, the iteration (13) reduces to the classical proximal gradient algorithm, also named the fixed point continuation algorithm when it used to solve (4).In other words, Ts PGA can also be regarded as a generalized variant of FPAC.
Given that Ts PGA is an application of the well-known TwIST to solve nuclear norm least squares for matric completion, hence, its global convergence can be followed directly.For completeness of this paper, we list the convergence theorem without proof at the end of this section.For the proof of the theorem, one can refer to [14,Appendix II].
our experiments more easy to follow: m: the row number of matrix; n: the column number of matrix; r: the rank of original matrix, which is far less than min{m, n}; sr: the sample ratio; p: the number of measurements, which is set to be p = sr × m × n; dr: the number of degree of freedom for a rank r matrix, which is defined as dr = r(m + n − r); F r: F r = r(m + nr)/p and F r ∈ (0, 1) is important to successfully recover M ; M : a real low rank matrix to be recovered, which is generated by M = M L M ⊤ R , where matrix M L ∈ R m×r and M R ∈ R n×r are generated via the Matlab script "randn(m, r)"; Ω: the index set of known elements, which are selected randomly; b: the given measurement vector, which is b = A(X) + ω; ω: Gaussian noise with mean zero and standard deviation σ generated by σ × randn(p, 1); µ: a penalty parameter, which is updated by continuation technique.X * : an optimal solution.All experiments are performed under Window 7 premium and MATLAB v7.8(2009a) running on a Lenovo laptop with an Intel core CPU at 2.4 GHz and 2 GB memory.The iterative process is terminated when the optimal solution X * satisfies the following criterion: The relative error (RelErr) measures the quality of X * to the original M .As usual [3,20], we say that M is recovered successfully if RelErr is less than tol = 10 −3 .By the way, the terminated condition ( 14) is also used in [11].On the other hand, for comparing in a relatively fair way, we also use the Matlab package as in FPCA for matrix singular value decomposition (SVD).
Our tests are partitioned into four parts.In the first three parts, we test Ts PGA and FPCA to solve different cases of matrix completion problems.In the forth part, we use two typical images to visibly show the efficiency of both tested algorithms.In each test, we use the same technique as in [14, (26) and ( 27)] to choose the parameters α and β for the purpose of guarantee the algorithm's convergence.The numerical results generated by Ts PGA and FPCA for solving matrix completion problems with different sr and r are reported in Table 1, where "Time" denotes the CPU time required by the algorithm.As can be seen from the top part of the Table 1, both algorithms required more computing time as F r is greater than 0.38.For each cases with different F r, we can clearly see that Ts PGA performs better than FPCA in terms of running time and relative error.The worse thing is that FPCA failed to obtain high accuracy solutions at the cases of sr = 0.8, 0.9.From the bottom part of this table, we observe that all the problems are successfully solved with different r when F r = 0.38, and see that Ts PGA is a winner in this case.From the simple test, it is concluded that Ts PGA is more efficient than FPCA.
In the second test, we apply Ts PGA and FPCA to solve some easy matrix completion problems with different dimension.The numerical results are listed in Table 2 where F r restricts into (0.01, 0.2) and the m and n vary from 100 to 2000.From this table, we see that both algorithms produce satisfied optimal solutions within almost equivalent running time.But the accuracy of the solutions obtained by FPCA looks slightly higher, which means that the performance of both algorithms are competitive.
In the third test, we use Ts PGA and FPCA to recover low rank matrix which corrupted by Gaussian noise.The noisy level is set as σ = 1e − 3. The numerical results are reported in Table 3. From this table, it is clear to see that Ts PGA requires less running time to get solutions with similar accuracy than FPCA.This test shows that Ts PGA runs faster to solve noisy matrix completion problem.
To visibly illustrate the efficiency of Ts PGA, in the last test, we use both algorithms to recover a couple of randomly corrupted images which widely used in the literature.In this test, we choose sr = 0.5 and r = 40.The original low-rank images, the corrupted images, and recovered images by each algorithm are presented in Figure Taking the above four cases together, it illustrates that Ts PGA provides an alternative approach to low rank matrix completion problems, and its performance is competitive with or slightly better than the widely used solver FPCA.

Conclusions
In the field of machine learning, one often meets the problem of exploiting the low-rank matrix from its given noisy partial entries.It has been shown that the task can be characterized as a matrix nuclear-norm minimization problem.However, the non-smoothness of the nuclear norm make the problem is challenging to solve.Due to the simplicity and effectiveness of proximal gradient algorithm for a wide range of convex minimization problems, the algorithm and its accelerated variants have been intensively studied in the past few years.Particularly, the type of this algorithm was successfully used to solve matrix completion problem by Ma, Goldfarb, and Chen [11].On the one hand, the proposed algorithm is actually an extension or generalization of the proximal gradient algorithm, in which the generated iteration is a combination of the previous point, the current point, and its proximal gradient point.On the other hand, our idea also comes from the effectiveness of the TwIST algorithm of Bioucas-Dias and Figueiredo [14] to recover large and sparse signal from the limited measurement.Hence, our algorithm can also be regarded as an extension of TwIST to solving matrix completion problems.Although both motivations are simple, the numerical experiments illustrated that the proposed algorithm performs well and is competitive with the well-known solver FPCA, which in turn showed the superiority of the proposed algorithm.Surely, this is the contribution of this paper.The paper was paid more attention on solving matrix completion problem where the Lipschitz constant of the gradient of the least square term is 1, which is essential to the convergence of the algorithm.However, its efficiency in solving general linear constrained nuclear norm minimization problems has not been studied.This should be our further task to investigate.

Table 1 .
Numerical results of Ts PGA and FPCA

Table 2 .
of Ts PGA and FPCA for easy matrix completion problems

Table 3 .
(6)erical Results of Ts PGA and FPCA for(6)with Gaussian noise From this figure, we see that Ts PGA successfully recovered the both corrupted images, and the quality of each recovered image is competitive with the one derived by FPCA.