Comparative Study of LASSO, Ridge Regression, Preliminary Test and Stein-type Estimators for the Sparse Gaussian Regression Model

This paper compares the performance characteristics of penalty estimators, namely, LASSO and ridge regression (RR), with the least squares estimator (LSE), restricted estimator (RE), preliminary test estimator (PTE) and the Stein-type estimators. Under the assumption of orthonormal design matrix of a given regression model, we find that the RR estimator dominates the LSE, RE, PTE, Stein-type estimators and LASSO estimator uniformly, while, similar to Hansen (2013), neither LASSO nor LSE, PTE and Stein-Type estimators dominates the other. Our conclusions are based on the analysis of L_2-risks and relative risk efficiencies (RRE) together with the RRE related tables and graphs.


Introduction
It is well-known that the "least squares estimators (LSE)" in linear models, are unbiased with minimum variance characteristics. But data analysts point out some deficiency of LSE with regards to "prediction accuracy" and "interpretation". To overcome these two important concerns, [32] proposed a new popular and exciting penalty estimator, called, least absolute shrinkage and selection operator (LASSO). It defines a continuous shrinking operation that can produce coefficients that are exactly 0 and competitive with "subset selection" and "ridge regression" retaining the good features of both of them. LASSO simultaneously estimates and selects the coefficients of a given model.
However, there are many shrinkage estimators such as "preliminary test (PT)" and Stein-type estimators (SE) in the literature. They do not select coefficients but only shrinks towards a target value.
There is an extensive literature of the preliminary test and Stein-type estimators. Most recent one is documented in [25]. Due to the immense impact of Stein's approach on "point estimation", scores of technical papers appears in the literature in various areas of applications.
In 1970, Hoerl and Kennard introduced the "ridge regression" estimator which opened the door for "penalty estimators" based on the [33]. Ridge regression combats the problem of multicollinearity in the linear models and is the precursor of the problem of estimation and selection of variables. Ridge regression (RR) methodology is

Linear Model and the Estimators
Consider the multiple linear regression model in two forms Y = Xβ + e, and Y = X 1 β 1 + X 2 β 2 + e, (1) where X X = I p is an identity matrix,x = (x 1 ,x 2 , . . . ,x p ) = 0, Y = (y 1 , y 2 , . . . , y n ) is an n-vector of responses, X is an n × p design matrix, β = (β 1 , β 2 , . . . , β p ) is an p-vector of regression coefficients and e = (e 1 , e 2 , . . . , e n ) is an n-vector of errors following the N (0, σ 2 I n ) distribution with known σ 2 . The second form of the model (1), arises by partitioning β = (β T 1 , β T 2 ) T and X = (X 1 , X 2 ), (p = p 1 + p 2 ) where β 1 may stand for the main effects and β 2 for the interactions which may be insignificant and one is interested in the estimation main effect when β 2 is suspected to be zero (sparsity condition). It is well known that the least squares estimator (LSE) of β under the assumed conditions is given byβ The LSE is the best linear unbiased estimator (BLUE) of β. Under normal theoryβ n ∼ N p (β, σ 2 I p ). Equivalently (β T 1n ,β T 2n ) T ∼ N p β 1 β 2 , σ 2 I p1 0 0 I p2 . We made the above simplified assumptions in order to create level playing field to compare a diverse set of shrinkage estimators of β = (β 1 , β 2 , . . . , β p ) . These assumptions also produces insight about the nature of the shrinkage that can be gleaned from the orthonormality of the design matrix. From now on we shall designate the LSE,β n = (β T 1 ,β T 2 ) T as the unrestricted estimator (UE) of β = (β T 1 , β T 2 ) T and β 2n = 0 as the restricted estimator (RE) of β 2 so that the restricted estimator of β R = (β T 1 , 0 T ) T isβ n = (β T 1n , 0 T ) T . We shall study the quantitative characteristics of the estimators based on the L 2 -risk defined by where β * n is any estimator of β. Based on (2) and (3), we see at once that the bias and the risk of LSE and RE are given respectively by, b 1 (β jn ) = 0 and R 1 (β jn : From equation (4), we easily find that where ∆ 2 = β 2 β2 σ 2 is the divergence parameter which measures the distance of ∆ = (∆ 21 , ∆ 22 , .., . . . , ∆ 2p2 ) form the origin in the R p2 -space given by All our analysis will be based on ∆ 2 ∈ R + specifying the parameter space instead of ∆ = (∆ 21 , ∆ 22 , .., . . . , ∆ 2p2 ) ∈ R p2 . Our basic interest is to consider several shrinkage estimators which shrink towards the origin, 0. Accordingly we first, consider the preliminary test estimator (PTE) aŝ where I(A) is the indicator function of the set A, L n =β 2nβ2n σ 2 is the test statistic for testing H 0 : β 2 = 0 and c α = χ 2 p (α) is the upper α th percentile of the null distribution of L n . It is known that L n follows a non-central chi-square distribution with p 2 degrees of freedom (D.F.) and non-centrality parameter, ∆ 2 , which we defined as the divergence parameter. PTE is a discrete process and "keeps" or "kills" the estimatorsβ n orβ n . The bias-vector and L 2 -risk of the PTE are respectively given by where H ν -functions represent the CDF of the non-central chi-square distribution with DF ν and noncentrality parameter ∆ 2 and c α is the upper α th percentile of the central chi-square distribution with ν degrees of freedom. A continuous version of PTE is the James-Stein type estimator (JSE) is defined bŷ The bias and L 2 -risk function of the JSE are given respectively by and R 4 (β JS n ; where χ 2 ν (∆ 2 ) is the standard non-central chi-square variable with ν DF and noncentrality parameter ∆ 2 . The main characteristic of the JSE is the reduction of L 2 -risk. We obtain the JSE as a simple modification of PTE and depends on the test-statistic, L n . See [25] for details. The JSE has the property of "over-shrinkage" beyond the target vector resulting in changes of sign. This is due to factor (1 − (p 2 − 2)L −1 n ) whose absolute value may exceed unity. This change of value effects the interpretation. Hence, we consider the Positive-Rule Stein type shrinkage estimator (PRSE),β S+ n defined byβ The bias and L 2 -risk function of PRSE are respectively given by is the truncated expectation of the reciprocal of a noncentral χ 2 -distribution with p 2 + 2 degrees of freedoms and noncentrality parameter 1 2 ∆ 2 . Next, we consider the basic penalty estimator called the ridge regression (RR) estimator ( [18]) and given bŷ This estimator is not scale invariant. If scales used to express the individual predictor variables are changed, then the ridge coefficients do not change inversely proportional to the changes in the variable scales. This ridge regression gives constant shrinkage, 1 1+k . The bias and L 2 -risk expression are given respectively by b 6 (β RR n ) = − k 1 + k β and R 6 (β RR n ; I p ) = Notice that the estimator depends on the unknown tuning parameter, k. In general ridge regression estimator combats the multicollinearity problem if the X-matrix is non-orthogonal. In this case, it is a scaled version of the LSE. It tends to null vector (0), as k → ∞ and for k = 0, it reduces to the LSE.
Finally, we consider the LASSO (Least absolute shrinkage and selection operator) estimator due to [32] which has gone viral in the statistical literature due to its applicability in data-analysis for linear models unlike other estimators. It shrinks some coefficients and sets others to 0 and hence tries to retain good properties of subset selection and ridge regression. The literature on LASSO related penalty estimator shows that mostly they have been studies under "orthonormal" set-up of the design matrix X. See for example [32], [9], [11] and [35] among many others. We use the orthonormality of the design matrix in our study too.
The LASSO estimator have been defined as given below, Here, λ is the tuning parameter (threshold parameter) and according to [32], LASSO is obtained by minimizing which provides simultaneously estimation and selection of the components of β-vector. Our aim is the estimation of β under L 2 -risk given by (3). For this, we consider the family of diagonal linear projections (DP), This estimator 'keeps' or 'kills' a parameter,β jn , ie, it does subset selection. Now, we incur a risk σ 2 if we usẽ β jn , and a risk β 2 j if we sue the estimator 0 instead. Hence, our ideal choice is I(|β j | > σ) for τ j , that is keep all those predictors whose true value is more than the noise level, σ 2 . These yield the ideal risk, R σ (DP ) given by This expression is a lower bound of the risk that we can hope for. If we assume that p 1 of β 2 j 's are greater than σ 2 and rest p 2 are zero, then we obtain R σ (DP ) = σ 2 (p 1 + ∆ 2 ). In this case, the lower bound of the risk of LASSO is given by We shall use this lower bound to compare LASSO with other estimators. Finally, based on the Theorem 1 of [9]and equation (20), we have the following inequality: Next, we consider the hard threshold estimator (HTE) given bŷ [9]shows that asymptotically LASSO comes as close as HTE to the performance of an ideal subset selector-one that uses information about the actual parameter. As such the result (22) holds forβ HT n (λ p ) as well. Finally, by Theorem 4 of Donoho and Johnstone (1994), the upper bound of (22) for a sequence of {λ * n } of thresholds close to λ p = 2ln(p) is given by (See Tibshirani (1996), Eqn 16). Thus, we see that Same result holds forβ HT n (λ p ). One may notice that the choice λ p = 2ln(p) gives smallest asymptotic L 2 -risk for both subset selection estimator as well as LASSO estimator when we consider competing estimators.

Analysis of Dominance Properties of the Estimators
In this section, we compare and contrast the L 2 -risks of the seven estimators discussed in section 2.
First, we note that LASSO have been proposed by [32]for shrinkage and selection of coefficients for linear and generalized regression models. The LASSO does not focus on subsets but rather defines a continuous shrinkage operation that can produce coefficients that are exactly 0 and competitive with subset selection and ridge regression in terms of prediction accuracy.
Secondly, LASSO performs best at a pont (see [7]. In our case for ∆ 2 = 0, it performs best among all estimators except ridge regression.
Thirdly, LASSO enjoys the "Oracle Properties: (see [35] since the design matrix is orthogonal. In his pioneering paper, [32]examined the relative merits of the subset selection, ridge regression and the LASSO in three different scenarios: (a) Small number of large coefficients -subset selection does the best here, the LASSO not quite as well as ridge regression does quite poorly. (b) Small to moderate numbers of moderate-size coefficients -LASSO does best followed by ridge regression and then subset selection. (c) Large number of small coefficients -ridge regression does best by a good margin, followed by LASSO and then subset selection.
The above results refer to prediction accuracy. Recently, [17] considered the comparison of LASSO, Stein-type estimators and subset selection based on L 2 -risk. His findings may be summarized as follows: (i) Neither LASSO nor LSE or Stein-Type estimators uniformly dominate one other. (ii) Via simulation studies, he concludes that LASSO estimation is particularly sensitive to coefficient parameterization and for a significant portion of the parameter space, LASSO has higher L 2 -risk than the LSE. [17]did not specify the regions where one estimator or the other has lower L 2 -risk. In his analysis, [17] used the normalized L 2 -risk bounds (NRB) to arrive at his conclusion.

Comparison of LASSO and LSE
Consider the solution for LASSO. In particular, suppose that the p 1 (< p) coefficients satisfy the condition β 2 j > σ 2 and remaining p 2 coefficients are 0 (zero). In this case, the L 2 -risk difference of LSE and LASSO is given by Hence, LASSO outperforms the LSE whenever Otherwise, LSE outperforms LASSO in the interval (p 2 , ∞). Hence, neither LSE nor LASSO outperform the other. Thus, by (20), we have the relative risk efficiency (RRE) On the other hand, if ∆ 2 = 0, the two bounds depend on ∆ 2 and bounds are decreasing function of ∆ 2 . Hence, RRE[LASSO : LSE] ≤≥ 1, depending on the size of (p 1 , p 2 , ∆ 2 ). Hence, neither LASSO nor LSE outperform the other uniformly. See Figure 1.

Comparison of LASSO with RE
For the comparison of LASSO with RE, we consider the risk relative efficiency (RRE) ofβ n relative to LSE given by which is a decreasing function of ∆ 2 and attains it maximum value 1 + p2 p1 at ∆ 2 = 0 and equals unity at Thus,β n dominatesβ n in the range 0 ≤ ∆ 2 ≤ p 2 andβ n dominatesβ n when ∆ 2 > p 2 . Thus none of the estimators dominate one another uniformly. The p 2 sphere with radius less that p 2 is the parameter space whenβ n dominatesβ n and outside this sphereβ n dominatesβ n . Now, we know that both LASSO and RE perform better than the LSE in the interval [0, p 2 ) and LSE performs better than both LASSO and RE in the interval (p 2 , ∞). It is evident that for a significant proportion of the parameter space both LASSO and RE has higher L 2 -risk than the LSE. Thus, we observe the performance characteristics of both LASSO and RE are the same indicating "oracle property" of LASSO under orthonormal design matrix.

Comparison of LASSO with the PTE
In this section, we present the comparison of LASSO with the PTE. In this case, the L 2 -risk difference is given by Then, LASSO performs better than the PTE whenever, Otherwise, PTE is better than the LASSO for ∆ 2 > ∆ 2 P T E . Hence neither PTE nor LASSO dominate the other uniformly for all α ∈ (0, 1). c

Comparison of LSE with JSE and PRSE
Now, we consider the comparison of JSE and LSE. It is easy to see that by simple L 2 -risk difference Henceβ JS n dominates uniformly overβ n . Similarly, the L 2 -risk difference ofβ S+ n andβ n is given by, Thus we have the following identity It is clear that JSE and PRSE gains are highest when the coefficients are small.

Comparison of LASSO with JSE and PRSE
How is the performance of LASSO compared to JSE? For this first we consider the L 2 RE differences of JSE and the upper bound of L 2 -risk of LASSO (p 2 ≥ 3) given by Hence, LASSO outperform JSE, whenever, This means that the region in which LASSO performs better than the JSE is given by (36), otherwise JSE performs better than the LASSO. Now, we consider the comparison of LASSO and positive-rule Stein estimator and consider the L 2 -riskdifference given by Now, LASSO outperforms PRSE whenever Ω * ∆ 2 holds, where

Comparison of Ridge Regression vs LSE and RE estimators
First, recall that the bias and L 2 -risk expression of ridge regression estimator are given respectively by and R 6 (β RR n ; I p ) = σ 2 p 1 + where k is the tuning parameter and ∆ 2 = β 2 β2 σ 2 . We first prove the following theorem on the dominance of the ridge estimator over the LSE given by. Theorem 3.2.1 There exist always a k > 0 in the range 0 < k < k * = σ 2 β β such that the ridge regression estimator β RR n has smaller risk than the LSE. Proof: It is obvious that for k = 0, R 6 (β RR n ; I p ) = pσ 2 , which is the risk of the LSE. Now, consider the two terms of R 6 (β RR 2n ; I p2 ) = p2σ 2 (1+k) 2 + σ 2 k 2 ∆ 2 (1+k) 2 . The first term is continuous and monotonically decreasing function of k and its derivative w.r.t k approaches ∞ as k → 0 + . The second term is also continuous and monotonically increasing function of k and its derivative tends to zero as k → 0 + . Note that the σ 2 k 2 ∆ 2 (1+k) 2 converges to 0 as k → 0 + . Note that Thus, a sufficient condition for (42) to be negative is that there exist a value of k in the interval 0 < k < k * such that Substituting k * in R 6 (β RR 2n ; I p2 ) we obtain As ∆ 2 → ∞, the above expression equals p 1 and at ∆ 2 = 0, it is Further, the risk of PRSE is monotonically increases towards the p-line while risk of RR is monotonically increasing below the graph of the risk of PRSE. Hence, R 6 (β RR n : I p ) ≤ R 5 (β S+ n : I p ) ≤ R 4 (β S n : I p ) ≤ R 1 (β n : I p ) ∀ ∆ 2 ∈ R + .

Comparison of Ridge Regression and LASSO
The L 2 -risk difference of sparse LASSO and ridge regression is given by Hence, R 6 (β RR 2n : I p2 ) ≤ R 7 (β L n : I k ) ∀∆ 2 (k) ∈ R + See Tables 3.3-3.4 for numerical efficiency of of all proposed estimators.
All of the above table analysis is consistent with the theoretical comparisons of the proposed estimators that are presented in section 3.

Acknowledgement
Authors are thankful to referees and editor for their valuable comments/suggessions which certainly improved the presentation of the paper.