Generalized Ridge Regression Estimator in High Dimensional Sparse Regression Models

Modern statistical analysis often encounters linear models with the number of explanatory variables much larger than the sample size. Estimation in these high-dimensional problems needs some regularization methods to be employed due to rank deficiency of the design matrix. In this paper, the ridge estimators are considered and their restricted regression counterparts are proposed when the errors are dependent under a multicollinearity and high-dimensionality setting. The asymptotic distributions of the proposed estimators are exactly derived. Incorporating the information contained in the restricted estimator, a shrinkage type ridge estimator is also exhibited and its asymptotic risk is analyzed under some special cases. To evaluate the efficiency of the proposed estimators, a Monté-Carlo simulation along with a real example are considered.


Introduction
Consider a linear regression model: where y i s are responses, x i = (x i1 , . . ., x ip ) ⊤ is design points, β = (β 1 , . . ., β p ) ⊤ is vector denoting unknown coefficients, ε i s are unobservable random errors and the superscript ( ⊤ ) denotes the transpose of a vector or matrix.Further, ε = (ε 1 , . . ., ε n ) ⊤ has a cumulative distribution function F (ε); E(ε) = 0 and Var(ε) = σ 2 V n , where σ 2 is finite and V n is a known matrix belonging to the space of all positive definite matrices of dimension n × n, denoted by S(n).Now-a-day, many data problems nowadays carry the structure that the number of covariates p may exceed sample size n, known as small n, large p problems.Such cases, that can be seen in the studies of genomics, financial markets, mobile phone communication, bioinformatics and risk management, regularization methods should be considered for inferring (see [9,5]) about parameters of interest.
In this paper, when a subset of coefficients is zero, the underlying model is called sparse.Under the sparsity assumption, the vector of coefficients β can be partitioned as (β 1 , β 2 ) where β 1 is the coefficient vector for main effects and β 2 is the vector for nuisance effects or insignificant coefficients.We are essentially interested in the estimation of β 1 when it is reasonable that β 2 is close to zero.
The paper is organized as follows: The problem of interest is stated in section 2. In section 3, sparse ridge model is considered by proposing the new estimators, while their asymptotic properties are derived in section 4. Section 5 is devoted to exhibiting a shrinkage type ridge estimator and analyzing its asymptotic properties.Efficiencies of the proposed shrinkage estimators relative to the ordinary estimator are evaluated through a GENERALIZED RIDGE REGRESSION ESTIMATOR IN MODELS Monté-Carlo simulation as well as a real data example in section 6.We conclude our study in section 7 by giving a summary.

Generalized Ridge Estimator
blueIn this section, we discuss a biased estimation technique under multicollinearity for the regression models.So, the following preliminaries are needed.In a full matrix notation, the model (1.1) can be represented as where As known, both of the ordinary least squares estimator (OLSE) and its covariance matrix are heavily dependent to the characteristics of the matrix S n = X ⊤ n X n .If S n is ill-conditioned, then the OLSE may affect by various numerical errors.The problem of multicollinearity can be solved by collecting additional data, reparameterizing the model and reselecting the variables.There are two well-known mathematical methods to overcome multicollinearity: the 'principal components regression method' and the 'ridge regression method'.Here, we discuss the ridge regression method.
It is known that spectral decomposition of the (symmetric) positive definite matrix S n can be given by where the columns of Γ are eigenvectors of the matrix S n as well as the scalers λ 1 ,. . ., λ p are its eigenvalues, satisfying without loss of generality.Therefore, the orthogonal (canonical) version of the model (2.2) is given by (2.4) where X * n = X n Γ and α = Γ ⊤ e.However, when S n is ill-conditioned, there exists an approximate linear dependency among its columns.So, the OLSE of β has a large variance and multicollinearity is said to be appeared.In such situation, there exists a small positive constant ε such that where r is called the numerical rank of S n (Watkins, 2002).More closeness of the small eigenvalues to the origin leads to more strength of the multicollinearity.To overcome damaging effect of the small eigenvalues, we need to increase them.
In another point of view based on the spectral decomposition, in the ridge regression the matrix S n is replaced by the matrix S n (k) defined by (2.6) Hence, we have and, where κ 2 (.) stands for the spectral condition number and ∥x∥ 2 denotes the Euclidean norm of the vector x.Thus, computations of the ridge regression are numerically more stable than the OLSE.In an extension scheme, the canonical generalized ridge estimator has been proposed by Hoerl and Kennard [12] as follows: for some suitably chosen diagonal matrix and Kennard [12] showed that for the known optimal values the generalized ridge regression estimator is superior to all of the other ones within the class of biased estimators, where σ 2 is the error variance of the model (2.2) and α 2 i is the i th entry of α.Moreover, Hocking et al. [13] showed that with these optimal k (i) n , values the proposed estimator (2.7) is superior to all estimators within the class of biased estimators they considered.However, the optimal value of k (i) n fully depends on the unknown values σ 2 and α i which must be estimated from the observed data.Hoerl and Kennard [12] suggested to replace σ 2 and α 2 i by their corresponding unbiased estimators.That is, where σ2 is an unbiased and effective estimator of σ 2 and αi is the i th entry of α which is an unbiased estimator of α.Hoerl and Kennard [12] also suggested an iterative procedure to estimate n .Two other methods to select k (i) n have been proposed by Hemmerle and Brantle [10].Some exact finite sample properties for (2.7) can be found in Hemmerle and Carey [11].Thus, the matrix S n is replaced by the matrix S n Hence, we have and, Now, considering the high-dimensional case p > n, we are primary interested in estimating the regression vector-parameter β in the model (2.2).Since S n is rank deficient, a regularization method is needed to combat this ill-conditioning, such as Tikhonov [21] regularization.In our setup, estimating β is equal to minimizing the following general criterion, under the where ) is called generalized ridge estimator (GRE).
For the case K (p) n = kI p , k > 0, the GRE reduces to the ordinary ridge regression estimator introduced by Hoerl and Kennard [12].If In what follows, we discuss about the properties of β(K Thus, for k 1 = min(k , and E β Now, we focus on the covariance.For the high-dimensional case p > n, using spectral decomposition where Theorem 1 Then, we have Proof: By definition, after some algebra, So, the proof is complete.2 Under some mild conditions, it can be shown that 1 For more detail on high-dimensional properties, we refer to [7].

Sub-model Approach
Since p > n, one methodology to infer about regression parameters is to reduce the number of features from p to p 1 < n.Hence, the regression parameter β is partitioned as , where the sub-vector β i has dimension p i , i = 1, 2 and p 1 + p 2 = p.Thus, the underlying model has form y n = X (1)  n β 1 + X (2)  n β 2 + ϵ, (3.1)where n ) in such a way that With respect to this partitioning, the generalized least square estimators (GOLSEs) of β 1 and β 2 are respectively given by β( 1) (1)  n = X (1)   n ⊤ Σ (2)   n −1 where The sparse model is defined when H o : β 2 = 0 is true.In this paper, we refer restricted regression model (RRM) to the sparse model.For the RRM, the generalized restricted estimator based on OLS estimator (GROLSE) has form According to [20], the GROLSE performs better than GOLSE when model is sparse.However, the former estimator performs poorly as β 2 is different from zero.In the following result, in a similar fashion as in [22], the relation between the sub-model and full-model estimators of β 1 is obtained.It can be easily shown that (3.5)

Sparse ridge model
Under the sparsity assumption, i.e., β 2 = 0 in the high-dimensional problem, following [19], the generalized restricted ridge estimator (GRRE), is given by where is the ridge parameter matrix as a function of sample size n and Similarly, the generalized ridge estimators (GREs) of β 1 and β 2 respectively have forms β( 2) )) where If A 12 = 0, then δ = 0, γ = µ 11.2 and A 11.2 = A 11 , and all the asymptotic risk functions reduce to common value σ 2 tr(A −1 11 ) + µ ⊤ 11.2 µ 11.2 for all ξ.In a similar fashion as in [20], if A 12 ̸ = 0, then when 11 A 12 moves away from 0, the asymptotic risk of ) ) in the sense of having smaller asymptotic risk function.

Shrinkage Estimator
Under the aforementioned analysis in previous section, one may consider improving the GRE of β 1 by shrinking toward 0 (see [20] for more details).In this respect, one plausible choice is to combine both GRE and GRRE of β 1 to obtain Now, let β * be any estimator of β 1 .Then the asymptotic distributional risk (ADR) of β * is evaluated by Under the pronounced regularity conditions and local alternatives {K (n) }, the asymptotic risk of shrinkage ridge estimator (SRE) ) , after some algebra, is given by Obviously, under ξ = 0, βS ) whenever and for the special case α 1 = 1, it can be revealed that βS and for the special case α 2 = 1, it can be revealed that βS (5.4)

Numerical Results
In this section, the Monté-Carlo studies and a real data example about riboflavin production data set (see [6]) are consider to justify the assertions.

The Monté-Carlo simulation studies
To examine the performance of the proposed estimators, a Monté-Carlo simulation is performed.To achieve different degrees of collinearity, following [18,8], the explanatory variables were generated using the following device for n = 25 and p = 30, 50 and 70 from the following model: where z ij are independent standard normal pseudo-random numbers and γ is specified so that the correlation between any two explanatory variables is given by γ 2 .These variables are then standardized so that X ⊤ n X n and X ⊤ n y n are in correlation forms.Three different values γ = 0.75, 0.90 and 0.99 are considered for the correlation.Then the observations for the dependent variable are determined by y n = X (1)  n β 1 + X (2)  n β 2 + ϵ, (6.2) with β 1 = (−1.25, 1, 2.5, 4, −3, −5) ⊤ , p 1 = 6 and n = 25.To achieve the sparse model, β 2 is generated as N p2 (0, 0.01 × I p2 ), p 2 = p − p 1 .The error term of the model (6.2) is generated as The Monté-Carlo simulation is performed with M = 10 3 replications, obtaining the ridge estimators β1 = β( 1) ) , in the sparse restricted regression model.The relative efficiencies of the above methods with respect to the first method are estimated as is the estimators obtained in the mth iteration.All numerical computations were conducted using the statistical package R.In Tables 1 to 4, the proposed estimators along with ADR(.), efficiencies of proposed estimators relative to GRE and optimal values of α 1 and α 2 were computed.The asymptotic risk of SRE was employed to select the optimal parameters α 1 and α 2 , numerically, by minimizing the ADR function of SRE.The minimum of ADR approximately occurred at α opt 1 = 0.4749 and α opt 2 = 0.7449 for p = 70 and γ = 0.99. Figure 1 shows the ADR functions versus α 1 and α 2 .Since the results were similar across cases, only the results for p = 70 and γ = 0.99 were reported to save space.
As it can be found from Tables 1-3, the SRE performs better than the other proposed estimators in the sense of having smaller ADR.Also, because of the sparsity assumption, both GRRE and SRE are more efficient than GRE.Moreover, since the simulated model is sparse, as it is evident from Table 4, the allocated weight to the GRRE (α 2 ) is greater than that of the GRE (α 1 ).From Figure 1, one can distinguish the optimal regions of superiority of SRE over GRE and GRRE.

Application to Riboflavin Production Data
To illustrate the usefulness of the suggested strategies for high-dimensional data in the semiparametric regression model, the data set about riboflavin (vitamin B2) production is considered in Bacillus subtilis, which can be found in R package "hdi".There is a single real valued response variable which is the logarithm of the riboflavin production rate.Furthermore, there are p = 4088 explanatory variables measuring the logarithm of the expression level of 4088 genes.There is one rather homogeneous data set from n = 71 samples that were hybridized repeatedly during a fed batch fermentation process where different engineered strains and strains grown under different fermentation conditions were analyzed.Based on 100-fold cross validation, the Lasso shrinks 4047 parameters to zero and remains p 1 = 41 significant explanatory variables.So, the specification of the sparse regression model is y n = X (1)  n β 1 + X (2)  n β 2 + ϵ, (6.3)where p 1 = 41 and p 2 = 4047.
According to [23], the unknown V n can be estimated by a consistent estimator ) y n is ordinary least square estimator for the sparse regression model.
Table 5 shows a summary of the results.In this According to Figure 2, the minimum of ADR approximately occurred at α opt 1 = 0.00001 and α opt 2 = 0.9999.Also, the ADR functions of proposed estimators versus α 1 and α 2 are plotted in Figure 3, which can be used in a similar fashion as in the simulation.

Summary
In this paper, the generalized ridge estimators were proposed for the sparse multiple regression model.In this context, a generalized restricted ridge estimator exhibited for the sparsity test β 2 = 0, where the regression vector-parameter β partitioned as β = (β ⊤ 1 , β ⊤ 2 ) ⊤ .Information contained in the generalized restricted ridge estimator was employed to construct a shrinkage ridge estimator.Some asymptotic distributional results were given for the proposed estimators and superiority conditions discussed under some regularity conditions.All the analysis were conducted under a classical view point, using the Tikhonov regularization.
It is well-known that as the sample size tends to infinity, both Bayesian and classical analysis give almost same results.However, under a finite sample size, one may consider a generalized Tikhonov regularization and derive ridge regression estimators in Bayesian context.To see this, take Σ, β o and Ω as the covariance of y n , expected value of β and covariance of β, respectively.Therefore, the generalized ridge estimator has form ) is leading to be the best estimator among others, since it offers smaller risk and bigger efficiency value, in all cases.According to Figure 1, results show that the global minimum occurs under the ADR of GRE and GRRE.Also, there exists an optimal region for performance of SRE with respect to GRE and GRRE.For the real example, from Table 5 as well as Figures 2 and 3, it can be deduced that βS n (K (p1) n ) is quite efficient in the sense that it has significant goodness of fit value.

Figure 1 .
Figure 1.The diagram of ADRs versus α 1 and α 2 for simulated data set.

. 5 )
Then, generalized restricted ridge estimator can be verified for the sparse (restricted) model.Results of the Monté-Carlo simulation for different γ, n = 25, p = 30, 50 and 70, were presented in Tables 1 to 4 and Figure1.From these Tables, it is realized that βS n (K (p1) n

Table 1 .
Evaluation of the proposed estimators for γ = 0.75

Table 2 .
Evaluation of the proposed estimators for γ = 0.90

Table 3 .
Evaluation of the proposed estimators for γ = 0.99

Table 4 .
Optimal values of shrinkage parameters for different p and γ values The diagram of ADR of SRE versus α 1 and α 2 for real data set.

Table 5 .
Evaluation of proposed estimators for real data set Figure 3.The diagram of ADRs versus α 1 and α 2 for real data set.