Variable selection and structure identification for ultrahigh-dimensional partially linear additive models with application to cardiomyopathy microarray data

In this paper, we introduce a two-step procedure, in the context of ultrahigh-dimensional additive models, to identify nonzero and linear components. We first develop a sure independence screening procedure based on the distance correlation between predictors and marginal distribution function of the response variable to reduce the dimensionality of the feature space to a moderate scale. Then a double penalization based procedure is applied to identify nonzero and linear components, simultaneously. We conduct extensive simulation experiments to evaluate the numerical performance of the proposed method and analyze a cardiomyopathy microarray data for an illustration. Numerical studies confirm the fine performance of the proposed method for various semiparametric models.


Introduction
Suppose we have a random sample (y i , x i1 , . . ., x ip ), 1 ≤ i ≤ n, where y i is the response variable and (x i1 , . . ., x ip ) is a p-dimensional covariate vector.Consider the partially linear additive model (PLAM) where S 1 and S 2 are mutually exclusive and complementary subsets of {1, ..., p}, {β j : jϵS 1 } are the regression coefficients of covariates, {f j : jϵS 2 } are unknown smooth functions and the model error ε has conditional mean zero and finite variance σ 2 .To ensure identifiability of the nonparametric functions, we assume that E[f j (X j )] = 0 for jϵS 2 .Estimation and variable selection for PLAM have been well studied in literature, for example, [16,12,7,1,17].The use of model ( 1) is based on the assumption that the linear and nonlinear parts are known in advance.However, such prior information is usually unavailable, especially when the number of covariates is large.Thus, in addition to distinguish nonzero components, it is of great interest to develop some efficient methods to identify linear components from nonlinear ones.
Zhang et al. [21] studied the model selection using two penalties, simultaneously, to identify the zero and linear components in PLAM.They did not prove any selection consistency results for general partially linear models.
Motivated by this, Huang et al. [9] proposed a semiparametric regression pursuit method for distinguishing linear from nonlinear components using a group MCP penalty.Lian [13] provided a way to determine linear components by using SCAD penalty.This was a new usage of SCAD in which no variable selection is performed.Lian [14] successfully identified nonzero and linear components of model (1) by applying a two-fold SCAD penalty in the additive regression.Lian et al. [15] proposed another two penalty procedure in high dimensional setting using adaptive group LASSO in which insignificant predictors and parametric components were simultaneously identified.
However, with the rapid development of data collecting technologies, the PLAM often face the challenge of ultrahigh-dimensionality.The aforementioned regularization methods may not perform well for ultrahighdimensional data due to the simultaneous challenges of computational expediency, statistical accuracy, and algorithmic stability ( [4]).Thus a natural question is how to identify the truly relevant predictors and parametric components in such ultrahigh-dimensional PLAM.
In this paper, this question is addressed on the basis of a two-stage procedure: (1) reducing the dimension from ultrahigh to moderate using a fast and efficient independence screening procedure; (2) identify nonzero and linear components from the screened submodel by a double penalization based procedure.Regrading the first stage, Fan and LV [3] first advocated the sure independence screening (SIS) method for the ultrahigh-dimensional linear models based on the Pearson correlation learning.Many authors further developed the SIS method and applied it to various statistical models, such as generalized linear models ( [4,6]), nonparametric additive models (NIS, [5]).Furthermore, in order to avoid the specification of a particular model structure, Zhu et al. [23] proposed a sure independent ranking and screening (SIRS) procedure for ultrahigh-dimensional data in the framework of the general multi-index models.Thereafter, a model-free SIS based on the distance correlation (DC-SIS) was developed by Li et al. [10].Li et al. [11] proposed a robust rank correlation screening (RRCS) method based on the Kendallcorrelation coefficient between response and predictors.
We propose a more robust approach, called robust distance correlation sure independence screening (RDC-SIS), to reduce dimensionality which ranks each covariate through its distance correlation with the marginal distribution function of the response variable.This method is model-free and we can expect that the procedure works well for skew or heavy tailed response variable.This procedure is a modification of DC-SIS proposed by Li et al. [10] in which Y is replaced by F (Y ).In the second stage, a double penalization based method is applied to refine the screened submodel and identify nonzero and linear components, simultaneously.
The rest of this paper is organized as follows.In Section 2, a modification of DC-SIS procedure is introduced for dimension reduction.Afterwards, a doubly penalized estimation method is explained in details in Section 3. We just focus on the SCAD penalty ( [2]) but other penalties such as the LASSO ( [19]) and MCP ( [20]) could also be applied.In section 4, simulation studies are carried out to assess the performance of the proposed method and to compare it with some existing methods.A real data example is used for illustration in Section 5.

Screening Procedure (RDC-SIS)
At the start of the analysis, we do not know which component functions are linear or actually zero and thus the following general additive model is used initially: ( We consider the problem of nonlinear variable screening in ultrahigh-dimensional feature space.The goal is to rapidly reduce the dimension of the covariate space p to a moderate scale via a computationally convenient procedure.We propose a robust feature screening procedure for model (2) using distance correlation between predictors and marginal distribution function of response variable.A review of distance correlation is as follows.Sz'ekely, Rizzo, and Bakirov [18] introduced distance correlation as a measurement of dependence between two random vectors.The distance correlation between random vectors U and V with finite first moments is a nonnegative value which is defined by if dcov(U, U )dcov(V, V ) > 0, and equals 0 otherwise.They stated that dcov 2 (U, V ) = S 1 + S 2 − 2S 3 , where S j , j = 1, 2, and 3, are defined as: and ( Ũ , Ṽ ) is an independent copy of (U, V ).In the category of model-free feature screening procedures for ultrahigh-dimensional setting, Li et.al [10] developed a sure independence screening method (DC-SIS) based on the distance correlation (DC-SIS) and showed that the DC-SIS has the sure screening property.Two remarkable properties of the distance correlation motivate them to use it in a feature screening procedure.The first one is the relationship between the distance correlation and the Pearson correlation coefficient.For two univariate normal random variables U and V , with the Pearson correlation coefficient ρ, Szekely, Rizzo, and Bakirov [18] showed that dcorr(U, V ) is strictly increasing in |ρ|.This property implies that the distance correlation based feature screening procedure is equivalent to the marginal Pearson correlation learning for linear regression with normally distributed predictors and random error.The second remarkable property of the distance correlation is that dcorr(U, V ) = 0 if and only if U and V are independent.Distance correlation has properties of a true dependence measure, analogous to Pearson correlation ρ.It has the advantage that it can detect nonlinear relationships which are ignored by marginal correlation.For more details about distance correlation, see Szekely, Rizzo, and Bakirov [18].
Our new measure of correlation between Y and X k is proposed by substituting F (Y ) instead of Y , i.e., marginal utility measure for predictor ranking is as: where F (Y ) is the marginal distribution function of Y .We note that two univariate random variables U and V are independent if and only if U and h(V ), a strictly monotone transformation of V , are independent.This implies that dcorr(X, F (Y )) = 0 if and only if X and Y are independent.Furthermore, when the response is the skew or heavytailed, it can be expected that this procedure has a good performance.This screening procedure is a model-free in which one does not need to specify a model structure between the predictors and the response.
It is desirable to derive an estimator of ω k based on the n independent and identical observations.Suppose that we have a random sample, (X i , Y i ) n i=1 , from the nonparametric additive model (2).We estimate S 1 , S 2 , and S 3 through the usual moment estimation as follows: Similarly, we can define the sample distance covariances dcov(X k , X k ) and dcov(F (Y ), F (Y )).Accordingly, the sample distance correlation between X k and F (Y ) can be defined by .
Thus we estimate ω k with ωk = dcorr(X k , F (Y )).We select a set of important predictors with large ωk .That is, the screened sub-model be determined by, where ν n is a pre-determined positive number.This procedure reduces the dimensionality from p to a possibly much smaller space with model size d = | Mνn |.

Variable Selection and Structure Identification
In where two penalties p λ1 (.) and p λ2 (.) are used to identify the zero and the linear coefficients, respectively, with two regularization parameters λ 1 and λ 2 , and A j and There is some flexibility in choosing A j and D j but one requirement is that is a linear function, so that the two penalties can be used to identify zero and linear components, respectively.One natural choice is and Then (6) can be written in matrix form as For later use we denote the objective function on the right hand side (7) as Q(b).There are different way to specify the penalty functions, but here we only focus on the SCAD penalty function, defined by its first derivative with a > 2 and P ′ a,λ (0) = 0. We use a = 3.7 as suggested in [2].To find the minimum of (7) for fixed tuning parameters, we use the iterative local quadratic approximation (LQA) proposed in [2].Using a simple Taylor expansion, given an initial estimate b 0 j , if ∥b j ∥ Aj > 0 and ∥b j ∥ Dj > 0, we approximate the penalty terms by and After removing some irrelevant terms, the criterion becomes for two dK × dK matrices Ω 1 and Ω 2 which are defined by and Note that ( 9) is a quadratic function and thus there exists a closed-form solution.Then the updating equation given the current estimate b The algorithm repeatedly solves the minimization criterion ( 9) and updates b (m) to b (m+1) , m = 0, 1, ... until the convergence is satisfied.That is, in the m-th iteration, we solve (9), where Ω 1 and Ω 2 are as above but with b 0 j replaced by the current estimate b (m) j . The solution obtained from ( 9) is the new estimate b (m+1) .During the iterations, as soon as some ∥b j ∥ Aj (respectively, ∥b j ∥ Dj ) drops below a certain threshold (10 −6 in our implementation), the component is identified as a zero function (respectively, linear function).

Simulation Studies
In this section, some simulation studies have been conducted to asses the finite sample performance of our methods.We first consider three models to illustrate our proposed screening procedure (RDC-SIS).The performance of the RDC-SIS is then compared with the existing competitors, such as DC-SIS ( [10]), SIRS ( [23]), SIS ( [3]) and NIS ( [5]).
To evaluate the performance of proposed mehod, three criteria are considered.The first criterion is the minimum model size (denoted by M), that is the smallest number of covariates needed to ensure that all the active variables are selected.To get better inference, the 5%, 25%, 50%, 75% and 95% quantiles of M out of 200 replications were also presented.The second criterion (denoted by P j ) is the empirical probability that the active covariate X j is selected, when the threshold d = 2 [ n/log(n) ] is adopted.The last criterion is the proportion (denoted by S) of truly active predictors that are identified by the screening procedure.Note that the first criterion does not need to specify a threshold.The more reliable screening procedure, the closer M value to the number of active predictor and also the closer S and P j value to 1.
We also carry out some Monte Carlo studies to assess the effectiveness of our two stage proposed method to separation of the linear and nonlinear components and to identify insignificant covariates simultaneously in partial linear additive models of non-polynomial (NP) dimensionality based on double penalization.
It is also needed to find a data-driven procedure to choose the regularization parameters λ 1 and λ 2 .We use the BIC-type criterion which is defined as log( where bλ is the minimizer of ( 7) for given λ = (λ 1 , λ 2 ), d 1 is the number of nonparametric components and d 2 is the number of parametric components, both for the given λ.
From Table 1, when the random error has a normal distribution, for both cases c = 1, and c = 2, all four screening methods perform quit well.However, it is consistently superior for the heavy-tailed error distribution.When the error has a t-distribution with one degree of freedom, however, the performance of DC-SIS nad SIS quickly deteriorates, while our method continues to perform well.DC-SIS do not work well in identifying active covariates.Compared with the SIRS method, the performances of the RDC-SIS and SIRS are equally good in all the considered scenarios; both of them deliver more satisfactory results than the DC-SIS and SIS procedures.The SIS has little chance to identify the important predictors.
Table 2 indicates that, for all types of distribution error, the performance of the proposed RDC-SIS is very well and outperform other methods.All P j and S of the RDC-SIS is equal 1.Thus, all active predictors can perfectly be selected into the resulting model across all three different error distributions.We can see that the DC-SIS procedure is comparable to the RDC-SIS method for all types of distribution error.
In model 3, where the model has a more complex structure, RDC-SIS is effective in identifying the number of active variables for all types of errors; while DC-SIS and NIS fail to identify some important predictors.The NIS has no chance for to identify X 3 and little chance to identify other active covariates.RDC-SIS and SIRS have similar performances and are equally well.Both of them outperform the NIS and DC-SIS procedures.Example 2. In this example, we first apply the RDC-SIS method to reduce dimensionality, and then fit a partial linear model (PLAM) where two penalty is used to simultaneously identify nonzero and linear components.We generated data from the model where Thus the true number of nonparametric components is 2 and the true number of linear components is 3.
To generate covariates, we let X j be marginally standard normal with correlations given by cov(X i , X j ) = 0.5 |i−j| .
To illustrate the efficiency of our proposed method, the following two different error distributions are considered: standard normal distribution N(0, 1) and t-distribution with three degree of freedom, t 3 , which is used to produce heavy-tailed distribution.We performed simulations with n = 200, 400, p = 1000, 2000 and the results are summarized in Table 4.
We used several criterion to measure the model identification performance: "NN": average number of nonlinear components selected; "NNT":average number of nonlinear components selected that are truly nonlinear; NL: average number of linear components selected; "NLT":average number of linear components selected that are truly linear.The numbers in parenthesis are the corresponding standard errors.The simulation results indicate that the proposed two-stage method is effective in estimating and identifying nonzero components as well as linear components from nonlinear ones simultaneously.

Data Analysis
To illustrate the usefulness of the suggested strategies for ultrahigh-dimensional data in the semiparametric regression model, we consider cardiomyopathy microarray data.This data set has attracted considerable attention and been systematically investigated by many researchers.They aim to identify the influential genes that affect the overexpression of a G protein-coupled receptor, called Ro1, in mice.The Ro1 expression level was measured for 30 specimens, and the predictors, the genetic expression levels, were obtained for p = 6319 genes.This data set has been studied by many researchers.Hall and Miller [8] showed that both genes Msa.2877.0 and Msa.1166.0 are particularly important using the generalized correlation.Li et al. [10] used the DC-SIS procedure that ranks two genes, Msa.2134.0 and Msa.2877.0, at the top.Li et al. [11] showed that Msa.1166.0 and Msa.7019.0 are particularly important using the RRCS procedure.The NIS procedure in [5] ranks two genes, labeled as Msa.2877.0 and Msa.1166.0, at the top.The RDC-SIS procedure ranks two genes, Msa.2134.0 and Msa.2877.0, in the top, which is similar to DC-SIS and SIRS of Zhu et al. [23].The scatter plots of Y versus these two gene expression levels with cubic spline fit curves in Figure 1 indicate clearly the existence of nonlinear patterns.It can be noted that the distance correlation has the advantage that it can detect nonlinear relationships which are ignored by marginal correlation.Our proposed RDC-SIS procedure shows the advantage that it detected two important genes having nonlinear relationships with Ro1, which might be ignored by some other methods.
A natural question arises: which screening procedure does perform better in terms of ranking?Following Li et al. [11], to compare the performance of these procedures, we fit three different additive models as follows: where X k1 and X k2 are the top two genes, g k1 and g k2 are two unknown link functions, ε is an error term.We fit g k1 and g k2 by using the "gam" function in the R "mgcv" package, where "gam" can be used to fit a generalized additive model to data.We also measure the performance of goodness of fit by the adjusted R 2 values and the explained deviance, where deviance implies the proportion of the null deviance explained by the proposed model, with a larger value indicating better performance.The RDC-SIS, corresponding to k = 1, regards Msa.2134.0 and Msa.2877.0 as the two predictors, while the generalized correlation ranking proposed by Hall and Miller [8], corresponding to k = 2, regards Msa.2877.0 and Msa.1166.0 as predictors in the above model.Also the RRCS proposed by Li et al. [11], corresponding to k = 3, regards Msa.1166.0 and Msa.5758.0 as the two predictors.The RDC-SIS method clearly achieves better performance with the adjusted R 2 of 96.8% and the deviance explained of 98.3%, in contrast to the adjusted R 2 of 84.5% and the deviance explained of 86.6% for the generalized correlation ranking method, and the adjusted R 2 of 77.9% and the deviance explained of 81.5% for the RRCS method.
From the above, we can conclude that the RDC-SIS is an efficient method for dimension reduction in ultrahigh dimensional data.For variable selection and structure identification in cardiomyopathy dataset, we first applied RDC-SIS to reduce the covariate dimension to the size of 3[n/ log(n)] = 24.After cleaning, two-fold SCAD penalty was used to identify parametric and non parametric parts.We have identified 4 genes of linear effects and 13 genes of nonlinear effects.The genes of linear effects are Msa.1920.0,Msa.2877.0,Msa.5595.0 and Msa.741.0.

Conclusion
In this paper, we developed a two stage procedure for variable selection and structure identification in ultrahighdimensional partially linear additive models.In the first stage, a sure independence procedure was used to dimension reduction from ultrahigh to moderate scale.This procedure ranks covariates trough their distance correlation with marginal distribution function of response variable.The proposed methodology has been supported by numeric examples and a real data analysis.This method is model-free and is robust for skew or heavy tailed response variable.In second stage, in order to distinguish linear and nonlinear parts and to identify insignificant covariates simultaneously, we used a double penalization based procedure.
Zhao and Li (2012)ning procedure is proposed to reduce the model size from a very large value p to a moderate scale d by specifying sensible threshold parameters ν n , although it is difficult to choose in practice.A practical way is to select the top d variables by ranking marginal utilities.The choice of d plays a very important role in the screening stage.Fan and Lv[3]recommended d = [?n/log(n)] as a sensible choice.Zhao and Li (2012)proposed an approach to select d for Cox models by controlling false positive rate.A larger value of the specified d would give a greater chance to include inactive variables.This can be solved by a penalty-based variable selection procedure given below.Now, suppose that d variables are selected in the screening stage.Consider a joint nonparametric additive model Y = ∑ d j=1 f j (X j ) + ϵ.B-spline basis is used to approximate each of unknown smooth functions, i.e., f j (x) ≈ ∑ K k=1 b jk B jk (x) for j = 1, ..., d.We use the two-fold penalization procedure to automatically identify different types of components, i.e., the coefficient b

Table 1 .
Five quantiles of minimum model size M , the empirical probability P j and the proportion of S in Model 1.

Table 2 .
Five quantiles of minimum model size M , the empirical probability P j and the proportion of S in Model 2.

Table 3 .
Five quantiles of minimum model size M , the empirical probability P j and the proportion of S in Model 3.

Table 4 .
Model identification results for Example 2.