Alternative Robust Variable Selection Procedures in Multiple Regression

Most of the commonly used linear regression variable selection techniques are affected in the presence of outliers and high leverage points and often could produce misleading conclusions. This article proposes robust variable selection methods, where the suspected outliers and high leverage points are identified by regression diagnostics tools and then the best variables are selected after diagnostic checking. The performance of the proposed methods is compared with the classical non-robust criteria and the existing criteria via simulations. Furthermore, Hawkins-Bradu-Kass data set was analyzed for illustration.


Introduction
Variable selection is one of the important topics in regression modeling, it gains the interest of many authors. Beside the common stepwise deletion and subset selection others proposed penalized likelihood approach ( [1]). Recently, [2] considered the problem of variable selection for ultrahigh-dimensional additive models, and [3] employed the firefly algorithm to select variables in count data regression.
For multiple linear regression model which is given in the form: where, x i = (x i1 , ..., x ip ) T is a vector, containing p explanatory variables, and y i is the response variable, β is a vector of p parameters, and ε i is the error component, that is independent and identically distributed (iid), with mean 0 and variance σ 2 . There are different criteria for model selection, the most common are Mallows' Cp [4], Schwartz criterion (SIC) [5] and Akaike information criterion (AIC) [6], which are defined as the following form: Here the G(SSE) is a function in the term of sum of square error, SSE = ∑ n i=1 r 2 i , with residual r i = y i − x T i β and c is a constant. The G(SSE) value equals SSE/σ 2 , log(SSE/n) and ln(SSE/n) for the Cp, SIC and AIC respectively, where n is the sample size. In the classical criteria, theβ is the OLS-estimator corresponding to the 817 traditional square function. It is defined bŷ Indirect approaches to robust model selection procedures are using a residual, r i from robust fit. In robust model selection of [7], [8] and [9] where robust versions of AIC ,SIC and Cp (RAIC, RSIC and RCp ) are proposed considering the residual r i following robust M -estimator, by choosingβ to minimize ∑ ρ(r i ), where, β is a vector of p parameters in linear regression model and ρ(.) is a function less sensitive to outliers than squared, yielding the estimating equation . M -estimators are efficient and highly robust to unusual values of y, but one rogue leverage point can break them down completely.
Whereas in recent years a good deal with outlier identification on direct approaches focused on the use of single-case diagnostic (see [10] [11], and [12]). A general idea to outlier diagnostic is to form a clean subset of data that is free of outliers. Let R be the set of indexes of the observations in the clean subset, y R and x R be the subsets of observations indexed by R,β R be estimated regression coefficients computed from fitting the model to the set R. And let SSE R be the corresponding sum of squares residual that finds the estimates corresponding to the clean samples having the smallest sum of squares of residuals. As such, as expected, the breakdown point is 50%. When R = n,β R =β OLS . This study suggests using SSE R in different model selection criteria.
The paper is organized as follows. In section 2, we indicate the extreme sensitivity of the exiting robust model to the leverage point, and discuss the robust diagnostic-variable selection criteria in Section 3. In Section 4 we review a popular regression diagnostic tool and its breakdown point which leads to a robust selection criteria. Section 5 presents the result for our simulation study and real data sets, while section 6 concludes the study.

Robust model selection criteria
The influence of leverage point on RAIC, RSIC and RCp are illustrated through the presence of outliers in the X-direction. For simplicity, a set of independent random uniform variable X on [-2,2] was generated according to the simple regression model given as follows: where, the ε i are iid, normally distributed with expectation 0 and variance (0.1 2 ). The data has been presented in Table 1 and Figure 1.
For this, a point with coordinates (0, x 10 ) is added, Figure 2 shows the situation. Figure 3 shows that, the value of different criteria increases as the size of contamination in x 10 increases, as expected, and if x are extremely large, then the values of RIAC, RSIC and RCp change dramatically, and that mean reject the straight line. This shows the high sensitivity of RIAC, RSIC and RCp selections procedures. Indeed, a very small change in the observations at x = 0 has already the effect of changing the selected model.

The diagnostic version of model selection criteria
Consider the diagnostic sum of squares error SSE R , by replacing the value of SSE in equation ( 2) in terms of SSE R , the criteria in equation ( 2), can be expressed as follows: Then the new robust methods, AIC R , Cp R and SIC R , can be expressed as follows: where SSE R is compute from the diagnostic-OLS (OLS R ) estimator defined as: Therefore, OLS R corresponds to find the clean subset of R observations whose least squares fit produces the lowest sum of squared residuals, and has a high breakdown point. It is resistant to outliers, including leverage points. In equation ( 5), the estimates corresponding to the R samples are having the smallest sum of residuals. This would be the most direct implementation of the idea that one wants to find the model which fits best for the majority of the data. However, the distributional properties of OLS residuals are much better understood.

Breakdown point
Definition breakdown point of an estimateβ of the parameters β is the largest amount of contamination that the data may contain and even turn over some information aboutβ. In other words, the breakdown point of an estimatê β shows the effects of replacing several data values by outliers [17]. Hence, the breakdown point for the regression estimatorβ of the sample Z = (X, y) can be defined as: whereZ are contaminated data obtained from Z by replacing m of the original n data be outliers. We obtained the following result from the breakdown point of the OLS R estimator. Remark 1. The breakdown point of the diagnostic-OLS estimatorβ OLS R with subset size R ≤ n is given by Applying Remark 1 to the OLS (n − R = 0) yields a finite sample breakdown point of However, only one outlier can already let the OLS tend to infinity, this classical OLS comes from the use of squared residuals. Using other convex loss function, as we have done in the diagnostic-squared residuals, does solve the problem and also results in a breakdown point of (n − R + 1)/n. With smaller values of R, the breakdown point will be higher, and by making R small enough (k big enough), it may result in a breakdown point larger than 50%. Instead, we suggest to use this method with data containing no more than 25% outliers, then take a value of R equal to 0.75% of the sample size. Such that the diagnostic-estimate is based on a sufficiently large number of observations. This will be high efficiency, as will be shown in the numerical example (simulations).

Techniques used to identify vertical and leverage
In this section we introduce a different way of finding the clean subset R. The residual mean square,σ 2 = (y −ŷ)/(n − p), the ordinary residual vector is defined as: where H is the hat or leverage matrix that is considered as a symmetric matrix and contains the information on the influence of the response value y on the corresponding fitted valuesŷ i = H T i . In this equation, H T i is the ith row of H matrix and h ii are the diagonal elements of H.
[10] suggested using the least median of squares (LM S) estimator to detect regression outliers. This method began by computing the residuals associated with LM S regression: where, M r is the median of r 2 1 , ..., r 2 n , and p is the number of predictors. However, a regression outlier is ith vector that satisfies (r i /s) > 2.5.
Later, [11] introduced potentials as a single leverage deleted measure: where where M AD is a median absolute deviation.
Well-known Mahalanobis (M D i ) distances are also suggested to apply as measures of leverage points in the literature (e.g. [12] ). Another study [14] reviewed different types of residuals for the diagnostic purpose of which the most commonly used is Studentized residuals, define as: an observation i is termed as an outlier if |t i | > c, where c is a constant value 2 ≤ c ≤ 3. According to [15], DF F IT S are introduced as: Further, the authors recommended considering observations as influential if |DF F IT S i | ≥ 2 √ p/n. However, the quantity DF F IT S is closely related to the well-known Cook's distance [16]. Cook's distance is defined as The relationship between CD i and DF F IT S i is given as: [13] suggested that h ii with values less than 0.2 appear to be safe, values between 0.2 and 0.5 as being risky and values greater than 0.5 if possible, are better to be avoided by controlling the design matrix.

Numerical examples
A simulation study is carried out to investigate the performance of the AIC R , Cp R , and SIC R statistic for detecting best variables in the regression model in equation ( 1) based on equations ( 6), ( 7) and ( 8).
The simulation study aims to compare the performance of the three proposed robust measures, namely AIC R , Cp R and SIC R with nonCrobust measures AIC, Cp and SIC as well as robust measures based on M -estimation RAIC, RCp and RSIC , respectively. In this simulation, 50 independent replicates of 3 independent uniform random variables on [-1,1] of x i1 , x i2 and x i3 , and 50 independent normally distributed errors ε i ∼ N (0, 1) were generated. The true model is given by .., 50 using two variables x i1 and x i2 . In order to illustrate the robustness to outliers, we consider the following cases: 1. Vertical outliers (outliers in the y only), 2. Bad leverage points (outliers in some x only).

Good leverage points (outliers in both y and x ).
For all cases we introduce outliers into the data such that the percentages of contamination used are varied (0%, 5%, 10%, and 20%) of outliers from N (50, 0.1 2 ) distribution. For each of these setting we simulate 1000 samples. We use the LM S given in equation ( 14) and potentials given in equation ( 15), to identify the verticals outliers and leverage points, respectively. The simulations were performed in R. To run the simulations, the R package rlm was used for the LM S (lmsreg). Tables 2 to 4 shows several results as follows: 1. The classical methods AIC, Cp, and SIC work better than the robust methods for the data without outliers. 2. When the percentage of vertical increases from 5% to 20%, the classical methods tend to under fit (x i1 or x i2 ). By contrast, the variable selection methods with the diagnostic tool (AIC R , Cp R , and SIC R ) perform well with reasonably good power. Whilst, the robust methods based on M -estimation (RAIC, RCp, and RSIC) continues to perform well until 20% contamination. Nonetheless, the proposed methods AIC R , Cp R , and SIC R perform well compared to the RAIC, RCp, and RSIC in both the uncontaminated and contaminated regression data. 3. In the presence of bad leverage point, the model selection criteria based on OLS and M -estimation often over fit or wrong fit in this case. Interestingly, the diagnostic tool based methods tend to correctly fit the true model more often.
The simulation results above illustrates that the performance of the proposed method (AIC R , Cp R , and SIC R ) yields a comparable power of selection, correct fit of those obtain in classical or RAIC, RCp, and RSIC approaches for both cases in presence of vertical and leverage points.

Example
In this example, Hawkins-Bradu-Kass dataset is used. This data available from the R library wle as data (artificial). Artificial data set containing 75 observations with 10 outliers (cases 1 to 10) and 14 high leverage points (cases 1 to 14). Scatter plots of Y on each three X ′ s as shown in Figure 4 which suggest that the data seem to be highly concentrated. All 2 3 possible models fitted with a combination of any of these covariates and computed several model selection methods values for each model (see Tables 5 to 7).   The best three selected models based on each version of AIC, Cp, and SIC methods are given in Table 8.
We observe from the results that all of the commonly used measures of selection model fail to focus on best variables. Tables 5 to 7 present the commonly used model selection AIC, Cp, and SIC together with robust RAIC, RCp, RSIC methods and AIC R , Cp R , and SIC R . It is clear from the results presented in this table that variable selected by the classical selection methods are not correct enough. Though the robust model selection      Table 8. Hawkins-Bradu-Kass, the selected best variables from best three models based on different classical criteria, robust criteria with M -estimation, and robust criteria using a deletion estimate of the scale

Conclusion
Diagnostic regression measures are robust regression methods, which are frequently used in practice. Nevertheless, they are not commonly used in selection models. This paper had introduced variable selection criteria based on a diagnostic scale, which are robust against outliers and leverage points. The simulation results had illustrated good performance of the proposed diagnostic variable selection criteria.