Hybridized Support Vector Machine and Recursive Feature Elimination with Information Complexity

In statistical data mining research, datasets often have nonlinearity and at the same time high-dimensionality. It has become difﬁcult to analyze such datasets in a comprehensive manner using traditional statistical methodologies. In this paper, a novel wrapper method called SVM-ICOMP PERF -RFE based on a hybridized support vector machine (SVM) and recursive feature elimination (RFE) with information-theoretic measure of complexity (ICOMP) is introduced and developed to classify high-dimensional data sets and to carry out subset selection of the features in the original data space for ﬁnding the best subset of features which are discriminating between the groups. Recursive feature elimination (RFE) ranks features based on information complexity (ICOMP) criterion. ICOMP plays an important role not only in choosing an optimal kernel function from a portfolio of many other kernel functions, but also in selecting important subset(s) of features. The potential and the ﬂexibility of our approach are illustrated on two real benchmark data sets, one is ionosphere data which includes radar returns from the ionosphere, and another is aorta data which is used for the early detection of atheroma most commonly resulting heart attack. Also, the proposed method is compared with other RFE based methods using different measures (i.e., weight and gradient) for feature rankings.


Introduction
In many classification problems there are very high-dimensional input datasets and finding the best subset of the original input features or variables which mostly contribute to the separation of the classes or groups is a challenge.Therefore, the problem of feature selection is a difficult combinatorial problem in Machine Learning and it has very of high practical importance in many applications.
Kernel-based methods have gained popularity for classification, clustering, and regression analysis in machine learning since the introduction of support vector machine (SVM) during the early 1990s, after obtaining support vectors (SVs) to classify a data set, questions such as: "How do we know which features are more responsible for, and important to, the classification?"has often been raised.This is due to the fact that the mapping is not one-to-one and onto in SVM.The application of a kernel function is thus an uninvertible process, and there is no way to go from the feature space back to the original space.Because of this geometry, SVM does not land itself 160 HYBRIDIZED SVM AND RFE WITH ICOMP in an automated internal relevant feature selection easily.Hence algorithms for feature selection play an important role in SVM .
In the literature of Machine Learning, as discussed in [11] in detail, there are two main approaches to solve the feature selection problem: (a) the filter approach, and (b) the wrapper approach.Both approaches differ in the way they evaluate a given feature subset.The filter method uses some relevance measure, which is independent of the performance of the learning algorithm.On the other hand, in the wrapper method each feature subset is taken into consideration with the classifier.That is, the features are evaluated by estimating the generalization performance (i.e. the expected risk) of the learning machine.
In this paper, the wrapper method called SVM-ICOM P P ERF -RFE, which combines recursive feature elimination and an information-theoretic measure of complexity (ICOMP) criterion especially designed for SVM based on feature selection developed by [12] is considered and emphasized.In the usual RFE, backward feature elimination is performed to find say, m, features which lead to the largest margin of class separation.This combinatorial problem is solved in a greedy fashion.In the two-class case the RFE algorithm begins with the set of all features and successively eliminates the feature which induces the smallest change based on sensitivity analysis for an appropriately defined cost function which is a measure of predictive ability (and is inversely proportional to the margin).Then, the RFE algorithm at each step eliminates the feature which keeps this quantity small.Assuming that the change of the set of support vectors when removing only one feature is negligible.
An information-theoretic measure of complexity (ICOMP) criterion [3,4,5,6,7] is used in RFE rankings of the features as an effective measure.ICOMP plays an important role not only in choosing an optimal kernel function from a portfolio of many other kernel functions, but also in selecting important subset(s) of features.It takes into account both the badness of fit? or the lack of fit? and the model complexity at the same time in one criterion function.
The proposed method is compared with two different RFE based methods [12,32,10] with two real benchmark data sets.
The rest of the organization of this paper is as follows.In Section 2, the background of SVM [30,27,9] and the several forms of the kernel functions are presented.In Section 3, the information complexity (ICOMP) criterion to choose the optimal kernel function and to select the best subset of features using HSVM-RFE is introduced.Section 4 discusses recursive feature elimination (RFE) technique and provides algorithms for three RFE based methods.In Section 5, numerical examples are provided to study the efficiency of the proposed method with two real benchmark data sets.The proposed method and two existing RFE based methods are compared in Section 6.This paper is concluded by Section 7.

Support Vector Machine (SVM)
Consider the case of classifying a set of linearly separable data into two groups.Assume a set of training data is given by {(x 1 , y 1 ), • • • , (x n , y n )} where x i is an input vector, y i ∈ (−1, 1) is a binary class index, and n is the size of training data.SVM finds optimal separating hyperplane that maximizes the margin between the classes [30].Then, a decision boundary (i.e.classifier) that partitions the underlying vector space into two classes can be represented by the following hyperplane: where w is the weight vector and b is the bias.The objective of SVM is to find maximum margin(M) decision boundary between two parallel hyperplanes, w T x + b = 1 and w T x + b = −1.An example of SVM is illustrated in Figure 1.Since the margin is given by 2/∥w∥, the corresponding optimization problem can be written as follows: where ξ i is the positive slack variable and C (> 0) is a pre-defined regularization coefficient.The linearlyconstrained optimization problem can be solved as a dual problem that maximizes the following function: Once the optimum values (α * , b * ) are obtained, based on the training set of points, a new point x new of the test data set is classified by the following decision rule: where D(•) is a classifier based on the training data set.K(x i , x new ) is the kernel trick proposed by [1].The kernel trick maps input data in the original space with nonlinearly into a high-dimensional feature space.The Table 1 presents some common kernel functions.

Information-Theoretic Measure of Complexity
An information-theoretic measure of complexity called ICOMP has been proposed by Bozdogan [3,4,5,7] as a decision rule for model selection such as AIC [2], and BIC [24].The development and construction of ICOMP is based on a generalization of the covariance complexity index originally introduced by [29].Instead of penalizing the number of free parameters directly, ICOMP penalizes the covariance complexity of the model.It is defined by Cauchy Inverse Multi-Quadric where L( θk ) is the maximized likelihood function, θk is the maximum likelihood estimate of the parameter vector θ k under the model M k , and C represents a real-valued complexity measure and Cov( θk ) = ΣModel represents the estimated covariance matrix of the parameter vector of the model.ICOMP should not be confused with the stochastic complexity (SC) or the minimum description length (MDL) of Rissanen [21,22,23], although they both use the notion of complexity of a model class based on coding theory.The detailed information-theoretic measure of complexity (ICOMP) is recapitulated in the subsections for the benefit of the readers who may not be familiar with ICOMP criterion.

Mutual Information in High Dimensions
For a random vector, the complexity is defined as follows.
Definition: The complexity of a random vector is a measure of the interdependency between its components.A continuous p-variate distribution is used with joint density function f (x) = f (x 1 , ..., x p ) and marginal density functions f j (x j ), j = 1, .., p.Following [15], and [13], the information measure of dependence is defined as follows: is the Kullback-Leibler information divergence [16] against independence.The properties of the Kullback-Leibler information divergence are as follows: • I(x) ≡ I(x 1 , . . ., x p ) ≥ 0 i.e., the expected mutual information is nonnegative.
The KL divergence is related to Shannon's entropy [25] by the important identity where H(x j ) is the marginal entropy, and H(x 1 , . . ., x p ) is the global or joint entropy.[31] calls this latter quantity the strength of structure and a measure of inter-dependence.
To define the information-theoretic measure of complexity of a multivariate distribution, let f (x) = f (x 1 , . . ., x p ) be a multivariate Gaussian density function given by As a short hand, let Then the joint entropy H(x) = H(x 1 , ..., x p ) from equation (1) for the case in which µ = 0 is given by From equation ( 2), the marginal entropy H(x j ) is where σ 2 j is the variance of the j th variable.

Initial Definition of Covariance Complexity
[29, p. 61] provides a reasonable initial definition of complexity of a covariance matrix Σ for the multivariate Gaussian distribution.This measure is given by: This reduces to where σ jj ≡ σ 2 j , is the variance of the j th variable, and is the j th diagonal element of Σ.The characteristics of covariance complexity C 0 are as follows: • The first term of equation ( 3) is not invariant under orthonormal transformations.
As pointed out by [29], the result in equation ( 3) is not an effective measure of the amount of complexity in the covariance matrix Σ, since: • C 0 (Σ) depends on the coordinates of the original random variables x 1 , ..., x p .
• The first term of C 0 (Σ) in equation ( 3) would change under orthonormal transformations.

Definition of Maximal Covariance Complexity
To improve upon C 0 (Σ) in equation ( 3), we propose the following.
Proposition: A maximal information theoretic measure of complexity of a covariance matrix Σ of a multivariate Gaussian distribution is defined as follows: where the maximum is taken over the orthonormal similarity transformation, T of the overall coordinate systems x 1 , ..., x p and λ a and λ g are arithmetic and geometric means of the eigenvalues.The properties of maximal information-theoretic measure of complexity are as follows: • C 1 (Σ) is the log ratio between the arithmetic and geometric mean of the eigenvalues.
• C 1 (Σ) incorporates the two most basic scalar measures of multivariate scatter -trace and determinant.

Modified Maximal Covariance Complexity
Following [29], the geometric definition of covariance complexity is defined by the Frobenius norm given by where ∥Σ∥ 2 = tr(Σ T Σ), the square of the Frobenius norm of Σ.
In terms of the eigenvalues (or singular values), C F (Σ) reduces to where s is the rank of Σ, λ j is the j th eigenvalue of Σ > 0, j = 1, 2, . . ., s and λ a is arithmetic mean of the eigenvalues.Note that C F (Σ) ≥ 0 with C F (Σ) = 0 only when all λ j = λ a .C 1 (Σ) can be approximated in terms of the eigenvalues λ j , j = 1, 2, . . ., s by Since in the feature space we are dealing with orthonormal matrices, to prevent the C 1 complexity not to go to zero, we relate C 1 and C F as a second order equivalent measure of complexity denoted by C 1F .Hence, the modified maximal entropic complexity C 1F (Σ) is defined as follows: .
In terms of the eigenvalues, C 1F (Σ) is given by where s = rank(Σ).The properties of the modified maximal entropic complexity C 1F are as follows: • C 1F (Σ) is scale-invariant and C 1F (Σ) ≥ 0 with C 1F (Σ) = 0 only when all λ j = λ a .
• C 1F (Σ) measures the relative variation in the eigenvalues rather than absolute variation of the eigenvalues.

ICOMP as a Performance Measure: ICOM P P ERF
Singularity of the estimated covariance matrix is a common problem that has recently attracted many researchers' work.Because of this, many methods have been proposed to make the covariance matrix well-conditioned so that we can estimate the covariance matrix.The usual response to singular or ill-conditioned covariance matrix estimates is the "naive" ridge regularization, Σ * = , which works to counteract the ill-conditionedness by adjusting the eigenvalues of Σ.The ridge parameter, α, is typically chosen to be very small.This, of course, begs the questions • How large of a perturbation do we need?
• How small a perturbation can we get away with?
This is a case where simplicity is not necessarily a good thing; it does not solve the problem with many real datasets.Yet another approach that does not seem to work well in practice is to augment Σ with a multiple of the kernel matrix, as suggested by [17].After much experimentation with a variety of different methods to improve the condition of the covariance matrix, a stabilization method [28] is applied to resolve the ill-conditioning of a covariance matrix.After the stabilization procedure, the two-stage stabilization and smoothing process is applied to provide a well-conditioned covariance matrix which is both nonsingular and positive definite.
• Stage 1. Stabilization algorithm [28]: 1. Perform spectral decomposition of Σ = VΛV T , where V is the matrix with eigenvectors and Λ has eigenvalues on the diagonal.

Form a new matrix of eigenvalues as
Finally, recompose the new stabilized matrix

• Stage 2: Compute the Stabilized and Smoothed Convex Sum Covariance Estimator
The second step is to feed the stabilized covariance matrix into a smoothed convex sum covariance matrix estimator (CSE) was proposed based on the quadratic loss function used by [20] and later by [8].The stabilized and smoothed convex sum covariance estimator (STA-CSE) is as follows: where DST A = ( 1 p tr( ΣST A ))I p .For p ≥ 2, m is chosen to be where .
This estimator improves upon ΣST A by shrinking all the estimated eigenvalues of ΣST A toward their common mean.The motivation of using both stabilization and smoothing of the covariance matrix in the ranking process of RFE subset selection is to extract more information since a reduced rank problem occurs in the kernel based methods.To remedy the current existing problems in the usual kernel based methods, the use of both stabilization and smoothing the covariance matrix is an attractive approach.
The choice of the best mapping function is not so simple and automatic.In the literature, a valid method for selecting the appropriate kernel function does not yet exist.The goal of SVM is to minimize the probability of misclassification error.Intuitively, then, the penalty term for a poorly-fitting model would be based on the classification error rate.In SVM problems, the error variance, σ 2 is estimated by the mean squared difference between actual group labels (y i ) and predicted group labels (ŷ i ) given by Now following the work of [14], the information-measure of complexity as performance measure of SVM is defined as follows: where ΣST A CSE is the stabilized and smoothed convex sum covariance matrix estimator (STA-CSE) given by and First, the hybrid covariance estimate is calculated, and then the diagonal matrix of the largest singular values as a reduced rank approximation of ΣST A CSE is computed.By minimizing ICOM P P ERF , the classification error is minimized under the best fitting model.Also, ICOM P P ERF is used to choose an optimal kernel function.One of the major motivations of introducing the information measure of complexity (ICOMP) criterion is based on the fact that in SVM-RFE subset selection problems the number of features is same from one subset to another.In such cases the models in terms of the number of parameters are considered to be equivalent.In equivalent models, AIC, BIC, or MDL type criteria do not have provision of distinguishing one equivalent model from another.Since their penalty terms are fixed, and not varying.In the literature cross-validation-based criteria has been used for feature selection.These type of criteria due to the high-dimensionality of the feature space are too time-consuming.The proposed method shortens the feature selection time.

Recursive Feature Elimination (RFE)
A feature selection method based on RFE has been developed by [12] which is called SVM-RFE.SVM-RFE is an application of a recursive feature elimination based on sensitivity analysis using an appropriately defined cost function (w : weight).The SVM-Gradient-RFE method [32,10] used the gradient as the cost function.In this paper, our cost function that we would like to use is the ICOM P P ERF .In our approach, the least sensitive feature which has the minimum value of the ICOM P P ERF is eliminated first.This eliminated feature becomes ranking p (p: number of features).Later, the machine is retrained on the remaining p − 1 features and then the feature with the minimum value of ICOM P P ERF is eliminated.The process continuous in an iterative fashion until no feature is left in that subset.This means that at the end of this iterative ranking scheme all the features are ranked according to ICOM P P ERF criterion.This is different than the Guyon's ranking scheme [12] where only weights have been considered without taking into account the model fit and the complexity of the model.This eliminated feature becomes ranking p − 1.By doing this process repeatedly until no feature is left, the features will be ranked.

SVM-RFE Algorithm
2. Until all values of the cost function are obtained with the number of non-ranked features, compute the cost function for all subset where H = y i y j K (x i , x j ), and H (−i) is H matrix without the i th feature.
3. Find the feature k with the smallest cost function value, and add k into the ranked subset, r and remove k from a subset, s. 4. Repeat 1-3 until subset, s is empty.

Ionosphere Data
The ionosphere data is radar data which was collected by a system in Goose Bay, Labrador [26].The system measures radar returns from the ionosphere.The data consists of 351 observations and 34 features with binary classes; good and bad returns.Figure 3 shows the scatter plots of the data with groups identified by blue (circle) and red (cross) colors.As shown in Figure 3, the separation in dimension 5 against dimensions 13, 19 and dimensions 18, 29 are quite poor.Tables 2 and 3 show performances of experiments based on ICOM P P ERF .In Table 2, the polynomial kernel with degree 3 on the 20% set shows a narrower confidence interval than other kernel functions for both training and test sets.As shown in Tables 2 and 3, the smallest ICOM P P ERF values are obtained at the polynomial kernel with degree 3 for the 20% set and for the 80% set.Tables 4 and 5 show the best subset selection based on the smallest ICOM P P ERF values.The training and test errors of the best subsets in both partitioned sets are within the 95% error confidence intervals.

Aorta Data
The aorta data are from medical imaging for a study of heart tissue.Hardening of the arteries is the leading cause of death and debility in the industrial world.Nuclear magnetic resonance (NMR) imaging has a role in diagnosing of arteries for prognosis of heart attack.The NMR aorta data was used by [19].The dataset sampled from 418 patients on 20 different characteristics.The first group consists of 194 patients who exhibited early atheroma, and the second group consists of 224 patients who were healthy.Figure 4 shows grouped scatter plots for the poor separation of dimension 3 against dimensions 13, 19 and against dimensions 10, 20 (group1: blue, group2: red), respectively.Tables 6 and 7 show that the best subset based on ICOM P P ERF is obtained at the Cauchy kernel in the 20% set and inverse multi-quadratic kernel in the 80% set.The confidence intervals are obtained based on ICOM P P ERF .The confidence intervals, are significantly narrow intervals in both of the sets.Tables 8 and 9 show the best subset selected based on ICOM P P ERF .

Comparison with Other RFE Based Methods
To compare three different RFE based methods; SVM-RFE, SVM-Gradient-RFE, SVM-ICOM P P ERF -RFE, the ionosphere and aorta datasets are used with the same kernel functions that are used in Tables 2, 3, 6, and 7.The datasets are randomly partitioned into two cases; 20%/80% and 80%/20% as training/test sets.Tables 10 and 11 present comparisons of three RFE based methods using the ionosphere data with four different kernel functions in two different cases.The average error rate represents misclassification error rate for test set.The SVM-ICOM P P ERF -RFE is the clear winner for most kernel functions except the linear kernel in the 80%/20% case.The best performance is obtained using the Cauchy kernel in the two cases with 88.12% and 93.28% accuracies.Tables 12 and 13 present comparisons of the three RFE based methods using the aorta data with four different kernel functions in two different cases.As shown in Tables 12 and 13, the SVM-ICOM P P ERF -RFE is the best method for the polynomial kernel (degree=2) with 99.99% accuracy for the 20%/80% case, the polynomial kernel (degree=2) with 99.88% accuracy for the 80%/20% case, and the inverse multi-quadratic kernel with 100% accuracy for the 80%/20% case.5 shows line plots of error rates for the test set with the Cauchy kernel function which gives smallest average error rates using the ionosphere data shown in Tables 10 and 11. Figure 6 shows line plots of error rates for the test set with the polynomial kernel (degree=2) and inverse multi-quadratic kernel functions, which give smallest average error rates using the aorta data shown in Tables 12, and 13.The SVM-ICOM P P ERF -RFE is competitive with both SVM-RFE and SVM-Gradient-RFE as shown in Figure 5. Also, SVM-ICOM P P ERF -RFE outperforms both SVM-RFE and SVM-Gradient-RFE with few features as shown in Figure 6.

Conclusion and Discussion
In this paper, a novel SVM-ICOM P P ERF -RFE method is proposed using an information complexity (ICOM P P ERF ) criterion.SVM-RFE is used in conjunction with ICOM P P ERF not only to choose an optimal

Figure 2 .
Figure 2. Illustration of radar refraction by ionosphere and heart anatomy.

Figure 5 .
Figure 5. Best results of SVM-ICOM P P ERF -RFE using ionosphere data: (a) Cauchy kernel function with 20% set (b) Cauchy kernel function with 80% set.

Table 2 .
Top subset features selected with 20% set using SVM-RFE ranking.

Table 3 .
Top subset features selected with 80% set using SVM-RFE ranking.

Table 6 .
Top subset features selected with 20% set using SVM-RFE ranking.

Table 7 .
Top subset features selected with 80% set using SVM-RFE ranking.
KernelBest Subset ICOM P P ERF CI for Training CI for Test Cauchy

Table 8 .
Subset selection based on ICOM P P ERF with 20% set (Cauchy).