Supersaturated Plans for Variable Selection in Large Databases

Over the last decades, the collection and storage of data has become massive with the advance of technology and variable selection has become a fundamental tool to large dimensional statistical modelling problems. In this study we implement data mining techniques, metaheuristics and use experimental designs in databases in order to determine the most relevant variables for classification in regression problems in cases where observations and labels of a large database are available. We propose a database-driven scheme for the encryption of specific fields of a database in order to select an optimal supersaturated design consisting of the variables of a large database which have been found to influence significantly the response outcome. The proposed design selection approach is quite promising, since we are able to retrieve an optimal supersaturated plan using a very small percentage of the available runs, a fact that makes the statistical analysis of a large database computationally feasible and affordable.


Introduction
The advent of new technologies has enabled scientists to measure the class label of hundreds of variables simultaneously and large dimensional problems are 162 C. PARPOULA, C. KOUKOUVINOS, D. SIMOS & S. STYLIANOU becoming more and more common since large amounts of data are increasingly produced and stored.Therefore, factor screening has become a challenge that many statisticians face in large-dimensional problems, and an essential activity in which the main goal is to identify correctly and parsimoniously the factors that have an important influence on the measured response.
Large databases exist in diverse fields of science, and extensive research into variable selection has been carried out over the last decades (see, for example [6] and [18]).Stepwise deletion and subset selection [21] are some of the existing traditional variable selection techniques which are useful for exploratory investigations but are very time-consuming or even impossible in cases where the number of predictor variables of interest is large.Variable selection procedures via penalized likelihood (see, for example [5] and [16]) are easily and quickly implemented even in a large-dimensional problem, but they remain very timeconsuming when they are applied during a large dimensional statistical analysis.This computational difficulty prevents these methods from being widely used when there is a large number of predictors in real life problems.
In this paper, we extend the idea of using heuristic algorithms for variable selection from the data of a database, as presented in [15].In this paper, we deal with a large-dimensional statistical modelling problem, and study the variable selection issue considering an alternative approach.We propose a step-by-step database-driven design selection scheme for the encryption of specific fields of a database which correspond to the significant variables of a regression problem in cases where observations and labels of a database are available.The proposed data-driven scheme is a combination of metaheuristics and data mining techniques, and enables the experimenter to identify the optimal supersaturated plan retrieved from a database for variable selection purposes.The close interaction between data mining and statistical analysis enabled the successful transition from collecting data, through modeling the underlying structures, to understandable and profitable results.
The rest of this paper is organized as follows.In Section 2, we discuss the need of considering design of experiments, and specifically supersaturated designs (SSDs) for variable selection issues.In Section 3, we describe the statistical methods employed in this work.We also present the proposed method of identifying an optimal supersaturated plan given a database.In Section 4, we describe the criteria used for performance evaluation; all the above procedures are applied to the real medical data, and the merits of the alternative approach using SSDs are presented.Finally in Section 5, the obtained results are discussed and some concluding remarks are made.

The use of SSDs for variable selection
Recently the experimental designs have been used for variable selection purposes.Pumplün et al. (see, [24,25]) introduced the use of experimental designs for variable selection problems given a database of observations However, in that case the retrieved plan arose from the class of D-optimal designs while in our case we are interested in supersaturated designs.Schiffner and Weihs [27] extended the study of [24] and investigated the appropri-ateness of D-optimal plans for training classification methods.Rüping and Weihs [26] dealt with the variable selection issue given a database of observations using statistical design of experiments and kernel methods.
Since nowadays massive data sets become available without predefined purposes, it is usually preferable to identify some important features in the data sets that will provide valuable information to support decision making [18].For situations where there is no prior knowledge of the factor effects, but factor sparsity holds [1], and where the aim is to identify any dominant factors, experimenters should seriously consider using SSDs as suggested in [9].SSDs are widely used in experimental situations in which a large number of factors are studied and only a few of them are expected to influence significantly the measured response.SSDs can be generally described as fractional factorial designs in which the number of factors m to be estimated exceeds the number of experimental runs n (m ≥ n).Recent research has targeted on the class of SSDs due to their mathematical novelty and their run-size economy Parpoula et al. [22] presented a new variable selection approach inspired by SSDs given a dataset of observations, and dealt with the large dimensional statistical modelling problem by employing nonconcave penalized likelihood methods and best subset techniques.Since this work focuses on the idea of implementing SSDs for variable selection issues, we do not present here more details on construction and analysis methods of SSDs, and we refer the interested reader to recent reviews (see, for example [9,14] for construction methods of SSDs; see, for example [17,11,8] for analysis methods of SSDs).The interested reader may also refer to [9,13] regarding the practical use of SSDs in real life problems.

Association rule mining
Association rule mining finds interesting associations and/or correlation relationships among large set of data items.Association rules show attribute value conditions that occur frequently together in a given data set.Association rules provide information of this type in the form of "if-then" statements.In addition to the antecedent (the "if" part) and the consequent (the "then" part), an association rule has two criteria that express the degree of uncertainty about the rule.The first criterion is called support which refers to the percentage of records in the training data for which the antecedents (the "if" part of the rule) are true.The other criterion is known as confidence which is based on the records for which the rule's antecedents are true, and is the percentage of those records for which the consequent(s) are also true.In other words, it is the percentage of predictions based on the rule that are correct; rules with lower confidence than the specified criterion are discarded.
Generalized Rule Induction (GRI) algorithm is the most suitable methodology for our study because it can handle categorical or numerical variables as inputs, and requires categorical variables as outputs.GRI applies an information-theoretic approach [28] to determine the interestingness of a candidate association rule using the quantitative measure J GRI uses this quantitative measure J to calculate how interesting a rule may be and uses bounds on the possible values this measure may take to constrain the rule search space.The association rules generated from GRI take the form " If X = x then Y = y" where X and Y are two fields (attributes) and x and y are values for those fields.GRI extracts rules with the highest information content based on the Jindex that takes both the generality (support) and accuracy (confidence) of rules into account.
This preliminary stage of the statistical analysis is very important since it enables the experimenter to locate fields and records that are most likely to be of interest in modeling, and create a database-driven scheme for the encryption of specific fields of a database.

Simple genetic algorithm
In this paper, we assume some basic familiarity with genetic algorithm concepts like reproduction, mutation and crossover and will only describe what is needed for our approach.The interested reader may refer to Goldberg [10] and to Davis [4] for more details concerning with the necessary concepts for a description of the Simple Genetic Algorithm (SGA).
In this paper, an SGA is implemented.We correspond records of the database to chromosomes of the SGA where we are interested in the case where the chromosome length is very small when compared to the total number of attributes of the database.The database is encoded in {−1, 1} values, and we use the same encoding for SGA.Every optimization algorithm depends on a certain objective function (OF) that needs to be optimized.For SSDs, a widely used optimality criterion is the r max criterion.We give the definition, below.
r max criterion A reasonable criterion for comparing supersaturated designs is the minimization of max i<j |s ij /n| where s ij /n = r ij is the correlation of two columns c i , c j The largest absolute value of r ij between all pairs of columns is denoted by r max More details can be found in Lin [19].

L 1 -norm support vector machine
Support vector machines (SVMs) are a supervised learning classification method based on ideas originated in statistical learning theory [30,3].SVMs use the training data in order to generate input-output mapping functions and determine the maximal margin hyperplane.Since in our study we deal with a binary classification problem, the optimal hyperplane in terms of classification performance, is the one with the maximal margin of separation between the two classes [30].SVMs can also be formulated as a regularized function estimation problem, corresponding to a hinge loss function plus a regularization term on the fitted coefficients [34].The Least Absolute Shrinkage and Selection Operator (LASSO) method [29] is one of the most common approaches in regression for parameter estimation.The L 1 penalty term (LASSO) was adapted to SVM methodology in order to perform automatical variable selection to classification problems (see, for example [2,35,7]).Recently the elastic net penalty term [36] was adapted to SVMs by using a mix of the L 1 -norm and the L 2 -norm penalties (see, [31,32]).The elastic net SVM is especially useful for cases in which the number of variables exceeds the sample size.The L 1 -norm SVM is suitable for our real data analysis, since the dimension of the data (m=44 input variables) is not larger than the number of training samples (n=8862 observations).

The proposed method
In this section, we present analytically the proposed method for harvesting an optimal super-saturated design from the records and specific fields (attributes) of a database.The proposed method proceeds as follows: 1. Let y i denote the i-th response in the data set and x i denote the m vector of explanatory variables of the data set.2. Split the data set into training (90%) and test (10%) set.For the partitioning, the total observations were randomly selected to create the training and test set, according to their predefined size.3. Perform GRI algorithm to (x i , y i ) i=1,...,n and generate the association rules which are used for initializing the random chromosomes that are used in the mating pool of an SGA.In particular, apply the GRI mining task iteratively i.e., run an "up" search on the entire training set and then run a "down" search on the remainder to weed out low-performing segments.4. Select the initial runs for the optimal plan by matching the generated GRI rules with fields of the database.If n d are the runs retrieved using the GRI rules then the algorithmic procedure proceeds by fixing the total number of runs n of the supersaturated design as n = n d + n h where n h are the runs that are being selected from the SGA. 5. Use 1-point crossover and keep track of the selected attributes using the GRI rules without changing their values.
6. Begin a heuristic search using as an initial seed the selected factors of the optimal plan.In particular, guide the heuristic to search for k factors, where a subset of them is always keep fixed and the others are randomized.7. Retrieve the remaining factors of the optimal plan subject to r max optimality criterion.The implemented SGA outputs an optimal plan belonging to the class of supersaturated designs when a value of r max <1 is detected.

Medical data
The Trauma data set used here was collected in an annual registry conducted during the period 01/01/2005 -31/12/2005 by the Hellenic Trauma and Emergency Surgery Society involving 30 General Hospitals in Greece.Altogether, 8862 patients were recorded and for each of them the binary response variable y taking only two possible outcomes, denoted by -1 and 1 for "survival"and "death", respectively was reported.The Trauma data set which is used for further analysis, includes all of the 8862 available patients and the 44 input explicative variables (see Appendix, Table 3 for the list of the variables), that include demographic, transport and intrahospital data.After medical advice, all of the factors are treated equally during the data mining approach, meaning that there was no factor that should be always maintained in the model.The data set was split into training (90%) (=7975 patterns) and test (10%) (=887 patterns) sets for classification and association analysis.All results were derived from SPSS Clementine 12.0 and MATLAB software.

Performance criteria
Assessing the reliability of a classifier is essential to ensure data quality The most common criterion to assess the quality of a classification model is discrimination which measures how well the two classes in the data set are separated [33].
We consider the most commonly used measures of discrimination for evaluating the performance of the employed method To discuss these performance criteria, we adopt the standard definitions used in binary classification; given a classifier and a record, there are four possible scenarios.Positive records are correctly predicted as positive (True Positive-TP), positive records are incorrectly identified as negative (False Negative-FN), negative records are classified as positive ones (False Positive-FP) and finally negative records are correctly identified as negative (True Negative-TN).
The classification accuracy is used as first criterion.Accuracy is defined as the percentage of correct classified records in the test set for every used method.The other two criteria used are the sensitivity and specificity which are two statistical Stat., Optim.Inf.Comput.Vol. 2, June 2014.SUPERSATURATED PLANS FOR VARIABLE SELECTION 167 measures of the performance of a binary classification test and are closely related to the concepts of Type I and Type II errors.Sensitivity measures the proportion of actual positives which are correctly identified as such whereas specificity measures the proportion of negatives which are correctly identified.The other used criteria in information retrieval are the recall which corresponds to sensitivity and precision which is the proportion of true positives among all predicted positives.In classification tasks, another widely used performance metric is the geometric mean of class accuracies which puts all classes on an equal footing, and does not give higher priority to the rare positive classes.A performance metric that allows for this is the F-measure, which does not take account of performance on the negative classes.
In two-class problems, the above mentioned performance measures are defined as follows: where the β parameter, 0 < β < 1, allows the user to assign relative weights to precision and recall, with 0.5 giving them equal importance.
Another popular statistical tool is the Receiver Operating Characteristic (ROC) curve [23] which by definition is used to evaluate the performance of a system with dichotomous outcomes.An ROC curve is presented as a plot of Sensitivity as a function of (1-Specificity) for all the possible cutoffs.Traditionally the Area Under the ROC Curve (AUC) is used as a summary index of test accuracy [12] and is useful as a descriptive of overall test performance.Statistically speaking, the AUC of a classifier is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.
As with many decision problems, errors of various types must be balanced against costs.In screening designs, there is a cost of declaring an inactive factor to be active (Type I error), and also a cost of declaring an active effect to be inactive (Type II error).Type II errors are troublesome, as addressed in [20], as well as Type I errors, since they can result in unnecessary cost in follow-up experiments and can cause detrimental actions if the experiment has immediate impact on practice.Under situations of effect sparsity Type I errors are very likely to occur.
Sensitivity and Specificity can be alternatively described as follows:

The optimal supersaturated plan
For initializing the random chromosomes that were used in the mating pool of an SGA, we employed the following GRI rules.Thus, we selected three initial runs for the optimal plan by matching these rules with fields of the database.We give below the optimal plan selected using the GRI rules and the SGA in 6 runs and 8 factors.This is interpreted as selecting 6 records with 8 field attributes.
We note here that the SGA also detected another optimal plan belonging to the class of SSDs with a value of r max <1.We give below the second optimal plan selected using the GRI rules and the SGA in 6 runs and 8 factors.This is also interpreted as selecting 6 records with 8 field attributes.

169
Note here that the x7 column, in both design matrices, is fully aliased to the mean since all of its values are equal to -1.Hence, its effect is not estimable.This fact is not troublesome for our study which is designed for screening purposes.We observe that the only difference between the first and second optimal plan retrieved from the database is the selection of the 7-th column, i.e., x14 instead of x16 respectively Since the ±1 signs are identically allocated to both x14 and x16 columns of the respective design matrices, the derived results will be totally identical either we consider to employ the first or the second detected optimal plan for further analysis.The fact that both plans retrieved from the database have identical ±1 signs allocated to each column of the design matrix is crucial, since it provides strong evidence that the desired optimal plan is indeed detected correctly (either the plan is by [x8,x5,x7,x13,x6,x2,x14,x11] or by [x8,x5,x7,x13,x6,x2,x16,x11]).

Comparative results
The L 1 -norm SVM methodology is used to evaluate the performance of the proposed method.We firstly perform L 1 -norm SVM method to the whole trauma dataset consisting of 8862 patients and 44 possible risk factors (Model I ).The fact that L 1 -norm SVMs are fast in train-ing, classifying, and are not computationally time-expensive allowed us to apply this method to the whole large dimensional dataset.We then perform L 1 -norm SVM to the proposed method, using as design matrix the identified SSD (68) and as response its corresponding outcome class labels (Model II ).A perfect predictor would be described as 100% sensitive and 100% specific.Sensitivity and specificity relate to the test's ability to identify positive and negative results, respectively.A test with high sensitivity and high specificity can be considered as a reliable indicator of a test with which has a low Type II error rate and a low Type I error rate, respectively.The L 1 -norm SVM (Model I) reaches the percentage of 99% for specificity which means that the classifier recognizes almost all actual negatives; in other words this means that it has a low Type I error rate.This measure alone does not tell us how well the classifier recognizes positive cases and so it is necessary to take into consideration both sensitivity of the used classifier.When the L 1norm SVM (Model I) is evaluated against the sensitivity has a clear disadvantage having lowest percentages, which means that the Type II error rates are higher.In general, Table 1 shows that the L 1 -norm SVM (Model I) tends to declare at a lower rate inactive variables to be active, and at a higher rate active variables to be inactive.The AUC for Model I takes the lowest value (AUC M odelI =0.57) compared to Model II (AUC M odelII =0.88).
The L 1 -norm SVM applied to the proposed method (Model II) has clearly better clas-sification accuracy sensitivity and specificity which almost reaches the absolute percentage of 100%.In general, Table 1 shows that the proposed method (Model II) has very low Type I errors (Type I error rate is almost 0, and this corresponds to cases where almost none of the inactive effects are declared as active) and very low Type II errors (Type II error rate is almost 0, and this corresponds to cases where almost none of the active effects are detected wrongly).In other words this means that the proposed method tends to declare at a low rate inactive variables to be active, and active variables to be inactive.Therefore, the proposed method is indeed stable in this sense.
Sensitivity and specificity values alone cannot be used to determine whether a test is useful in practice, and may be misleading.In medical diagnostics, sensitivity is the ability of a test to correctly identify those with the disease (true positive rate), whereas specificity is the ability of the test to correctly identify those without the disease (true negative rate).The "worst-case" sensitivity or specificity must be calculated in order to avoid reliance on experiments with few results.A common way to do this is to calculate the binomial proportion confidence intervals for sensitivity and specificity giving the range of values within which the correct value lies at a given confidence level (95%).Table 2 shows that sensitivity and specificity values for both Model I and Model II are satisfactory since the estimated values lie at the corresponding estimated confidence interval.Note here that the three basic things that usually impact the width of a confidence interval are the confidence level, variability and sample size.The fact that the confidence intervals in Table 2, for Model II are much broader than Model I is not surprising since smaller sample sizes generate wider intervals.The L 1 -norm SVM applied to the whole dataset (Model I ) detected a set of 33 out of 44 variables as statistically significant.The generated Model I excluded 11 variables as unimportant, i.e., x19, x20, x23, x24, x29, x30, x31, x38, x40, x42, x43.The proposed method (Model II ) also succeeded to exclude these 11 variables as unimportant.Moreover, the proposed method (Model II ) achieved to exclude more unimportant variables leading to a more parsimonious model compared to Model I.The proposed method (Model II ) detected a set of 8 out of 44 variables as statistically significant i.e., x2, x5, x6, x7, x8, x11, x13, x14 (or x16).Since the decision between x14 and x16 seems to be arbitrary and in order to avoid excluding a variable and missing probably a very important factor we further performed a subsequent analysis examining three new models, firstly including only x14, secondly including only x16 and finally a model including both variables (x14 and x16).The derived results are presented below.

Subsequent Analysis
The L 1 -norm SVM methodology is also used to evaluate the performance of the new models considered for subsequent analysis.We firstly perform L 1norm SVM method to the dataset consisting of m= 8   We observe from Table 3 that Model B or Model C clearly outperform Model A in terms of training error, sensitivity recall, precision, G-mean and F-measure values.Model B and Model C have similar performance with small differences in values of the examined criteria.Moreover, we observe from Table 4 that sensitivity and specificity values for all three models are satisfactory since the estimated values lie at the corresponding estimated confidence interval.Since the decision between selecting Model B (including only x16) or Model C (including both x14 and x16) seems not to be clear and might be arbitrary due to the similar performance of both models, we recommend selecting Model C as the "best" model derived, order to avoid excluding a variable and missing probably an important factor.

Concluding Remarks
The innovative nature of our study lies on using the class of SSDs in conjunction with data mining methods and genetic algorithms that enabled us to deal with the problem of variable selection in a large-dimensional database with a feasible computational effort.The proposed method achieved to declare at a low rate inactive variables to be active, and active variables to be inactive.Therefore, the proposed method is indeed stable in this sense.The proposed method using SSDs is very important for the statistical analysis of large data, since it allowed us to identify effectively and parsimoniously the important prognostic factors (9 statistically significant out of the available 44 predictor variables) using only few runs (6 runs of the available 8862 runs).We recommend this alternative approach for variable selection purposes given a database of observations since it enables the experimenter to use only a very small percentage of the available runs, a fact that makes the statistical analysis of a large database computationally feasible and affordable.The proposed approach might be preferable to practitioners who find it expensive to consider levels of certain factors which are associated with high costs and would like to achieve specific levels of certain factors in order to minimize the experimental cost.In the proposed method, only main effects models have been considered.It will naturally be of interest to incorporate interaction effects in the models, and then to develop a suitable method for subsequent analysis.Work is currently under progress in this direction and we hope to report these findings in a future paper.

Table 1 :
Advanced comparison of models performance

Table 3 :
Advanced comparison of models performance