Improvement of CPU time of Linear Discriminant Functions based on MNM criterion by IP

Shinmura [12, 13, 14] proposes an optimal linear discriminant function (OLDF) using integer programming (IP) called as IP-OLDF based on the minimum number of misclassifications (MNM) criterion. It is defined on the data and discriminant coefficient spaces. We can understand the relation of a linear discriminant function (LDF) and NM clearly. This basic knowledge tells us several new facts of the discriminant theory. If data satisfies the Harr’s condition [1] or general position, IP-OLDF can obtain true MNM. But if data does not satisfy it, it may not choose the true MNM, because of the unresolved problem of the discriminant analysis that all LDFs cannot discriminate the cases xi on the discriminant hyperplane (f(xi) = 0) correctly. Therefore, Revised IP-OLDF [15, 16] is developed. However, it requires large elapsed runtime (CPU) because it is solved by IP. In this paper, we show how to reduce CPU time by Revised IPLPOLDF, NMs of which are good estimates of MNMs. It is evaluated whether NM of Revised IPLP-OLDF almost is as low as MNM by Revised IP-OLDF. And CPU time of Revised IPLP-OLDF is remarkably improved compared with Revised IP-OLDF. These results are examined by a total of 149 different discriminant functions using by real training samples and re-sampling validation samples.


Introduction
In this paper, four linear discriminant functions by mathematical programming (MP) are introduced.IP-OLDF is defined by IP and looks for the minimum NM t b j * x + 1) on data space, and have unique NM, because interior points are surrounded by specific k linear equations and NM is decided by the number of minus half-plane of H i (b) = 0(i = 1, 2, • • • , n) .We cannot find the relation of discriminant function and NM until now, because the constant is treated as free variable and define (p + 1)-dimensional coefficient space.Case x i on data space corresponds to linear equation H i (b) = 0 on discriminant coefficients space, and point b j on coefficients space corresponds to discriminant functions f j (x) = t b j * x + 1.This is clear by fixing the constant of discriminant function.2. Optimal convex polyhedron is defined as convex polyhedron, NM of which has MNM.Until now, all discriminant function cannot avoid some cases on f (x) = 0. We have no rule how to discriminate these cases into class 1/class 2 correctly.This unresolved problem is abandoned until now.This means that NMs of all discriminant functions may not be correct.If we judge |f (x)| ≤ 10 −6 as zero and the number of cases on f (x) = 0 are 'm', true NM may increase at least m.It is founded that IP-OLDF finds vertex of OCP, if data is general position 2 .Only Revised IP-OLDF can find the interior point of OCP directly.If data is not general position, IP-OLDF may not find the vertex of OCP.The point b j on vertex or edge of convex polyhedron is not free from the unresolved problem, because there are cases x i on f j (x i ) = 0.If LDF finds interior point b j in theoretical, this function is free from the unresolved problem.This is confirmed by checking that the number of |f (x)| ≤ 10 −6 is zero.Therefore, all discriminant functions except for Revised IP-OLDF must output NM and this number.
3. MNM decreases monotonously M N M q ≥ M N M q+1 .M N M q is MNM of q-features, and M N M q+1 is MNM of (q + 1)-features added one feature to existed q-features.Proof is very simple because OCP in q-coefficients space is included in (q + 1)coefficients space.This means an important fact.If M N M q = 0, MNM of all models including these q-features are zero.Secondly, Revised IPLP-OLDF is compared with Fisher's LDF and logistic regression by 100-fold cross validations using 100 re-sampling samples [18,19].

Fisher's LDF and logistic regression
Fisher [3] introduces Fisher's LDF based on the maximization of ratio (between classes / within class).If we admit Fisher's assumption that the distributions of two classes are normal distributions such as F 1 (x : m 1 , Σ 1 ) and F 1 (x : m 2 , Σ 2 ), and variance covariance matrices are same(Σ 1 = Σ 2 ), the same Fisher's LDF is derived by the plug in rule such as log(F 1 /F 2 ) = 0.If variance covariance matrices of two classes are not same (Σ 1 ̸ = Σ 2 ), QDF is introduced.Multi-class discrimination and MT (Mahalanobis -Taguchi) theory [22] in QC are defined by Mahalanobis distance.Variance covariance matrix plays an important role in the discrimination theory.Model selection technique is achieved by the sweep operator [5].But several serious problems are found as follows.
1.In general, NMs or error rates 3 of Fisher's LDF and QDF are worse than logistic regression.Therefore, users in medical and economic field use logistic regression instead of Fisher's LDF and QDF.This is reason why logistic regression is free from specific distribution such as normal distribution.2. NMs of LDF and QDF are not zero for linear separable data such as Swiss bank note data and 18 pass/fail determinations of exams [17].Latter results are as follows.Error rates of Fisher's LDF are from 2.2% to 16.7%.Error rate of QDF is from 0.8% to 10.8% [20].These problems are caused by the reason why real data does not satisfy Fisher's assumption.

IP-OLDF
IP-OLDF is defined in (1).Vector b is p-discriminant coefficients.From n cases, we obtain the optimal coefficients b that minimizes ∑ e i by IP.The constant of linear equation (H i (b) = t x i * b + 1) is fixed to 1 for i = 1, . . ., n.This notation can show the relation of LDFs and NMs.Decision variable e i is 0/1 integer variable for x i .If x i is classified into class 1/class 2 correctly, e i = 0 and y i * ( t x i * b + 1) ≥ 0. But, if there are cases on the discriminant hyper-plane ( t x i * b + 1) = 0, IP-OLDF treats e i = 0 and y i * ( t x i * b + 1) ≥ 0 nevertheless we cannot judge which classes these cases belong to.For misclassified case x i , e i = 1 and y i * ( t x i * b + 1) ≥ −10000.This means that binary integer variable choose ( t x i * b + 1) = 0 or ( t x i * b + 1) = −10000 as the linear discriminant hyper-planes for classified /misclassified cases.Therefore, we get MNM as optimal solution if data is general position.If data is not general position, object function may not be true MNM 4 . ( e i : 0/1, the decision variable corresponding to each x i ; b: p-discriminant coefficients vector; M : Big M constant such as 10000.This model in ( 1) is very important that the constant is fixed to be 1.We can exchange b and x, and obtain Therefore, this model is considered on both data and discriminant coefficient spaces.The linear hyper-plane misclassifies x i on data space.N linear hyper-planes divide the discriminant coefficients space into a finite number of convex polyhedrons.Interior points of this convex polyhedron are included in the plus or minus half plane of each H i (b) = 0. Therefore, the interior points of same convex polyhedron have a unique NM.LDFs corresponding to these interior points classify the same cases correctly and misclassify others.If we choose LDF corresponding to the interior points, those are free from the unresolved problems.If data is general position, IP-OLDF stops the optimization by choosing just p constraints that become H i (b) = 0 out of n constraints.Interior points b j of OCP are located in the plus side of H i (b) = 0 that composes OCP.If data is not general position, IP-OLDF may choose over (p + 1) constraints.We cannot discriminate these (p + 1) cases in class 1/class 2 theoretically.Until now, this important recognition is disregard.

Revised IP-OLDF and Revised LP-OLDF
Revised IP-OLDF is defined in (2).The constant of this discriminant function is a free variable b 0 .The right-hand constant of the constraints are changed to for the misclassified cases, the constraints are relaxed (y i * ( The Big M constant is very important to prevent cases from being on the discriminant hyper-plane, because the misclassified cases by SVs are extracted to alternative SV (y i * ( t x i * b + b 0 ) = −9999) and there are no cases in where b 0 : constant term (free variable).
Revised LP-OLDF is defined by changing e i from 0/1 decision variable to real nonnegative variable.This method is one of the L1-norm methods [7,21].The object function is the summation of the distances from the discriminant hyperplane of the misclassified cases, because e i = 0 for the classified cases.It is as same as S-SVM5 if penalty c is large positive number.

Revised IPLP-OLDF
Revised IPLP-OLDF is defined in two phases as follows: In the first phase, Revised LP-OLDF is applied to all cases, and these cases are categorized in two groups: cases that are classified correctly (e i = 0) and cases that are not classified (e i = 1) by SVs.In the second phase, Revised IP-OLDF is applied to latter cases.
The CPU time may be reduced because Revised IP-OLDF analyzes restricted cases.This method is called as Revised IPLP-OLDF.

Comparison of Revised IP-OLDF and Revised IPLP-OLDF
In this study, four kinds of real data are used as the training samples: The student data [16] consists of 40 students with five features.The object variable consists of two classes: 25 students who pass the exam and 15 students who fail.All combinations of features (31 = 2 5 − 1) are investigated.Iris data [2] consists of 100 cases with 4 features.The object variable consisted of two species: 50 versicolor and 50 virginica.All combinations of features (15 = 2 4 − 1) are investigated.CPD data [11] consists of 240 patients with 19 features.The object variable consists of two classes: 180 pregnant women whose babies are born by natural delivery and 60 pregnant women whose babies are born by Caesarian section.Forty models selected by forward and backward stepwise methods are investigated, because there are (2 19 − 1) models by all combinations of features.The Swiss bank notes data [4] consists of 200 cases with 6 features.The object variable consists of two kinds of bills: 100 genuine and 100 counterfeit bills.A total of (63 = 2 6 − 1) models are investigated.Four kinds of re-sampling data are generated by Speakeasy.These samples consisted of 20,000 cases and those are used as the validation samples.Revised IP-OLDF and Revised IPLP-OLDF are applied to both the training and validation samples by LINGO (Optimization Modeling Software for Linear, Nonlinear, and Integer Programming) Ver. 10 [10] developed by LINDO Systems Inc. in 2008.And, both CPU times are compared from Table1 to Table4.In addition to this results, the NMs of 135 models by Revised IPLP-OLDF are compared with 135 NMs of LDF and logistic regression by 100-folf cross validation in Table5 by LINGO Ver.14 in 2014.

Swiss Bank Notes Data
Table 1 shows the result of the Swiss bank note data (bank data).The first column (Var.)shows the 63 models from 6-features (p=6) to 1-feature (p=1).In the same number of features (p), those are arranged in descending order of R-squares.Here, x 1 is length of bill (mm); x 2 and x 3 are width of left and right edges (mm); x 4 and x 5 are bottom and top margin widths (mm); x 6 is length of image diagonal (mm).Variable name is shown by only suffix number in the table.The third column (IP) shows MNM by Revised IP-OLDF.IP-OLDF finds that MNM of(x 4 , x 6 ) is zero.Therefore, 16 models including (x 4 , x 6 ) are linear separable.The fourth column (EC1) shows NM of the re-sampling data (or validation data) obtained by 63 discriminant functions of Revised IP-OLDF.The fifth column (%) shows the difference of two error rates as defined by the formula (EC1/20000 -IP/200) * 100.This column means one of "generalization ability index of each Revised IP-OLDF".Six differences are greater than 4%.We had better considered about generalization ability of each model in addition to whole models.The LP column shows NMs by Revised LP-OLDF in the first phase of Revised IPLP-OLDF.NM of 1-variable (x 5 ) is 45 and is less than MNM=47.MNM is the lower limit of NMs of all LDFs.This shows that Revised LP-OLDF is not free from the unresolved problem.The IPLP column shows the estimates of MNM by Revised IPLP-OLDF in the second phase.All 63 results of both functions are same.The EC2 column shows NMs in the validation samples.The second "%" column shows the difference of the two error rates by the formula (EC2/20000 -IPLP/200) * 100.By comparison of two '%' columns tell us that the values of Revised IPLP-OLDF are less than those of Revised IP-OLDF.This may show that generalization ability of Revised IPLP-OLDF is better than Revised IP-OLDF in the whole models.MP-based models are solved by fixing some cases on the discriminant hyper-plane or SVs.Therefore, these discriminant functions cannot count NM correctly because some cases lie on the discriminant hyperplane.In the case of Revised LP-OLDF, some cases are fixed on the SVs.But, if e i = 1/10000 = 0.0001, x i lies on the discriminant hyper-plane.This is examined in future research.
The CPU times of Revised IP-OLDF and Revised IPLP-OLDF of 63 models are 133,399 seconds and 2688 seconds, respectively.Revised IPLP-OLDF is approximately 50 times faster than Revised IP-OLDF.Table 2 shows the result of the iris data.The first column (Var.)shows the 15 models from p=4 to p=1.x 1 through x 4 mean sepal length (x 1 ), sepal width (x 2 ), petal length (x 3 ), petal width (x 4 ), and species (x 5 ).The third column (IP) shows MNM by Revised IP-OLDF.The fourth column (EC1) shows NM of the re-sampling data by obtained 31 discriminant functions of Revised IP-OLDF.The fifth column (%) is defined by the formula (EC1/20000 -IP/100) * 100.The LP column shows NMs by Revised LP-OLDF.NMs of two 1-var models such as (x 1 ) and (x 2 ) are less than MNMs.The IPLP column shows the estimates of MNM by Revised IPLP-OLDF in the second phase.All 15 results of both functions are same.The EC2 column shows NM in the validation samples.The second '%' column is defined by the formula (EC2/20000 -IPLP/100) * 100.All absolute values of both '%' columns are less than 0.4%.This implies us that both Revised IP-OLDF and Revised IPLP-OLDF are good generalization ability for iris data.The CPU times of Revised IP-OLDF and Revised IPLP-OLDF of the 15 models are 446 seconds and 30 seconds.Revised IPLP-OLDF is approximately 15 times faster than Revised IP-OLDF.Table 3 shows the result of the student data.The first column (Var.)shows the 31 models from p=5 to p=1.x 1 through x 5 mean the hours of study per day, number of days drinking per week, spending money per month, sex (0/1 dummy variable), and smoking (0/1 dummy variable).The third column (IP) shows MNM of the student data.The fourth column (EC1) shows NM of the re-sampling data by Revised IP-OLDF.The fifth column (%) shows the difference of the two error rates by the formula (EC1/20000 -IP/40) * 100.Three 2-variable models and two 1-variable models in LP column are less than MNMs in IP column.The IPLP column shows estimates of MNMs by Revised IPLP-OLDF.All 31 results of both functions are same.The EC2 column shows NM of the re-sampling data by Revised IPLP-OLDF.The second '%' column is defined by the formula (EC2/20000 -IPLP/40) * 100.Both values of '%' columns except for 1-variable (x 2 ) are same.Absolute values of both '%' columns are larger than other data sets.The CPU times of Revised IP-OLDF and Revised IPLP-OLDF are 20 seconds and 40 seconds.Revised IPLP-OLDF is slower than Revised IP-OLDF, because all features are integers and many values are overlaps.

CPD Data
Table 4 shows the result of CPD data.The first column (p) shows the 40 models from p=1 to p=19."F, B, f, and b in the column Type" show the forward (F) and backward (B) models for the full models, and forward (f) and backward (b) models for the 16-variables model dropped three variables (x 4 , x 7 and x 14 ) that relate to multicolinearities.The features are as follows.x 1 : age of a pregnant woman, x 2 : number of times of a delivery, x 3 : number of the sacrum, x 4 : anteroposterior distance at the pelvic inlet, x 5 : anteroposterior distance at the wide pelvis, x 6 : anteroposterior distance at the narrow pelvis, x 7 : the shortest anteroposterior distance, x 8 : fetal biparietal diameter, and x 9 : x 7 -x 8 , x 10 : anteroposterior distance at the pelvic inlet, x 11 : biparietal diameter at the pelvic inlet, x 12 : x 13 -x 14 , x 13 : area at the pelvic inlet, x 14 : area of the fetal head, x 15 : area at the bottom length of the uterus, Mean error rates of difference of (LDF -Revised IPLP-OLDF) for training samples are summarized by minimum and maximum values.Minimum and maximum values of 15 different models of iris training samples are 0.55% and 5.23%.This means that mean of error rates of LDF are from 0.55% to 5.23% worse than those of Revised IPLP-OLDF.Minimum and maximum values of 15 different models of iris validation samples are -0.6% and 2.36%.Only two models out of 15 models of LDF are better than Revised IPLP-OLDF in the validation samples.In the training samples, 135 models of LDF are worse than those of Revised IPLP-OLDF.Only 15 models of LDF are better than Revised IPLP-OLDF for validation samples.Mean error rates of difference (logistic regression -Revised IPLP -OLDF) tell us that only 3 and 33 models of logistic regression are better than Revised IPLP-OLDF for the training and validation samples, respectively.In 2014, these results are recalculated using LINGO Ver.14.The elapsed runtimes of Revised IPLP-OLDF are less than 3 minutes.The elapsed runtimes of Fisher's LDF and logistic regression by JMP are over 21 minutes.The elapsed runtimes of Revised IPLP-OLDF in Ver.13 were slower than those of Fisher's LDF and logistic regression.Reversals of CPU time have occurred for this time.

Conclusion
The CPU times of Revised IPLP-OLDF of bank data, iris data, and CPD data are 50, 15, and 100 times faster than those of Revised IP-OLDF in 2009.All NMs [20]y  & Rieduyl [4]collect 200 genuine and counterfeit Swiss bank note data having 6 features and write a textbook about discriminant theory.IP-OLDF finds MNM of two-features (x 4 , x 6 ) is zero.Therefore, 16 models including (x 4 , x 6 ) are zero.It is concluded that Fisher's LDF and QDF based on the variance covariance matrices almost cannot recognize linear separable data[20].
In this research, two comparisons are tried.First, Revised IP-OLDF resolves problems of discriminant theory.But, this requires more CPU time, because this is solved by IP.Therefore, Revised IPLP-OLDF that looks for good estimate of MNM is developed.The CPU times and NMs of Revised IPLP-OLDF are compared with Revised IP-OLDF.It is concluded that the CPU time of Revised IPLP-OLDF is faster than Revised IP-OLDF, and error rates of Revised IPLP-OLDF are less than equal those of Revised IP-OLDF in the validation sample.

Table 4 .
[18,19]abdominal circumference, x 17 : external conjugate, x 18 : interprochanteric diameter, and x 19 : lateral conjugate.Small random noises are added to x 9 and x 12 .The fourth column (IP) shows MNM by Revised IP-OLDF.The fifth column (EC1) shows NM of the re-sampling data by Revised IP-OLDF.The sixth column (%) is defined by the formula (EC1/20000 -IP/240) * 100.All NMs in LP column are greater than equal those in IP column.The IPLP column shows the estimates of MNM in the second phase.All 40 results of both functions are same.The EC2 column shows NM of the re-sampling data.The second '%' column is defined by the formula (EC2/20000 -IPLP/240) * 100.Comparison of two '%' columns are as follows.There are 32 models, both '%' of which are same.Seven '%' of Revised IP-OLDF are greater than those of Revised IPLP-OLDF.And only one error rate of Revised IPLP-OLDF is greater than Revised IP-OLDF.The CPU times of Revised IP-OLDF and Revised IPLP-OLDF of the 40 models are 38,170 seconds and 380 seconds.Revised IPLP-OLDF is approximately 100 times faster than Revised IP-OLDF.This large difference in CPU time may be caused by the multicolinearity, because it may require a long time to check the convergence.Result of CPD Data One hundred re-sampling samples are generated by four data.NMs of Revised IPLP-OLDF are compared with those of Fisher's LDF and logistic regression by 100-fold cross validations[18,19].The results of Revised IPLP-OLDF are obtained by LINGO Ver.14 in 2014.The results of Fisher's LDF and logistic regression are obtained by JMP Ver.10 [8].All possible models of Iris (24 − 1 = 15 models), Student (2 5 − 1 = 31 models), Swiss bank note (2 6 − 1 = 63 models) data are computed.There are (2 19 − 1) models of CPD data.Therefore, only 26 models selected by the forward and backward stepwise methods are computed.At first, 100 NMs are computed for 135 different models.And, mean of error rates are computed by 135 models.Next, these 13,500 discriminant functions are applied for validation samples and computed mean error rates for validation samples.Last, four differences are computed in Table5.

Table 5 .
Comparison of mean of error rates of Revised IPLP-OLDF vs. (LDF and logistic regression)