A Trivial Linear Discriminant Function

In this paper, we focus on the new model selection procedure of the discriminant analysis. Combining resampling technique with k-fold cross validation, we develop a k-fold cross validation for small sample method. By this breakthrough, we obtain the mean error rate in the validation samples (M2) and the 95% confidence interval (CI) of discriminant coefficient. Moreover, we propose the model selection procedure in which the model having a minimum M2 was chosen to the best model. We apply this new method and procedure to the pass/ fail determination of exam scores. In this case, we fix the constant =1 for seven linear discriminant functions (LDFs) and several good results were obtained as follows: 1) M2 of Fisher’s LDF are over 4.6% worse than Revised IP-OLDF. 2) A soft-margin SVM for penalty c=1 (SVM1) is worse than another mathematical programming (MP) based LDFs and logistic regression . 3) The 95% CI of the best discriminant coefficients was obtained. Seven LDFs except for Fisher’s LDF are almost the same as a trivial LDF for the linear separable model. Furthermore, if we choose the median of the coefficient of seven LDFs except for Fisher’s LDF, those are almost the same as the trivial LDF for the linear separable model.


Introduction
In this paper, we propose new model selection procedure of the discriminant analysis by the "k-fold cross-validation for small sample" method [14,18]. Fisher [1] described the linear discriminant function (Fisher's LDF) and founded the discriminant theory. He never formulated the standard error (SE) of Fisher's LDF. Therefore, there was no sophisticated model selection procedure of the discriminant analysis. Although Lachenbruch et al. [4] had proposed a leave-one-out (LOO) method for model selection of the discriminant analysis, they could not achieve the new method because of lack of computer power. If we fix "k=100", we can obtain 100 LDFs and 100 error rates in the training and validation samples. From the 100 error rates, we calculate two means of error rates such M1 and M2 in the training and validation samples. We consider the model with minimum M2 among all possible combination models [3] is the best model. We apply this new method and procedure for three data sets of the pass/fail determinations and obtain good results. We had better distinguished these computer-intensive approaches from the traditional inferential statistics with the SE. Genuine statisticians without computer power had established the inferential statistics by their intellectual brain. Now, we can utilize the computer power of statistical and MP solver. The researchers who wish to discriminate their research data can decide the best model. We obtain the 95% confidence interval (CI) of the error rate and discriminant coefficients [14,17,18,21,22]. The new method and procedure give us precise and deterministic judgment about the model selection of the discriminant analysis.

Eight LDFs
In this research, we compare two statistical LDFs and six MP-based LDFs [21]. Two statistical LDFs are Fisher's LDF and a logistic regression. Fisher proposed Fisher's LDF based on the variance-covariance matrices and found the discriminant analysis. After Fisher's LDF, we can use a quadratic discriminant function (QDF) and the multiclass discrimination based on the variance-covariance matrices. From 1971 to 1974, we developed the diagnostic logic between normal and abnormal symptoms of an electrocardiogram (ECG) data by Fisher's LDF and QDF. Our research was inferior to the decision tree logic developed by the medical doctor. After this experience, we concluded these discriminant functions are fragile for the discrimination of the normal and abnormal diseases because of two main reasons.
1) There are many cases nearby the discriminant hyperplane. All LDFs except for Revised IP-OLDF cannot discriminate the cases on the discriminant hyperplane correctly (the first problem of the discriminant analysis, the problem1). The problem1 means that the number of misclassifications (NM) of these LDFs may not be correct.
2) If the value of some variables increases or decrease, the probability belonging to abnormal disease increases from 0 to 1. The discriminant functions based on the variance-covariance matrices assume the typical abnormal patients are the average of the abnormal classes. However, the typical patients are far from the normal patients. Taguchi and Jugulum proposed the Mahalanobis-Taguchi (MT) method using Mahalanobis-distance based on the variance-covariance matrix [25]. They claim that the cases belonging to abnormal states are far from the normal state. Their claim is the same perception as our claim. Therefore, the Framing study developed the logistic regression in the equation (1). If some independent variable increases or decreases, the probability 'p' belonging to class1 increases from 0 (class2) to 1 (class1). Most statisticians respect Fisher who opened a new frontier of much statistical theory by his intellectual consideration without computer power. Therefore, we think no researchers discuss this point seriously. However we can use a powerful computer power and software such as statistical software JMP [5] and MP solver LINGO [7]. We are in the next generation blessed unlike Fisher era. We should develop analytical techniques tailored to the characteristics of the individual data without the normal distribution.
where p: the probability belongs to class1; x: the independent variables.
We can obtain the maximum/minimum value of the function by MP, regardless of the presence or absence of constraints. Therefore, Schrage [6] introduced several definitions of regression models. Quadratic Programming (QP) defines the ordinal least square method. Linear Programming (LP) defines the "Least Absolute Values (LAV) Regression". Nonlinear Programming (NLP) represents several Lp-norm regression. However, there were few pieces of research about the regression analysis. On the other hands, there were many pieces of research about 324 SHUICHI SHINMURA MP-based discriminant models [24]. However, statistical users rarely used these discriminant functions because there was no evaluation of the real data.
Vapnik [26] proposed three different SVM models. H-SVM in equation (2) indicates the discrimination of linearly separable data clearly if we fix c = 0 and drop e i . Real data are rarely linearly separable. For this reason, S-SVM has been defined in equation (2) with two objects. These two objects are combined by defining some "penalty c." However, S-SVM does not have the rule to determine 'c' correctly. In this research, two S-SVMs such as SVM4 (c = 10 4 ) and SVM1 (c = 1) are examined. We know the "M1 & M2" of SVM4 are almost better than SVM1. Some researchers misunderstand S-SVM can discriminate the linearly separable data exactly and prefer to choose the small penalty c without the examination of real data.
where y i = 1/ − 1 for x i ∈ class1/class2; x i : p-independent variables (p-variables); b: p-discriminant coefficients; b 0 : the constant and free variable; c: penalty c; e i : non-negative decision variable.
On the other hand, Shinmura [8] developed an Optimal LDF by Integer Programming (IP-OLDF) based on the MNM criterion. We found several new facts about the discriminant theory [9,10]. However, we found IP-OLDF cannot find true MNM if data does not satisfy the general position [11,12,13]. Therefore, we developed the Revised IP-OLDF in equation (3). Only Revised IP-OLDF is free from the problem1. Other LDFs must count the number of cases on the discriminant hyperplane. Moreover, only H-SVM and Revised IP-OLDF can recognize a linear separable model theoretically. Another LDFs cannot recognize a linear separable data and cannot judge the data is overlap or not (the problem 2).
If e i is a real non-negative variable in equation (3), we utilize Revised LP-OLDF, which is an L1-norm LDF. Revised IPLP-OLDF is a combined model of Revised LP-OLDF and Revised IP-OLDF. The CPU time of Revised IPLP-OLDF was very faster than Revised IP-OLDF before 2012 [15]. However, it is slower than Revised IP-OLDF after 2012 because LINGO IP solver improves the computation time tremendously [20]. We expect Revised IPLP-OLDF can be supposed to obtain an estimate of MNM faster than Revised IP-OLDF for large samples in near future.
In this research, we compare Revised IP-OLDF with seven LDFs by the "k-fold cross-validation for small sample" method and evaluate the best model of Revised IP-OLDF among eight LDFs.

The Model Selection Procedure by K-fold Cross-validation for Small Sample Method
We examined Revised IP-OLDF by several small samples. It was difficult for us to compare Revised IP-OLDF with seven LDFs because we could not validate the effectiveness of Revised IP-OLDF. Therefore, we proposed the "k-fold cross-validation for small sample" method and can evaluate eight LDFs by two means of error rates such as M1 and M2 in the training and validation samples. Although Fisher developed Fisher's LDF, he never formulated We obtained some NMs of Fisher's LDF and QDF by two options of "prior probability=Same" or "prior probability= Proportional" using JMP until now. After this paper, we fix the option as the "prior probability = Proportional".
the equation of SEs of error rates and discriminant coefficients. Therefore, there were no good model selection procedures instead of the LOO method.
In this research, we propose the new method and model selection procedure as follows: 1)We discriminate an original data by eight LDFs and two discriminant functions such as QDF and a Regularized Discriminant Analysis (RDA) [2]. In principal, we discriminated all possible models. Goodnight established this technique in the regression analysis by the sweep operator [3]. By this technique we can overlook the whole picture of the study. 2)We discriminate re-sampling samples by the new method. In this research we fix k = 100 to obtain two means of error rates and 95% CI of error rates and discriminant coefficients. [22] 3)We consider the best model with the minimum values of M2 (minimum M2 standard) and compare eight M2s of eight best models. Although Vapnik defined the generalization ability, we claim the best model has good generalization ability. Moreover, we discuss the 95% CI of discriminant coefficients by fixing the constant=1.

Original Data and Re-sampling Sample
After 2010, we teach the preliminary statistical course for approximately 130 freshmen. Midterm and final exams consisted of 100 questions with ten choices. We consider discrimination using four testlet scores as independent variables [16]. If the pass mark is 50 points, we can easily obtain a trivial LDF (f = T 1 + T 2 + T 3 + T 4 − 50), NM of which is zero. If f ≥ 0 or f < 0, the students pass or fail the exam, respectively. Usually, all LDFs except for Revised IP-OLDF are not free from the problem1. Because we can define the discriminant rule by exam scores (or independent variables), we can obtain above trivial LDF that is free from the problem1. We claim the pass/fail determination using exam scores gives us deep knowledge about the discrimination and offers good test data for linear separable data. We had many papers about the medical diagnosis. However, we have no connection with medical doctors now. Therefore, we use these data instead of the medical data. Table 1 shows the discrimination of four testlet scores for 10%, 50% and 90% levels of the midterm exams in 2012. One hundred twenty-four students attended the exam. 'p' denotes the number of independent variables selected by the forward stepwise technique. In 10% level, the 2-variables model (T4, T2) was linearly separable. There are 15 discriminant models by all combination of four variables, only four LDFs of which are linearly separable. The pass mark is 36 points, and ten students fail the exam. In 50% level, the only full model is linearly separable. The pass mark is 63 points, and fifty-seven students fail the exam. In 90% levels, the only full model is linearly separable. The pass mark is 78 points, and one hundred twelve students fail the exam. "RIP and Logi" are Revised IP-OLDF and logistic regression. Both LDFs can discriminate a linear separable model correctly. However, logistic regression cannot sometimes discriminate the linear separable data correctly. On the other hand, Fisher's LDF and QDF cannot discriminate all linear separable models in these data. Fig.1 are three scatter plots. X-axis is a first principal component. Y-axes correspond to second, third and fourth principal components from the left to right. Three 95% probability ellipses correspond to three groups such as SCORE <= 35, 36 <= SCORE <= 77 and 78 <= SCORE. Three groups consist of 10, 102 and 12 students. First and third groups are almost symmetry because two ellipses and cases are almost the same. If we check both NMs of the full model in three levels of Table 1, those of Fisher's LDF and QDF are 1s, 7s, and 10s, respectively. We cannot explain whether the increasing trend is common or not for another data.

Pass/Fail Determination by Exam Scores (50% level in 2012)
In this chapter, we discuss the discrimination at the 50% level. The pass mark is 63 points. Table 1 tells us only full model is a linear separable model. We know a trivial LDF is f = T 1 + T 2 + T 3 + T 4 − 63. In this research, we omit four 1-variable models because there is no meaning about the discrimination. Table 2 shows the MNM and nine 'Diff2' of seven LDFs and two discriminant functions. Seven LDFs are as follows: H-SVM, SVM4, SVM1, Revised LP-OLDF (LP), Revised IPLP-OLDF (IPLP), logistic regression (Logistic) and Fisher's LDF (LDF). Two discriminant functions are QDF and RDA. Nine 'Diff2' are the differences of (nine NMs -MNM). The first column is the sequential number (SN) of eleven models from 4-variables to 2-variables that correspond to the second column 'Model'. We check the number of cases on the discriminant hyperplane of seven LDFs. We show it in the parenthesized numbers. Only five 2-variables model of Revised LP-OLDF cannot avoid the problem1. We cannot check the number of cases on the discriminant hyperplane of four statistical discriminant functions analyzed by JMP. SVM1, LDF, QDF, and RDA cannot recognize the linear separable model (the problem2). 'Diff2.' tells us the following facts. 1)We can roughly evaluate nine discriminant functions as follows. Revised IPLP-OLDF is the best result because ten NMs of IPLP are the same as MNM. Logistic regression is the second best because four NMs of 'Logistic' are the same as MNM. Three statistical discriminant functions except for logistic regression are the worst result.

NMs of Original Data
2)Two models of LP have minus values because there are several cases on the discriminant hyperplane of these two models. We cannot evaluate the problem1 of other three models (SN = 9, 10, 11). Until now, we discuss the problem1 for the models having minus values of 'Diff2'. However, we cannot solve the problem1 for five models of Revised LP-OLDF clearly. Moreover, we cannot find the problem1 of four statistical discriminant functions because we cannot know the number of cases on the discriminant hyperplane. We expect statistical software output this number for users. Table 3 shows the results by the new method. We omit QDF and RDA because those are not LDFs. We examine 11 discriminant models of seven LDFs, and only full model of H-SVM. First 11 rows are eleven models of Revised IP-OLDF (RIP). M1s and M2s are the mean of error rates in the training and validation samples. We confirm the full models of eight LDFs are the best models because the values of M2 are the minimum values among 11

The 95% CI of Discriminant Coefficients
We examine the 95% CI of the best models. Table 4 shows the median and 95% CI of six MP-based LDFs by fixing the constant=1. It is our mistake Fisher's LDF and logistic regression by JMP script designed by us do not output 100 discriminant coefficients. If the 95% CI include zero, we can judge the pseudo-population coefficient is zero. If the value of 2.5% is greater than 0 or the value of 97.5% is less than 0, we estimate the pseudo-population coefficient a positive or negative value. Following this judgment, only five coefficients of T2 except for SVM1 are zero, and other coefficients are negative. Four medians of the full model are almost -0.016. This fact implies us these LDFs are the same as the trivial LDF in the equation (4): We can find this surprising result by fixing the constant=1. When we did not fix the constant, we could not find this result [22]. We discriminate the pseudo-population sample having 12,400 cases by the Fisher's LDF and logistic regression. The equation (5) is a logistic regression. If we divide five coefficients by 2178, we can obtain the same trivial LDF. Because the numbers in parentheses are SEs, all coefficients are zero.
On the other hand, we get Fisher's LDF in equations (6). We obtain these coefficients by the regression analysis. If we divide the coefficients by -3.22, we know Fisher's LDF is not the same as the trivial LDF because the third coefficient becomes -0.008. This fact tells us that Fisher's LDF does not follow the real data. It assumes the data is 328 SHUICHI SHINMURA

Pass/Fail Determination by Exam Scores (90% level in 2012)
In this section, we discuss the discrimination at the 90% level. The pass mark is 78 points. Table 1 tells us only full model is a linear separable model. We know a trivial linear separable LDF is f = T 1 + T 2 + T 3 + T 4 − 78 (or 77.5). 2)SVM1, LDF, RDA, and QDF cannot discriminate the linearly separable model exactly (the problem2).

NMs of Original Data
3)Although logistic regression is the second best at the 50% level, ten NMs of non-linear separable models are greater than MNM. Moreover, logistic regression is worse than SVM4, SVM1, LP, and IPLP. This result is crucial because logistic regression is as same as IPLP and is better than SVM4, SVM1, and LP in other data. 4)Three statistical discriminant functions are worse than RIP and IPLP. 5)We find the problem1 of Revised LP-OLDF for three models (SN = 3, 8,10).   Table 7 shows the median and 95% CI of six MP-based LDFs. Three coefficients of T2 of RIP, IPLP and LP are zero, and other coefficients are negative. If four medians of the full model are -0.0128, this LDF is the same as a trivial LDF such as f = T1+T2+T3+T4-78 (or 77.5). All MP-based LDFs are almost equal to the trivial LDF in the equation (7).

The 95% CI of Discriminant Coefficients
We examine the coefficients of the full model as the best model. A trivial LDF is The equation (10)

Conclusion
In this research, we discuss the new method and model selection procedure of the discriminant analysis. We discriminate the pass/fail determinations at 10%, 50%, and 90% levels. We select the best models of eight LDFs by the "minimum M2 standard" method. Two studies by 50% and 90% choose the same best models because only full models are linearly separable. We get surprising results about the best models of all MP-based LDFs and logistic regression those are almost the same as trivial LDFs. Both Fisher's LDFs are quite different from trivial LDFs. We can obtain these results by fixing the constant=1. Absolute values of 'M2Diff.' of six LDFs except for Fisher's LDF are within 0.08% and 0.16%, respectively. However, those of Fisher's LDF are 4.58% and 11.53%, respectively. Next, we select the second best model among ten non-linear separable models by six LDFs. Analysis of 50% selects the second 3-variables model and all 'M2Diff.' are greater than zero. Especially, that of Fisher's LDF is 3.026%. Analysis of 90% selects the fourth 3-variables model and all 'M2Diff.' are greater than 0.04%. Especially, that of Fisher's LDF is 10.247%. On the other hand, there are four linear separable models at 10% level. Four LDFs such as RIP, IPLP, LP and Fisher's LDF select the fourth 2-variables model, and four LDFs such as H-SVM, SVM4, SVM1 and logistic regression select the full model. We choose the full model as the best model. Moreover, all LDFs are not the same as trivial LDFs. Base on the above results at 10%, 50%, and 90% level, we could summarize as follows. 1)Three M2s of Fisher's best LDF are 9.66% 4.58% and 11.53% worse than Revised IP-OLDF. Only Fisher's LDFs are fragile for the pass/fail determination by exams scores. Therefore, we are worried to obtain the same results about the medical diagnosis.
2)Two best models of Revised IP-OLDF and logistic regression are the same as trivial LDFs at 50% and 90% level. However all LDFs are not the same as trivial LDFs at 10% level. We cannot explain the reason theoretically.
3)If we select the second best LDFs for non-linear separable models, all LDFs select the same models and M2 of Revised IP-OLDFs have the minimum values. This fact may imply us Revised IP-OLDF is superior to other LDFs for the non-linear separable models, although only Revised IPOLDF and H-SVM can recognize the linear separable models.