Applying Bayesian Decision Theory in RBF Neural Network to Improve Network precision in Data Classification

One of the common tools used for classification of data is RBF neural network. The lack of connectivity of features in each layer in the structure of neural networks such as the RBF neural network causes the values of the features to not be multiplied, and the action and dependency of the values of a feature on other features not to be considered in the classification or regression process. The most important reason for the lack of connectivity among the features can be considered as the problem of learning weights. This research tries to use the multiplication of values of event probability of features to improve the efficiency of data classification in the RBF neural network based on the reasons mentioned above through classification style in the Bayesian decision theory. Moreover, the linear weight coefficients at the final layer of the RBF neural network are used to determine the importance of the feature event in the final decision. This research tries to use the capabilities of the RBF neural network in assigning event probabilities to the values of input features based on data centers. The presence of linear weight in the final layer makes learning weights improve the classification. Empirical experiments show good results.


Introduction
Neural networks are one of the common tools in regression and data classification [1].The neural network has the capability to create required nonlinear relationship between the inputs and outputs.This relationship can be achieved through learning weights.As features of neural networks are not connected at each layer, they cause that the direct relationship among the features values not to be considered in determining the final output.Therefore, the dependency among the features is not considered in determining the output.The most important reason for the lack of connection among the input features is the difficulty of learning weights [2].Therefore, connections among the features are deleted.One of the important examples in this regard is the reduction of connections among the features in the Boltzmann machine and its reduction to the restricted Boltzmann machine [4].One of the methods, which consider the values of the features and their dependencies in the classification, is the Bayesian decision theory [5].Bayesian decision theory is an appropriate method for decision making based on the dependency among the features and considering the interaction of features in the final output.However, modeling and calculating the degree of dependency among all features is difficult when the number of features is very high .On the other hand, the features take different values, making it difficult use Bayesian decision theory.This algorithm can consider all dependencies.However, by increasing the number of features, creating a Bayesian network becomes impossible to model the dependency relationship among all the features.Thus, in addition to modeling the relationship among 589 the features and multiplying the values of features, this research tries to model the dependency among them using the learnable coefficients.The linear weights determined based on desired output in each class minimize the error caused by the dependency among the features as much as possible .Since the coefficients are linear, they dot have the complexities of multilayer neural networks for learning weight.In this research, back propagation algorithm was changed in such a way that it can be easily used to learn weights in the proposed neural network structure.The proposed method shows good results for a particular data set.Section 2 reviews the literature of the today's common neural networks used in classification.Section 3 presents the methodology and Section 4 presents the results of the experiments

Literature Review
Neural networks are a popular tool in classification of data due to their efficiency and relatively easy learning.The need to establish a relationship between desired input and output has caused relatively diverse structures to be defined for it.In this section, some types of neural networks used in research experiments are briefly reviewed.

Backpropagation neural network
In Backpropagation neural network [6], the error rate is propagated from output side to lower layers to calculate the output error in each layer.Then, learning weight is performed based on parameters such as learning error, desired output, real output, activation function derivative, and input value.Thus, it is appropriate for learning with supervisor.Backpropagation neural networks can be implemented on flat neural networks.Backpropagation calculates a gradient that is needed in the calculation of the weights to be used in the network [7].In the learning process, backpropagation is commonly used by the gradient descent optimization algorithm.backpropagation adjusts the weight of neurons by calculating the gradient of the loss function [8].Given the reasons mentioned, this algorithm was used to learn linear weight coefficients in this paper.

RBF neural networks
The RBF neural network [9] is one of the common tools in data classification [10].In RBF neural networks, the classification is performed with the help of the value of membership function set on each feature.Data centers play an important role in the mentioned network.Given the use of membership functions and data centers, the statistical data center plays an important role in determining the output in the RBF neural network.

Naive Bayesian Classifier
The main purpose of the bayesian classifiers is to minimize the probability of miss classification [11].naive Bayes classifiers are a family of probabilistic classifiers.The Naive Bayesian classifier is suitable for problems with high and independent input variables [12] So if the input variables are not independent, it's best to review the classification approach.

Deep Belief Network
Deep Belief Networks(DBN) [13] are new generation of neural networks, performing classification and regression with regard to higher level features [14].DBN extract the higher level features using the statistical approach.These networks are developed by putting restricted Boltzmann machines together [15].The restricted Boltzmann machines are a kind of Boltzmann machine, in which the relationship between the features of the same level has been deleted.Figure 1 shows the structure of the Boltzmann machine and the restricted Boltzmann machine for the visible input vector v and the hidden layer vector h [16].In Figure (1), L is the symmetric weight between the visible layer and J is the symmetric weight between hidden layer features.W is the symmetric weight between the visible and hidden layer .The problem of Boltzmann machine weights has caused the connection among the features of the same level to be deleted [16].Despite the deletion of connectivity among the same level features in the hidden and visible layer, the restricted Boltzmann machine is considered as an efficient tool in feature engineering.

Methodology
In the methodology section, Bayesian decision theory and multiplication of the values of the features are used to determine the class of data using direct multiplication of data features values.Using this approach, the relationship among the features and their interactions in the final decision can be considered.On the other hand, given the different values received by the feature, we try to make decision to determine the probability of the event of a feature, based on the center of data set assigned to one feature.Thus, RBF neural network was used due to its ability to use radial functions.The combination of these two approaches was used in the methodology of this paper.The reason to present this method is the lack of relationship among the features in the input layer.In order to show the lack of relationship among the features in the neural network, the m th output in the n th layer is shown by equation (1) [17].
In equation (1), N (n−1) is equal to the number of nodes in the layer (n − 1).w (n) ,m is equal to the connection weight of nude m to nude N in the n layer (n − 1).X (n−1) (n−1) is equal to value of nude N in the layer (n − 1).Equation ( 2) can be written in terms of the network input in the layer as in equation (2). ( Given the max and min terms in the equation above, it can be rewritten as Equation (3). ).
In fact, equation( 2) can be written as a set of N (n−1) × N (n−2) max term.Each max term includes multiplication of a set of weights in each feature.Accordingly, in total, no two features are multiplied.As seen, features are not multiplied and the direct effect of the values of the features is not considered in the calculation of the output.Even in the learning process, the value of gradient is calculated separately based on the ratio of changes in weight Stat., Optim.Inf.Comput.Vol. 6, December 2018 NADER REZAZADEH 591 coefficients to each attribute.Its reason is the lack of relationship among the features of each layer in the structure of the neural network.The problem of learning weights for network is the most important reason for deleting the relationship among the features of each layer.As in Boltzmann machine, there is a problem of learning [16].This paper presents a method for learning the weights of the network and calculating the output based on the direct multiplication of the features.In the proposed method, network weights are set using parameters and statistical computations.

The implementation of the Bayesian decision theory
Before introducing the proposed method, the way of using the membership functions and weight coefficients in Bayesian classification is described in this paper.In a Bayesian classifier, assuming the presence of input set X = {X 1 , . . ., X n }, the probability value of the class S i event is equal to Equation ( 4) [18] Equation ( 5)can be written for Conditional probability of S i given multiple independent events X 1 , . . ., Xn [19]. .
Assuming the independence of events X 1 , . . ., X n ,the expression P ( ∩ j=1,...,n Xj|S i ) can be written as follows [20] Therefore, If the set of observations X = {X 1 , . . ., X n }, is independent in pair, equation ( 7) will be applied According to Equation ( 7), the most probable condition, in the case of independency of features and the absence of dependency among the features, can be achieved using Equation ( 8) [18] i * = Argmax i P (S i )P (X Given the different values received by each feature X j , 1 ≤ j ≤ n, Gaussian membership function can be used to implement the value of P (X j |S i ).Gaussian membership function helps model the event probability value of the feature X j by assuming the event Si.For this purpose, base point set is required set to examine the probability of the event X j in which the input is compared to it.The important point is that which point should be found, which always occur for a class or at least has the maximum event.Such a point can be used as the center of Gaussian membership function.The solution proposed in this paper is the data centers .With an initial preprocessing without supervisor, data centers are calculated for the class S i and the center C i = {C i 1 , , C i n }is extracted.To measure the input distance X j from the center C i j and assigning the appropriate probability value to the input X j , the Gaussian membership function, [21] shown in Equation (9), is used In simple words, the data center of a feature, as the point that has the maximum event, takes the maximum probability value of an event.The probability of other values of a feature is determined by the proximity to the data center.Given what was stated above, the Bayesian equation for classification can be rewritten as Equation ( 10) Parameter m is the number of classes.The value ofP (S i ) can be easily calculated by dividing the number of learning vectors having the class i (S i ) by total number of learning vectors.The above equation would be true if the event probability for the observations {X 1 , . . ., X 2 } is independent in pair.Thus, in the case of dependencies between observations, the above equation would have an error, depending on the degree of dependency.In the implementation, there is assumed that there is no information on the level of dependency between observations.To achieve correct probability equation, set of weight coefficients W i = {W i 1 , . . ., W i n }, due to increase the precision of the calculation P (S i |X 1 , . . ., X n ), is used, as you can see in equation ( 11) 1 , , W i n }is exclusively determined for s i .If the value of these weight coefficients is correctly determined, computational error caused by non-determining the relationship between the event of features of each data set is reduced.As extracting the graph of the relationship between the probabilities of the event of features and increasing the number of features is very difficult, the proposed method can be useful.These linear coefficients can be learnt in the network.As only these weights are present in the classification layer, extracting the network error gradient is feasible based on weights.Thus, there are no problems of learning weight of connected features, as mentioned in the previous section.

The proposed RBF neural network structure
In the proposed method, the RBF neural network structure is changed in a way that Equation ( 11) can be implemented.To this purpose, the activation function was deleted.In addition, AND gate was used to multiply the event probability values of features and weight coefficients.However, to implement these changes, it is necessary to change the learning linear weights method, which is explained below.Figure (2) shows the changed RBF in order to implement Bayesian decision rules.The linear weight vector W i is learned based on the decision of the i class, S i .The vector of weights V i is set so that it can guide input X j to only φ(X j , C i j ).so that its membership value is determined only based on the data center C i j .Thus, weights are firstly set as in Equation (12).

Learning weighs algorithm
The learning algorithm for the vector set W i = {W i 1 , . . ., W i n } has been set based on the back propagation algorithm.Due to the modification of the network structure, the most important part of the weight learning is the extraction of the output gradient of the network.In the following, how to establish compatibility between the network output and the backpropagation algorithm is explained.Learning weight in backpropagation algorithm is done through following equation [22].Parameter n is the number of features.In Equation ( 13), The parameter δ j is equal to the error of neuron j. the parameter η is equal to the learning rate .Given the assumptions of the problem, equation ( 13) can be written as follows where Parameter m is the number of classes.As the weights are linear per output, the learning weight process is simply performed simply as shown Equation (15).In equation ( 14), the gradient ∂E ∂W i j can be rewritten as To calculate the Bayesian probability, there is no need to use the activation function.Thus, the output of the network and the final output are same.Equation ( 16) can be written as follows Set of linear weights should be set in a way that generates output 1 for per class i.In other words, the network should give the highest possible probability to the training data.It means that target value in equation (11) for class S i would equal to 1.The aim of weight Learning is to achieve output 1 for training data, given that S i occurred.
regarding to descriptions, the value of ∂E ∂y is calculated as Equation (17).
On the other hand, given the linear equation ( 11), the value of ∂y ∂W i j would be equal to Equation (18).
If you do not want to use the prior probability in decision making, the value of ∂y ∂W i j would be equal to Equation (19).
Thus, the final equation for adjusting weights W i (t) = {W i 1 (t), . . ., W i n (t)} based on the error propagation rule would be equal to Equation (21). where Similarly, Adjusting the weights without considering the prior probability is as equation (22). where Adjustment of weights can continue to reach the desired error.Preferably, it is best to determine the number of training epoch.Because achieving the optimal error for the network may not be possible.

Experimental results
The data in the experimental results section were selected based on specific goals.The importance of the selected data is related to fact that there is high dependency between the features values and the network output.For example, in the Deal or No Deal [17] data set, if each of features has no acceptable value, no final agreement is achieved.Thus, it is necessary to consider the dependency between the values of the features of each class in the final decision, as Bayesian decision theory.The Conditional Entropy [18] criterion was used to measure the degree of dependency in an event of a feature and other features.This criterion can determine the degree of dependency of an event of a feature and other features.For this purpose, the Conditional Entropy criterion is used among the dataset samples as presented in Equation ( 23) [23].
The H(Y |X) criterion is the amount of information required to estimate the value of Y provided that X has occurred.This can be a good criterion to measure the degree of dependency of an event of a feature and other features in determining the final output.The value of the Conditional Entropy for the two random variables X and Y is defined as Equation (24).
The mean of Conditional Entropy among all features can also be defined as Equation (25). where Equation ( 25) simply represents the mean value of the dependencies in the event of the features of a dataset.This criterion is considered in the experimental results section has been considered data set 2.
In the experiments section, the process of classification was performed on each of the data sets using the proposed Bayesian RBF method, the back propagation neural network, the RBF neural network and the deep belief neural network.The number of learning weight courses was 100, and the learning rate and standard deviation for all experiments were 0.12, 0.1, respectively.Figure (3) shows the the MSE learning error during the learning on data set 1.
As shown in Figure (3), the Bayesian RBF showed a lower MSE value during the experiment.Moreover, the convergence rate of the learning error of the mentioned network was higher than that of other methods, especially the back propagation neural network.As only linear coefficients are learned in the proposed method, this convergence rate was predictable.Figure (3) shows the MSE learning error during the learning period on the data set 2.
The need for creating symmetric weight among the values of features in the data set 2 to consider the relationship between the values of the features has caused that learning process to be associated with a higher error rate and lower convergence rate.Due to the higher dependency among the features, the need for a direct multiplication of the values of features would be also greater.By solving the problem mentioned by Bayesian RBF, the network has a higher convergence rate and less learning error compared to other methods.Then, the mean precision of data classification by Bayesian RBF, RBF, back propagation, and deep belief network methods is compared on data set 1 and 2. Its results are shown in Table (2).
The proposed Bayesian RBF method could perform better than other methods used in these tests.This superiority is more evident in the percentage of precision of classification in data set 2. Given the results obtained in Table    (2), the precision of all methods for data set 2 has a significant reduction compared to that for data set 1.However, this reduction in precision in the proposed method was very lower.The most important reason for this reduction in precision was dependency among the values of features in determining the network output and the need for action terms among the features in the Deal or no data se for correct estimation.
In the following, receiver operating characteristic curve, i.e., ROC curve, had been used to, illustrates the diagnostic ability of a classifier system.But the ROC curve for the binary classifier is used.Hence, The target data values of the student's performance are divided into two main classes.Deal or no Deal Dataset has 2 classes, so there is no problem in using the ROC curve.In the following, the diagnostic ability of selected classifiers on the dataset 1 is shown using the curve of the ROC in Fig (5).
To accurately analyze the quadrilateral curves, one can use the Accuracy value corresponding to each ROC curve.The accuracy value corresponding to each curve is obtained based on the area under the curve, which brings in the figure (5) As shown in figure (5) and table (3), Bayesian-RBF and Deep Belief Network have a higher accuracy than others.In the following, the diagnostic ability of selected classifiers on the dataset 2 is shown using the curve of the ROC in Fig( 6).In the same way, The accuracy value for each ROC curve is brought in the Table (4) In both experiments, the Bayesian-RBF neural network had a higher precision and accuracy compared to other networks.The Bayesian RBF method could perform more successfully in estimating class values by obtaining operating weights of the features.The superiority of the proposed network's accuracy is more evident in Dataset 2. This is despite the fact that the Bayesian-RBF accuracy, like other network, decreased in the second experiment The reason for the reduction of the accuracy of all networks in the second experiment is to increase the mean of conditional entropy between features .Increasing conditional entropy means increasing the ambiguity in the conditional probability of the feature event.The ambiguity itself reduces the accuracy of classification in all networks.the proposed method, using embedded operating weights, reduces the effects of increasing the ambiguity in the conditional probability of event.embedded weights reduce the classifier error caused by the dependency between the feature event.with the help of embedded linear weights, the maxterms of the decisionmaking formulas are more accurately adjusted.This adjustment is achievable by using the target of training dataset and corresponding learning algorithm.Consequently, the proposed classifier has a better accuracy compared to other classifiers selected in the experiments.

Conclusion
One of the shortcomings of the neural network is lack of a connection among the features of each layer, which makes the direct relationship of a feature with other features in classification not to be considered.The proposed Bayesian RBF network makes it possible that the effect of the values of one feature on other features in decision making to be considered by multiplying the values of the features and setting the action linear weights among the values of features.In addition, due to linearity of the coefficients, they can be learned by a back propagation learning algorithm.Although the number of connections and linear weights considered among the features is less than the total number of possible edges of a dependency graph, it can greatly reduce the classification error.The low classification error and the high convergence rate in the learning process are features of the Bayesian RBF method, which is also evident in empirical experiments.The proposed method could show higher precision and accuracy in classification on data set, in which there are influential dependencies among its features.

Figure 2 .
Figure 2. The changed structure of the RBF network for Bayesian classification on class S i 0

Table 1 .
Datasets used in experiments

Table 2 .
Results of algorithm estimation precision

Table 3 .
Accuracy of classifiers

Table 4 .
Figure 6.ROC curve of four prediction of four selected classifiers in dataset 2 Accuracy of classifiers