Weighted Machine Learning

Sometimes not all training samples are equal in supervised machine learning. This might happen in different applications because some training samples are measured by more accurate devices, training samples come from different sources with different reliabilities, there is more confidence on some training samples than others, some training samples are more relevant than others, or for any other reason the user wants to put more emphasis on some training samples. In non-weighted machine learning techniques which are designed for equally important training samples: (a) the cost of misclassification is equal for training samples in parametric classification techniques, (b) residuals are equally important in parametric regression models, and (c) when voting in non-parametric classification and regression models, training samples either have equal weights or their weights are determined internally by kernels in the feature space, thus no external weights. The weighted least squares model is an example of a weighted machine learning technique which takes the training samples’ weights into account. In this work, we develop the weighted versions of Bayesian predictor, perceptron, multilayer perceptron, SVM, and decision tree and show how their results would be different from their non-weighted versions.


Introduction
In this paper, machine learning algorithms are modified to take the training samples' weights into account.The weighted machine learning techniques developed in this work, not only provide developers with the opportunity to give different weights to training samples, but also can be embedded into other algorithms such as AdaBoost [1,2], where there is a hierarchy of classifiers, each requiring to be trained using a different weighting over training samples.Figure 1 shows, schematically, how non-weighted linear predictors become biased when the training samples' weights are taken into account.Figure 2 shows the same for nonlinear classifiers.There are two classes, one indicated with circles and the other with squares.Darkness of training samples shows their weights and the classifier is represented with a dashed line.As shown in the figure, the weighted classifier decides in favor of more important samples by keeping more distance from them.
To bias the predictor in favor of more important training samples, we embed the training samples' weights into the cost function.This way we regulate the misclassification cost based on the weights during training.In other words, misclassifying more important training samples would be more costly and the predictor will attempt to avoid it.This approach is possible only for machine learning algorithms which are based on minimizing a cost function.For Bayesian predictors, we embed the training samples' weights into the probability distribution functions.This way we increase the likelihood of a class when the irresponsive sample (the sample with an unknown output) is close to training samples with large weights in that class.In short, training the weighted predictor is more concerned  about correct prediction of training samples with larger weights than those with smaller weights.As a result, the trained model predicts in favor of training samples with larger weights.This makes the weighted predictor different than its non-weighted counterpart.
In this paper, we use the training dataset in Table 1 to show the difference between the weighted machine learning techniques developed here and their non-weighted counterparts.We consider two classes ω 1 and ω 2 , each with 10 samples, and two features l 1 and l 2 to simplify the visualization.Training samples and their weights in this table are chosen carefully to emphasize the difference between weighted and non-weighted predictors.The training dataset is shown in Figure 3. Circles represent class ω 1 and squares represent class ω 2 .Darkness of training samples shows their weight.

Classification
The Bayes classifier calculates the probability of different classes given the observed feature vector as p(ω j |x) = p(ω j )p(x|ω j )/p(x) and then assigns x to the class with the highest probability [3,4]; where p(ω j |x) is the posterior probability, p(ω j ) is the prior probability, and p(x|ω j ) is the likelihood.The denominator, p(x), is usually ignored in calculations as it is the same for all classes.A simple way to embed weights (g i ) for training samples into the Table 1.Training samples and their weights.
Bayes classifier is to define the prior probability (p(ω j )) as the sum of weights of training samples belonging to class ω j divided by the sum of all weights (1).
Regardless of parametric or non-parametric definition of the likelihood (p(x|ω j )), an important drawback with this simple approach is that it does not consider where the irresponsive sample (x) is situated with respect to more important training samples in each class.For example, the irresponsive sample in Figure 4, shown with a cross, is closer to more important samples (darker ones in the figure) in ω 2 and one expects it to be classified in ω 2 .However, based on the aforementioned approach, it will be classified in ω 1 because ω 1 has a larger prior (p(ω 1 ) > p(ω 2 )) and the likelihoods for two classes are equal (p(x|ω 1 ) = p(x|ω 2 )).Likelihoods, p(x|ω 1 ) and p(x|ω 2 ), are calculated without considering the weights.To solve this problem, weights need to be considered in likelihoods.To take into account the position of x with respect to more important training samples in each class, we define the likelihood (p(x|ω j )) based on non-parametric Parzen windows [5], shown in (2), instead of calculating the priors from (1).

WEIGHTED MACHINE LEARNING
In this equation, N j is the size of the class ω j , x i represents the i-th training sample's feature vector, x represents the irresponsive sample's feature vector, g i is the i-th training sample's weight, K is the kernel function, and Σ j is the covariance matrix of features obtained based only on samples from class ω j .The step kernel in (3) or the Gaussian kernel in (4) can be used in (2), where l is the dimension of feature space and the subscript k in x k and x i k refers to the k-th feature in the corresponding feature vector.More kernels are available in Hardle [6] and Fan and Gijbels [7].Instead of choosing the kernel bandwidth to be a constant value, which is the common practice, we choose the covariance matrix for class ω j (shown by Σ j ) divided by a constant value (shown by σ) as the kernel bandwidth for class j.The constant value (σ) can be tuned using cross-validation.

Regression
In case of regression, (5) can be used to estimate the response at the irresponsive sample x.This equation estimates the response at x as the weighted average of other training samples' responses, where each training sample's weight is the multiplication of its original weight (g i ) by the output of the kernel for that training sample (K(x − x i , Σ)).
In other words, a training sample's weight in this equation (g i K(x − x i , Σ)) is the combination of its importance as well as its distance to the irresponsive sample in the feature space.The latter is what the kernel is concerned about.
Since there are no classes in regression, the covariance matrix of features (Σ) in ( 5) is defined over all training samples.

Experiment
Here we use the dataset in Table 1 to show the effect of embedding training samples' weights in likelihoods (2) on the irresponsive sample's classification.Priors are considered equal since the frequencies of the two classes are the same.The Gaussian kernel in (4) with a bandwidth of Σ j /3 is used as Parzen window, where Σ j shows the covariance matrix of class ω j .Figure 5 shows the division of the feature space between the two classes with and without considering the training samples' weights in calculating the likelihoods.It is shown that when the weighted Bayesian classifier is applied, the classification of the irresponsive sample (shown with a cross) is switched from class ω 2 to class ω 1 because of its proximity to some important samples in class ω 1 .

Least squares (LS)
The output of the least squares (LS) predictor is x T w where w is the extended weight vector to include the threshold or intercept (w 0 ) and x is the extended feature vector to include a 1.The desired output is denoted with y i .The weight vector will be computed so as to minimize the sum of squared errors between the desired and true outputs [8], that is: where N is the number of training samples.Minimizing the cost function in (6) with respect to w results in: Let us define: where X is an N × (l + 1) matrix whose rows are the feature vectors with an additional 1, l is the number of features, and y is a vector consisting of the corresponding desired responses.Then: By substituting ( 9) in (7) we have: Matrix X + = (X T X) −1 X T is known as the pseudoinverse of X and is equal to X −1 if X is square.To develop the weighted version of LS predictor, we adjust the cost of error based on the weight of training samples (g i ), Minimizing the cost function in (11) with respect to w results in:

WEIGHTED MACHINE LEARNING
Let us define: Then: Substituting ( 14) in (12) results in: Equation ( 15) is known as weighted least squares [9].Let us investigate what happens if the weight of all training samples is equal to a constant c.In this case, G = c × I N ×N where I N ×N is the N × N identity matrix.Substituting this in (15) results in: In other words, the weighted LS is no different than the non-weighted LS if all weights are equal.This is the case with all weighted predictors developed in this work.

Experiment.
Here we use the dataset in Table 1 to show the effect of embedding the training samples' weights in LS (15)

Perceptron
The perceptron cost function is defined as [10]: Stat., Optim.Inf.Comput.Vol. 6, December 2018 where N is the number of training samples, x i is the i-th feature vector including an additional 1 as its last element, and w is the weight vector (the perpendicular vector to the hyperplane classifier toward class ω 1 ) including the threshold (w 0 ) as its last element.The cost function is minimized if the classifier produces a positive response for samples of class ω 1 and a negative response for samples of class ω 2 .We can iteratively find the weight vector that minimizes the perceptron cost function using the gradient descent scheme [10,11]: where w t is the weight vector estimate at the t-th iteration and α is the training rate which is a small positive number.
We embed the training samples' weights (g i ) in the perceptron cost function (19) to punish the classifier more for misclassifying training samples with larger weights and less for training samples with smaller weights.In other words, the training samples' weights enter the cost function to adjust the perceptron cost based on the importance of training samples.The perceptron classifier is no longer equally fair to all training samples.
With the new cost function, the iterative steps for updating the weight vector through the gradient descent scheme will change to: If we define αi = αg i we obtain: Therefore, the weighted perceptron classifier can be obtained by including the weights in the cost and defining the training rate as αi = αg i which means a different training rate for each training sample based on its weight.Adjusting the training rate based on the training samples' weights and including the weights in the cost function bias the trained perceptron in favor of training samples with larger weights.

Experiment.
Here we use the dataset in Table 1 to show the effect of including training samples' weights in perceptron classifier.Figure 7 shows the division of the feature space between the two classes with and without considering the training samples' weights in computing the linear classifier.The high cost of misclassifying important samples from class ω 1 in weighted perceptron classifier pushes the border toward class ω 2 .

Two linearly separable classes.
Assume ω 1 and ω 2 are two linearly separable classes shown in Figure 8. SVM [12,13,14] maximizes the margin around the hyperplane separating the two classes.We know that the distance between a sample x i and a hyperplane f (x) = w T x + w 0 = 0 is obtained from |f (x i )|/||w||.Assume x 1 is the nearest sample in class ω 1 to the hyperplane f (x) and x 2 is the nearest sample in class ω 2 to the hyperplane f (x).Then x 1 and x 2 are called support vectors.To maximize the margin, the hyperplane f (x) must intersect the line connecting x 1 and x 2 at its midpoint, as shown in Figure 8.Therefore, we can scale w and w 0 so that f (x 1 ) = 1 and f (x 2 ) = −1.This leads to having a margin of:  Since x 1 and x 2 are the closest samples to the hyperplane f (x), the distance of other samples from the hyperplane is greater than 1/||w||, as shown in Figure 8.Therefore, we have: We define: Substituting ( 24) in (23) results in: We need to maximize the margin (2/||w||) in (22) which is equivalent to minimizing the norm ||w||.The mathematical formulation for finding w and w 0 of the hyperplane follows: subject to y i (w where N is the number of training samples.The above cost function is convex and the constraints are linear and define a convex set of feasible solutions.The corresponding Lagrangian function L(w, w 0 , λ) for the above convex programming problem is defined as follows [15,16,17,18]: where λ i , i = 1, 2, , N are the Lagrangian multipliers associated with the constraint in (27).We need to find w, w 0 , and λ by solving the Lagrangian duality: max λ≥0 min w,w0 L(w, w 0 , λ) [15,16,17,18].The Karush-Kuhn-Tucker conditions that min w,w0 L(w, w 0 , λ) has to satisfy are [15,16,17,18]: Equations ( 29) and ( 30) depend only on training samples whose λ i ̸ = 0, referred to as support vectors.On the other hand, the conditions in (32) state that either λ i or y i (w T x + w 0 ) − 1 must be zero.Therefore support vectors are training samples where |w T x + w 0 | = 1 (i.e.y i (w T x + w 0 ) − 1 = 0 and λ i ̸ = 0) which means they are on the boundary of the margin.Therefore, ( 29) and ( 30) depend only on support vectors and consequently the hyperplane classifier is designed only based on support vectors and is independent of other training samples because their λ i is zero.While, none of the training samples falls inside the margin (by construction), this is not necessarily the case for irresponsive samples.The intuition is that maximizing the margin on the training samples will lead to good separation on the irresponsive samples.
By expanding (28), we have: By replacing ∑ N i=1 λ i y i = 0 from (30) in the above equation, we get: By substituting w from (29), we have: Now we maximize the above Lagrangian function with respect to λ: subject to Once the optimal Lagrangian multipliers (λ i ) have been computed by maximizing (33), w is obtained by replacing them in (29) and w 0 is computed as an average value obtained using complementary slackness conditions in (32) for support vectors (λ i ̸ = 0).
The weighted version of SVM needs to be more sensitive to training samples with larger weights.In other words, the distance from training samples to the classifier hyperplane needs to be compromised based on their weights.From a geometric point of view, we develop the weighted SVM by moving training samples toward the classifier hyperplane by a factor proportional to their weight (g i ).We measure the distance of a training sample (x i ) from the classifier hyperplane (f (x)) through (36), where the actual distance is reduced by a factor of 1/(1 + g i ).If a training sample's weight is zero, its distance to the classifier hyperplane, in (36), remains intact, and if its weight is very large, its distance will become close to zero.
Assume x 1 is the nearest sample in class ω 1 to the classifier hyperplane based on the distance calculated from (36) and x 2 is the nearest sample in class ω 2 to the classifier hyperplane.We can scale w and w 0 so that f (x 1 )/(1 + g 1 ) = 1 and f (x 2 )/(1 + g 2 ) = −1.This leads to having a margin of: Since x 1 and x 2 are the closest samples to the hyperplane, the distance of other samples from the hyperplane (based on (36)) is larger than 1.Therefore, we have: We define: Substituting (39) in (38) results in: We need to maximize the margin (2/||w||) in (37) which is equivalent to minimizing the norm ||w||.The mathematical formulation for finding w and w 0 of the hyperplane follows: where N is the number of training samples.The above cost function is convex and the constraints are linear and define a convex set of feasible solutions.The corresponding Lagrangian function L(w, w 0 , λ) for the above convex programming problem is defined as follows: where λ i , i = 1, 2, , N are the Lagrangian multipliers associated with the constraint in (42).We need to find w, w 0 , and λ by solving the Lagrangian duality: max λ≥0 min w,w0 L(w, w 0 , λ) [15,16,17,18].The Karush-Kuhn-Tucker conditions that min w,w0 L(w, w 0 , λ) has to satisfy are [15,16,17,18]: (45) The conditions in (47) state that either λ i or y i [(w T x + w 0 )/(1 + g i )] − 1 must be zero.Therefore support vectors are training samples where It is now clear how our modified distance function in (36) affects the choice of support vectors.Before, support vectors were those geometrically closest to the hyperplane but now a trade-off between the weight (g i ) and the geometrical distance to the hyperplane determines whether a training sample is a support vector or not.
By expanding (43), we have: By replacing ∑ N i=1 λiyi 1+gi = 0 from (45) in the above equation, we get: By substituting w from (44), we have: Now we maximize the above Lagrangian function with respect to λ: (49) Once the optimal Lagrangian multipliers (λ i ) have been computed, by maximizing (48), w is obtained by replacing them in (44) and w 0 is computed as an average value obtained using complementary slackness conditions in (47) for support vectors ( An interesting observation is that the term (1 + g i ) appears everywhere in the computations as a denominator of y i .It means the weighted SVM can be obtained by replacing y i with y i /(1 + g i ) in non-weighted SVM computations.

Two linearly nonseparable classes.
If the two classes are not linearly separable which is usually the case in real-world problems, e.g. Figure 9, then it is not possible to find an empty band separating them.Each training sample will have one of the following constraints, as shown in Figure 9: • it falls outside the band and is correctly classified, i.e. y i (w • it falls inside the band and is correctly classified, i.e. 0 ≤ y i (w We can summarize the three above constraints in one by introducing the slack variable (ξ i ) [12]: if x i is outside the band and correctly classified 0 < ξ i ≤ 1 if x i is inside the band and correctly classified The optimization task is now to maximize the margin (minimize the norm) while minimizing the slack variables [12].The mathematical formulation for finding w and w 0 of the hyperplane follows: The smoothing parameter C is a positive user-defined constant that controls the trade-off between the two competing terms in the cost function.The two terms are against each other because minimizing the norm (i.e.maximizing the margin) increases the slack variables by increasing the number of training samples inside the band.On the other hand, decreasing the number of samples inside the band is equivalent to decreasing the margin.Therefore, by choosing a very large C → ∞, the width of the margin disappears, 2/||w|| → 0, because we allow the norm to grow much faster than slack variables (ξ i ).The corresponding Lagrangian function L(w, w 0 , ξ, λ, µ) for the above convex programming problem is defined as follows [15,16,17,18]: where λ i , i = 1, 2, , N are the Lagrangian multipliers associated with the constraint in (53) and µ i , i = 1, 2, , N are the Lagrangian multipliers associated with the constraint in (54).We need to find w, w 0 , and λ by solving the Lagrangian duality: max λ≥0 min w,w0,ξ L(w, w 0 , ξ, λ, µ) [15,16,17,18].The KarushCKuhnCTucker conditions that min w,w0,ξ L(w, w 0 , ξ, λ, µ) has to satisfy are [15,16,17,18]: Equations ( 56) and (57) depend only on training samples whose λ i ̸ = 0, referred to as support vectors.On the other hand, the conditions in (61) state that either λ i or y i (w T x + w 0 ) − 1 + ξ i must be zero.Therefore support vectors are training samples where y i (w Therefore, correctly classified training samples outside the margin are not support vectors because we have y i (w T x + w 0 ) > 1 and y i (w T x + w 0 ) − 1 + ξ i cannot be zero considering ξ i ≥ 0. It means that support vectors are those on the edge of the margin (ξ i = 0), correctly classified inside the margin (0 < ξ i < 1), or misclassified (ξ i ≥ 1), as shown in Figure 9. From (58) and (59), we can see that λ i = C for support vectors falling inside the margin (ξ i > 0) and 0 < λ i < C for support vectors falling on the edge of the margin (ξ i = 0).Therefore, (56) and (57) depend only on support vectors and consequently the hyperplane classifier is designed only based on support vectors and is independent of other training samples because their λ i is zero.
By expanding (55), we have: By replacing ∑ N i=1 λ i y i = 0 from (57) and C − µ i − λ i = 0 from (58), we get: By substituting w from (56), we end up with: Now we maximize the above Lagrangian function with respect to λ: subject to Once the optimal Lagrangian multipliers (λ i ) have been computed, by maximizing (62), w is obtained by replacing them in (56) and w 0 is computed as an average value obtained using complementary slackness conditions in (61) for support vectors (λ i ̸ = 0).However, ξ i is also unknown in (61).We know from (58) and (59) that ξ i is zero for training samples whose λ i < C. Therefore, if we only use the training samples whose 0 < λ i < C (support vectors falling on the edge of the margin) to find w 0 via (61), we can consider ξ i = 0.
In the linearly nonseparable case the Lagrangian multipliers (λ i ) are bounded above by C, which is the only difference between the linearly separable and nonseparable cases.The slack variables, ξ i , and their associated Lagrangian multipliers, µ i , are not involved in finding the classifier hyperplane but their effect is indirectly felt through C [4].
The weighted version of SVM needs to be more sensitive to training samples with larger weights (g i ).In other words, the distance from training samples to the classifier hyperplane needs to be compromised based on their weights.However, the modified distance function in case of two nonseparable classes is different than separable classes.When the two classes are separable, we always move training samples toward the classifier hyperplane by a factor proportional to their weight because training samples are always on the correct side of the classifier hyperplane.On the other hand, in case of two nonseparable classes, a training sample might lie on the wrong side of the classifier hyperplane.Therefore, if a training sample lies on the correct side of the classifier hyperplane, we should move it toward the hyperplane and otherwise away from it by a factor proportional to its weight.This way we increase the sensitivity of the classifier to training samples with large weights.We introduce the modified distance function for weighted SVM, in case of two nonseparable classes, as: We define: Using (66), we can combine the two distance functions in (65) in one: By scaling w and w 0 , and introducing the slack variable (ξ i ) we can define the following constraint for training samples: The optimization task is now to maximize the margin (minimize the norm) while minimizing the slack variables (ξ i ).The mathematical formulation for finding w and w 0 of the hyperplane follows: The corresponding Lagrangian function L(w, w 0 , ξ, λ, µ) for the above convex programming problem is defined as follows: where λ i , i = 1, 2, , N are the Lagrangian multipliers associated with the constraint in (70) and µ i , i = 1, 2, , N are the Lagrangian multipliers associated with the constraint in (71).We need to find w, w 0 , and λ by solving the Lagrangian duality: max λ≥0 min w,w0,ξ L(w, w 0 , ξ, λ, µ) [15,16,17,18].The KarushCKuhnCTucker conditions that min w,w0,ξ L(w, w 0 , ξ, λ, µ) has to satisfy are [15,16,17,18]: (73) = 0 (74) By expanding (72), we have: By replacing C − µ i − λ i = 0 from (75), we get: )) = 0 from (74), we have: Stat., Optim.Inf.Comput.Vol. 6, December 2018 Table 2. Training samples and their weights for SVM.
classifier hyperplane based on the relocated training samples.Therefore, the following algorithm offers an alternative but geometrically equivalent approach to the above algorithm with the same time complexity.The following algorithm can take advantage of existing software and libraries for non-weighted SVM to develop the weighted SVM.
• w and w 0 are initialized for a non-weighted SVM, • Loop: repeat until convergence • find w t and w 0t for the non-weighted SVM classifier hyperplane based on Xt where the subscript t stands for the iterator inside the loop.At the first step in the loop, X is the feature matrix (each row representing one training sample), y is a column vector containing the responses, w is a column vector representing the norm of the classifier hyperplane, w 0 is the intercept of the classifier hyperplane, and g is a column vector containing the training samples' weights.The dot shows array (or element-wise) operations versus matrix operations shown with a cross.In (82), ) is the magnitude by which we have to move the training samples, and is the movement direction.The movement magnitude is proportional to the training sample's weight.The movement is in the direction of the classifier's vector (w) for training samples in class ω 2 (y=-1) and in the opposite direction of w for training samples in class ω 1 (y=1).In other words, we have to update the position of a training sample by moving it ) toward the classifier hyperplane if it is correctly classified or the same amount away from the hyperplane if it is wrongly classified.Therefore, training samples with large weights which were not normally selected as support vectors, now have a higher chance of being selected as support vectors if the aforementioned shift could drop them inside the margin.

Experiment.
Due to SVM's stability to changes in a small part of the training data, the dataset in Table 1 cannot differentiate between the non-weighted and weighted SVM.In other words, the two classifiers are the same for that dataset.Instead, we use the dataset in Table 2 to show the effect of embedding training samples' weights in SVM. Figure 10 shows the division of the feature space between the two classes with and without considering the training samples' weights in computing the linear classifier.In weighted SVM, the important samples in class ω 1 will move toward class ω 2 , through (82), and repel the border toward class ω 2 .

Decision trees
Ordinary binary decision trees (OBDTs) split the feature space into hyperrectangles with sides parallel to the axes [19].Nodes in an OBDT, shown in Figure 11, are binary questions whose answers are either yes or no and the answer to these questions determines the path to a leaf which is equivalent to a response (nominal label in classification or numerical estimate in regression).Questions at nodes are of the form is x k ≤ α ?where x k is the k-th feature and α is a threshold.To predict the response of an irresponsive sample, one needs to answer the question at each node and traverse to the left or right node based on the answer until a leaf (response) is reached.The training process involves designing the questions, structuring the tree, and associating each leaf with a response.Each node splits the training dataset into two disjoint groups, each corresponding to one of the answers: yes or no.Many questions can be asked at each node based on what feature (x k ) to choose and what threshold (α) to use.Different thresholds that can be considered for a specific feature at a node are determined based on the training samples at that node.For example, if there are N samples at a node, there could be N -1 different thresholds, each taken halfway between consecutive distinct values of x k among the training samples at that node.Therefore, if there are l features and N training samples at a node, (N − 1) × l different questions can be asked.The best question to ask at a node is the one which maximizes the impurity decrease (∆I).The impurity decrease is calculated through (83) [19]: where I is the impurity of the ancestor node, N is the number of training samples in the ancestor node, N Y is the number of training samples in the descendant node corresponding with the answer yes to the question, N N is the number of training samples in the descendant node corresponding with the answer no to the question, and I Y and I N are the impurities of the descendent nodes.Entropy of training samples at a node, in (84), is a common definition of node impurity in classification tasks (I classif ication ) [19], where N is the number of training samples at this node, M is the number of classes, and N (ω i ) is the number of training samples from class ω i at this node.Therefore, in classification, impurity at a node is proportional to the heterogeneity of classes among training samples at that node.The largest impurity (log 2 M ) happens when training samples are equally distributed among classes and the least impurity (0) happens when all training samples belong to the same class.
The impurity of a node in regression tasks (I regression ) is commonly calculated as the variance, in (85), where y i is the response of the i-th training sample at this node and ȳ is the average of responses at this node.
A node is considered a leaf if the maximum impurity decrease (∆I max ) for that node is less than a user-defined threshold or it contains only a few training samples, although other alternative conditions have been used in the literature [19,20].The majority rule in case of classification or the average rule in case of regression is commonly used to determine the response at that leaf [19].
In the weighted version of OBDT, the impurity decrease (∆I) and impurity (I) are calculated through the following equations: where ∑ g Y and ∑ g N are the sum of the weight of training samples corresponding to the answers yes and no, respectively, ∑ g is the sum of the weight of all training samples at the ancestor node, g(ω i ) is the sum of the weight of training samples belonging to class ω i , and g i is the i-th training sample's weight.
A node is considered a leaf if the maximum impurity decrease (∆I max ) for that node is less than a user-defined threshold or the total weight of training samples inside it, is too small.In case of classification, the class with the largest total weight (argmax ωj ∑ i∈ωj g i ) is associated with that leaf.In case of regression, the weighted average of the responses ( ∑ i∈leaf g i y i / ∑ i∈leaf g i ) is associated with that leaf.In the weighted decision tree, samples with larger weights play a more important role in deciding what question to ask at a node (by playing a more significant role in calculating impurity and impurity decrease), when to stop splitting the nodes, and what response to associate with a leaf.

Experiment.
Here we use the dataset in Table 1 to show the effect of embedding training samples' weights in decision tree.Figure 12 shows the division of the feature space between the two classes with and without considering the training samples' weights in developing the decision tree.The important samples from class ω 1 change the way the weighted decision tree divides the feature space between the two classes in comparison with the non-weighted decision tree.

Multilayer perceptron (MLP)
In the backpropagation algorithm [21,22,23], the architecture of the network is fixed and its synaptic weights are computed so as to minimize a cost function defined as: where N is the number of training samples and ϵ(i) is a function of the network's output (ŷ(i)) and the desired output (y(i)) for the i-th training sample.A common choice for ϵ(i) is the sum of squared errors in the output nodes [24,25,22,23]: where L refers to the output layer, k L represents the number of nodes in the output layer, ŷL j (i) represents the output of the j-th node in the output layer, and y L j (i) represents its corresponding desired value.We also have the following equation for calculating the output of the j-th node at the r-th layer for the i-th training sample (ŷ r j (i)): where f r j is the activation function at the j-th node of the r-th layer, k r−1 is the number of nodes at the (r-1)-th layer, ŷr−1 k (i) is the output of the k-th node in the (r-1)-th layer, and w r jk is the synaptic weight from the k-th node at the (r-1)-th layer to the j-th node at the r-th layer.
We can iteratively find the synaptic weight vectors that minimize the perceptron cost function using the gradient descent scheme [21,22,23].In each iteration, the weight vector (including the threshold) of the j-th node in the r-th layer (w r j ) is modified through (93): The modification term in (93) (∆w r j ) is computed through (94) according to the gradient descent scheme: By substituting the cost function from (89) in (94) and applying the chain rule in differentiation, we obtain: By defining δ r j (i) = ∂ϵ(i) ∂ν r j (i) in the above equation, we obtain: We can calculate ∂ν r j (i) ∂w r j using (92) as follows: where k r−1 is the number of nodes in the (r-1)-th layer and ŷr−1 (i) is the output vector the (r-1)-th layer for the i-th training sample.By substituting (97) in (96) we obtain: The above equation obtains the correction term for batch mode [26].In online or pattern mode, instead of summing up the corrections all training samples and updating the weights at once, the weights are updated once for each individual training sample before moving on to the next [26].In stochastic mode, the gradient at each iteration is calculated based on a random subset of training samples [27].Now we have to compute δ r j (i) based on the definition of the cost function given in (90).First we calculate this term for the output layer (r = L): By substituting (90) and (91) in the above equation we get: By keeping only the terms that are dependent on ν L j (i) we get: where ŷL j (i) is the output of the j-th node in the output layer for the i-th training sample, y L j (i) is its corresponding desired value, and f L j is the activation function of the j-th node in the output layer which takes ν L j (i) as input.Now we compute δ r j (i) for hidden layers (r < L): We use (92) to calculate: Replacing ŷr m (i) with f r m (ν r m (i)) based on (91), we get: By keeping only the terms that are dependent on ν r j (i) we get:

Nonlinear SVM
In nonlinear SVM [28], training samples are nonlinearly mapped from their original l-dimensional space (where they cannot be linearly separated) into a k-dimensional space (k >> l) where they are more likely to be linearly separable [29,4].However, there is no guarantee that training samples will be linearly separable in the new kdimensional space.Therefore, linear SVM with slack variables is used to find the hyperplane separating the two classes in the k-dimensional space.Although the classifier is a hyperplane in the k-dimensional space, it is a hypersurface in the l-dimensional space due to the nonlinear mapping, hence the name nonlinear SVM.The next step is to find the dimensionality of the new space (k) and the mapping function.We use the following equations, obtained in Section 3.3.2, to find the SVM classifier hyperplane f (x) = w T x + w 0 in the k-dimensional space, where xi is the i-th feature vector (x i ) mapped into the k-dimensional space.
An elegant property of the SVM helps to implicitly map the training samples into the k-dimensional space without knowing the mapping function and k.Notice that training samples enter into (113), (118), and (119) in pairs, in the form of inner products (x T i xj ) in the k-dimensional space.Therefore, for finding w and w 0 of the hyperplane in the k-dimensional space and even for classifying a new sample using (118), only the inner product of pairs of feature vectors in the k-dimensional space is required.Knowing the mapping function and the dimensionality of the new space (k) is not necessary.We can use the kernel trick to find the inner product of two feature vectors in the k-dimensional space without actually mapping them from the l-dimensional space into the k-dimensional space.According to Mercer's theorem, for any kernel (K), there exists a space in which K(x i , x j ) = xT i xj [30,31,32].Equations ( 120), (121), and ( 122) are examples of kernel functions (31) called polynomial, radial basis function, and hyperbolic tangent, respectively, where σ is the kernel's bandwidth.
K(x i , x j ) = tanh(βx T i x j + γ) , for appropriate values of β and γ, e.g.β = 2 and γ = 1 (122) Therefore, to convert the linear SVM to nonlinear SVM we just need to replace the inner product of the mapped feature vectors (x T i xj ) by a kernel function of the original feature vectors K(x i , x j ): Although f (x) is linear in the k-dimensional space, it is nonlinear in the l-dimensional space due to the nonlinearity of the kernel function.
Here we explain why we cannot develop the weighted version of nonlinear SVM.We use the following equations, obtained in Section 3.3.2, to find the weighted SVM classifier hyperplane f (x) = w T x + w 0 in the k-dimensional space, where xi is the i-th feature vector (x i ) mapped into the k-dimensional space.
each training sample, reflecting its reliability or accuracy.Due to the lack of such weights in this dataset (and to our knowledge in other machine learning datasets), we apply the following procedure to produce artificial weights for training samples.A function runs through each training sample, switches its output class from what it is to the other one with a random probability of 0 < k < 1, and assigns a weight of 1 − k to that training sample.This way, a training sample's weight shows how reliable that training sample is and there are no changes imposed on test samples.Weighted machine learning techniques take advantage of these weights to take the training samples' reliability into account but non-weighted machine learning techniques ignore these weights.Leave-one-out or onefold cross validation is used to estimate the accuracy of weighted and non-weighted machine learning techniques, reported in Table 3, where hyperparameters are optimized using cross-validation.The higher accuracy of weighted machine learning techniques (see Table 3) comes as no surprise since they take advantage of weights while non-weighted machine learning techniques do not.Nevertheless, it proves the proposed weighted machine learning techniques' efficiency in appropriately taking the training samples' weights into account, whenever such weights are available, in order to improve the prediction accuracy.Clearly, as mentioned earlier, the weighted SVM classifier shows the least difference with its non-weighted counterpart, underlying its relative reluctance to react to weights in comparison with other weighted predictors.The weighted least squares and weighted perceptron achieve the highest accuracy, shedding light on the linear separability of the two classes in this specific application.On the other hand, the weighted Bayesian classifier makes more dramatic changes in the border, resulting in a lower accuracy than other weighted non-linear classifiers (decision tree and MLP), for the same reason, i.e. the linear separability of the two classes in this specific application.

Conclusions
The weighted machine learning techniques developed in this paper provide developers with the opportunity to give different weights to training samples.These weights are used to adjust the classifier/regressor in favor of more importance samples, and thereby giving a higher significance to more important samples.It is worth noting that the weighted linear and nonlinear classifiers change the division of the feature space only around the border (the most uncertain area) and areas far from the border are less likely to change their label.The weighted SVM classifier showed the least difference with its non-weighted counterpart when the training samples' weights are not much variant.The reason for this is that SVM classifier is designed only based on support vectors not all training samples.If the weights of training samples are not much different, their relocation based on their weights might not be large enough to change the selection of support vectors.In other words, the weighted SVM classifier would be different than its non-weighted version only if relocating training samples based on their weights would result in a rearrangement of training samples significant enough to change the selection of support vectors.In other words, the weighted SVM has the highest stability with respect to weight changes in a small subset of training samples.The weighted MLP also takes the training samples' weights into account through smallest adjustments in its nonlinear border.However, the MLP's behavior is highly dependent on the network's size.In other words, a larger number of hidden nodes will result in a more significant difference between the weighted and non-weighted MLP.On the other hand, the weighted decision tree and weighted Bayesian classifiers showed the most dramatic changes in how the feature space is divided between the two classes in comparison with their non-weighted counterparts.The reason is that these two models are highly local, especially the non-parametric Bayesian classifier.Therefore, even small changes in the training samples' weights would result in a different classification of the feature space.The weighted LS and perceptron showed slight and similar changes in how they divide the feature space between the two classes in comparison with their non-weighted counterparts.The similar behavior is because both models minimize the sum of squared errors, although in different ways.The difference between weighted and non-weighted versions being slight in these two cases originates from their linear nature and consequently their rather restricted flexibility in modifying their shape.How the weighted machine learning techniques developed in this work will contribute in improving the prediction accuracy in different real-world applications is yet to be seen.Our next step, toward underscoring the significance of weighted machine learning models, is to apply them to spatial-temporal or environmental data, where the training samples' weights reflect the spatial-temporal autocorrelation between each training sample and the irresponsive (or test) sample.

Figure 3 .
Figure 3. Training samples from two classes, circles and squares, shaded based on their weights.

Figure 4 .
Figure 4. Two classes shown with circles and squares where the darkness of samples shows their weight.

Figure 5 .
Figure 5. Division of the feature space between the two classes, circles and squares, without (left) and with (right) considering the training samples' weights (darkness of samples) in Bayesian classifier.

Figure 6
shows the division of the feature space between the two classes with and without considering the training samples' weights in computing the linear classifier.Circles represent class ω 1 and squares represent class ω 2 .Darkness of training samples shows their weight.In the weighted LS classifier, training samples with large weights from class ω 1 push the border toward class ω 2 .

Figure 6 .
Figure 6.Division of the feature space between the two classes, circles and squares, without (solid line) and with (dashed line) considering the training samples' weights (darkness of samples) in LS classifier.

Figure 7 .
Figure 7. Division of the feature space between the two classes, circles and squares, without (solid line) and with (dashed line) considering the training samples' weights (darkness of samples) in perceptron classifier (logistic activation function and adaptive training rate with 1000 iterations).

Figure 8 .
Figure 8. SVM classifier for two linearly separable classes; black points show support vectors.

Figure 9 .
Figure 9. SVM classifier for two linearly nonseparable classes; black points show support vectors.

Figure 10 .
Figure 10.Division of the feature space between the two classes, circles and squares, without (solid line) and with (dashed line) considering the training samples' weights (darkness of samples) in SVM classifier (C=1).

Figure 11 .
Figure 11.Ordinary binary decision trees; Q stands for and R stands for response.

Figure 12 .
Figure 12.Division of the feature space between the two classes, circles and squares, without (solid line) and with (dashed line) considering the training samples' weights (darkness of samples) in decision tree classifier (minimum impurity decrease for splitting a node is considered 0.1).

4. 2 . 1 .
Experiment.Here we use the dataset in Table1to show the effect of embedding training samples' weights in the cost function (107) and backpropagation algorithm (111).The MLP is designed with one hidden layer including two nodes.Including more hidden nodes will result in all training samples being correctly classified in both nonweighted and weighted MLP, a zero classification cost for both non-weighted (89) and weighted (107) MLP, and consequently similar classifiers.With two hidden nodes some training samples cannot be correctly classified, so we can see the difference between non-weighted and weighted MLP classifiers.Figure13shows the division of the feature space between the two classes with and without considering the training samples' weights in the cost function and the backpropagation algorithm.Despite both weighted and non-weighted MLP misclassify the same four training samples, the weighted MLP classifier provides a better fit (a lower error) for the two more important samples from class ω 1 .

Figure 13 .
Figure 13.Division of the feature space between the two classes, circles and squares, without (solid line) and with (dashed line) considering the training samples' weights (darkness of samples) in MLP classifier (logistic activation function and adaptive training rate with 2000 iterations).

Table 3 .
[35]racy of different classification techniques for breast cancer prediction.Sum of squared errors Maximum number of iterations: 1000 Not updating the weights after those iterations resulting in an increase in the total cost Multiply all learning rates by 1.1 or 0.8 after each step based on whether the total cost decreases or increases Adaptive learning rate: multiply the learning rate for a parameter by 1.2 if the partial derivative of the loss, with respect to that parameter, remains the same sign in successive steps and multiply it by 0.7 otherwise[35]