Hidden Markov Models Training Using Hybrid Baum Welch-Variable Neighborhood Search Algorithm

Hidden Markov Models (HMM) are used in a wide range of artificial intelligence applications including speech recognition, computer vision, computational biology and finance. Estimating an HMM parameters is often addressed via the Baum-Welch algorithm (BWA), but this algorithm tends to convergence to local optimum of the model parameters. Therefore, optimizing HMM parameters remains a crucial and challenging work. In this paper, a Variable Neighborhood Search (VNS) combined with Baum-Welch algorithm (VNS-BWA) is proposed. The idea is to use VNS to escape from local minima, enable greater exploration of the search space, and enhance the learning capability of HMMs models. The proposed algorithm has entire advantage of combination of the search mechanism in VNS algorithm for training with no gradient information, and the BWA algorithm that utilizes this kind of knowledge. The performance of the proposed method is validated on a real dataset. The results show that the VNS-BWA has better performance finding the optimal parameters of HMM models, enhancing its learning capability and classification performance.


Introduction
Hidden Markov models have been known as a statistical model with great success and widely used in a vast range of application fields such computational biology, speech processing, pattern recognition, and finance [1,2,3,4,5,6,7] among other disciplines.
Attempts to estimate HMMs parameters were carried out by several authors [9,10,11,12,13], but the dominant approach to HMM parameter estimation problem is the Baum-Welch algorithm [8], despite its use in practice, the Baum-Welch algorithm can get trapped in local optima of the model parameters. Thus there is need for an algorithm that can escape from the local optimum and then probe the solution space to reach the global optimum of the model parameters.
Variable neighborhood search [15] is among the class of metaheuristics that have provided optimal solutions in many different problem domains, by considering changes of neighborhood in both, the descent phase (to find a local neighborhood optimum), and therfore the perturbation phase (to get out of the corresponding basin of attraction). VNS has been successfully applied to a wide variety of optimization problems such the clustered vehicle problem, the maximum min-sum dispersion problem or the financial derivative problem [16,17,18,19], just to cite several recent works.

161
With this context, this paper, provide the design, implementation, and experimental setup for a hybridation of the VNS and BWA to approximately estimate optimal model parameters for Hidden Markov models. And assess the performance on a real financial dataset.
The remainder of this paper is organized as follows: In Section 2, we give some basic descriptions of Hidden Markov Models notations, followed by a description of BWA, VNS and the hybrid VNS-BWA algorithms for training HMMs. In Section 3 we describe the data used in this paper, and explains the experiment setup and results. Finally, we conclude this paper.

Methodology and Notations
In this section, we expose subsequently the mathematical framework of the discrete case of HMM models, Baum Welch and VNS algorithms. Then we describe the proposed process of estimating HMM models parameters.

Elements of a Hidden Markov Models
An HMM model is characterized by the following elements : were N is the number of states in the model, and T being the length of the observation sequence. Let S = (S 0 , S 1 , . . . , S T −1 ) be the states sequence.
where M is the number of observation symbols, and T the length of the observation sequence.
In this paper we consider the values of O t and S t to be discrete.
• A = {a ij } the state transition probabilities matrix, where A ∈ R N ×N and a ij = P (S t+1 = j|S t = i) • π = {π i } the initial probability vector, where π ∈ R N , and π i = P (S 0 = i) We denote an HMM as a triplet λ = (π, A, B)

Baum-Welch learning
A.Training HMM with a single observed variable In the following, we summarize the Baum-Welch Algorithm (BWA), for estimating an HMM model parametre λ that generates a sequence of observations . We first define the following probabilities [8]: The parameter reestimation formulas,are described by the following expressions: The parameter λ will be estimated iteratively by Baum Welch procedure as follow: 1. Initialize λ = λ 0 for the model. 2. Compute α t (i), β t (i), ζ t (i, j) and γ t (i). 3. Update λ 4. Repeat steps 2 and 3 until P (O|λ) no longer increases.
In this paper λ 0 structure is computed as the following counts : HMM can be extended to supports multiple observed variables with one common hidden sequence [14]. Assuming independence between the M observed variables. The triplet λ is defined as λ = (π, A, B 0:M −1 ) where the matrix A corresponding to state transition probabilities and π the initial probability vector with the same structure as defined in the section 2.1. The only diference is that there are multiple emission matrices. The m-th emission probability matrix The parameter learning process can be performed by means of the Baum Welch Algorithm extended for multiple observed variables based on the probabilities α t (i), β t (i) ,γ t (i) and ζ t (i, j). Computed as follows : The parametre λ can be updated as follows:

Variable Neighborhood Search Algorithm
The basic idea of VNS algorithm [15] is to explore a set of predefined neighborhoods to provide a better solution. The algorithm begin by a descent method to a local minimum, then explore distant neighbourhoods of this solution. Each time, one or several points within the defined neighbourhood are used as initial solutions for a local descent. The method jumps from a solution to a new one if and only if a better solution has been found. More formally, Firstly, a finite set of neighborhood structures N k , (k = 1, . . . , k max ) is defined, and N k (x) the set of solutions in the kth neighborhood of x. Next an initial solution x is randomly generated or constructed. At each iteration, the neighborhood index k is initialized to 1. VNS's procedure is composed of three steps: shaking, local-search and move. In the shaking step, a solution x in the k th neighborhood of the incumbent solution x is randomly generated. Then, a local-search procedure is applied to the shaking's output x . The local search's output is denoted x . If x is better than x, x replaces x from which the search continues with k = 1. Otherwise, k is incremented and a new shaking step starts using the (k + 1) th neighborhood structure. The procedure continues until termination criteria is met. The most common ones are to fix a maximum computational time allowed, a maximum number of iterations, or a maximum number of iterations between two contiguous improvements. The pseudocode of a VNS algorithm is given by Algorithm 1.
Further modifications to basic VNS as well as several strategies for parallelizing are discussed in [16,20].

VNS-HMM MODEL
In VNS-BWA training, we first choose an initial solution λ 0 = (π 0 , A 0 , B 0 ), then we re-estimate λ 0 iteratively using BWA, until convergence or reaching iterations limit, to a solution λ * . Therefore, λ * is encoded into a string W =(w 0 , w 1 , . . . , w p ) (Fig.1), and used to construct an initial solution for the VNS algorithm, the solution space includes all probable parameters with a specified number of observation states and hidden states.
In this paper, we used the basic VNS variant that executes alternately a simple local search procedure and a shaking procedure, together with a neighborhood change step until fulfilling a predefined stopping criterion. The steps of Basic VNS using sequential neighborhood change step are as given in Algorithm 1. while k < k max do 6: Shaking. Generate a point x ∈ N k (x) at random 7: Local search. Apply some local search method with x as the initial solution. Denote with x the soobtained local optimum. 8: Move or not. If this local optimum is better than the current solution, x ← x , and k ← 1, Otherwise, k ← k+1 9: end while 10: end while In the proposed VNS algorithm, two neighborhood actions, are defined to generat the neighborhood structures of a candidate solution: The first neighborhood action is obtained by replacing components of the candidate solution by a number r, as (w 0 , w 1 , . . . , r, . . . , w p ), where 0 ≤ r ≤ 1.
The second neighborhood action is obtained by adding a number r to components of the candidate solution, as (w 0 , w 1 , . . . , w i + r, . . . , w p ), and normalize.
We propose two fitness functions to evaluate the degree of optimality based on the learning type, when the learning is supevised, we maximize the accuracy = f (W ) = T he number of correctly classif ied observations T he total number of observations , and when the learning is unsupervided, we maximize the likelihood of observation sequence g(W ) = Log(P (O|λ)).
At the beginning of VNS, the first neighbourhood (k ← 1) is selected and at each iteration, the parameter k is increased (k ← k+1). The process of increasing the parameter k occurs when the algorithm is trapped at a local optimum, that until the maximum size of the shaking phase, k max , is reached, Then, k is re-initialized to the first neighbourhood (k ← 1).
The process of optimization is stopped when the termination condition is met (e.g. a maximum allowed CPU time, reaching to the maximum number of iteration itermax or increasing the optimal fitness under a specified threshold), and the best solution is adopted as the HMM parameter λ best . This process the hybrid VNS-BWA algorithm is shown in (Fig.2).

Experimental Study
This section describes the experimental study used for evaluating the implementations our model, including the test instances, evaluation metrics, the data set split, experiment designe, and experiments results.

Test Instances
The data set used in this study is a public data set available from Lending Club [21]. Lending Club is one of the largest online peer-to-peer lending agencies in the United States. Lending Club makes available loan characteristic information from the years 2007 to 2018. Excluding records containing obvious errors and the characteristics, with missing information, and by keeping accepted records, with both good/bad statuses observed. The final data set used for analysis consist of 1,266,782 issued loans including 247426 defaults, with 12 attributes, 5 numerical and 7 categorical: loan amount, annual income, fisco score, dti ratio, interest rate, address state, employment length, home ownership, grade, purpose, delinquency in 2 years, and the loan status. Table 1 and 2 summerize the descriptive statistics of the lending club dataset.

Binning
Binning techniques are extensively used in machine learning applications. Particulary, in credit scoring [22]. There are various discretization tools to solve the optimal binning problem. In this paper, a recent optimal binning discretization tool is employed to determine the optimal discretization of the variables of the lending club dataset [23].  Table 3 shows the binning results for each characteristic of the lending club dataset: name, data type and the number of bins. In Addition to several quality metrics: Gini index, Information Value (IV), and the quality score. The resulting bins will be used to perform data transformations step for HMM modeling. The binning process is also used to address statistical noise, data scaling , and reducing the HMM model complexity.

Evaluation Metrics
There are many indicators for measuring the performance of a credit scoring models. In this work, we consider the AUC metric it refers to the area under the receiver operating characteristic curve (ROC), which is a comprehensive indicator reflecting the continuous variables of sensitivity and specificity. The AUC mesure range usually between 0.5 and 1. If a classifier has an AUC value is 0.5, it means that the classifier randomly guesses the samples value. The higher the AUC value, the better predictive power of the classifier.
To mesure fitting performance of the proposed model we use the Accuracy metric: Accuracy = T P +T N T P +T N +F P +F N Where true positive (TP) is the number of instances that actually belong to the good group that were correctly classified as good by the classifier, true negative (TN) refer to the number of instances that belong to the bad group and correctly classified as bad, false positive (FP) to the number of instances that are from bad group but mistakenly classified as good, and false negative (FN) refer to the number of instances from the good group but incorrectly classified as bad.

Dataset split
To minimize the impact of data dependency and improve the reliability of the estimates, k-fold cross validation is used to create random partitions of the dataset as follow : 1. The data set is split into k mutually folds of nearly equal size. 2. Choose the first subset for testing set and the k-1 remainer for training set.
3. Build the model on the training set. 4. Evaluate the model on the testing set by calculating the evaluation metrics. 5. Alternately choose the following subset for testing set and the k-1 remainder for training set. 6. The structure of the model is then trained k times each time using k − 1 subsets (training set) for training and the performance of the model is evaluated k-1 on the remaining subset (testing set). 7. The predictive power of classifier is obtained by averaging the k validation fold estimates found during the k runs of the cross validation process.
Common values for K values are 5 and 10. In this experiment, we take the value of K as 5. In this experiment, cross-validation was mainly used to assist in the adjustment of model parameters and the evaluation of the final model results.

Experimental Design
In the experimental setup using lending club p2p dataset. The risk being good or bad applicant is represented by two hidden states in the HMM model, denoted as S 0 , and S 1 . Each observation sequence correspond to an input variable preprocessed into a canonical form for its HMM model where the number of observation symbols are the number of bins resulting this procedure of discretization defined in section 3.2. We compute the initial HMM model structure for each characteristic as described is section 2.2.A.
To compare the classification performance of the VNS-HMM model with HMM model we proceed as follows: Step 1: Randomly under-sample the dataset to handle the class imbalance.
Step 2: Split the data set is split into 5 mutually folds of nearly equal size.
Step 3: Build the credit scoring models using the training folds.
Step 4: Utilize the classifiers built up in Step 3 to predict the PD and label of samples in test folds Step 5: Evaluate and compare the VNS-HMM and HMM models performances by averaging the results values and ploting the ROC curves.
Additionally, we include a total of six machine learning classifiers, namely Multi-Layer Perceptron (MLP), Support Vector Machine (SVM), Random Forest (RF), Extreme Gradient Boosting (XGBoost), CatBoost, and Light Gradient Boosting Machine (LightGBM) to compare their performances to the VNS-HMM model.  Figure 4.

Experiments Results and Analysis
From the experimental results, the following conclusions can be drawn. First, employing the VNS method for HMM training data yelled significant better accuracies while performing the search strategy. VNS was able to provides the opportunity for an improved fitting accuracy and enabled to escape local optimums, representing the effectiveness of our approach. Furthermore the VNS training model did not require many iterations to achieve better performance. Second, the predictive performance on AUC of the VNS-HMM outperforming those of HMM in the five folds. Table 4 show compared performance of the proposed model and the other classifiers. VNS-HMM model obtains the best value for AUC followed by SVM and third best value in term of accuracy after SVM and MLP.

Conclusion
This paper proposed VNS-BWA as a new approach for hidden markov models training. Where VNS is used in combination with BWA to explores the search space for the optimal parameter structure of HMMs. The result demonstrated the ability of the VNS-BWA to find better HMM parametres and enhancing it's predictive performance. We limited our approch to discret HMM models, but it can deal with continuous observations when the density functions are normal distributions or a known probability distribution which is not feasible in our case. Several directions remain to explore. Since both VNS and BWA support GPU-computing, we consider working on a parallel implementation to tackle larger problems. Also, it would be very interesting to compare VNS-HMM model to semi-supervised techniques in machine learning for future research.