Probability Model Based on Cluster Analysis to Classify Sequences of Observations for Small Training Sets

The problem of recognizing patterns, when there are few training data available, is particularly relevant and arises in cases when collection of training data is expensive or essentially impossible. The work proposes a new probability model MC&CL (Markov Chain and Clusters) based on a combination of markov chain and algorithm of clustering (self-organizing map of Kohonen, k-means method), to solve a problem of classifying sequences of observations, when the amount of training dataset is low. An original experimental comparison is made between the developed model (MC&CL) and a number of the other popular models to classify sequences: HMM (Hidden Markov Model), HCRF (Hidden Conditional Random Fields), LSTM (Long Short-Term Memory), kNN+DTW (k-Nearest Neighbors algorithm + Dynamic Time Warping algorithm). A comparison is made using synthetic random sequences, generated from the hidden markov model, with noise added to training specimens. The best accuracy of classifying the suggested model is shown, as compared to those under review, when the amount of training data is low.


Introduction
Sequences of observations are classified in the process of solving the problems of recognizing: speech [1,2], hand-written text [3], gestures of hands/head [4,5], states of technical objects [6,7,8] Due to intense introduction of computer-aided learning into various areas of human activities, machine learning engineers often have to deal with small-scale training sets, which structure and characteristics are almost unknown. To classify the sequences of observations, the following machine learning methods have widely been used: Hidden Markov Model (HMM), Hidden Conditional Random Fields (HCRF), Long Short-Term Memory (LSTM), k-Nearest Neighbors algorithm (kNN) with Dynamic Time Warping algorithm (DTW). kNN method is a popular metric non-parametric algorithm of classification. It is based on computing a distance between test specimen and specimens from the training set. Several studies on applying kNN-method together with DTW algorithm were undertaken by Professor Eamonn Keogh and his colleagues. These studies have shown that kNN+DTW method demonstrates the best results to classify one-dimensional sequences of observations [9]. LSTM method is a variation of architecture of recurrent neural networks, developed to classify time sequences. There are works, where the superiority of LSTM neural network over the other machine learning algorithms is shown in the problems of recognizing hand-written text [10] and speech [11]. HMM and HCRF methods are probability models with the concept of states [12,13,14,15]. The basic idea 297 of such models is that, at each instant of time, the system (process) under review is in a state out of finite set of states, and, as time passes, a transition from one state to the other takes place based on markov assumption about conditional independence. HMM is a generative model that requires assessing parameters of distributions of probabilities of the observed data, which become available in each state [16,17]. HCRF is a discriminative model that requires assessing parameters of separating hyperplane between the observed data of various classes in each state [18,19,20]. As of now, there are no publicly available experimental data about the quality of classification, demonstrated by the above models, when the amount of training data is low (up to 100 specimens). Hence, the objective of this work is to examine the behavior of the above-listed methods in solving the problem of classification, when the amount of training set is low, and to develop a new probability MC&CL (Markov Chain and CLusters) model, which remedies the shortcomings of the reviewed models. Synthetic random sequences are used as experimental data: training and test sequences of observations are generated through sampling from hidden markov model with Gaussian probability-density function. Adding noise to training sets breaks training and test sets down. MC&CL (Markov Chain and CLusters) method implies development and modification of probability model, we have previously developed, which is based on markov chain and self-organizing map of Kohonen/Growing neural gas [21,22]. As distinguished from the previous works, the suggested method is generalized for the condition of random algorithm for clustering, and tailored to solve the problem of classification, when the amount of training data is low. Low amount of training data in machine learning is dangerous, because it doesnt allow to form a statistically significant representation of the research subject since, with few experiments performed, it is not possible to separate noise components from useful information about the object. This work shows that the developed MC&CL model effectively solves the problem of classification, when there are few training data, through the efficient leveling of noise components in the training set.
The following two problems may cause low accuracy of classification, when there are few training data: 1) multiple free parameters [23]; 2) lack of parameters of distributing the observed data in the model [24,25,26]. kNN+DTW model is non-parametrical, and, it might actually be said that it has very many free parameters, if it is assumed that each specimen from training set is considered to be a parameter. Hence, kNN+DTW model is hardly suitable for classifying, when the set is small, since it corresponds to problem 1. HCRF and LSTM models are discriminative, and they do not assess parameters of distributing the observed data. Therefore, HCRF and LSTM models are not really suitable for classifying, when the set is small, since they correspond to problem 2. HMM model has few parameters and requires assessment of parameters of distributing the observed data. Thus, it is in line with both items and is suitable for solving the problem of classification, when there are few training data. In this paper, an attempt shall be made to reduce the number of free parameters in HMM model and to alter an algorithm of its training, having suggested thereby a new MC&CL model. Chapter 2 contains a description of the developed MC&CL model. Chapter 3 contains information about experiments. Chapters 4 and 5 comprise Discussion and Conclusion, respectively.

Description of probability MC&CL model
It is implied that the sequence of observations might be represented as a set of multidimensional random variables . . , o n } is the th element of sequence, a vector of attributes that contains n components of o i . The structure of HMM model is taken as the basis of the suggested model, but with the difference that the states of the model are explicit rather than hidden. The sequence of observations shall be broken down to clusters, using any known algorithm for clustering (SOM, k-means). Here, the number of cluster shall be equivalent to the number 298 PROBABILITY MODEL BASED ON CLUSTER ANALYSIS TO CLASSIFY of the explicit state of the model. Then, the structure of the model shall represent a product of distributions of probabilities of random variables of two types: observed data and numbers of clusters. A joint distribution of probabilities of random variables shall be a product of two conditional distributions of probabilities: wherex t -random variable, corresponding to the t th element of sequence of observations; h t -random variable, corresponding to the number of cluster that corresponds to the t th sampling of sequence; T -length of random sequence of observations. Distribution of probabilities p(x t | h t ) shall be specified as a multidimensional Gaussian distribution with a scalar value of dispersion. Here, the dispersion shall be considered equal for all distributions of product (1). Hence, the probability of observing elementō t of sequence in cluster with number ν shall be computed as where n -size of attribute space; β -distribution parameter (non-assessable); c ν -value of the center of cluster with number; o t -t th element of sequence of observations. Distribution of probabilities p(h t | h t−1 ) shall be specified as the distribution of markov chain with regularization in the form of adding Dirichlet distribution. Then, probabilities of transitions between clusters shall be computed as: where a i,j -number of transitions from cluster with number i to cluster with number j, i = 0 . . . N, j = 1 . . . N (N -number of clusters); ξ j -Dirichlet distribution parameter. Assessment of parameters of distributing probabilities in markov chain, carried out according to the method of maximum likelihood, is biased. When the amount of set increases, the bias of assessment is eliminated. Since there are few training data, an assessment shall be made pursuant to the maximum a posteriori method. Distribution of Dirichlet is a distribution a priori conjugated to the distribution of probabilities of transitions in markov chain. Adding of a priori conjugated distribution fulfills a function of regularization, not allowing the model to re-train, when there are few training data.

Training of MC&CL model
-for the other elements of the sequence (t ̸ = 1) where µ j (ō t ) -probability of observing t th elementō t of sequence in cluster j; λ i,j -probability of transiting from cluster i to cluster j.

Experimental results
Random synthetic dataset shall be generated, using software package pmtk3 [27], to carry out a comparative assessment of classification quality. Sequences of observations for training and test datasets shall be generated from hidden markov model with random parameters of distribution. Gaussian noise (SNR = 0.1 dB) shall be added to training specimens so as to simulate differences between training and test data. 2 functions shall be used from package pmtk3: mkRndGaussHmm(. . .) -creation of random hidden markov model with Gaussian function of distributing probabilities of observing data in each of the hidden states of the model; hmmSample(. . .) -generation of the sequence of observations with the specified parameters. Gaussian noise is added to data through function awgn(). Parameters of generated sequences of observations are shown in Table 1. Software code on Matlab/Octave language of generating test and training sequences is presented in Listing 1. -MC&CL. Source: repository with initial codes [29], number of states (number of clusters): 6; MC&CL(SOM) -method of self-organizing Kohonen maps is used as an algorithm for clustering; (SOM); MC&CL(k-means)k-means method is used as an algorithm for clustering).
To assess the quality of classification, an error in classification shall be used, which is computed as a portion of wrong answers: the number of mismatches between the answer of classifier and actual class mark, divided by the total number of answers. A dependence of the quality of classifying on the size of training set for the above-stated models is presented in Fig. 2 and Table 1. Training is made sequentially on 20, 30, 40, and so on, up to 110 specimens from the training set. Testing is always made on 200 specimens from testing dataset.

Discussion
After analyzing figure 2, it can be concluded that the suggested MC&CL model demonstrates the best result in classifying, when there are few training data (dozens of specimens). This is due to the fact that MC&CL model has fewer free parameters than the other models that hinders recovery of noise components from the data. From analyzing Table 2, it can be inferred that selection of clustering algorithm (SOM or k-means) has a minor effect on the final error in classification. The closest to MC&CL model result is shown by HMM model. As it was supposed, in the beginning of work, HMM model was successfully outweighed applying few training data, due to decreasing the number of free parameters in the model, particularly, the number of parameters in the probability-density function of distributing the observed data: 1. HMM model. Distribution of data, observed in each state of the model, shall be specified by multidimensional Gaussian function with parametersμ -vector of mathematical expectations and Σ -covariance 302 PROBABILITY MODEL BASED ON CLUSTER ANALYSIS TO CLASSIFY matrix p(x) = 1 (2 · π) n/2 · |Σ| 1/2 exp where n -size of attribute space. Number of parameters: n + n·(n+1) 2 . 2. MC&CL model. Distribution of data, observed in each state of the model, shall be specified by multidimensional Gaussian function with parametersc -vector of mathematical expectations (centers of clusters) and β −1 -scalar value of dispersion. 3.

Conclusion
The present paper contains a description of MC&CL probability model, we have developed, generalized for the condition of random algorithm of clustering, and the original results of comparative analysis between MC&CL model and HMM, HCRF, LSTM, kNN+DTW models that use synthetic data, generated on the basis of hidden markov model with noise added to training specimens. It is shown that the developed MC&CL model, successfully solves the problem of classification, when the amount of training data is low, through efficient leveling of noise components within the training set. Such a model may be relevant to solve the problem of classification, when it is impossible to form a large training set, since there are financial or time restrictions, and due to the fact that the occurring phenomena as such (anomalies) are rare.