A Hybrid DBN and CRF Model for Spectral-Spatial Classiﬁcation of Hyperspectral Images

Hyperspectral image classiﬁcation plays an important role in remote sensing image analysis. Recent techniques have attempted to investigate the capabilities of deep learning approaches to tackle the hyperspectral image classiﬁcation. This work shows how to further improve the hyperspectral image classiﬁcation through using both a deep representation and contextual information. To implement this objective, this work proposes a new Conditional Random Field (CRF) model (named DBN-CRF) with the potentials deﬁned over the deep features produced by a Deep Belief Network (DBN). The newly formulated DBN-CRF model takes advantage of the strength of DBNs in learning a good representation and the ability of CRFs to model contextual (spatial) information in both the observations and labels. Within a piecewise training framework, an efﬁcient training method is proposed to train the whole DBN-CRF model end-to-end. This means that the parameters in DBN and CRF can be jointly trained and thus the proposed method can fully use the strength of both DBN and CRF. Moreover, in the proposed training method, the end-to-end training can be implemented with a standard back-propagation algorithm, avoiding the repeated inference usually involved in CRF training and thus is computationally efﬁcient. Experiments on real-world hyperspectral data show that our method outperforms the most recent approaches in hyperspectral image classiﬁcation.


Introduction
Over the past decades, hyperspectral imaging has experienced a significant success since it can get spatially and spectrally continuous data simultaneously.The hyperspectral images are used in a wide range of realworld applications, such as in psychology, urban planning, surveillance, agriculture, and disaster prevention and monitoring [1,2,3,4,5].In these applications, the main techniques can be finally transformed into the classification tasks, which are equivalent to the assignment of each pixel with a land-cover label.For this reason, hyperspectral image classification is of particular interest in land-cover analysis research.Many popular methods have been developed for the hyperspectral image classification in the past several decades .One of the main approaches in this context is the use of only spectral information within each pixel within a popular classifier, such as multinomial logistic regression (MLR) [6,7,8], neural networks [9,10], support vector machines (SVMs) [11,12], graph method [13,14], AdaBoost [15], Gaussian process approach [16] and random forest [17].
Such hyperspectral image classification methods take into consideration only spectral variations of pixels, ignoring important spatial correlations.However, the spatial correlations have been proved to be very useful for 76 A HYBRID MODEL FOR SPECTRAL-SPATIAL CLASSIFICATION OF HYPERSPECTRAL IMAGES image analysis in both remote sensing and computer vision communities [18].Thus in recent years, also spectralspatial classification methods have been proposed capturing both spectral and spatial information in a pixel and its neighboring pixels.The spectral-spatial methods shown significant advantages in terms of improving classification performance, and were developed following two main strategies.One is to incorporate contextual information into extracted features [19,20,21,22,23,24,25,26], such as the morphical features [21,22], tensor features [23], wavelet features [24], LBP features [25], and cooccurrence texture features [26].
The other is to model contextual information by intrinsic structures of the classifiers [27,28,29,30,31,32,33,34,35,36].The probabilistic graphical models have been developed as effective methods to incorporate the spectral and spatial information in an unified framework.In particular, Markov Random Fields (MRFs) and its variant Conditional Random Fields (CRFs) have obtained widespread success and have become one of the successful graphical models used in the literature of remote sensing [29,30,31,32,33,34,35,36]. The MRF framework is effective to model contextual information in label image [29,30].But for computational tractability, the observed spectral vectors are assumed to be conditional independent, neglecting the contextual information in the observed data of a given class.As a variant of MRF, the CRF has intrinsic ability to incorporate the contextual information in both label and observed data in a principled manner.Morover, the contextual information is captured through intrinsic CRF structure, no need of complex modeling of the dependencies between the observations of neighboring sites.With these merits, CRFs usually show significant advantages over MRFs in terms of improving classification performance [31,32,33,34,35,36].This work will develop a new CRF model to further improve the spectralspatial hyperspectral image classification.
On a separate track to this progress in the hyperspectral image classification methods, deep learning methods have been developed as effective methods to represent and classify hyperspectral images [37,38,39,40,41].In fact, most of the popular methods mentioned previously can be deemed as shallow methods with only one or two processing layers.The researches in computer vision, however, demonstrated that deep architectures with more layers can potentially extract abstract and invariant features for better image classification [42].Moreover, the deep learning methods have been immensely successful in many vision tasks such as image recognition [43], object detection [44] and semantic segmentation [45,46,47].This motivates exploring the use of deep learning for the hyperspectral image representation and classification.
There are significant challenges, however, in adapting deep learning to the hyperspectral image classification.The standard approach to real-world hyperspectral image classification is to select some samples from a given image for classifier training, and then use the learned classifier to classify the remaining test samples in the same image [31].This means that we usually do not have enough training image patches for a supervised contextual deep model, such as convolutional neural network (CNN).A few methods have been proposed to partially deal with the problem and make the deep learning fit to the hyperspectral image classification.In work [41], a fully unsupervised learning method of CNN has been developed to extract the deep sparse features based on the highly efficient enforcing population and lifetime sparsity algorithm.The work [40] relieved the problem through defining a 1-D CNN, which is performed over the set of spatially independent spectral vectors, not over the image patches as used in 2-D CNN.Thus the method cannot use the important spatial information.Another kind of methods to deal with the problem utilize the deep models with special structures, which naturally support the unsupervised training.The typical models include the multi-layer stacked autoencoder (SAE) [37,38]and deep belief networks (DBN) [39].They can be pre-trained through unsupervised ways, and have the ability to use the spatial information in observation through using a hand-crafting contextual features as the input.To sum up, the available deep learning methods usually focus on the feature representation, and some of them can model the spatial information in the observations.Most of them, however, are unaware of the spatial information lied in the labels (the final objective of the classification task), no saying of using the spatial information in labels to improve the previous deep learning.
As mentioned previously, the CRFs naturally have the ability to model the spatial information in both the observations and labels.Therefore, intuitively, CRFs can be used to improve the hyperspectral image classification results produced by a deep model.One approach to combine the CRFs and deep models is using the result of deep model as the input to a CRF, and thus the CRF inference is essentially used as a post-processing step [46].In this setup, the two procedures, i.e., the CRF and deep model, are independent of each other, and thus the previous deep model cannot be trained to optimally fit to the later CRF.Work [45] and [47]  The remaining topic is how to train the proposed DBN-CRF model.In this work, we will develop an end-to-end learning solution to jointly train DBN and CRF to use the spectral and spatial information in both the observations and labels.Since CRFs for image analysis are large graphical models with loops, the computing can be expensive.In addition, the deep structure of DBN model and the characteristic of the standard real-world hyperspectral image classification make the end-to-end training even much harder.Motivated by the work in [31,47], we will develop an efficient method within the piecewise framework to jointly train the DBN and CRF model.We will demonstrate that the training of the proposed DBN-CRF model can be finally implemented as the fine-tunings of two DBN models, corresponding to the unary and pairwise potentials respectively.
The proposed method demonstrates the following main merits.Firstly, the method avoids repeated inference typically needed in CRF training.Secondly, it can use the efficient back-propagation method directly to fine-tune the DBN so that it optimally fits for CRF inference.Finally, the method allows using only spatially separated spectral vectors as training samples and thus is feasible in real-world hyperspectral image classification tasks.To the authors knowledge, this work is the first that proposes end-to-end training of a joint DBN and CRF model to improve the hyperspectral image classification based on spectral-spatial information.The rest of the paper is arranged as follows.Section 2 proposes the hybrid DBN and CRF model for spectral-spatial hyperspectral image classification.The efficient end-to-end learning algorithm of the proposed DBN-CRF model is developed in Section 3. Section 4 utilizes the real-world hyperspectral image data sets to evaluate the proposed method.Finally our technique is concluded and discussed in Section 5.

A hybrid DBN and CRF model
Our goal is to develop a systematic approach to assign a label for each terrain site based on cube of observations, hoping that the labels are as close to the ground truth as possible.This problem can be formulated naturally under the statistical framework.Especially, we will incorporate the deep features from DBN into the CRF to formulate a new hybrid DBN and CRF (DBN-CRF) model, which takes advantage of the strength of DBN in deep learning representation and CRF in contextual (spatial) modeling in both the observations and labels.

DBNs for Deep Representation
A hyperspectral image usually has hundreds of spectral bands in a narrow bandwidth with fixed sample intervals.The abundant information presents the hyperspectral image the potential to discriminate the different land cover classes.However, the simple method using directly the spectral signature cannot release fully the potential of the hyperspectral image.In order to get a better classification map, it is necessary to extract an informative representation of the original spectral signature.The single hidden layer model is usually limited in capturing the features in the hyperspectral data, while multiple layers together could demonstrate the real power.
A DBN is such a model built of stack of a series of Restricted Boltzmann Machines (RBMs).The graphical representation of a DBN is shown in Fig. 1.In a DBN, the output of the previous RBM is used as the input data for the next RBM.Two adjacent layers have a full set of connections between them, but no two units in the same layer are connected.In theory, the output of every layer can be used as the extracted deep features.The output of the j -th hidden unit of the l -th layer of the network with the input x is ) where are the weight and bias parameters from the first to the lth layer of the network,w l = { w l ij ; i = 1, 2, ..., J l−1 , j = 1, 2, ..., J l } and b l = { b l j ; j = 1, 2, ..., J l } are the weight and bias parameters of the l -th layer with J l units.
To extract effectively the deep features, the parameters with L layers should be trained at first.The DBN training can be implemented by the unsupervised pre-training procedure [48].Moreover, the pre-trained model can be further fine-tuned by a supervised training procedure.To implement the fine-tuning procedure, a softmax layer is usually incorporated into the DBN as the last layer (See Fig. 1).Since the softmax perform similar as a classifier, and thus the fine-tuning procedure can use the semantic labels of the training samples to fine-tune the model's parameters to fit for the final hyperspectral image classification.The output of the m -th unit (class) of the softmax layer is where M is the number of classes and is the parameter vector for the m -th unit of the softmax layer.Equation ( 2) can be also deemed as the probability P (y = m|x, θ) of the input data x m -th class.
The usual maximum likelihood (ML) method is used to fine-tune the parameters such that they minimize the negative log-likelihood where is the output of the k−th training sample xk corresponding to the ŷk −th class, that is T is a spectral signature with D bands,ŷ k takes the label value from {1, 2, ..., M } and K is the number of training samples.The stochastic gradient descent (SGD) is usually used to optimize the objective function of (3) using the BP algorithm to compute the needed gradients.More details can be found in work [49].

CRFs for Spectral-Spatial Classification
In the context of hyperspectral image classification, observed data from an input image x is a set of spectral vectors x = {x i , i ∈ S}, where x i = [x i1 , x i2 , ..., x iD ] T denotes a spectral vector associated with an image site i,S is the set of image sites, and D is the number of bands.The classification task is to assign (class) labels to image sites using the observed spectral vectors.The obtained label image is denoted by y = {y i , i ∈ S} ,where y i takes value in the set {1, 2, ..., M } and M is the number of classes.
Within the Bayesian framework, hyperspectral image classification generally considers the posterior P (y|x).Within the classical MRF framework, the posterior is formulated as P (y|x) ∝ P (y) P (x|y) , where the prior distribution P (y) is usually formulated as a Gibbs distribution, and y is said to be an MRF.P (x|y) is the likelihood and usually assumed as a factored form P (x|y) = ∏ i P (x i |y i ) for computational feasibility.This assumption is equivalent to the conditional independence of the observed spectral values, and thus makes the MRF use only single-site spectral information to estimate the label of that site and neglect the contextual (spatial) information in the observed data.
In contrast, the CRF framework models directly the posterior of labels given the observed data as a Gibbs distribution: where } is the partition function and ψ c is potential defined over clique c with parameters θ.The commonly used CRF models are formulated up to pairwise clique potentials only (assuming the potentials defined over other higher order cliques to be zeros), that is where η i is the set of neighbors of site i,ψ i (•) and ψ ij (•) are the unary and pairwise clique potentials with parameters θ u and θ v , respectively.Then the model parameters are With the formulation of ( 5) or ( 6), the CRF model avoids the problem of explicit modeling of likelihood in MRF framework and thus has advantages over MRF, particularly on the flexible modeling of contextual information.We can observe from (5) that the potential ψ c (y c , x, θ) in CRF model is defined over the labels of a clique c and the whole input observation x and thus in theory, the CRF model has the ability to capture the contextual information in both the labels and observed data.

DBN-CRF for Deep Spectral-Spatial Classification
It is noted from the previous discussions that the merits of CRFs derive mainly from the flexibility of potentials, and thus defining the potentials is an important issue to formulate a CRF model.We can define the right potentials according to the tasks.For example, the unary and pairwise clique potentials in CRFs can be viewed as arbitrary local discriminative classifiers.This allows one to use domain-specific discriminative classifiers rather  [39].The spectral-spatial signature of one site were obtained by performing the PCA transformation over the hyperspectral image at first, then extracting the first 10 components as the new representation of the hyperspectral image, and finally concatenating the new representation in a 3 × 3 window centered at the site.It can be easily calculated that the spectral-spatial signature is a 90-dimentional vector.We can see from these plots that the curve of each land-cover class has its own visual shape.Although the curves of different classes show some difference with each other, how to design (usually hand-crafted) the features to represent these signals, especially the essential specifics and difference, is still a very difficult problem.Fortunately, the deep learning method demonstrated in recent years that it can automatically learn the representation and mine the features similar with the primary human vision.Therefore, the proposed CRF in this work uses the deep features learned by the DBN to capture the complex statistics of the hyperspectral data.
The unary potential is formulated as follows: Here δ (•) is the indicator function, which equals 1 if the input is true and 0 otherwise, M is the number of classes,U = {u m ; m = 1, ..., M } is the set of unary parameters, is the vector of deep features extracted by the DBN from previous section for site i, i.e., the output of the last second layer (L−th layer) of the structure shown in Fig. 3, W  ] T is the parameter vector of CRF for the m−th class and J L u is the number of the units of the L−th layer of DBN for the unary potentials.
2) Pairwise Potentials of DBN-CRF.To define the pairwise potential, we mainly focus on its ability to encode contextual information.There are complex interactions in neighboring spectral vectors: a spectral vector belonging to a type of terrain is highly dependent upon its neighbors, since in a type of terrain, spatial variations of pixel spectral vector may follow some underlying patterns rather than being random; moreover for different classes the underlying patterns may be different.Although the classical Ising/Potts models usually used as the pairwise clique potentials in MRF can model the contextual information in labels, they do not permit the use of observed data, and thus cannot capture the contextual information in the observed data.
A generalized Ising model is used to model the pairwise potential in this work, i.e., Here ) is a feature vector extracted from observation x based on the DBN for a pair of sites are the weight and bias parameters from first to L−th layers of the DBN for the pairwise potentials, and w l p and b l p are the weight and bias parameters of the l−th layer.In this work, pairwise feature vector ) is obtained by concatenating all elements of the inputs for sites i and j at first, and then using the DBN to extract the deep feature vector from the concatenated input.For example, if the input for each site i is simply the D − dimentional spectral vector x i = [x i1 , x i2 , ..., x iD ] T , the input to the DBN for the site pair .., x iD , x j1 , x j2 , ..., x jD ] T .The vector v mn in ( 8) is the parameters of CRF for the pair class (m, n) and has the dimension as J L p , i.e., the number of units in L−th layer of DBN for the pairwise potentials.
Our formulation of pairwise potentials is different from the classical Ising/Potts model, which uses only the contextual information of labels to enforce neighborhood smoothness.The formulation (8) in this work explicitly depends on the whole observed data x and the neighboring labels and thus can capture the two kinds of contextual information in both the labels and observed data.Furthermore, the contextual information in labels is based on the idea of pairwise discrimination of the observed data, making it be data-adaptive instead of being fixed as a priori in MRFs.
In addition, the formulation of pairwise potentials is also very different from that used in [47].The formulation in [47] concatenates two deep features from sites i and j to get the feature vector of the site pair (i, j) , while our method concatenates the input of sites i and j at first, and then uses only one DBN to extract the deep feature for the pair input.Our processing method can fully model and use the contextual information of the pair observations of two classes.In addition, the most obvious difference between the formulations is that our formulation has the extra CRF parameter vector v mn for the pair potentials and u m in the unary potentials, which could make the CRF have more discriminative ability.Moreover the extra CRF parameters u m and v mn bring our method the merit of convenient and efficient end-to-end training procedure: the training of our CRF model can be implemented as two efficient DBN trainings and thus the CRF and DBN in the proposed DBN-CRF can be jointly learned.This point will be demonstrated in Section 4.
3) Graphical Representation of the Proposed DBN-CRF.Fig. 3 shows the graphical representation of the proposed DBN-CRF for spectral-spatial hyperspectral image classification.In order to make the figure clearer, only one-dimensional CRF model is presented, and a two-dimensional CRF is actually used for the hyperspectral image classification.In the figure, only the definitions of the unary site i and pair sites (i, i + 1) are completely illustrated and other sites have the similar structures.

Efficient training of the DBN-CRF model
The parameters needed to be estimated in the proposed DBN-CRF model include where x = {x q , q = 1, 2, ..., Q} are Q i.i.d.training images and ŷ = {ŷ q , q = 1, 2, ..., Q} are the label images.Exact ML estimation is intractable in general due to combinatorial size of the label space in computing the partition function Z (x q , θ).In addition, the ML method needs the whole training images.However as aforementioned, the most usual task in hyperspectral image classification is to select some samples from a given image for classifier training, and then use the learned classifier to classify the whole given image.Therefore the ML can not be directly used for the task at hand.A feasible method is to estimate the parameters locally.On the one hand, this method can approximate the partition function Z (x q , θ) efficiently, on the other hand, it uses only local training samples and thus is suitable for the usual hyperspectral image classification task.Pseudolikelihood estimation is a classical local estimation method.But as demonstrated in [50], its accuracy can be poor on some data.In this paper, we focus on an alternative named piecewise training framework [50].Within the framework, an efficient training method will be developed for the proposed DBN-CRF model.

Piecewise Training for DBN-CRF
The intuition of piecewise training is that if each factor ψ a (y a , x, θ) of an objective function can independently predicts y a and x accurately, then the prediction of the global factor graph will also be accurate [32,47,50].Let a ∈ A be a graph factor composed of a set of sites, and A is the set of all graph factors.In this work, the objective function is divided as multiple factors according to the types of the cliques.Let C s be the set of the type of cliques with s sites.Then, a ∈ A is a clique c in the set A = {C s , s = 1, 2, ...} ≡ C. The objective function in (9) is approximated in piecewise training framework with this special division for CRF as where is the set of all selected cliques, Ĉs denotes the set of cliques selected for training and ŷc is the labels of training samples over clique c.The meaning of "piece" corresponds to a term in (10), and that term would be the exact likelihood of the piece if the rest of the graph were omitted.
In this work, the CRF in (6) with only unary and pair potentials is used.Let .., K} be a set of training samples, where xk is a selected training spectral vector, ŷk is the corresponding label and K is the number of training samples.The objective function of the piecewise training framework of the proposed DBN-CRF can be written as the following factorized form: where } are the sets of parameters of the unary and pair potentials respectively, and Υ θu and Υ θp denote the first and second term at the right of equal sign in (11).Equation (11) shows that, under the piecewise training framework with the special division, DBN-CRF training can be implemented by independently training the local classifiers over each kind of cliques.Furthermore, we will demonstrate that, with the potentials defined as ( 7) and ( 8) the local classifiers are exactly two DBN models with extra softmax layers.

Training for Unary Clique Potentials
With the definition of the unary potential as (7), the objective function Υ θu in (11) can be written as with Compared with (3) and ( 4), ( 12) is just the objective function of fine-tuning DBN with the weight parameters w L+1 u of the last layer as u m .Therefore, the available usual algorithms can be directly used to learn the "piece" corresponding to the unary potential of the proposed DBN-CRF.Especially, the learning can be implemented by the SGD, with the BP algorithm to efficiently compute the gradients.

Training for Pairwise Clique Potentials
Based on the formulation of pairwise clique potential in (8), the second term Υ θp in (11) can be written as with It can be easily noted through the comparison between ( 4) and ( 15) that O (ŷi,ŷj ) is just the output of a DBN with the last layer as a softmax classifier for M 2 classes.Therefore, similar to the learning of the unary potential, ( 14) is also the objective function of fine-tuning DBN with the weight parameters w L+1 p of the last layer as v mn .As before, the SGD with BP training algorithm can be also used to minimize (14).
We further analyze the usual real-world hyperspectral image classification.It is time-consuming to label the pixels near the spatial borders between different classes.Therefore, we usually do not have enough pair samples with different labels to learn the parameters v mn (m ̸ = n).To deal with the problem and meanwhile to make the training procedure more efficient, as the setting in [31], the parameter vector is set to 0 if m ̸ = n, and we only consider parameters {v mm , m = 1, 2, ..., M }.Then (15) can be equivalently written as where τ = M (M − 1) is a constant, and 16) is the output of a DBN with the last layer as a softmax classifier for M + 1 classes.Consequently ( 14) is exactly the objective of fine-tuning DBN with only M + 1 classes.
To sum up, this section developed an end-to-end learning solution to jointly train DBN and CRF in the proposed DBN-CRF model.Finally the learning of the proposed DBN-CRF model can be implemented efficiently as the learning of two DBNs with the last layer as a softmax classifier.The graphical representation of the proposed efficient learning method is shown in Fig. 4. We can see from the figure that the training data set in the usual training method are some spatial continuous samples, while the training data set in the usual training method are some spatial continuous samples, while the proposed training method allows selecting the samples randomly to construct the training samples for different pieces.This makes the proposed method have more flexibilities to fit to the real-world hyperspectral image classification.In addition, the figure also shows that the DBN-CRF learning divided as the learning of different DBNs corresponding to different potentials can be implemented through a parallel way.

Model Combination in Inference
The previous subsections demonstrate that the proposed training method for the DBN-CRF divides the model into two DBNs with the last layer as a softmax classifier, and they are trained independently.However, this paralleled training method may lead to problems with over-counting during inference [51].It is difficult to assess analytically the degree of over-counting introduced by dependences between the different terms in DBN-CRF model.Instead, as our previous work [31] and [32] , this work introduces the scalar powers for each term.Thus, given the input of the test image x and the learned parameters = arg max where λ 1 and λ 2 are the fixed powers for the unary and pairwise clique potentials, respectively.Because the fixed powers λ 1 and λ 2 function in the same manner by assigning weights to their corresponding potentials, λ 2 is fixed to be one and only λ 1 is required to be adjusted.The optimal selection of the power is an area of active research.Same as the work in [51] does, this paper optimizes the power discriminatively using the cross validation.Then, based on the learned parameters and selected power, the inference of the form ( 17) can be efficiently implemented by loopy belief propagation (LBP) [52].

Experimental Data sets
In our experiments, two hyperspectral data sets were applied to evaluate the proposed method.A hyperspectral image can be considered as a cube of observed pixels, made up of several 2-D arrays.Figs. 5 and 6 show the 2) University of Pavia : This data set was taken by a sensor known as the reflective optics system imaging spectrometer (ROSIS-3) over the city of Pavia, Italy.The image contains 610 × 340 pixels and 115 bands collected over 0.43-0.86µm range of the electromagnetic spectrum.In the available data online, some bands were removed due to noise and the remaining 103 channels were used for the classification in this work.Nine land-cover classes were selected, which are shown in Fig. 6.

Experimental Setup
To evaluate the proposed method, the available labeled samples were randomly divided into training set and test set.The Indiana Pines data set has sixteen different land-cover classes available in the original ground truth.To make the experimental analysis more significant from the statistical viewpoint, as the setting in [40], eight classes were discarded since only few training samples were available.The remaining eight classes were distributed by 8598 elements.For each class, 200 samples were randomly selected as the training samples and the remaining samples were used as test samples.For the University of Pavia data set, all the nine land-cover classes were used to validate the proposed method.We also selected randomly 200 samples as the training samples for each class.Tables I [39].The proposed DBN-CRFs were trained over the given training samples through the efficient training method proposed in Section 3. Then the learned models were used to classify the whole hyperspectral images, i.e., solving the optimization of ( 17) through LBP algorithm.The power parameter λ 1 in (17) was learned through cross validation as 0.9.
As an example, Fig. 7 shows the learned weight parameters } over the University of Pavia data set.Although they are not so obvious as that in the filters trained over 2-D visual signals, some structures in the filters still can be observed.The learned first layer weights are localized continuous structure filters, and the weights in the second and third layers are shown as the local singular filters, which could correspond to the higher level abstract representation of the input signals.These results are consistent with much prior work, especially in 2-D signal representation.Figs. 5 and 6 show the classification results for visual evaluation.Figs.5(c)-(d) and 6(c)-(d) give the classification results of DBN-CRFs with only unary potentials (setting parameters{v mm , m = 1, 2, ..., M } as 0 in inference, named as DBN-CRF-U) and both unary and pairwise potentials.It can be noted that the DBN-CRF-U, which does not take into account any neighborhood interactions in both observed data and labels, always results in noisy classifications.However, the full DBN-CRF model obtained much better classification results, where the full DBN-CRF model protected the shapes and details of some objects, while simultaneously removed the isolated classification noise.This apparently demonstrates how combining the unary potentials, which focus on discrimination from deep representations, and pairwise potentials, which capture contextual information in both labels and deep representations, improves the classification results.
In order to carry out quantitative evaluation, we computed confusion matrices and average values from overall accuracies (OAs), average accuracies (AAs), and Kappa statistics (Kappa) of ten run of trainings and tests.Tables III and IV present the confusion matrix over Indian Pines and University of Pavia data sets, respectively.Inspection of the confusion matrices and classification accuracies for each class confirms that the most critical classes to separate in Indian Pines data set are corn-notill, corn-mintill, soybean-notill and soybean-mintill, while in University of Pavia data set, they are asphalt, bitumen and bricks.This situation was expected, as the spectral behaviors of such classes in each data set are quite similar.Table V presents the classification performances of DBN-CRF-U and full DBN-CRF.The OA and AA of the full DBN-CRF over Indian Pines data set are 92.15% and 94.22%, which are much better than 88.34% and 91.53% of the DBN-CRF-U.Over the University of Pavia data set, the full DBN-CRF obtained 94.02% OA and 94.62% AA, which are also higher than 91.24% and 92.43 obtained by the DBN-CRF-U.In addition, Table V shows that the full DBN-CRF also obtained better Kappa measures than that of the DBN-CRF-U.
2) Effects of Training Set Size on Classification Accuracies: The experiments are conducted to further verify that the proposed method is suitable for classifying data sets with limited training samples.Over the Indian Pines data set, eight different situations were analyzed, i.e., 50, 100, 150, 200, 250, 300, 350 and 400 samples of each class were randomly selected to train the models and the remaining sample sets were used to evaluate the classification accuracies, while over the University of Pavia , we analyzed seven different situations where training data sets have 100, 200, 300, 400, 500, 600 and 700 samples for each class.Fig. 8 shows the results of different models.It is obvious that our proposed full DBN-CRF consistently provides higher accuracy than the DBN-CRF-U.Although the classification accuracy of DBN-CRF suffers from the decrease of the training samples and the increase of the test samples, the proposed DBN-CRF generally demonstrates relatively stable classification performance.
3) Computational Performance: At the end of this subsection, we evaluate the computational performances of the proposed method.Within the piecewise training framework, the DBN-CRF model parameters We must admit that the training process is relatively time-consuming to achieve the good performances.However, the proposed DBN-CRF shares the same advantages of deep learning algorithms, such as the relatively fast inference and the good representation of hyperspectral image and thus the good classification performance.Moreover, our method could be improved greatly on efficiency.Firstly, the proposed training method naturally allows the parallel way to train the unary and pairwise DBNs.Secondly, the codes of the pre-training and finetuning of DBNs can be also modified to run on the GPUs, which can significantly accelerate the training procedure.

Comparison to Other Methods
To thoroughly evaluate the performance of the proposed method, we ran several sets of experiments to compare it with the most recent results in hyperspectral image classification.Firstly, we compared our method with the successful SVM-based method to demonstrate the performance difference between our deep method and the stateof-the-art 'shallow' methods.Secondly, we compared our method with the newly developed deep CNN model without using contextual (spatial) information.The comparison was used to demonstrate the ability of our method to use the contextual (spatial) information and its importance in hyperspectral image classification.Finally, the proposed method was compared with the recent DBN with the spectral-spatial information in the observation.The comparison is designed to show the advantages of our method in modeling and using the contextual information in both the observations and labels, and thus the merits to improve the classification performance.We compute the McNemar's test, which is based upon the standardized normal test statistic [53] , to assess the statistical significance of differences between the accuracies achieved by two different methods.The statistic can be computed as where F ij measures the pairwise statistical significance of the difference between the accuracies of the i−th and j−th methods and the f ij is the number of samples classified correctly by i−th method but wrongly by j−th method.At the 95% level of confidence, the difference of accuracies between the different methods is statistically significant if  I and II.Over the Indian Pines data set, the SVM-Poly obtained the classification result with OA, AA and Kappa as 87.65%, 91.01%and 0.8499, while the proposed method obtained the better result with OA, AA and Kappa as 92.15%, 94.22% and 0.9044.Over the University of Pavia data set, the proposed method also produced much better results than that did by the SVM-Poly.In addition, the computed |F ij |between SVM-Poly and our method over the Indian Pines and DC Mall are 15.73 and 19.01, which mean the better results of our method over the SVM-Poly are statistically significant.Since the SVM-Poly is a typical 'shallow' classifier, thus the comparison between the results demonstrated that the DBN representations from the deep learning and the contextual information captured by the CRF model can benefit the hyperspectral image classification.
2) Comparison to shallow CRF.The CRF for hyperspectral image classification has been proposed in our previous work [31].The multimodal logistic regression (MLR) was used to define the unary potentials of the CRF model, while the pairwise potentials were defined as the Ising model.Combined with the proposed efficient training method, the CRF model is fit for the real-world hyperspectral image classification.Moreover, the CRF has the ability to capture and use the important contextual information in both the observations and labels.Therefore, the CRF obtained the promising results over the real-world hyperspectral images.However, the MLR used to define the unary potentials is a discriminative classifier and shallow, thus the CRF is actually a shallow discriminative model.Moreover, the discriminative characteristic makes the shallow CRF focus mainly on the model's ability to discriminate different land cover classes, and thus lack the typical abilities of the generative models to use the good description model of the observation to improve classification performance.In addition, although the shallow CRF and the proposed DBN-CRF have the similar formulations of the potentials, the features used in the proposed DBN-CRF derive from the deep representation models.Therefore, the shallow CRF and DBN-CRF have the similar ability to capture the contextual information, but the DBN-CRF has the extra merits from the deep representation.The merits make the proposed DBN-CRF obtain much better classification results (see the experimental results in Table VI and VII).Moreover, the computed |F ij | between the shallow CRF and our method over the Indian Pines and University of Pavia data sets are 13.28 and 16.36, which mean that the difference between our method and the shallow CRF are statistically significant.This experimentally validates that the deep representation combined with the contextual information can significantly benefit the classification performance.
3) Comparison to spectral CNN.CNNs are biologically inspired and multilayer classes of deep learning models.They have demonstrated excellent performance on various visual tasks, including the classification of common two-dimensional images.Work [40] further introduced the CNN into the hyperspectral images classification and produced very promising results.Therefore, we further compare our method to the CNN.The architecture of the proposed CNN contains five layers, including the input layer, the convolutional layer, the max pooling layer, the Stat., Optim.Inf.Comput.Vol. 5, June 2017 PING ZHONG AND ZHIQIANG GONG 93 full connection layer, and the output layer.Although the proposed CNN can capture some contextual information through the convolutional and pooling layer, the proposed CNNs have been performed over only the spectral domain, and thus neglected the important spatial information of the hyperspectral images.We name this method as spectral CNN in this paper.
For the fair comparison, our method was performed under the experimental setup same as that in work [40].Moreover, we used directly the results from work [40].However, only partial results corresponding to the evaluations in this work have been presented in work [40].For the Indian Pines data set, only the OAs have been provided, while for the University of Pavia data set, although the work [40] provided only the OAs, we calculated the AAs and Kappa values using the available results in work [40].The results show that the proposed DBN-CRF model produced better results than that of CNNs.This means that besides the deep representation of the spectral observations, the spatial information in the hyperspectral images also play a very important role to improve the hyperspectral image classification.
4) Comparison to spectral-spatial DBN: Finally, the proposed DBN-CRF model is compared with another deep learning method, i.e., the DBN model with the last layer as a LR classifier (DBN-LR).Work [39] implemented two DBN-LR classifiers.The first one uses only the spectral signatures as the input, then trains the DBN to extract the deep features, and finally inputs the deep features into the LR to get the final labels.The same training and test data sets with the sizes presented in Table I and II were used to train the DBN-LR, and the experimental setting in training and test DBN-LR is same as in [39].It can be noted from the results in Table VI and VII that DBN-LR produced better results than the 'shallow' SVM did.This demonstrates the merits of deep representation that the deep learning methods share.Further checking the results of DBN-LR and another analogue deep model, i.e., spectral CNN [40], shows that the spectral CNN obtained a little better results.This could derive from the fact that the spectral CNN can model and use the contextual information in spectral bands through the convolutional and pooling layers, while the DBN-LR cannot sufficiently use the contextual information since the variables in one RBM layer (stacked to formulate the DBN-LR) are assumed to be independent.It can be easily noted from Table VI and VII that our method produced better results than both the DBN-LR and spectral CNN.
To further improve the performance of DBN-LR, work [39] proposed a novel deep architecture, which combines the spectral-spatial feature and classification together.The new method (named DBN-LR-S-S) performs the principal component analysis (PCA) transformation over the hyperspectral image, then the first several principal components of the neighboring sites are concatenated to form the contextual feature of one site, and finally the extracted contextual features are input into the DBN and LR to get the final labels.The results in Table VI and VII show that the DBN-LR-S-S not only outperformed the DBN-LR, but also produced better results than the CNN did.This means the spatial information in the observations did benefit the hyperspectral image classification.Our method further captured and used the spatial information in both the observations and labels, and thus obtained better results than that of the DBN-LR-S-S.Moreover, the computed |F ij | between the DBN-LR-S-S and our method over the Indian Pines and University of Pavia data sets are 6.39 and 11.37.This means that the superiority of our method over the DBN-LR-S-S is statistically significant.Especially, if the same contextual features used in DBN-LR-S-S were used in our method (named DBN-CRF-S-S), the classification performance of our method can be further improved, and the improvement is statistically significant (see the computed |F ij | in Table VI and VII).The comparisons demonstrate that the deep representations and contextual information in both the observations and labels can play an important role in the hyperspectral image classification.Our method can effectively fuse the important information through the designed DBN-CRF structure and the proposed end-to-end training method.

Conclusion and Discussion
In this paper, we have proposed a novel DBN-CRF model for hyperspectral image classification.The proposed model takes advantage of the strength of DBNs in deep learning representation and CRFs in contextual (spatial) modeling in both the observations and labels.Therefore, the deep representations and contextual information in the hyperspectral image have been combined to improve the hyperspectral image classification.In addition, the proposed end-to-end training method of the DBN-CRF jointly trains the DBN and CRF in a very efficient way through the usual back-propagation method.The experimental results over the real-world hyperspectral data validated the efficiency and effectiveness of our method in the classification task.
Our current method uses the DBNs to extract the deep features to define the potentials of CRFs.Other deep learning methods, such as the deep CNNs, also have great potentiality for hyperspectral image classification.In theory, these deep learning methods can be directly introduced into our method.Especially, if the last layer of the deep models is the classifier similar with the soft-max, the training of the new models can be also implemented as the trainings of some simple models.Another future work is to use the more diverse contextual information in the CRF model through define different potentials.The current pairwise potentials model only the contextual information between the sites with same class label.The contextual information lies in the sites of different cover class can be modeled through setting the non-zero pairwise CRF parameters.Another way to increase the diversity of contextual information is introducing the high-order potentials, which can model the high-order statistics and capture the long-range contextual information.
proposed different end-to-end training Stat., Optim.Inf.Comput.Vol. 5, June 2017 PING ZHONG AND ZHIQIANG GONG 77 methods of joint CNN and CRF models for natural image semantic segmentation.For their method to hyperspectral image classification, however, these methods still face the problem of a lack of enough training samples for the CNN.Considering the merit of DBN on the unsupervised training, we turn to develop a new model to combine the DBN and CRF to deal with the problems simultaneously.Especifically, this work will propose a new hybrid DBN and CRF model (named DBN-CRF) with the CRFs potentials defined over the deep features from the DBN.

Figure 1 .
Figure 1.Illustration of the graphical representation of the DBN for hyperspectral image representation.A softmax layer is added to the DBN to use the semantic information of the training samples and the back-propagation to fine-tune the parameters of the DBN.

Figure 3 .
Figure 3. Graphical representaton of the proposed DBN-CRF for spectral-spatial hyperspectral image classification.
DBNs, and U = {u m ; m = 1, ..., M } and V = {v mn ; m, n = 1, ..., M } in unary and pairwise potentials of CRF.Our objective is to develop an end-toend training method to estimate the parameters in DBN and CRF simultaneously.Using the CRF definition in (5), classical ML parameter estimation method chooses parameter values θ = {W, B, U, V } such that they minimize the negative log-likelihood

Figure 4 .
Figure 4. Graphical model of the piecewise training for DBN-CRF.The left of equal sign is original DBN-CRF training, right is the version trained by the proposed piecewise training method.This figure presents only up to pairwise factors.

Figure 5 .
Figure 5. Indian Pines data set and its example classification results.(a) Original image produced by the mixture of three bands.(b) Ground truth with eight classes.(c) and (d) The classification results of DBN-CRF-U and DBN-CRF.(e) Map color.

Figure 6 .
Figure 6.University of Pavia data set and its example classification results.(a) Original image produced by the mixture of three bands.(b) Ground truth with nine classes.(c) and (d) The classification results of DBN-CRF-U and DBN-CRF.(e) Map color.

Figure 7 .
Figure 7. Example results of the learned weight parameters over the University of Pavia data set: (a) -(d) are the learned weight parameters from layer 1 to 4.

}Figure 8 .
Figure 8. Classification accuracies versus numbers of training samples (each class) for the Indian Pines data set (a) and University of Pavia data set (b).
xK } is a set of training samples and Ŷ = {ŷ 1 , ŷ2 , ..., ŷK } be the corresponding labels, where L−th layers of DBN for the unary potentials, w l u and b l u are the weight and bias parameters of l−th layer, u m = [ u 1m , u 2m , ..., u J L b L u } are the weight and bias parameters from Stat., Optim.Inf.Comput.Vol. 5, June 2017 PING ZHONG AND ZHIQIANG GONG 81 first to u m } , the independently trained models can be combined in the CRF to infer the label image as wL u , w1 p , ..., wL p } , B = { BL u , BL p } = { b1 u , ..., bL u , b1 p , ..., bL p } , Ũ = {ũ m ; m = 1, ..., M } and Ṽ = {ṽ mn ; m, n = 1, ..., M Table VI and VII show the classification results of the different methods over the Indian Pines and University of Pavia data set, respectively.1) Comparison to SVM.SVM-based method can be deemed as the benchmark 'shallow' hyperspectral image classification method.SVM-based method and our CRF were trained and tested on same training and test data sets with the sizes presented in Table