Classification of Aircraft in Remote Sensing Images Based on Deep Convolutional Neural Networks

Convolutional Neural Network (CNN) is a component of Deep Learning(DL) recently exploited in different fields. In this work, we improve the performance of multi-label classification based on CNN for remote sensing images of aircraft types. Intensive preprocessing limits the classification rate in previous studies. In order to avoid under-fitting and over-fitting problems, we optimized the architecture and Network parameters. To validate our method the recent public image dataset called Multi-Type Aircraft Remote Sensing Images (MTARSI) is used. Extensive experiments prove the effectiveness of the proposed method in terms of classification rate.


Introduction
Image classification is one of the most important field of computer vision and machine learning. Assigning automatically predefined labels to images is that the aim of image classification. One of the important issues in remote sensing image processing is aircraft type classification, and it has been widely used in civil and military applications. To solve this issue, researchers have designed and implemented several methods for the image's classification. Machine learning(ML) has emerged united of the foremost successful artificial intelligence techniques and has achieved impressive performance within the field of computer vision and image processing, with applications like image medical classification [1] [2], remote sensing image scene classification [3], aircraft detection [17] , and aircraft classification [5]. All algorithms in ML are based on many handcrafted features available from images for doing the classification; those methods are named also handcraft descriptors in classification. Recently in remote sensing image classification based on DL is growing. It has been widely applied in diverse areas study, including vegetated areas, urban areas, wetlands, and forest areas [6] . As a result, the details of ground objects, such as contour, structure, and texture information, can be obtained conveniently. Among DL algorithms used in classification, CNN have gained popularity. Since 2012, CNN has attracted more attention because of the increasing computing power, availability of lower-cost hardware, open-source algorithms, and the rise of big data [7] . Getting deeper is an important typical trend of CNN [8] . By increasing depth, CNN can approximate the target function with increased non-linearity and get better descriptor representations. However, the complexity of the network is increasing, which makes the network more difficult to optimize and easier to get under-fitting or over-fitting. The main contributions of this work can be summarized as follows: The rest of this paper is organized as follows. In Section 2, we briefly review the related work. The Dataset MTARSI and CNN model of deep learning are presented in Section 3, whereas Section 4 results and discussion are largely explained. The paper is ended with a conclusion in section 5.

Related Work
Initially, the modern CNN is presented in [9]. The authors developed a multi-layer artificial neural network called LeNet-5 that could classify handwritten digits [10]. Subsequently, many CNN models have been proposed to classify images and have been considered in different topics for images. Previous research projects are studied for three typical CNN application cases in remote sensing image classification: scene classification, object detection and object segmentation are presented [11]. Due to the state-of-the-art in image classification using CNN such as VGG [12], GoogLeNet [8], ResNet [13], DenseNet [14], and EfficientNet [15] were successfully applied to the ImageNet dataset available on line [16]. Recognition of the aircraft from remote sensing images focuses on deciding whether an object is an aircraft or not based on reinforcement learning and convolutional neural networks [17]. Wu and Prasad suggested a recurring neural network in which some convolutional layers are followed by recurring layers. Middle level and locally invariant descriptors are extracted from raw HSI, and spectral context features are extracted from the descriptors generated by convolutional layers [18]. Fu et al. propose a fine-grained aircraft recognition method for remote sensing images. Their multi-class activation mapping uses two sub networks, the target net and the object net, to fully use the descriptors of discriminative object parts [19]. Zhao et al. propose the aircraft type recognition issue by detecting the landmark points of an aircraft usng a vanilla network designed a keypoint detection model based on CNN and a keypoint matching method to recognize aircraft, transforming the aircraft recognition issue into the landmark detection problem [?] . All of the above works have been trained and tested using different dataset. The only common point is the scene classification with intensive preprocessing, which consequently affects the performance of the systems proposed. Zhi-Ze Wu et al. examined the performance of five state-of-the-art CNN structures namely VGG, GoogLeNet, ResNet, DenseNet, and EfficientNet and found that the result that EfficientNet is better than others in term of classification rate [21]. Marmanis et al. used a CNN pretrained from the ImageNet challenge and used it to extract an initial set of representations for Earth observation classification [22]. These methods for identifying the object type from remote sensing images have achieved significant results, but there are still many challenges. In the literature, there are numerous variants of CNN architectures with huge parameters that need to be adapted for each case studied.

Proposed Approach
This section provides information on the MTARSI data set used in this work, a detailed description of the evaluated network architecture is provided below

Dataset
Database available in the literature are divided into three groups based on three of the basis of classification tasks: scene classification, object detection and object segmentation. Among this Dataset, MTARASI dataset has the advantage that labeled images include a single type of aircraft in different orientations. The MTARSI dataset was used for training and testing our proposed method. It is an open-source dataset for Aircraft Type Recognition from Remote Sensing. The MTARSI dataset contains a total of 9,385 remote sensing images that were taken from Google Earth satellite images and manually expanded. It contains 20 different types of aircraft and different sizes covering 36 airports. Each image contains exactly one complete labeled aircraft. The spatial resolution of the images varies in a range of 0.3 to 1, and contains various orientations, aspect ratios, and pixel sizes of the objects. In addition, the images vary depending on to the altitude, nadir-angles of the satellites, and the illumination. Some image patches 6 CLASSIFICATION OF AIRCRAFT IN REMOTE SENSING IMAGES have some cropped objects, and some examples are black and white images. These variations in the MTARSI allow the trained aircraft classification architectures to achieve similar performance in different image conditions. The aircraft may differ on type and model as illustrated in Figure 1, where a sample of image aircraft types extracted from MTARSI dataset is depicted.

Convolutional Neural Network Architecture
A mathematical model function for a neural network can be viewed as input x to output values y. In supervised classification, the function ψ assigns an input data to a given set of a predefined classes in the output. Classification problem can be described more formally, given a training set of data, the objective is to learn a function ψ called a hypothesis, so that ψ(x) is an optimal predictor for the corresponding value of y. In dataset MATRASI used in this work, we have extracted N = 8779 total images. Each image is 200 × 200 pixels, then each image is represented as D = 200 × 200 × 3 = 120.000 distinct values, and a total of C=20 class labels for aircraft.
where y ∈ {c 1 , c 2 , ..., c 20 } and x ∈ {1, ..., 120000} The loss function is the cross entropy between the predicted probability and the true label y; it measures the error by comparing the target label vector y and the predicted label 7 vectorsŷ. For N training samples, Loss function is defined as The cross-entropy loss is numerically stable in training and is faster in term of convergence rate when coupled with softmax normalization [23]. Loss function is used in the training process to find the parameter values for model proposed. The loss is returned on training and testing process and its interpretation is how well the model is doing for these two process. DL architectures have been developed and have been applied in different fields and have performed several algorithms in visual recognition. The structure of CNNs allows the model to learn highly abstract feature detectors and to map the input descriptors into representations that can clearly outperform the performance of the subsequent example. The AlexNet structure contains filters of 11 × 11, the recent trends to wards using smaller filters. The ResNet architecture does not contain filters larger than 3 × 3. The advantage of the CNN is its flexibility to add or reduce the number of layers in its structure for a given task. Furthermore, there are many optional techniques that can be used to train it. The CNN design is depicted in Figure 3. Generally, a CNN mainly consists of three parts: convolutional layers, pooling layers, and fully connected layers.

Convolutional layers
It has a nonlinear sufficiently intense and will not be able to model the response variable (as a class label). The convolutional layer computes the output feature map: where * is a two-dimensional discrete convolution operator, W s is weight and b is a bias parameter. The parameters weight W s and bias vector b are adjusting in training process.

3.2.2.
Nonlinear layer in this layer, a nonlinear function applied to each component of a mapped component. The nonlinear layer is added after each convolution operation. It has an activation function that effects the nonlinear property. The output of ReLu is the maximum value between zero and the input value. The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time and ReLU train several times faster than their equivalents with other units. Rectified Linear Unit (ReLU) is commonly chosen and defined as :

Pooling
Layer it follows a convolutional layer and it is used to reduce the dimensionality of feature maps and improve the robustness of the extracted features. This allows us to reduce the number of parameters, which both decrease the training time and avoid over-fitting. It is usually placed between two convolutional layers. There are two types of basic pooling operation which are the most commonly used: average pooling and max pooling. Detailed theoretical analysis can be found in [7]. Max pooling returns the maximum value from the portion of the image covered by the Kernel.

Fully connected layer
In this layer, the output maps of the last convolution layer or pooling layer are flattened into vectors, serving as the inputs to the first fully connected layer. The output of the final fully connected layer is the learned feature, forming the result of which is extracted from the input image by the convolutional network. In the training of the model, the flattened output is fed to a feed-forward neural network whose back propagation algorithm is applied to each iteration [24]. Over a fixed number of epochs, the model is able to distinguish between dominating and certain low-level descriptors in images and classify them using the softmax function used in classification technique. Softmax is usually used in the multi-classification tasks defined below : Softmax function returns value in the range [0,1], it can be viewed as form of a probability distribution. It defines a flexible learning task with adjustable margin [25].

Experiment Results and Discussions
For implementation, we used a PC with an HP i5 8th generation CPU processor 1.80GHz, 4 Go RAM, ×64, with Python 3.7 for Windows 10 in Keras environment.

Data Preprocessing
In this study, 8779 images were extracted from the MATRASI dataset libeled in 20 classes of aircraft. Before training our CNN, because all image are in different sizes all image were resized to the size of 200 × 200; and normalized using ImageDataGenerator function. This method consists on dividing all the pixels of the image by 255 to range min-max values between 0 and 1. Once we have images loaded, we put them into a CNN that does the classification.

Model adopted
After data preprocessing process, the model built contains 3 convolution layers paired with a batch normalization, three max-pooling layers and a fully connected layer with 2 hidden layers. Each three convolution layers is accompanied by a pooling layer. There are 20 final output nodes at the end of the CNN. Each node represents an aircraft type. The CNN model proposed in this work is shown in Figure 4. Our CNN architecture has more than 8 million parameters. After several experiences as aim to increase the classification rate and avoid overfit and underfit, the CNN structure built has 3 convolution layers paired with a batch normalization and 3 Max pooling layers followed by 2 fully connected layers. The output is a score matrix for the weights for each class. Each filter has a kernel size of 3 × 3.

Training and Testing Process
After building the CNN, it was trained on 8,779 images for 20 epochs with batch size of 128, compiled with categorical crossentropy loss function and RMS optimizer with the learning rate 0.0001. RMSprop is a gradient based optimization technique; it was developed as a stochastic technique for mini-batch learning [26]. The classification rate or accuracy is one metric used to measure how often the algorithm classifies an image correctly 9 Figure 4. Structure of CNN adopted. and defined as Accuracy = Correct predictions All predictions (6) After training process, our system was tested on 2,122 images selected randomly in file image training dataset. The following graphs in figure 5 shows the accuracy and loss vs the number of epochs in training and testing model. As we can see in Figure 5, the accuracy plots at each epoch shows that our model suffer little from over-fitting. Beyond an epoch 10, the test accuracy is slightly lower than the training accuracy. This implies that our model proposed obtains better performance with less complexity when we choose. The highest test accuracy is reached from epoch 10 and does not increase. And we can see, beyond epoch 10, the loss test and train decrease. The highest test accuracy at all the epochs is reported as the best score 99.90%. Compared to other algorithms developed by researchers using the same MTARSI dataset is presented in Table 1 in term of average accuracy related to all epoch.

Conclusion
An approach based on convolutional neural networks for the multi-classification of aircraft type is presented. The empirical results indicated that our approach provides superior results on MTARSI data sets. Classification rate (90,66%) is a good result obtained with only normalization data augmentation and without model regularization. CNN have been shown to be very powerful for image analysis and classification tasks in other field. In the future,