Anomaly detection in Big data based on clustering

Selection of the right tool for anomaly (outlier) detection in Big data is an urgent task. In this paper algorithms for data clustering and outlier detection that take into account the compactness and separation of clusters are provided. We consider the features of their use in this capacity. Numerical experiments on real data of different sizes demonstrate the effectiveness of the proposed algorithms.


Introduction
When analyzing data, the information quality is of paramount importance. This task is complicated by the growth of large volumes of collected information. Working with Big data requires large computational resources. In this regard, researchers pay special attention to the development of effective methods for anomaly (outlier) detection.
The high degree of importance of the solving tasks had led to the fact that a whole galaxy of different methods appeared in this area. The methods differ from each other in ease of implementation, suitability for data processing, and the basic principles underlying them.
Among of them are clustering methods. Clustering technology is used in many areas: medicine, archeology, information systems, etc. Data clustering is also often used as an initial step for data analytics.
The aim of this paper is to develop a clustering approach for anomaly detection in real Big data. The paper develops algorithms that minimize the compactness of clusters and maximize the separation of clusters from each other according to the distances between their centers and the remoteness of cluster centers from the selected common center of points in dataset.
The number of clusters is not known in advance and is established in accordance with some subjective criterion. Therefore, in this paper, the number of clusters is determined according to [1].
To illustrate the viability of the developed anomaly detection algorithms, the results of examples of small, medium and large real data sets are presented.
The rest of the paper is organized as follows. Section 2 gives a literature review of existing works on clustering of large amounts of data and outlier detection. The proposed clustering algorithms are described in Section 3. In Section 4, datasets and clustering evaluation metrics are presented. The experimental results and discussion are given in Section 5, followed by conclusions in Section 6. 327

Proposed Algorithms
This section describes the proposed algorithms for anomaly detection.
Let us denote the following notations: x i ∈ R n (i = 1, n) is the point from the dataset, where n is the total number of points in the input dataset, c p ∈ R k (p = 1, k) is the cluster's number, where k is the number of clusters, S W is the compactness of clusters, S BW is the separation of clusters from each other, and S B is the measure of the remoteness of each cluster center (O p ) from the center of all points (O) in the input dataset The algorithm of the first proposed approach for anomaly detection is as follows:
Step 1: Find the center of all points of the dataset (O) Step 2: s = 0 Step 3: Calculate the compactness (S W ) according to (1) Step 4: Calculate the separation of clusters (S BW ) according to (2) Step 5: Calculate the remoteness (S B ) according to (3) Step 6: Calculate the value of the following function taking into account (1)-(4) Step 7: s = s + 1 Step 8: Repeat steps 3-7 until the convergence condition is met: where s is the number of iteration steps.
Step 9: Return the values of IDX End In the second algorithm, the task is to maximize an objective function in order to detect anomalies in the dataset: Step 1: Find the center of all points of the dataset (O) Step 2: s = 0 Step 3: Calculate the compactness (S W ) according to (1) Step 4: Calculate the separation of clusters (S BW ) according to (2) Step 5: Calculate the remoteness (S B ) according to (3) Step 6: Calculate the value of the following function taking into account (1)-(4) Step 7: s = s + 1 Step 8: Repeat steps 3-7 until the convergence condition is met: where s is the number of iteration steps.
Step 9: Return the values of IDX End In the third algorithm, the task consists in maximizing the objective function according to regularization parameter α (0 α 1), which will be determined experimentally. The steps of the algorithm are as follows:
Step 1: Find the center of all points of the dataset (O) Step 2: s = 0 Step 3: Calculate the compactness (S W ) according to (1) Step 4: Calculate the separation of clusters (S BW ) according to (2) Step 5: Calculate the remoteness (S B ) according to (3) Step 6: Calculate the value of the following function taking into account (1)-(4) Step 7: s = s + 1 Step 8: Repeat steps 3-7 until the convergence condition is met: where s is the number of iteration steps.
Step 9: Return the values of IDX End

Datasets and Evaluation Metrics
This section describes the datasets that were used to conduct the experiments and evaluation metrics.

Datasets
The experiments were performed on six datasets from the UCI repository [15,16], including Diabetic Retinopathy Debrecen dataset (Diabetic), Phishing dataset, Banknote authentication, Forest CoverType dataset (Covertype), NSL-KDD dataset [17] and Spambase dataset (Spam). Diabetic Retinopathy (DR) Debrecen Dataset contains features extracted from the Messidor image set to predict whether an image contains signs of diabetic retinopathy or not [18,19]. It contains 19 features (a Euclidean distance of the center of the macula and the center of the optic disc, the binary result of the AM/FM-based classification, etc.) with 1151 samples.
Banknote Authentication Dataset was extracted from images that were taken from genuine and forged banknotelike specimens [16]. Dataset contains four features (variance of WT image, the skewness of WT image, the kurtosis of WT image, and entropy of image) with 1372 samples.
NSL-KDD Dataset of attack signatures was constructed based on KDD-99 database [20]. The database contains training (125973 samples) and test (22544 samples) sets. Labels are assigned to each instance either as an "attack" type or as "normal" behavior. The total number of samples (148517) was considered in this paper.
Spambase Dataset contains spam and non-spam e-mails [16] It includes features that indicate whether a particular word or character was frequently occurring in the e-mail and measure the length of sequences of consecutive capital letters. The dataset contains two classes: spam (1) or not (0).
Phishing Dataset contains 11055 phishing websites [22]. It includes 30 attributes (using the IP address, URL length, abnormal URL, website forwarding, etc.). This data belongs to one of the two classes labeled as Phishy (-1) and Legitimate (1).
A comparison of the clustering solutions is based on counting the pairs of points. Based on the results, a decision will be made: "normal"/abnormal behavior. The most well-known clustering distance metrics based on data point pairs are the purity [24,25], the Mirkin metric [26], the partition coefficient [27], the variation of information [28], the F-measure [29] and the V-measure [29].
Purity. The purity of the cluster C p gives the ratio of the dominant class size in the cluster to the cluster size itself [24,25,30]. The value of the purity is always in the interval . The purity of the entire collection of clusters can be evaluated as a weighted sum of the individual cluster purities: where k + is the initial number of classes, k is the number of clusters that need to be found. A higher purity value indicates a better clustering solution.
Mirkin metric. The Mirkin metric is defined as follows [26]: The smaller the metric value, the better clustering. F-measure. Another evaluation measure, also known as the clustering accuracy, is based on the F value of the cluster C p and the class C + p + , that is the harmonic mean of the precision and the recall. Precision and recall are computed as follows [23]: Thus, the F-measure has the following form: The F-measure of the cluster C p is the maximum F-value attained at any class in the entire set of classes . The F-measure of the entire dataset is considered to be the weighted sum of the individual cluster F-measures. That is, The higher the F-measure, the better clustering solution.
Partition coefficient (PC). This coefficient is used to compare C = (C 1 , ..., C k ) and distributions [27]. According to [31], PC is calculated as: A higher value of P C (C, C + ) indicates a better clustering solution.
Variation of information (VI). This metric measures the amount of information that the authors gain and lose when going from the clustering C to another clustering C + [28,31].
In general, the smaller the VI, the better clustering solution. V-measure. The V-measure is an entropy-based measure that explicitly measures how successfully the criteria of homogeneity and completeness have been satisfied [25]. The homogeneity can be defined as where H (C + |C) is equal to 0 when each cluster contains only members of a single class, a perfect homogeneous clustering. In the degenerate case when H (C + ) is equal to 0, when there is only a single class, the homogeneity is defined to be 1.
Completeness is symmetric to homogeneity. The completeness can be defined as where V-measure of the clustering solution is calculated by finding the harmonic mean of homogeneity and completeness as follows: The computation of the homogeneity, the completeness, and the V-measure are completely independent from the number of classes and clusters, the size of the dataset and the clustering algorithm.

Experimental Results and Discussion
A number of experiments were implemented in Matlab 2016a on a 64-bit Windows-based system with an Intel core (i7), 2.5 GHz processor machine with 8 Gbytes of RAM to evaluate the performance of the proposed algorithms. Experimental datasets Diabetic, Phishing, NSL-KDD, Banknote authentication, Spam, and Covertype were used as initial data. The characteristics of the datasets are presented in Table 1. Six quality metrics having different nature were selected for the analysis. The datasets were divided into two classes: C + 1 and C + 2 . During the preprocessing, the values in the datasets were standardized. Samples in the C + 1 class were taken as anomalies.

ANOMALY DETECTION IN BIG DATA
The results of the proposed algorithms based on six metrics are presented below. Purity, Mirkin metric, Fmeasure, PC, VI, and V-measure were considered as evaluation metrics. The best results were marked in bold.
The proposed approaches are compared with the k-means algorithm. Table 2 shows the experimental results on all datasets for the k-means algorithm. The best result according to Purity, Mirkin, VI, and PC metrics was obtained for the Covertype dataset and gained 96.47%, 43.15%, 5.8% and 46.25%, respectively. F-measure showed the best result on the NSL-KDD dataset. According to Purity (53.08%), Mirkin (50%) and VI (17.83%) metrics, the lowest value was achieved for the Diabetic dataset.
The best results for the first proposed algorithm were obtained for the Covertype dataset: Purity = 96.47%, Mirkin metric = 8.04%, F-measure = 97.23%, VI= 1.46%, PC = 47.55% and V-measure = 1.0000 (Table 3). The analysis reveals that the second proposed approach yields a high quality of clustering for NSL-KDD according to Mirkin and F-measure metrics and for Covertype dataset according to Purity, VI and PC metrics ( Table 4). The worst results were also obtained for Diabetic dataset. The influence of the regularization parameter α on the performance of the proposed algorithm on different datasets was considered (Table 5- It can be concluded from    In addition, at α = 0.6, α = 0.7, α = 0.8, α = 0.9 and α = 1, according to the Mirkin and F-measure metrics, the best indicators for the Banknote authentication dataset were achieved.
V-measure does not have a discriminating ability, i.e. its value on different datasets is almost the same for the methods. From this, it can be concluded that the use of the V-measure is not useful for evaluating the results of clustering. Therefore, in the following comparisons, it was not considered.
To illustrate the viability of the developed anomaly detection algorithms, the results are presented in Figure 1-5. Figure 1 shows the results of the third implemented clustering algorithm for the purity metric. Based on the experimental results, it can be concluded that the proposed approach shows the best results for Covertype and Banknote authentication datasets. The lowest values were observed for NSL-KDD, but at α = 0.9, an improvement can be seen.
The lowest results for the Mirkin metric for all α values were obtained for Covertype and Banknote authentication datasets (Figure 2). For the Spam dataset, the value of the metric practically did not change and was ∼41.8%.
In Figure 3, it can be seen that the highest results were obtained for the Covertype dataset at α from 0 to 0.9, while at α = 1 the value dropped sharply and amounted to 7.87%. The worst results according to the F-measure metric were obtained for the Diabetic dataset.  The best results, i.e. the minimum values of the VI metric were obtained for the Covertype dataset, Banknote authentication dataset and NSL-KDD (Figure 4). Values of VI worsened with the increase of α value for the Diabetic dataset.
From Figure 5, according to the experimental results, it was obtained that Covertype and Banknote authentication datasets achieve the best results for almost all values of α. For Spam dataset, the metric value was practically constant. The values of the PC metric fall sharply at α from 0.6 to 1 for Phishing dataset, and at α=1 for NSL-KDD. A comparison of the evaluation metrics' values for the first (dark blue bars), second (blue bars), third (yellow bars) proposed algorithms and the k-means algorithm (red bars) on six datasets is more clearly illustrated in Figure  6. For the third algorithm, the best results for each metric at every α were selected. Based on the experimental results, it can be concluded that the first and third proposed approaches are superior to the k-means algorithm in all evaluation metrics for the Covertype dataset. Purity, Mirkin, F-measure and PC metrics showed good results for NSL-KDD dataset when applying the second algorithm. According to all metrics, Banknote authentication dataset achieved the best result for the third algorithm, Phishing dataset -for the second algorithm, and Diabetic dataset -for the first and third algorithms.
The values of Purity, Mirkin, and F-measure have shown the best results for the second and third algorithms, while PC metric was the best only for the second algorithm on Spam dataset. It can be concluded that the third algorithm works well on datasets of small and large sizes, while the second algorithm has shown the best results on the datasets of medium size.

Conclusion
In this paper, new clustering algorithms were proposed for anomaly detection in Big data. The aim of the algorithms presented in the paper is to improve the anomaly detection. The algorithms that minimize the compactness of clusters and maximize the separation of clusters from each other according to the distances between their centers and the remoteness of cluster centers from the selected common center of points in the dataset were presented. The comparison was made using six datasets containing anomalous values. The quality of the clustering result was estimated using six evaluation metrics. An important feature of the proposed approaches is that they increase the accuracy of anomalous values detection based on clustering. The performance of the proposed algorithms with the k-means algorithm was compared. It can be concluded that the proposed algorithms work efficiently on real datasets of different size.
It is important that the proposed approaches can be applied in various research fields. Future research will focus on the development and application of ensembles of clustering algorithms to anomaly detection.