Weighted Clustering for Anomaly Detection in Big Data

In this paper, a new method for anomaly detection based on weighted clustering is proposed. The weights that were obtained by summing the weights of each point from the data set are assigned to clusters. The comparison is made using seven datasets (of large dimensions) with the k-means algorithm. The proposed approach increases the reliability of data partitioning into groups. Experimental results show that the proposed approach becomes more efﬁcient with increasing size of the analysed dataset.


Introduction
With the emerging computer technology, the amount of data a person can work with has increased significantly.For a long time, scientists have been developing algorithms aimed at simplifying the work with data, identifying new, previously unknown knowledge stored in the data.Computing systems are limited in the storage, analysis, and processing of Big data due to its volume, speed or diversity.Large amounts of data can be described by adding veracity and value to volume, variety, and velocity [1].Thus, with the help of five V, it is possible to find a new understanding of the existing Big data.However, today the amount of stored data is becoming too large to be processed by traditional algorithms.Researchers have to resort to various tricks, such as working with parts of available data, using a priori knowledge about the available data.Thus, working with Big data raises the need to formalize new methods used by researchers, creates new algorithms and software tools that use the power of previously created tools to work with data of large volumes and dimensions.The problem of anomaly (outliers) detection is one of the problems of Big data analysis.It is widely used in the following areas: intrusion detection in computer networks, fraud detection in banking transactions, medicine, monitoring the movement of trains, failure detection of spacecraft systems, etc. [2].
One of the most convenient and understandable approaches for Big data processing is clustering.The problem of clustering is becoming more and more relevant in many areas.Currently, many different clustering algorithms have been developed.Their complexity depends on the dimensionality of the data, the volume of the clustering set, the scope of application, and so on.
In 2003, three requirements for data flows clustering algorithms were formulated [3]: 1) data compression and expression of the compressed data; 2) processing new data points in a fast and incremental way; 3) distinguishing outliers quickly and clearly.A lot of works are devoted to the development and application of clustering algorithms (mainly various modifications of the k-means algorithm are used) to data flows [4,5].Cluster analysis can be used to get an idea of the data, generate a hypothesis, and detect anomalies and classification.Applications are often defined in terms of outliers (for example, in case of fraud detection, anomaly detection in the network, etc.), in which case a direct approach is likely to be more effective [6,7].The difficulty of anomaly (outliers) detection is to label patterns of normal and abnormal behaviors that are not easy to obtain [8,9].However, the chosen approach for anomaly detection can only be suitable for a certain range of tasks, but not for all [10].
The aim of this paper is to develop a clustering approach for anomaly detection in real Big data.Working with Big data requires large computational resources.For a previously known number of clusters, according to [11], we propose an algorithm based on weighted clustering.
The rest of the paper is organized as follows.Section 2 gives a literature review of existing works on clustering large amounts of data.The proposed weighted clustering method is described in Section 3. In Section 4, datasets and clustering evaluation metrics are presented.The experimental results and discussion are given in Section 5, followed by conclusions in Section 6.

Related Work
In [12] a new clustering algorithm was proposed.It shows good results in accordance with the three metrics of clustering (volume, variety, and veracity).This approach works with large amounts of data and showed a compromise between the quality of clustering and runtime.
In order to work with Big data, an approach combining the principal component analysis (PCA) and the k-means algorithm was proposed [13].Due to the randomized preconditioning transformation, it is possible to achieve accurate and reliable estimates in the data sparsification process.The proposed sparsified K-means algorithm returns both assignments and cluster centers in a single pass over the data, while the state-of-the-art feature-based algorithms require at least two passes.Preconditioning and sampling technique could be used to either speed up computation for in-core memory problems, or to create one-pass variants for out-of-core or streaming problems.
A new approach combining K-means and tree-based classification [14] was proposed in order to analyse and visualize the high-dimensional time series.Hellinger distance between density functions is heavily used in the proposed analysis.Sensible results were obtained as a result of experiments on real datasets.
The approach in [15] solves the problem of initialization in the clustering algorithm.At the training stage, the min-max problem was solved iteratively.At each iteration, the weights were updated.The experimental results have shown the robustness of bad initializations and its efficacy compared to k-Means, k-Means++ [16] and K-Harmonic Means [17].
A modified k-means algorithm (KMOR) was proposed for data clustering and outlier detection in [18].The experiments were performed on synthetic and real datasets.The proposed algorithm was compared with ODC (Outlier Detection and Clustering) algorithm [19].The results showed the superiority of the KMOR algorithm for accuracy and run-time.
In work [20] it is emphasized that the developed solutions allow for estimating not only the temporal data centroid but also its weighting vector, which indicates the representativeness of the centroid elements.The impact of the isotropy and isolation of clusters on the effectiveness of the clustering methods was also discussed.The proposed solutions can be directly applicable to any other variations of k-means.
An alternative clustering approach was proposed in [21].It is quite simple and consists in fitting the data to the clustering model.The method is designed as a clustering algorithm where the initial structure is not important.
The main task of [22] is to increase the performance of the k-means algorithm for large datasets.An experimental comparison was made with other clustering methods.The results have shown high performance of the proposed hierarchical k-means (H-K-means) algorithm.

Proposed Approach
In this section, the authors propose a clustering method for anomaly detection and describe the details of this algorithm.
Let us denote the following notations: X = (x 1 , x 2 , ..., x n ) are the points in the dataset, where n is the total number of data points in the dataset, X = (x i1 , x i2 , ..., x im ) ∈ R m is the point in the dataset, where m is the dimension of data points, C = (C 1 , C 2 , ..., C k ) are clusters, where C p (p = 1, k) is the p th cluster and k is the number of clusters.The task is to minimize the following function in order to detect anomalies in the dataset as follows: where |C p | W is the weight of the p th cluster.
In this case, the clusters weight is determined as a sum of the weights of all points in the cluster: where weights of points are calculated on the basis of their distance from the center of all points in the dataset and the center of the p th cluster (O p ) is defined as The algorithm of the proposed method for anomaly detection is as follows: Step 1: Find the center of all points of the dataset (O) Step 2: Calculate the weights of all points x i according to (3) Step 3: s = 0 Step 4: Calculate the value of the function according to (1) taking into account (2) Step 5: s = s + 1 Step 6: Repeat steps 3-5 until the convergence condition is met: where s is the number of iterations.
Step In the algorithm, each cluster is represented by its center, and the goal is to find a solution that minimizes the distance between each point and the cluster center to which it is assigned [23].
Figure 1 shows a comparison of the proposed approach and the k-means algorithm using Banknote authentication dataset as an example.The following characteristics of the dataset were considered: variance of Wavelet Transformed (WT) image, the kurtosis of WT image and entropy of image.The dataset contains two clusters: blue filled circles ("normal" values) and red stars (anomalies).Figure 1 visually shows the effectiveness of the proposed approach.

Datasets and Evaluation Metrics
This section compares the performance of the proposed method and k-means algorithm.First, we will describe the datasets that were used to conduct the experiments.

Datasets
The experiments were performed on six datasets from the UCI repository [24,25], including Diabetic Retinopathy Debrecen dataset (Diabetic), MAGIC Gamma Telescope DataSet (Magic04), Banknote authentication, Credit card clients, Forest CoverType dataset (Covertype) and Phishing dataset, and NSL-KDD dataset [26].These datasets are medium and large in size and are used in many research areas.
Diabetic Retinopathy (DR) Debrecen Dataset contains features extracted from the Messidor image set to predict whether an image contains signs of diabetic retinopathy or not [27,28].It contains 19 features (the Euclidean distance of the center of the macula and the center of the optic disc, the binary result of the AM/FM-based classification, etc.) with 1151 samples.The 20th attribute is a class label, i.e. 1 (contains signs of DR) and 0 (no signs of DR).In this paper samples with class label equal to 1 were considered as anomalous values.
MAGIC 04 Dataset was generated to simulate registration of high energy gamma particles in an atmospheric Cherenkov telescope, taking advantage of the radiation emitted by charged particles produced inside the electromagnetic showers initiated by the gammas, and developing in the atmosphere.The dataset was generated by Corsika program [29].It was collected from 19020 samples.It contains 10 features (axis of the ellipse, the ratio of the highest pixel, projected onto the major axis, etc.).The 11th attribute is a class label, i.e. gamma (signal) and hadron (background).In the paper, samples with the gamma class label were considered as an anomaly.
Banknote Authentication Dataset was extracted from images that were taken from genuine and forged banknotelike specimens [25].The images have a size of 400x 400 pixels.WT tool was used to extract features from images.The dataset contains four features (variance of WT image, the skewness of WT image, the kurtosis of WT image, and entropy of image) with 1372 samples.Samples with labels of counterfeit banknotes were taken as anomalies.
Credit Card Clients Dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005 [30].It contains 30000 samples.This research employed a binary variable, default payment (Yes=1, No=0), as the response variable.The dataset contains 23 features: the amount of the given credit, gender (male/female), education, marital status, age (year), history of past payment, etc.
NSL-KDD Dataset of attack signatures [26] was constructed based on KDD-99 database [31].To conduct research in the field of intrusion detection, a set of communication data was compiled and covered a wide range of various intrusions simulated in an environment that mimics the US Air Force network.The database contains training (125973 samples) and test (22544 samples) sets.Each instance has 42 attributes.Labels are assigned to each instance either as an "attack" type or as "normal" behavior.The total number of samples was 148517 (NSL-KDD All).
Covertype Dataset includes information about four wilderness areas located in the Roosevelt National Forest of northern Colorado (USA) [32].These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.This dataset contains 54 features (elevation in meters, distance to nearest surface water features, soil types, etc.) with 581012 samples.Dataset classes include Spruce/Fir (1), Lodgepole Pine (2), Ponderosa Pine (3), Cottonwood/Willow (4), Aspen (5), Douglas-fir (6) and Krummholz (7).In this paper, 1-6 classes were considered as normal values, and samples with class label Krummholz -as anomalies.
Phishing Dataset contains 11055 phishing websites [33].It includes 30 attributes (using the IP address, URL length, abnormal URL, website forwarding, etc.).This data belongs to one of the two classes labeled as Phishy (-1) and Legitimate (1).

Clustering Evaluation Metrics
As a result of clustering algorithms application, it is necessary to estimate the quality of the obtained partitions.To do this, the quality assessment indices were considered.Six quality metrics having different nature were selected for the analysis [34].
Assume that the dataset N is divided into classes ) (true clustering), and, using the clustering procedure, clusters C = (C 1 , ..., C k ) can be found in the dataset, where k + is the initial number of classes, k is the number of clusters that need to be found [35].
A comparison of the clustering solutions is based on counting the pairs of points.A decision ("normal"/abnormal behavior) will be made based on the results.The most well-known clustering distance metrics based on data point pairs are the purity [36,37], the Mirkin metric [38], the partition coefficient [39], the variation of information [40], the F-measure [41] and the V-measure [41].
Purity.The purity of the cluster gives the ratio of the dominant class size in the cluster to the cluster size itself [36,37,42].The value of the purity is always in the interval ] . The purity of the entire collection of clusters can be evaluated as a weighted sum of the individual cluster purities: A higher purity value indicates a better clustering solution.
Mirkin metric.The Mirkin metric is defined as follows [38]: , that is the harmonic mean of the precision and the recall.Precision and recall are computed as follows [35]: Thus, the F-measure has the following form: The F-measure of the cluster C p is the maximum F-value attained at any class in the entire set of classes ) .The F-measure of the entire dataset is considered to be the weighted sum of the individual cluster F-measures.That is, The higher the F-measure, the better clustering solution.
Partition coefficient (PC).This coefficient is used to compare C = (C 1 , ..., C k ) and ) distributions [39].According to [42], PC is calculated as: A higher value of P C (C, C + ) indicates a better clustering solution.

Variation of information (VI).
This metric measures the amount of information that the authors gain and lose when going from the clustering C to another clustering C + [40,42].
In general, the smaller the VI, the better clustering solution.
V-measure.The V-measure is an entropy-based measure that explicitly measures how successfully the criteria of homogeneity and completeness have been satisfied [37].The homogeneity can be defined as where H (C + |C) is equal to 0 when each cluster contains only members of a single class, a perfect homogeneous clustering.In the degenerate case when H (C + ) is equal to 0, when there is only a single class, the homogeneity is defined to be 1.
Completeness is symmetric to homogeneity.The completeness can be defined as where V-measure of the clustering solution is calculated by finding the harmonic mean of homogeneity and completeness as follows: The computation of the homogeneity, the completeness, and the V-measure are completely independent from the number of classes and clusters, the size of the dataset and the clustering algorithm.

Experimental Results and Discussion
To evaluate the performance of the proposed approach, a number of experiments were implemented in Matlab 2016a on a 64-bit Windows-based system with an Intel core (i7), 2.5 GHz processor machine with 8 Gbytes of RAM.
Experimental datasets Diabetic, Magic04, Banknote authentication, Credit card clients, NSL-KDD All, Covertype, and Phishing were used as initial data.The characteristics of the datasets are presented in Table 1.Six quality metrics having different nature were selected for the analysis.The results of the proposed approach (PA) and k-means (KM) algorithm based on six metrics are presented in Table 2. Purity, Mirkin metric, PC, VI, F-measure, and V-measure were considered as evaluation metrics.
In the proposed each cluster is represented by its center, and the task is to find a solution that minimizes the distance between each point and the center of the cluster to which it is assigned, taking into account the weights of the clusters [43].
The proposed approach showed the best results for all metrics on the Covertype dataset: Purity = 96.47%,Mirkin metric = 36.69%,PC = 47.12%,VI= 5.06%, F-measure = 69.24%and V-measure = 1.0000.The best result based on the purity metric was obtained for the Covertype dataset and gained 96.47%, which coincided with the value of the same metric on this dataset for the approach we proposed.
According to the Mirkin metric (42.76%), the lowest value was achieved for the Credit card clients dataset.The highest result for the NSL-KDD All dataset according to F-measure metric is 68.01%.Covertype dataset showed the best results for k-means clustering based on VI (5.80%) and PC (46.28%) metrics.
V-measure does not have a discriminating ability, i.e. its value on different datasets is almost the same for the methods.From this, it can be concluded that the use of the V-measure is not useful for evaluating the results of clustering.Therefore, in the following comparisons, it was not considered.A comparison of the performance of the proposed approach with the k-means algorithm is shown in Table 3.
In the Table 3 "+" means that the result outperforms and " -" means the opposite.A comparison of the evaluation metrics values for the proposed approach (red bars) and the k-means algorithm (blue bars) on seven datasets is more clearly illustrated in Fig. 2. Based on the experimental results, it can be concluded that the proposed approach is superior to the k-means algorithm in four metrics (Mirkin metric, F-measure, PC and VI) for the Covertype dataset.Purity, Mirkin, Fmeasure and VI metrics showed good results for the Diabetic dataset, Purity, Mirkin metric, F-measure and PC for Banknote authentication dataset and NSL-KDD All.The values of Purity metrics have been the same for two approaches on Magic04, Credit card clients, and Covertype datasets.
High evaluation results for the Phishing dataset were obtained based on Purity, Mirkin, PC and VI metrics.For Magic04 and Credit card clients datasets, the improvement was obtained based on PC (0.33%) and F-measure (53.09%) metrics, respectively.

Conclusion
In this paper, a new method for anomaly values detection based on weighted clustering was proposed.The aim of the algorithm presented in the paper was to improve the process of anomaly detection in large data sets.The weights that were obtained by summing the weights of each point from the dataset were assigned to clusters.
In the paper, the weight of each point was determined by its position according to the center in the entire dataset.It can be seen that the weighting improves the clustering solution.The comparison was made using seven datasets (of large dimensions) with the k-means algorithm.The quality of the clustering result was estimated using six evaluation metrics.An important feature of the proposed approach is that it increases the accuracy of anomaly detection based on clustering.The experimental results showed that the proposed algorithm more accurately detects anomalies compared to k-means and has practical significance.
By applying the proposed approach to data clustering, it is possible to increase the reliability of clusters partitioning into groups.It can be concluded that the proposed approach becomes more efficient with increasing size of the analysed dataset.We investigated the effect of cluster weights on performance and the accuracy of finding anomalies in Big data.It is important that this approach can be applied in various research fields.Future research will focus on the development and application of ensembles of clustering algorithms to anomaly detection.

Figure 2 .
Figure 2. The comparison of the methods based on evaluation metrics.
June 2018 R. ALGULIYEV, R. ALIGULIYEV, Y. IMAMVERDIYEV AND L. SUKHOSTAT 183The smaller the metric value, the better clustering.F-measure.Another evaluation measure, also known as the "clustering accuracy", is based on the F value of the cluster C p and the class C + )Stat., Optim.Inf.Comput.Vol. 6,

Table 1 .
Summary of the datasets.During the preprocessing, the values in the datasets were standardized to have a mean of 0 and a standard deviation of 1.All datasets contain two classes: C + 1 and C + 2 .Samples included in the C + 1 class are taken as anomalies.

Table 2 .
Comparison of the proposed approach and k-means on different datasets.

Table 3 .
Performance evaluation compared between the proposed approach and k-means algorithm.