PSO+K-means Algorithm for Anomaly Detection in Big Data

The use of clustering methods in anomaly detection is considered as an effective approach. The choice of the cluster primary center and the ﬁnding of local optimum in the well-known k-means and other classic clustering algorithms are considered as one of the major problems and do not allow to get accurate results in anomaly detection. In this paper to improve the accuracy of anomaly detection based on the combination of PSO (particle swarm optimization) and k-means algorithms, the new weighted clustering method is proposed. The proposed method is tested on Yahoo! S5 dataset and a comparative analysis of the obtained results with the k-means algorithm is performed. The results of experiments show that compared to the k-means algorithm the proposed method is more robust and allows to get more accurate results.


Introduction
The data clustering is an unsupervised classification method, the purpose of which is to divide the dataset into clusters by using the similarity measure between the objects.Cluster analysis has become an important tool for data analysis, pattern recognition, machine learning, image segmentation, neural computations and of the other areas of science [1].The purpose of the clustering algorithm is to maximize the inter-cluster distance and minimize the intra-cluster distance.The well-known k-means algorithm has been applied to many practical clustering problems [2,3,4,5,6,7,8].The purpose of this method is to automatically divide the dataset into k groups.k-means algorithm can generate fast and effective results.
Clustering in k-means algorithm is performed by minimizing the average square distance between each point and the center closest to that point.Although the k-means is a well-known algorithm, this algorithm has many drawbacks.The k-means algorithm depends on initial cluster center selection and requires determination of cluster center previously.
Since the objective function of the k-means algorithm is not convex, this algorithm has many local minimum points.To overcome these type problems of the k-means algorithm evolutionary algorithms are widely used [9,10].PSO is one of the evolutionary algorithms and based on the thinking and behavior of the swarm.Literature analysis shows that PSO based clustering allows getting higher results compared to existing clustering methods [11].
Anomaly detection is one of the major fields of research and researchers are trying to find new algorithms for detecting anomalies.Clustering techniques of data mining is an interested area of research for detecting possible intrusions and attacks.This paper presents a new clustering approach for anomaly detection by using the hybridization of k-means and PSO methods.Numerous clustering methods have been proposed to detect anomalies based on hybridization of PSO and k-means algorithms [12,13,14,15,16,17].Differences between existing 349 clustering algorithms proposed on the basis of PSO are in their objective functions.In [9], a data clustering method by applying the PSO algorithm is proposed.From the experiments carried out here, it is clear that due to the lack of weight values of the criteria there are many instances identified incorrectly.In addition, as the weight of criteria to be minimized in data clustering is taken as equal in current algorithms, it is impossible to regulate the higher optimality of the objective function.Therefore, to demonstrate the importance of the criteria in the data clustering process it is necessary to assign the appropriate weight coefficients to these criteria [18].Here, to indicate the importance of the criteria in the clustering process the weight coefficients are used.Assigning the weights to the criteria allows getting a better optimal solution.
In this paper, to eliminate the above-mentioned clustering problems (clustering accuracy, the presence of many local minimum points, predefined cluster centers) a multi-criterion optimization method based on the weighted PSO algorithm is proposed.In this method minimizing the intra-cluster distance and maximization, the inter-cluster distance has been selected as optimization criteria.
The main contributions of this work are: • The weighted detection method of anomalies occurred in the cloud environment is proposed.Here the optimization is achieved through the minimization of the two criteria.In this method, the sum of the minimum value of intra-cluster distance and the maximum value of the inter-cluster distance is used as an objective function.
• Based on the proposed method, the PSO algorithm for data clustering is constructed.
• The capabilities of the proposed method are evaluated in the Matlab software package.
• The ability to clustering data of any size of the standard PSO algorithm is demonstrated.
• A new clustering algorithm based on the PSO algorithm is proposed.Here, to form the initial population the k-means algorithm is used.

Related works
Hybridization of algorithms for improving the clustering performance is not a novel idea.The contribution of our proposed method is using a PSO based swarm intelligence algorithm, with k-means in order to optimize clustering results based on two contemporary criteria: well-separated clusters by low intra-cluster and high inter-cluster distances.Besides these, another aim of this paper is achievement the local optimization by validity indexes.We apply a new fitness function for PSO algorithm to select the best solution.Then, we propose an anomaly detection method by the combination of PSO and traditional k-means algorithms.We design this hybrid system over cloud computing environment to find the optimal number of clusters with high separation from neighbor clusters and low compactness of local data points, increase detection rate, and decrease the false positive rate at the same time.
PSO with k-means with various parameters is applied in different domains.Comparison of these methods is summarized in Table 1.

Optimization method based on swarm intelligence
The particle swarm optimization (PSO) algorithm is proposed by Eberhart and Kennedy in 1995, is a populationbased stochastic search process algorithm and is modeled on the basis of the social behavior of birds [20,21].The population of particles is the basis of the algorithm.Each of these particles is the possible solution to the optimization problem.In the PSO algorithm, the swarm generates several possible solutions to the optimization problem.Each of these possible solutions is called particles.The purpose of the PSO algorithm is to find the best solution, satisfying to the objective function.
Each particle is the one position in a N dimensional space and moves by regulating its position in the multidimensional search field according to the following: • The best position of the particle; • The best position of the particle's neighbors.
• Each i th particle consists of: ) , xn is the data point, and z ij refers to the jth cluster centroid of the ith particle, and d is the position of the particles.The function depicts the sum of all the intra-cluster distance, in which lower is better Incorporating the multidimensional a synchronism and stochastic disturbance method to the velocity in the PSO, which keep populations diversity and ability of search global optimum.This method is called MSPSO-K.[12] City coordinates of Hopfield-10 TSP and Iris plants database ) -the maximum value of the mean of distances within same classes; A combination of the core idea of kmeans with PSO, which leads to the clustering algorithm with low error rate as compared to k-means [13] KDDCup 1999 dataset an objective function distance measure (Euclidean distance is chosen in this paper) between a data point X i and the cluster center Z j .
The anomaly intrusion detection system based on the combination of PSO (for initializing K cluster centroids) and K-means (for local search ability to stable the centroids).The results show a false positive rate of 2.8% and the detection rate of 86%

Inertia weight is
A hybrid PSO-k-means anomaly detection algorithm optimizes clustering results based on two contemporary criteria: well-separated clusters by low intra-cluster and high inter-cluster distances and achievement the local optimization by validity indexes.
• x i -the current position of the i th particle; • v i -the current velocity of the i th particle; • y i -the personal best position of the i th particle; • g -the best position of the swarm.The position of the particle is regulated according to the following parameters: (1) where ω is the inertia weight (ω = 0.7298), γ 1 (cognitive coefficient -weight of local information) and γ 2 (social coefficient -weight of global information) are acceleration constants, r 1 (t), r 2 (t) ∼ U (0, 1) are uniformly distributed random variables over an interval (0, 1), l = 1, ..., N , i = 1, ..., P , P is the swarm size.
The personal best position of the i-th particle is calculated as follows:

Statement of the clustering problem
In the R n space, the clustering problem may describe as follows: a given point set including n points {x 1 , x 2 , ..., x n },divide these points into K (known constant) sets G 1 , G 2 , ..., G K according to the similarity of them.
Here, during partitioning the following conditions must be satisfied: Clustering is a process that groups objects into multiple clusters by allowing similar objects to be assigned to the same cluster.K-means is the most popular and widely used clustering algorithm.The k-means algorithm tries to find the c 1 , c 2 , ..., c k cluster centers where the sum of the square distance to the nearest cluster center is minimised: where d (x i , c k ) is a distance between x i and c k .In this paper Euclidean distance is used.The standard k-means algorithm is summarized as follows: Step 1. Determination of k clusters and initialization of the ) centroids for each cluster.Each cluster center is a m dimensional vector, for example c Step 2. Calculate the d (t−1) ki distance between i th data set (point in m dimensional space) and k th cluster center.The Euclidean distance is calculated as follows: Step 3. Assign each data object to the nearest cluster center.
Step 4. Update of each c (t) k cluster center by Eq.( 9).This equation calculates the average value of all points assigned to that cluster center: where n k is the number of points assigned to kth cluster.
Step 6. Step 6.If the value of D is satisfied, choosing the final cluster centers.Otherwise, go to t = t + 1 iteration and return to step 2.

Proposed optimization function for PSO clustering
In PSO clustering algorithm, each Y i = (y 1 , y 2 , ..., y k ) particle represents the centers of the k classes.Here, the particle swarm consists of the plans classified as the several candidate solutions.In the optimization algorithm uses the fitness function of selecting a contingent paying plan from the classified plans as the candidate.For the selection among the classified plans of the plan satisfying to the stated condition, the optimization algorithm is used fitness function.For this purpose, in this paper the following fitness function is proposed: Here, the goal is to minimize the value of the cost function given by the Eq.(10).That is, it is assumed that clusterization will be carried out more efficiently at the minimum value of this function.(1 − a) and a are the weight ratios of the J 1 and J 2 criteria, respectively, and represent the impact of J 1 and J 2 criteria to the evaluation.As a result of a series experiments, the clustering result in the a= 0.731 value of weight ratio was relatively stable and better.For this reason, in this paper, the value of weight ratio is taken as a= 0.731.
The minimum value of the function f satisfies the condition that the distance between the points in the same class is small and the distance between the classes is large.The classification plan, which has the lowest f function value, is considered to be the best.The two criteria that form the evaluation fitness are: a)Intra-cluster distance -the distance between data vectors and their corresponding cluster center within a cluster, where the objective is to minimize the intra-cluster distance.This criterion is calculated by Eq. ( 11): where c j is the j th cluster center, x i -denotes data points belonging to C j .b) Inter-cluster distance -the distance between all cluster centers, where the objective is to maximize the distance between clusters.This criterion is calculated by Eq. ( 12): where c k ,c j are the kth and j th cluster centers, respectively.K is the number of clusters.According to this structure the algorithm of PSO based optimization problem for anomaly detection consists of the following steps:

Proposed optimization model for anomaly detection
Step 1. Initialize a population size m, γ 1 and γ 2 learning factors, the number of iterations, a population of particles with small random positions x p , inertia weight (ω), the number of clusters k, dataset consisting of n points.The position of each particle (x i ) corresponds to the cluster centers of size k × m and velocity represents the rate of change in particle's position.
Step 2. Run the following steps in the k-means algorithm for every particle in the population: 1. Calculate Euclidean distance measure d pi , between pth cluster center (particle) and ith data point using Eq.( 8). 2. Assign each data object x i to the nearest cluster center.
Step 3.After grouping the data objects based on minimum distance criterion, evaluate the fitness function which is the maximization of classification accuracy as given by Eq. (10).
Step 4. Compare evaluation with the particle's previous best value P best , in terms of fitness value.If the current position (cluster center location) is better than P best , then assign the current position to P best , else retain P best at its old value.This process is carried out for each particle in the population.
Step 5.After updating P best , choose the best fit value (having maximum fitness value) among the particles in P best and assign it as G best .The Gbest is a single particle of dimension k × m. k being number of clusters identified in partitioning the dataset.
Step 6.The velocity and position of each particle are updated using Eqs.( 1) and ( 2) respectively.
Step 7. Check the convergence criterion, which may be a good fitness value or a maximum number of iterations.If converged, return the G best as the optimal cluster centers, else increment the iteration count t = t + 1 and loop to Step 2.

Metrics for performance evaluation
Assume that the dataset D is composed of the classes ) [22].Purity.The purity of the cluster C p is defined as follows: The value of the purity is always in the interval . A large purity value implies that the cluster is a "pure" subset of the dominant class.The purity of the entire collection of clusters was evaluated as a weighted sum of the individual cluster purities: According to this measure, a higher purity value indicates a better clustering solution.
Entropy.An entropy measure based on information-theoretic considerations can be also used.The entropy of the cluster C p is defined to be Note that when x is close to 0, then x log x is close to 0. So it is considered to be 0log0 = 0.

PSO+K-MEANS ALGORITHM FOR ANOMALY DETECTION IN BIG DATA
Since the entropy considers the distribution of semantic classes in a cluster, it is a more comprehensive measure than the purity.Note that to take values between 0 and 1 the entropy is normalized.If almost all the points of cluster |Cp| is close to 1 for p = p + and is close to 0 for p ̸ = p + .So the entropy E (C p ) is close to 0(since 0log0 = 0 and 1log1 = 0).On the other hand, if the points of cluster C p are randomly divided among all the classes of C + , then |Cp| is close to 1 K + and the entropy of cluster C p is close to 1.In contrast to the purity measure, an entropy value of 0 means that the cluster is comprised entirely of one class, while an entropy value near 1 implies that the cluster contains a uniform mixture of all classes.
The global clustering entropy of the entire collection is defined to be the sum of the individual cluster entropies weighted according to the cluster size.That is, The global entropy also takes values between 0 and 1.A perfect clustering solution will be one that produces clusters that contain documents from only a single class, in which case entropy will be zero.In general, the smaller the entropy, the better the quality of the cluster.
Dunn Index.Dunn Index one of the most cited indices is proposed by [23].The Dunn index identifies clusters which are well separated and compact.The goal is therefore, to maximize the inter-cluster distance while minimizing the intra-cluster distance.The Dunn index for k clusters is defined by Eq. ( 17) [24]: where diss(c i , c j ) = min x∈ci,y∈cj ∥x − y∥ is the dissimilarity between clusters c i and c j and diam(c) = max x,y∈c ∥x − y∥ is the intra-cluster function (or diameter) of the cluster.İf the Dunn index is large, it means that compact and well-separated clusters exist.Silhouette index.The Silhouette index (SI) is another well known way of estimating the number of groups in a dataset.The SI computes for each point a width depending on its membership in any cluster.This silhoutte width is then an average over all observations.This leads to Eq. ( 18): where n is the total number of points, a i is the average distance between point i and all other points in its own cluster and b i is the minimum of the average dissimilarities between i and points in other clusters.The partition with the higest SI is taken to be optimal.

Experiments
In this section, for evaluating the effectiveness, comparative analysis of the clustering results of the proposed method and the k-means algorithm on real Yahoo!S5 dataset is carried out.For the PSO based clustering algorithm, the acceleration constants are taken as γ 1 =γ 2 = 1.4962.
In the test process, the size of the population was taken 50 and the value of the fitness function calculated by the Eq.(10).
The real 2.csv file of the A1Benchmark Yahoo!S5 dataset contains 1440 rows.The feature vector of this dataset contains two parameters such as timestamp and value, there is also the dataset marked as normal or anomaly.In other words, the dataset is a classified dataset.The real state of the A1Benchmark dataset (without clustering) is as shown in Figure 2.For the experiments from the Yahoo!S5 dataset was taken overall 84 points, 68 of them which were normal (blue dots) and 16 are anomalous points.
Figure 3 illustrates point anomalies, which are anomalies in a single value in the time series of A1Benchmark dataset.
Based on the results of the k-means algorithm on this dataset 79 points identified as normal 5 points as anomalous (Figure 4).Here, the 11 points have been identified incorrectly.
In the proposed PSO algorithm, 72 points from the total data were identified as normal 12 points as the anomaly (Figure 5).In this algorithm, 4 points have been identified incorrectly.In the figure, the black circles show the cluster centers.
The effectiveness of the clustering was evaluated over four metrics: Dunn's index, Silhouette index, Purity index, Entropy Index (Table 2).------Here, according to the clustering performance evaluation metrics, the Dunn Index for the k-means algorithm is 0.0510, and for the PSO algorithm 0,3847.Note that, although the Dunn Index is large, the algorithm is considered to be the most effective.According to the Silhouette index, the k-means algorithm has got 0.3899, the PSO algorithm 0.8722.Good results have been also obtained from the proposed method over other metrics.Thus, according to the Purity index, the k-means algorithm has got 0.8690 and the PSO algorithm 0.9524.But when calculating entropy, the entropy of the k-means algorithm was 0.5821 and the entropy of the PSO algorithm was 0.3096.Low entropy in clustering shows that the method is more effective.
The iteration number of the proposed PSO algorithm is taken 200, and the obtained results improved gradually and the optimal solution (BestCost) has been found (Figure 6).

Conclusion
Investigations showed that PSO algorithm, PSO modifications, and hybridisation it with various algorithms provide excellent results in terms of effectiveness and accuracy in solving the optimization problem.Application of this algorithm to data clustering allows for the accurate creation of clusters and accurate forecasting and clustering of data.
In this paper, PSO based multi-criterion optimization method for data clustering was proposed.Comparative analysis of the proposed method with the k-means algorithm on Yahoo!S5 dataset was conducted.The results of experiments showed that the clustering method based on PSO algorithm is better than the k-means algorithm.

Figure 1
Figure 1 shows the proposed optimization model for anomaly detection.As shown in the figure, the model is based on the idea of combining PSO and traditional k-means algorithms.Here the class labels cannot be used directly in the clustering but used in fitness function of PSO to improve the performance of the traditional k-means algorithm.

Figure 1 .
Figure 1.Optimization model for anomaly detection clustering), and to find clusters C = (C 1 , ..., C K ) in this dataset application of clustering technique is required.There are various indices to compare the two partitions C = (C 1 , ..., C K ) and C + = ( C + 1 , ..., C + K +

Figure 2 .
Figure 2. The real state of the A1Benchmark dataset

Figure 3 .Figure 4 .Figure 5 .
Figure 3. Point anomalies in a single value in the time series of A1Benchmark dataset

Figure 6 .
Figure 6.Optimal solution dynamics based on PSO

Table 1 .
Hybridization PSO with k-means for clustering

Table 2 .
Clustering results of the PSO algorithm on a = 0.731