# Weighted Clustering for Anomaly Detection in Big Data

### Abstract

In this paper, a new method for anomaly detection based on weighted clustering is proposed. The weights that were obtained by summing the weights of each point from the data set are assigned to clusters. The comparison is made using seven datasets (of large dimensions) with the k-means algorithm. The proposed approach increases the reliability of data partitioning into groups. Experimental results show that the proposed approach becomes more efficient with increasing size of the analysed dataset.### References

Y. Zhai, Y.-S. Ong, and I.W. Tsang, The Emerging “Big Dimensionality”, IEEE Comp. Int. Mag. 9 (2014), pp. 14–26.

V. Chandola, A. Banerjee, and V. Kumar, Anomaly Detection: A Survey, ACM Comp. Surv. 41 (2009), pp. 1–58.

D. Barbara, Requirements for Clustering Data Streams, ACM SIGKDD Expl. Newsl. 3 (2003), pp. 23-27.

J. Chandrika and K.R. Ananda Kumar, Dynamic Clustering of High Speed Data Streams. Int. J. Comp. Sci. 9 (2012), pp. 224-228.

Q. Quan, C.-J. Xiao, and R. Zhang. Grid-based Data Stream Clustering for Intrusion Detection. Int. J. Netw. Sec. 15 (2013), pp. 1-8.

Y. Kou, C. Lu, S. Sirwongwattana, and Y. Huang, Survey of Fraud Detection Techniques, IEEE ICNSC Conference, Taipei, Taiwan, 2004.

A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava, A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection, SIAM Conference on Data Mining, San Francisco, CA, 2003.

M. Xie, S. Han, B. Tian, and S. Parvin, Anomaly Detection in Wireless Sensor Networks: A Survey, J. Netw. Comp. Appl. 34 (2011), pp. 1302–1325.

H. Nallaivarothayan, D. Ryan, S. Denman, S. Sridharan, and C. Fookes, An Evaluation of Different Features and Learning Models for Anomalous Event Detection, DICTA Conference, Hobart, Australia, 2013.

J.J. Davis and A. J. Clark, Data Preprocessing for Anomaly based Network Intrusion Detection: A Review, Comp. and Sec. 30 (2011), pp. 353–375.

R.M. Alguliyev, R.M. Aliguliyev, A. Bagirov, and R. Karimov, Batch Clustering Algorithm for Big Data Sets, AICT Conference, Baku. Azerbaijan, 2016.

H. Rehioui, A. Idrissi, M. Abourezq, and F. Zegrari, DENCLUE-IM: A New Approach for Big Data Clustering, Proc. Comp. Sci. 83 (2016), pp. 560 – 567.

F. Pourkamali Anaraki and S. Becker, Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means, IEEE Trans. Info. Theory, 63 (2017), pp. 2954–2974.

R.S. Tsay, Some Methods for Analyzing Big Dependent Data, J. Business & Econ. Stat, 34 (2016), pp. 673-688.

G. Tzortzis and A. Likas. The MinMax K-Means Clustering Algorithm. Patt. Recog. 47 (2014), pp. 2505-2516.

D.Arthur and S.Vassilvitskii, K-Means++: The Advantages of Careful Seeding, ACM-SIAM SODA Symposium, New Orleans, Louisiana, 2007.

A.Banerje and J.Ghosh, Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres, IEEE Trans. Neural Netw. 15 (2004), pp. 702–719.

G. Gan and M.K.-P. Ng, K-Means Clustering with Outlier Removal, Patt. Recog. Letters 90 (2017), pp. 8–14.

M. Ahmed and A. Naser, A Novel Approach for Outlier Detection and Clustering Improvement, IEEE ICIEA Conference, Melbourne, Australia, 2013.

S. Soheily-Khah, A. Douzal-Chouakria, and E. Gaussier, Generalized K-Means-Based Clustering for Temporal Data under Weighted and Kernel Time Warp, Patt. Recog. Letters 75 (2016), pp. 63–69.

M.I. Malinen, R. Mariescu-Istodor, and P. Franti, K-means*: Clustering by Gradual Data Transformation, Patt. Recog. 47 (2014), pp. 3376–3386.

T.-S. Xu, H.-D. Chiang, G.-Y. Liu, and C.-W. Tan, Hierarchical K-means Method for Clustering Large-Scale Advanced Metering Infrastructure Data, IEEE Trans. Power Delivery 32 (2017), pp. 609-616.

F. Jiang, G. Liu, J. Du, and Y. Sui, Initialization of K-modes Clustering using Outlier Detection Techniques. Inf. Sci. 332 (2016), pp. 167-183.

J. Eggermont, J.N. Kok, and W.A. Kosters. Genetic Programming for Data Classification: Partitioning the Search Space, ACM SAC Symposium, New York, NY, 2004.

M. Lichman, UCI Machine Learning Repository, University of California, 2013. Available at http://archive.ics.uci.edu/ml.

P. Aggarwal and S.K. Sharma, Analysis of KDD Dataset Attributes-Class Wise for Intrusion Detection, Proc. Comp. Sci. 57 (2015), pp. 842–851.

B. Antal and A. Hajdu, An Ensemble-Based System for Automatic Screening of Diabetic Retinopathy, Knowl. Based Syst. 60 (2014), pp. 20-27.

E. Decenciere, X. Zhang, G. Cazuguel, B. Lay, B. Cochener, C. Trone, P. Gain, R. Ordonez, P. Massin, A. Erginay, B. Charton, and J.-C. Klein, Feedback on a Publicly Distributed Database: the Messidor Database. Image Analysis & Stereology 33 (2014), pp. 231-234.

D. Heck, J. Knapp, J.N. Capdevielle,G.Schatz, and T. Thouw, CORSIKA: A Monte Carlo Code to Simulate Extensive Air Showers, Forschungszentrum Karlsruhe, Germany, 1998.

I.C. Yeh and C.H. Lien, The Comparisons of Data Mining Techniques for the Predictive Accuracy of Probability of Default of Credit Card Clients, Expert Syst. with Appl. 36 (2009), pp. 2473-2480

J. McHugh, Testing Intrusion Detection Systems: A Critique of the 1998 and 1999 DARPA Intrusion Detection System Evaluations as Performed by Lincoln Laboratory, ACM Trans. Inf. and Syst. Sec. 3 (2000), pp. 262–294.

J.A. Blackard and J.D. Denis, Comparative Accuracies of Artificial Neural Networks and Discriminant Analysis in Predicting Forest Cover Types from Cartographic Variables, Comp. and Elect. Agriculture 24 (2000), pp. 131-151.

R. Mohammad, F.A. Thabtah, and T.L. McCluskey, Phishing Websites Dataset, University of Huddersfield, 2015. Available at https://archive.ics.uci.edu/ml/datasets/Phishing+Websites

R.M. Alguliev, R.M. Aliguliyev, T.Kh. Fataliyev, and R.Sh. Hasanova, Weighted Consensus Index for Assessment of the Scientific Performance of Researchers, COLLNET J. Scientometrics and Inf. Management 8 (2014), pp. 371–400.

F. Boutin and M. Hascoet, Cluster Validity Indices for Graph Partitioning, ICIV Conference, London, UK, 2004.

A.M. Rubinov, N.V. Soukhorukova, and J. Ugon, Classes and Clusters in Data Analysis, Euro. J. Operational Research 173 (2006), pp. 849–865.

B. Mirkin, Mathematical Classification and Clustering, J. Global Optimization 12 (1998), pp. 105-108.

J.C. Bezdek and N.R. Pal, Some New Indexes of Cluster Validity, IEEE Trans. Syst., Man and Cyber, Part B 28 (1998), pp. 301–315.

A. Patrikainen and M. Meila, Comparing subspace clusterings, IEEE Trans. Knowl. and Data Engin. 18 (2006), pp. 902–916.

A. Rosenberg, J. Hirschberg, V-measure: a conditional entropy-based external cluster evaluation measure, EMNLP-CoNLL Conference, Prague, Czech Republic, 2007.

R.M. Aliguliyev, Performance evaluation of density-based clustering methods, Inf. Sci. 179 (2009), pp. 3583-3602.

I. Eyal, I. Keidar, R. Rom, Distributed data clustering in sensor networks, Distrib. Comput. 24 (2010), pp. 207–222.

*Statistics, Optimization & Information Computing*,

*6*(2), 178-188. https://doi.org/10.19139/soic.v6i2.404

- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).