Weighted Clustering for Anomaly Detection in Big Data

  • Rasim Alguliyev Institute of Information Technology, Azerbaijan National Academy of Sciences
  • Ramiz Aliguliyev Institute of Information Technology, Azerbaijan National Academy of Sciences
  • Yadigar Imamverdiyev Institute of Information Technology, Azerbaijan National Academy of Sciences
  • Lyudmila Sukhostat Institute of Information Technology, Azerbaijan National Academy of Sciences
Keywords: Clustering, weighted clustering, clustering evaluation metrics, Big data, anomaly detection, k-means.

Abstract

In this paper, a new method for anomaly detection based on weighted clustering is proposed. The weights that were obtained by summing the weights of each point from the data set are assigned to clusters. The comparison is made using seven datasets (of large dimensions) with the k-means algorithm. The proposed approach increases the reliability of data partitioning into groups. Experimental results show that the proposed approach becomes more efficient with increasing size of the analysed dataset.

References

Y. Zhai, Y.-S. Ong, and I.W. Tsang, The Emerging “Big Dimensionality”, IEEE Comp. Int. Mag. 9 (2014), pp. 14–26.

V. Chandola, A. Banerjee, and V. Kumar, Anomaly Detection: A Survey, ACM Comp. Surv. 41 (2009), pp. 1–58.

D. Barbara, Requirements for Clustering Data Streams, ACM SIGKDD Expl. Newsl. 3 (2003), pp. 23-27.

J. Chandrika and K.R. Ananda Kumar, Dynamic Clustering of High Speed Data Streams. Int. J. Comp. Sci. 9 (2012), pp. 224-228.

Q. Quan, C.-J. Xiao, and R. Zhang. Grid-based Data Stream Clustering for Intrusion Detection. Int. J. Netw. Sec. 15 (2013), pp. 1-8.

Y. Kou, C. Lu, S. Sirwongwattana, and Y. Huang, Survey of Fraud Detection Techniques, IEEE ICNSC Conference, Taipei, Taiwan, 2004.

A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava, A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection, SIAM Conference on Data Mining, San Francisco, CA, 2003.

M. Xie, S. Han, B. Tian, and S. Parvin, Anomaly Detection in Wireless Sensor Networks: A Survey, J. Netw. Comp. Appl. 34 (2011), pp. 1302–1325.

H. Nallaivarothayan, D. Ryan, S. Denman, S. Sridharan, and C. Fookes, An Evaluation of Different Features and Learning Models for Anomalous Event Detection, DICTA Conference, Hobart, Australia, 2013.

J.J. Davis and A. J. Clark, Data Preprocessing for Anomaly based Network Intrusion Detection: A Review, Comp. and Sec. 30 (2011), pp. 353–375.

R.M. Alguliyev, R.M. Aliguliyev, A. Bagirov, and R. Karimov, Batch Clustering Algorithm for Big Data Sets, AICT Conference, Baku. Azerbaijan, 2016.

H. Rehioui, A. Idrissi, M. Abourezq, and F. Zegrari, DENCLUE-IM: A New Approach for Big Data Clustering, Proc. Comp. Sci. 83 (2016), pp. 560 – 567.

F. Pourkamali Anaraki and S. Becker, Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means, IEEE Trans. Info. Theory, 63 (2017), pp. 2954–2974.

R.S. Tsay, Some Methods for Analyzing Big Dependent Data, J. Business & Econ. Stat, 34 (2016), pp. 673-688.

G. Tzortzis and A. Likas. The MinMax K-Means Clustering Algorithm. Patt. Recog. 47 (2014), pp. 2505-2516.

D.Arthur and S.Vassilvitskii, K-Means++: The Advantages of Careful Seeding, ACM-SIAM SODA Symposium, New Orleans, Louisiana, 2007.

A.Banerje and J.Ghosh, Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres, IEEE Trans. Neural Netw. 15 (2004), pp. 702–719.

G. Gan and M.K.-P. Ng, K-Means Clustering with Outlier Removal, Patt. Recog. Letters 90 (2017), pp. 8–14.

M. Ahmed and A. Naser, A Novel Approach for Outlier Detection and Clustering Improvement, IEEE ICIEA Conference, Melbourne, Australia, 2013.

S. Soheily-Khah, A. Douzal-Chouakria, and E. Gaussier, Generalized K-Means-Based Clustering for Temporal Data under Weighted and Kernel Time Warp, Patt. Recog. Letters 75 (2016), pp. 63–69.

M.I. Malinen, R. Mariescu-Istodor, and P. Franti, K-means*: Clustering by Gradual Data Transformation, Patt. Recog. 47 (2014), pp. 3376–3386.

T.-S. Xu, H.-D. Chiang, G.-Y. Liu, and C.-W. Tan, Hierarchical K-means Method for Clustering Large-Scale Advanced Metering Infrastructure Data, IEEE Trans. Power Delivery 32 (2017), pp. 609-616.

F. Jiang, G. Liu, J. Du, and Y. Sui, Initialization of K-modes Clustering using Outlier Detection Techniques. Inf. Sci. 332 (2016), pp. 167-183.

J. Eggermont, J.N. Kok, and W.A. Kosters. Genetic Programming for Data Classification: Partitioning the Search Space, ACM SAC Symposium, New York, NY, 2004.

M. Lichman, UCI Machine Learning Repository, University of California, 2013. Available at http://archive.ics.uci.edu/ml.

P. Aggarwal and S.K. Sharma, Analysis of KDD Dataset Attributes-Class Wise for Intrusion Detection, Proc. Comp. Sci. 57 (2015), pp. 842–851.

B. Antal and A. Hajdu, An Ensemble-Based System for Automatic Screening of Diabetic Retinopathy, Knowl. Based Syst. 60 (2014), pp. 20-27.

E. Decenciere, X. Zhang, G. Cazuguel, B. Lay, B. Cochener, C. Trone, P. Gain, R. Ordonez, P. Massin, A. Erginay, B. Charton, and J.-C. Klein, Feedback on a Publicly Distributed Database: the Messidor Database. Image Analysis & Stereology 33 (2014), pp. 231-234.

D. Heck, J. Knapp, J.N. Capdevielle,G.Schatz, and T. Thouw, CORSIKA: A Monte Carlo Code to Simulate Extensive Air Showers, Forschungszentrum Karlsruhe, Germany, 1998.

I.C. Yeh and C.H. Lien, The Comparisons of Data Mining Techniques for the Predictive Accuracy of Probability of Default of Credit Card Clients, Expert Syst. with Appl. 36 (2009), pp. 2473-2480

J. McHugh, Testing Intrusion Detection Systems: A Critique of the 1998 and 1999 DARPA Intrusion Detection System Evaluations as Performed by Lincoln Laboratory, ACM Trans. Inf. and Syst. Sec. 3 (2000), pp. 262–294.

J.A. Blackard and J.D. Denis, Comparative Accuracies of Artificial Neural Networks and Discriminant Analysis in Predicting Forest Cover Types from Cartographic Variables, Comp. and Elect. Agriculture 24 (2000), pp. 131-151.

R. Mohammad, F.A. Thabtah, and T.L. McCluskey, Phishing Websites Dataset, University of Huddersfield, 2015. Available at https://archive.ics.uci.edu/ml/datasets/Phishing+Websites

R.M. Alguliev, R.M. Aliguliyev, T.Kh. Fataliyev, and R.Sh. Hasanova, Weighted Consensus Index for Assessment of the Scientific Performance of Researchers, COLLNET J. Scientometrics and Inf. Management 8 (2014), pp. 371–400.

F. Boutin and M. Hascoet, Cluster Validity Indices for Graph Partitioning, ICIV Conference, London, UK, 2004.

A.M. Rubinov, N.V. Soukhorukova, and J. Ugon, Classes and Clusters in Data Analysis, Euro. J. Operational Research 173 (2006), pp. 849–865.

B. Mirkin, Mathematical Classification and Clustering, J. Global Optimization 12 (1998), pp. 105-108.

J.C. Bezdek and N.R. Pal, Some New Indexes of Cluster Validity, IEEE Trans. Syst., Man and Cyber, Part B 28 (1998), pp. 301–315.

A. Patrikainen and M. Meila, Comparing subspace clusterings, IEEE Trans. Knowl. and Data Engin. 18 (2006), pp. 902–916.

A. Rosenberg, J. Hirschberg, V-measure: a conditional entropy-based external cluster evaluation measure, EMNLP-CoNLL Conference, Prague, Czech Republic, 2007.

R.M. Aliguliyev, Performance evaluation of density-based clustering methods, Inf. Sci. 179 (2009), pp. 3583-3602.

I. Eyal, I. Keidar, R. Rom, Distributed data clustering in sensor networks, Distrib. Comput. 24 (2010), pp. 207–222.

Published
2018-06-24
How to Cite
Alguliyev, R., Aliguliyev, R., Imamverdiyev, Y., & Sukhostat, L. (2018). Weighted Clustering for Anomaly Detection in Big Data. Statistics, Optimization & Information Computing, 6(2), 178-188. https://doi.org/10.19139/soic.v6i2.404
Section
Research Articles

Most read articles by the same author(s)