### Anomaly Detection in Big Data based on Clustering

#### Abstract

Selection of the right tool for anomaly (outlier) detection in Big data is an urgent task. In this paper algorithms for data clustering and outlier detection that take into account the compactness and separation of clusters are provided. We consider the features of their use in this capacity. Numerical experiments on real data of different sizes demonstrate the effectiveness of the proposed algorithms.

#### Keywords

#### References

R. M. Alguliyev, R. M. Aliguliyev, A. Bagirov, and R. Karimov, “Batch Clustering Algorithm for Big Data Sets,” Proceedings of the AICT Conference, pp. 79-84, 2016.

F. Jiang, G. Liu, J. Du, and Y. Sui, “Initialization of K-modes clustering using outlier detection techniques,” Information Sciences, vol. 332, pp. 167-183, 2016.

G. S. D. S. Jayakumar and B. J. Thomas, “A new procedure of clustering based on multivariate outlier detection,” Journal of Data Science, vol. 11, no. 1, pp. 69-84, 2013.

F. Macia-Perez, J. Berna-Martinez, A. Fernandez, and M. Abreu, “Algorithm for the detection of outliers based on the theory of rough sets,” Decision Support Systems, vol. 75, pp. 63-75, 2015.

T. S. Xu, H. D. Chiang, G. Y. Liu, and C. W. Tan, “Hierarchical K-means Method for Clustering Large-Scale Advanced Metering Infrastructure Data,” IEEE Transactions on Power Delivery, vol. 32, no. 2, pp. 609-616, 2017.

B. Liu, Y. Xiao, P. S. Yu, Z. Hao, and L. Cao, “An efficient approach for outlier detection with imperfect data labels,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 7, pp. 1602-1616, 2014.

A. M. C. Souza and J. R. A. Amazonas, “An outlier detect algorithm using big data processing and internet of things architecture,” Procedia Computer Science, vol. 52, pp. 1010-1015, 2015.

J. Huang, Q. Zhu, and L. Y. J. Feng, “A non-parameter outlier detection algorithm based on Natural Neighbor,” Knowl.-Based Syst., vol. 92, pp. 71-77, 2016.

M. M. Breunig, H. P. Kriegel, R. T. Ng, and J. Sander, “LOF: identifying density-based local outliers,” ACM Sigmod Record, vol. 29, no. 2, pp. 93-104, 2000.

J. Ha, S. Seok, and J.-S. Lee, “Robust outlier detection using the instability factor,” Knowledge-Based Syst., vol. 63, pp. 15-23, 2014.

M. Capó, A. Pérez, and J. A. Lozano, “An efficient approximation to the k-means clustering for massive data,” Knowledge-Based Systems, vol. 117, pp. 56-69, 2017.

L. Ott, L. Pang, F. Ramos, D. Howe, and S. Chawla, “Integer programming relaxations for integrated clustering and outlier detection,” In arXiv:1403.1329, 2014.

G. Gan and M. Ng, “k-means clustering with outlier removal,” Pattern Recognition Letters, vol. 90, pp. 8-14, 2017.

R. Dave and R. Krishnapuram, “Robust clustering methods: a unified view,” IEEE Trans. Fuzzy Syst., vol. 5, no. 2, pp. 270-293, 1997.

J. Eggermont, J. N. Kok, and W. A. Kosters, “Genetic programming for data classification: partitioning the search space,” ACM SAC Symposium, pp. 1001-1005, 2004.

M. Lichman, “UCI Machine Learning Repository,” University of California, Available at http://archive.ics.uci.edu/ml, 2013.

P. Aggarwal and S.K. Sharma, “Analysis of KDD dataset attributes-class wise for intrusion detection,” Proc. Comp. Sci., vol. 57, pp. 842-851, 2015.

B. Antal and A. Hajdu, “An ensemble-based system for automatic screening of diabetic retinopathy,” Knowl. Based Syst., vol. 60, pp. 20-27, 2014.

E. Decenciere, X. Zhang, G. Cazuguel, B. Lay, B. Cochener, C. Trone, P. Gain, R. Ordonez, P. Massin, A. Erginay, B. Charton, and J.-C. Klein, “Feedback on a publicly distributed database: the messidor database,” Image Analysis & Stereology, vol. 33, pp. 231-234, 2014.

J. McHugh, “Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln laboratory,” ACM Trans. Inf. and Syst. Sec., vol. 3, pp. 262-294, 2000.

J. A. Blackard and J. D. Denis, “Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables,” Comp. and Elect. Agriculture, vol. 24, pp. 131-151, 2000.

R. Mohammad, F. A. Thabtah, and T. L. McCluskey, “Phishing websites dataset,” University of Huddersfield, Available at https://archive.ics.uci.edu/ml/datasets/Phishing+Websites, 2015

R. M. Alguliev, R. M. Aliguliyev, T. Kh. Fataliyev, and R. Sh. Hasanova, “Weighted consensus index for assessment of the scientific performance of researchers,” COLLNET J. Scientometrics and Inf. Management, vol. 8, pp. 371-400, 2014.

F. Boutin and M. Hascoet, “Cluster validity indices for graph partitioning,” Proceedings of the ICIV Conference, pp. 376-381, 2004.

A. M. Rubinov, N. V. Soukhorukova, and J. Ugon, “Classes and clusters in data analysis,” Euro. J. Operational Research, vol. 173, pp. 849-865, 2006.

B. Mirkin, “Mathematical classification and clustering,” J. Global Optimization, vol. 12, pp. 105-108, 1998.

J. C. Bezdek and N. R. Pal, “Some new indexes of cluster validity,” IEEE Trans. Syst., Man and Cyber, Part B, vol. 28, pp. 301-315, 1998.

A. Patrikainen and M. Meila, “Comparing subspace clusterings,” IEEE Trans. Knowl. and Data Engin., vol. 18, pp. 902-916, 2006.

A. Rosenberg and J. Hirschberg, “V-measure: a conditional entropy-based external cluster evaluation measure,” Proceedings of the EMNLP-CoNLL Conference, pp. 410-420, 2007.

R. M. Aliguliyev, “Performance evaluation of density-based clustering methods,” Inf. Sci., vol. 179, pp. 3583-3602, 2009.

I. Eyal, I. Keidar, and R. Rom, “Distributed data clustering in sensor networks,” Distrib. Comput., vol. 24, pp. 207-222, 2010.

DOI: 10.19139/soic.v5i4.365

### Refbacks

- There are currently no refbacks.