Comparative Evaluation of Imbalanced Data Management Techniques for Solving Classification Problems on Imbalanced Datasets

Keywords: Imbalanced data handling, Oversampling Technique, Machine Learning, Classification, Synthetic Minority Over-sampling Technique (SMOTE)

Abstract

Dealing with imbalanced data is crucial and challenging when developing effective machine-learning models for data classification purposes. It significantly impacts the classification model's performance without proper data management, leading to suboptimal results. Many methods for managing imbalanced data have been studied and developed to improve data balance. In this paper, we conduct a comparative study to assess the influence of a ranking technique on the evaluation of the effectiveness of 66 traditional methods for addressing imbalanced data. The three classification models, i.e., Decision Tree, Random Forest, and XGBoost, act as classification models. The experimental settings have been divided into two segments. The first part evaluates the performance of various imbalanced dataset handling methods, while the second part compares the performance of the top 4 oversampling methods. The study encompasses 50 separate datasets: 20 retrieved from the UCI repository and 30 sourced from the OpenML repository. The evaluation is based on F-Measure and statistical methods, including the Kruskal-Wallis test and Borda Count, to rank the data imbalance handling capabilities of the 66 methods. The SMOTE technique is the benchmark for comparison due to its popularity in handling imbalanced data. Based on the experimental results, the MCT, Polynom-fit-SMOTE, and CBSO methods were identified as the top three performers, demonstrating superior effectiveness in managing imbalanced datasets. This research could be beneficial and serve as a practical guide for practitioners to apply suitable techniques for data management.

References

1. Haixiang G, Yijing L, Shang J, et al (2017) Learning from class-imbalanced data: Review of methods and applications. Expert Syst Appl 73:220–239. https://doi.org/10.1016/j.eswa.2016.12.035
2. Badrouchi S, Ahmed A, Mongi Bacha M, et al (2021) A machine learning framework for predicting long-term graft survival after kidney transplantation. Expert Syst Appl 182:115235. https://doi.org/10.1016/j.eswa.2021.115235
3. Moghadam P, Ahmadi A (2022) A machine learning framework to predict kidney graft failure with class imbalance using Red Deer algorithm. Expert Syst Appl 210:118515. https://doi.org/10.1016/j.eswa.2022.118515
4. He H, Garcia EA (2009) Learning from Imbalanced Data. IEEE Trans Knowl Data Eng 21:1263–1284. https://doi.org/10.1109/TKDE.2008.239
5. Kovács G (2019) An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl Soft Comput 83:105662. https://doi.org/10.1016/j.asoc.2019.105662
6. Liu R (2023) A novel synthetic minority oversampling technique based on relative and absolute densities for imbalanced classification. Applied Intelligence 53:786–803. https://doi.org/10.1007/s10489-022-03512-5
7. Barua S, Islam MdM, Murase K (2011) A Novel Synthetic Minority Oversampling Technique for Imbalanced Data Set Learning. In: Lu B-L, Zhang L, Kwok J (eds) Neural Information Processing. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 735–744
8. Sandhan T, Choi JY (2014) Handling Imbalanced Datasets by Partially Guided Hybrid Sampling for Pattern Recognition. In: 2014 22nd International Conference on Pattern Recognition. pp 1449–1453
9. Elreedy D, Atiya AF (2019) A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance. Inf Sci (N Y) 505:32–64. https://doi.org/10.1016/j.ins.2019.07.070
10. Raghuwanshi BS, Shukla S (2020) SMOTE based class-specific extreme learning machine for imbalanced learning. Knowl Based Syst 187:104814. https://doi.org/10.1016/j.knosys.2019.06.022
11. Chawla N V, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16:321–357. https://doi.org/10.1613/jair.953
12. Cieslak DA, Chawla N V, Striegel A (2006) Combating imbalance in network intrusion datasets. In: 2006 IEEE International Conference on Granular Computing. pp 732–737
13. Gazzah S, Amara NEB (2008) New Oversampling Approaches Based on Polynomial Fitting for Imbalanced Data Sets. In: 2008 The Eighth IAPR International Workshop on Document Analysis Systems. pp 677–684
14. Koto F (2014) SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: An enhancement strategy to handle imbalance in data level. In: 2014 International Conference on Advanced Computer Science and Information System. pp 280–284
15. Fernández-Navarro F, Hervás-Martínez C, Antonio Gutiérrez P (2011) A dynamic over-sampling procedure based on sensitivity for multi-class problems. Pattern Recognit 44:1821–1833. https://doi.org/10.1016/j.patcog.2011.02.019
16. Farquad MAH, Bose I (2012) Preprocessing unbalanced data using support vector machine. Decis Support Syst 53:226–233. https://doi.org/10.1016/j.dss.2012.01.016
17. Jiang L, Qiu C, Li C (2015) A Novel Minority Cloning Technique for Cost-Sensitive Learning. Intern J Pattern Recognit Artif Intell 29:1551004. https://doi.org/10.1142/S0218001415510040
18. Amirruddin AD, Muharam FM, Ismail MH, et al (2022) Synthetic Minority Over-sampling TEchnique (SMOTE) and Logistic Model Tree (LMT)-Adaptive Boosting algorithms for classifying imbalanced datasets of nutrient and chlorophyll sufficiency levels of oil palm (Elaeis guineensis) using spectroradiometers and unmanned aerial vehicles. Comput Electron Agric 193:106646. https://doi.org/10.1016/j.compag.2021.106646
19. Prasojo RA, Putra MAA, Ekojono, et al (2023) Precise transformer fault diagnosis via random forest model enhanced by synthetic minority over-sampling technique. Electric Power Systems Research 220:109361. https://doi.org/10.1016/j.epsr.2023.109361
20. Imakura A, Kihira M, Okada Y, Sakurai T (2023) Another use of SMOTE for interpretable data collaboration analysis. Expert Syst Appl 228:120385. https://doi.org/10.1016/j.eswa.2023.120385
21. Kruskal WH, Wallis WA (1952) Use of Ranks in One-Criterion Variance Analysis. J Am Stat Assoc 47:583–621
22. Borda JC (1784) Memoire sur les Elections au Scrutin. Paris: Histoire de I’ Academie Royale des Sciences, Paris
23. Zhu R, Wang Z, Ma Z, et al (2018) LRID: A new metric of multi-class imbalance degree based on likelihood-ratio test. Pattern Recognit Lett 116:36–42. https://doi.org/10.1016/j.patrec.2018.09.012
24. García-Lapresta JL, Martínez-Panero M, Meneses LC (2009) Defining the Borda count in a linguistic decision making context. Inf Sci (N Y) 179:2309–2316. https://doi.org/10.1016/j.ins.2008.12.021
25. Reel PS, Reel S, Van Kralingen JC, et al (2022) Machine learning for classification of hypertension subtypes using multi-omics: A multi-centre, retrospective, data-driven study Articles. EBioMedicine 84:104276. https://doi.org/10.1016/j.ebiom.2022.104276
26. Guan H, Zhang Y, Xian M, et al (2021) SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling. Applied Intelligence 51:1394–1409. https://doi.org/10.1007/s10489-020-01852-8
27. Jain V, Phophalia A, Bhatt JS (2018) Investigation of a Joint Splitting Criteria for Decision Tree Classifier Use of Information Gain and Gini Index. In: TENCON 2018 - 2018 IEEE Region 10 Conference. pp 2187–2192
28. Sheng C, Yu H (2022) An optimized prediction algorithm based on XGBoost. In: 2022 International Conference on Networking and Network Applications (NaNA). pp 1–6
29. Breiman L (2001) Random Forests. Mach Learn 45:5–32. https://doi.org/10.1023/A:1010933404324
30. Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? Journal of Machine Learning Research 15:3133–3181
31. Sun Z, Wang G, Li P, et al (2024) An improved random forest based on the classification accuracy and correlation measurement of decision trees. Expert Syst Appl 237:121549. https://doi.org/10.1016/j.eswa.2023.121549
32. Wangchamhan T, Chiewchanwattana S, Sunat K (2017) Efficient algorithms based on the k-means and Chaotic League Championship Algorithm for numeric, categorical, and mixed-type data clustering. Expert Syst Appl 90:146–167. https://doi.org/10.1016/j.eswa.2017.08.004
33. Wu W-W (2011) Beyond Travel & Tourism competitiveness ranking using DEA, GST, ANN and Borda count. Expert Syst Appl 38:12974–12982. https://doi.org/10.1016/j.eswa.2011.04.096
34. Wang Z, Xie Z (2014) Infrared face recognition based on local binary patterns and Kruskal-Wallis test. In: 2014 IEEE/ACIS 13th International Conference on Computer and Information Science (ICIS). pp 185–188
35. Kelly M, Longjohn R, Nottingham K (2023) The UCI Machine Learning Repository. https://archive.ics.uci.edu. Accessed 24 Sep 2023
36. Vanschoren J, van Rijn JN, Bischl B, Torgo L (2013) OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15:49–60. https://doi.org/10.1145/2641190.2641198
Published
2024-02-18
How to Cite
Watthaisong, T., Sunat, K., & Muangkote, N. (2024). Comparative Evaluation of Imbalanced Data Management Techniques for Solving Classification Problems on Imbalanced Datasets. Statistics, Optimization & Information Computing, 12(2), 547-570. https://doi.org/10.19139/soic-2310-5070-1890
Section
Research Articles