On a new stacked ensemble framework for imputing missing data in the presence of outliers

Authors

  • Mahmoud A. Abdel-Fattah Department of Applied Statistics and Econometrics, Faculty of Graduate Studies for Statistical Research, 5 Ahmed Zewail St., Cairo University, Giza 12613, Egypt. https://orcid.org/0009-0000-6603-6207
  • Mai A. Mohsen Department of Applied Statistics and Econometrics, Faculty of Graduate Studies for Statistical Research, 5 Ahmed Zewail St., Cairo University, Giza 12613, Egypt.
  • Amany M. Mousa Department of Applied Statistics and Econometrics, Faculty of Graduate Studies for Statistical Research, 5 Ahmed Zewail St., Cairo University, Giza 12613, Egypt.

DOI:

https://doi.org/10.19139/soic-2310-5070-2894

Keywords:

Missing value imputation, Ensemble, Stacking, MissForest, IRMI, EM, Ridge

Abstract

Missing value imputation (MVI) presents a real challenge which becomes more complicated in the presence of outliers. Although ensemble techniques such as bagging and boosting have been employed for MVI and have shown promising results, stacking has not been investigated in this area, despite its efficiency in prediction tasks. To address this gap, two robust stacking frameworks are proposed for imputing missing data in the presence of outliers, namely RKSF-IM and RESF-IM. These proposed frameworks begin by adding an outlier indicator. Then they employ two different stacking configurations, where MissForest, IRMI, and EM are the base learners, and their predicted values are used as inputs in ridge regression, which acts as a meta learner in the second layer. The RMSE, MAE, and Wasserstein distance metrics of the proposed frameworks are evaluated against those of the mean, median, XGBoost, EM, IRMI, KNN, MissForest, and SVM imputation methods using a simulation study and two real data applications. The simulation study considers different scenarios for missing rates and outliers. The study also investigates the impact of adding an outlier indicator on the performance of the different imputation methods. The proposed stacking configurations show better performance, under the simulation settings, than the competing methods in most scenarios. In addition, many existing imputation methods are further improved by including an outlier indicator variable.

Downloads

Published

2025-10-20

Issue

Section

Research Articles

How to Cite

On a new stacked ensemble framework for imputing missing data in the presence of outliers. (2025). Statistics, Optimization & Information Computing, 14(6), 3526-3545. https://doi.org/10.19139/soic-2310-5070-2894