Missing Value Management: Weighted Heuristic Similarity Estimation for Numeric Values

Document Type : Original Research Articles.

Authors

Faculty of Computers and Information, Computer Science Dept. Mansoura University, Egypt

Abstract

For businesses and technologies such as the Internet of Things (IoT) and digital banking that handles massive volumes of data, it is crucial to have all processed data values accurately recorded; for data values that are not recorded, they must be replaced using a reliable imputation method. The need for missing value imputation is of extreme importance in big data applications as data volumes tend to grow exponentially and their data structures change rapidly. This study proposes a reasonable distance function that is more effective in determining the best replacement values for missing data before applying a classifier on the objective dataset. In essence, the Weighted Heuristic Similarity Estimation mechanism (WHSE) consumes substantial effort in practical application fields. The WHSE method was benchmarked using Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) metrics. The evaluation process was conducted using three distinct classifiers: Nearest-Neighbor (NN), Linear-Regression (LR), and Multi-Layer Perceptron (MLP). WHSE method is applied on two different datasets: Iris and Forest Fires to estimate its impact in replacing missing value. Consequently, WHSE formula can direct the applied classifier to score at least similar performance -- if not ideal-- regardless of the characteristics of the imputed data. WHSE method is expected to be scalable, stable and applicable in big data analytics.   

Keywords

Main Subjects