Enhancing Fraud Detection in Imbalanced Datasets: A Comparative Study of Machine Learning and Deep Learning Algorithms with SMOTE Preprocessing

Document Type : Original Research Articles.

Authors

Information Systems Department, Faculty of Computer & Information Sciences - Mansoura University

Abstract

Fraud detection has become a critical challenge, particularly with the growth of e-commerce. Financial institutions are under increasing pressure to develop robust systems to mitigate significant economic losses due to fraudulent activities. A key difficulty in detecting credit card fraud is the imbalance of data sets, where fraudulent transactions are far fewer than legitimate ones. This imbalance often results in models struggling to effectively recognize fraud.
To address this issue, various techniques have been developed. The Synthetic Minority Oversampling Technique (SMOTE) is widely used to create synthetic instances and balance the data set. Other strategies include under-sampling, which reduces the number of legitimate transactions, and cost-sensitive learning, which assigns different costs to misclassifications to prioritize fraud detection. Advanced SMOTE variants, such as Borderline-SMOTE and ADASYN, further enhance the balance of data by focusing on complex samples.
This paper examines how data preprocessing affects the performance of several machine learning and deep learning algorithms. Key preprocessing steps include data cleaning, normalization, feature selection, and SMOTE application. The cleaned and normalized data set ensures quality and comparability, while feature selection reduces dimensionality. The application of SMOTE directly addresses the class imbalance.
The preprocessed data are evaluated using Support Vector Machines (SVM), Random Forests (RF), Convolitional Neural Networks (CNN), and Long-Short-Term Memory Networks (LSTMs). These algorithms are assessed for their ability to detect fraud after pre-processing. Comparative analyses confirm the effectiveness of SMOTE, showing improved performance across all algorithms. Metrics such as accuracy, precision, recall, and F1 score exhibit high results, with CNN achieving the highest performance (95% accuracy and 94% F1 score), followed by RF, LSTM, and SVM. Although SMOTE enhanced SVM performance, it did not match CNN or RF levels. These findings highlight the significant improvements that data pre-processing can yield, providing valuable insights for improving fraud detection systems.  

Keywords

Main Subjects