Systematic Literature Review: Machine Learning Algorithm Performance Evaluation of Extract-Transform-Load

Muhammad Faisal Ashshidiq; Mohamad Nurkamal Fauzan

doi:10.37278/sisinfo.v8i1.1340

Authors

Muhammad Faisal Ashshidiq Informatics Engineering, Vocational School, Universitas Logistik dan Bisnis Internasional
Mohamad Nurkamal Fauzan Informatics Engineering, Vocational School, Universitas Logistik dan Bisnis Internasional

DOI:

https://doi.org/10.37278/sisinfo.v8i1.1340

Keywords:

Machine Learning, Anomaly Detection, Nested Array, SLR

Abstract

The exponential growth of data in the digital era poses significant challenges for effective data utilization. The Extract, Transform, Load (ETL) process is the foundation for preparing large-scale, unstructured data from various sources (NoSQL databases, log files) for analysis in a data warehouse. However, handling complex data structures such as nested arrays in MongoDB is a major obstacle during the transformation phase. In addition, the purpose of the transformation process is to maintain data quality and integrity. This crucial need requires a robust mechanism for anomaly detection to identify unusual patterns or events that indicate data corruption or system errors. The process of handling system errors requires analyzing nested array data structures using relevant machine learning algorithms for anomaly detection. This literature study is expected to provide valuable insights and identify relevant algorithms in data anomaly detection after the ETL process.

References

E. Marcelli, T. Barbariol, and G. A. Susto, “Active Learning-based Isolation Forest (ALIF): Enhancing Anomaly Detection in Decision Support Systems,” Jul. 2022.

N. Usman, E. Utami, and A. D. Hartanto, “Comparative Analysis of Elliptic Envelope, Isolation Forest, One-Class SVM, and Local Outlier Factor in Detecting Earthquakes with Status Anomaly using Outlier,” in 2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE), IEEE, Feb. 2023, pp. 673–678. doi: 10.1109/ICCoSITE57641.2023.10127748.

J. C. Nwokeji and R. Matovu, “A Systematic Literature Review on Big Data Extraction, Transformation and Loading (ETL),” 2021, pp. 308–324. doi: 10.1007/978-3-030-80126-7_24.

J. Nwokeji, F. Aqlan, A. Anugu, and A. Olagunju, “Big Data ETL Implementation Approaches: A Systematic Literature Review (P),” Jul. 2018, pp. 714–721. doi: 10.18293/SEKE2018-152.

F. Raymand, B. Najafi, A. Haghighat Mamaghani, A. Moazami, and F. Rinaldi, “Machine learning-based estimation of buildings’ characteristics employing electrical and chilled water consumption data: Pipeline optimization,” Energy Build., vol. 295, p. 113327, Sep. 2023, doi: 10.1016/j.enbuild.2023.113327.

Y. Gong, F. Gu, K. Chen, and F. Wang, “The Architecture of Micro-services and the Separation of Frond-end and Back-end Applied in a Campus Information System,” in 2020 IEEE International Conference on Advances in Electrical Engineering and Computer Applications( AEECA), IEEE, Aug. 2020, pp. 321–324. doi: 10.1109/AEECA49918.2020.9213662.

A. A. Yulianto, “Extract Transform Load (ETL) Process in Distributed Database Academic Data Warehouse,” APTIKOM Journal on Computer Science and Information Technologies, vol. 4, no. 2, pp. 61–68, Jul. 2019, doi: 10.11591/APTIKOM.J.CSIT.36.

M. Gorawski, K. Pasterak, A. Gorawska, and M. Gorawski, “The stream data warehouse: Page replacement algorithms and quality of service metrics,” Future Generation Computer Systems, vol. 142, pp. 212–227, May 2023, doi: 10.1016/j.future.2023.01.003.

S. R. Cheruku, S. Jain, and A. Aggarwal, “Managing Data Warehouses in Cloud Environments: Challenges and Solutions,” International Research Journal of Modernization in Engineering Technology and Science, vol. 6, no. 8, Sep. 2024, doi: 10.56726/IRJMETS61249.

F. F. Hasan and M. S. A. Bakar, “Data Transformation from SQL to NoSQL MongoDB Based on R Programming Language,” in 2021 5th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), 2021, pp. 399–403. doi: 10.1109/ISMSIT52890.2021.9604548.

B. R. Chang, H.-F. Tsai, and Y.-D. Lee, “Integrated High-Performance Platform for Fast Query Response in Big Data with Hive, Impala, and SparkSQL: A Performance Evaluation,” Applied Sciences, vol. 8, no. 9, p. 1514, Sep. 2018, doi: 10.3390/app8091514.

A. Herreros-Martínez, R. Magdalena-Benedicto, J. Vila-Francés, A. J. Serrano-López, S. Pérez-Díaz, and J. J. Martínez-Herráiz, “Applied Machine Learning to Anomaly Detection in Enterprise Purchase Processes: A Hybrid Approach Using Clustering and Isolation Forest,” Information, vol. 16, no. 3, p. 177, Feb. 2025, doi: 10.3390/info16030177.

M. Nalini, B. Yamini, C. Ambhika, and R. Siva Subramanian, “Enhancing early attack detection: novel hybrid density-based isolation forest for improved anomaly detection,” International Journal of Machine Learning and Cybernetics, vol. 16, no. 5–6, pp. 3429–3447, Jun. 2025, doi: 10.1007/s13042-024-02460-5.

A. R. Fadillah and M. N. Fauzan, “Systematic Literature Review: Identifying Key Variables and Measuring Maximum Loan Limits,” Jurnal ELTIKOM : Jurnal Teknik Elektro, Teknologi Informasi dan Komputer, vol. 8, no. 2, pp. 100–110, Dec. 2024, doi: 10.31961/eltikom.v8i2.1156.

S. Mishra, S. Konidala, and J. Manda, “Improving the ETL process through declarative transformation languages ,” Distributed Learning and Broad Applications in Scientific Research, vol. 5, Jun. 2019.

B. Oliveira, Ó. Oliveira, T. Matos, V. Santos, and O. Belo, “AN ETL PATTERN FOR LOG CONFIGURATION AND ANALYSIS,” in Proceedings of the International Conferences Big Data Analytics, Data Mining and Computational Intelligence 2019; and Theory and Practice in Modern Computing 2019, IADIS Press, Jul. 2019, pp. 39–46. doi: 10.33965/bigdaci2019_201907L005.

A. Herreros-Martínez, R. Magdalena-Benedicto, J. Vila-Francés, A. J. Serrano-López, S. Pérez-Díaz, and J. J. Martínez-Herráiz, “Applied Machine Learning to Anomaly Detection in Enterprise Purchase Processes: A Hybrid Approach Using Clustering and Isolation Forest,” Information, vol. 16, no. 3, p. 177, Feb. 2025, doi: 10.3390/info16030177.

B. J. Wheeler and H. A. Karimi, “Enhancing Hyperspectral Anomaly Detection Algorithm Comparisons: Leveraging Dataset and Algorithm Characteristics,” Remote Sens. (Basel)., vol. 16, no. 20, p. 3879, Oct. 2024, doi: 10.3390/rs16203879.

A. Gautama Putrada, I. Dian Oktaviani, M. Nurkamal Fauzan, and N. Alamsyah, “CNN Pruning for Edge Computing-Based Corn Disease Detection with a Novel NG-Mean Accuracy Loss Optimization,” Telematika, vol. 17, no. 2, pp. 68–83, Aug. 2024, doi: 10.35671/telematika.v17i2.2899.

E. F. Agyemang, “Anomaly detection using unsupervised machine learning algorithms: A simulation study,” Sci. Afr., vol. 26, p. e02386, Dec. 2024, doi: 10.1016/j.sciaf.2024.e02386.

D. Sartor, T. Barbariol, and G. A. Susto, “Bayesian active learning isolation forest (B-ALIF): A weakly supervised strategy for anomaly detection,” Eng. Appl. Artif. Intell., vol. 130, p. 107671, Apr. 2024, doi: 10.1016/j.engappai.2023.107671.

G. Hannák, G. Horváth, A. Kádár, and M. D. Szalai, “Bilateral‐Weighted Online Adaptive Isolation Forest for anomaly detection in streaming data,” Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 16, no. 3, pp. 215–223, Jun. 2023, doi: 10.1002/sam.11612.

M. S. Hossain and H. Mahmood, “Short-Term Load Forecasting Using an LSTM Neural Network,” in 2020 IEEE Power and Energy Conference at Illinois (PECI), IEEE, Feb. 2020, pp. 1–6. doi: 10.1109/PECI48348.2020.9064654.

Y. Qiao, K. Wu, and P. Jin, “Efficient Anomaly Detection for High-Dimensional Sensing Data With One-Class Support Vector Machine,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 1, pp. 404–417, Jan. 2023, doi: 10.1109/TKDE.2021.3077046.

J. C. Quiroz, T. Chard, Z. Sa, A. Ritchie, L. Jorm, and B. Gallego, “Extract, transform, load framework for the conversion of health databases to OMOP,” PLoS One, vol. 17, no. 4, p. e0266911, Apr. 2022, doi: 10.1371/journal.pone.0266911.

E. Widad, E. Saida, and Y. Gahi, “Quality Anomaly Detection Using Predictive Techniques: An Extensive Big Data Quality Framework for Reliable Data Analysis,” IEEE Access, vol. 11, pp. 103306–103318, 2023, doi: 10.1109/ACCESS.2023.3317354.

T. Nguyen, H.-T. Nguyen, and T.-A. Nguyen-Hoang, “Data quality management in big data: Strategies, tools, and educational implications,” J. Parallel Distrib. Comput., vol. 200, p. 105067, Jun. 2025, doi: 10.1016/j.jpdc.2025.105067.

F. F. Hasan and M. S. A. Bakar, “Data Transformation from SQL to NoSQL MongoDB Based on R Programming Language,” in 2021 5th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), IEEE, Oct. 2021, pp. 399–403. doi: 10.1109/ISMSIT52890.2021.9604548.

S. Alam, S. K. Sonbhadra, S. Agarwal, and P. Nagabhushan, “One-class support vector classifiers: A survey,” Knowl. Based. Syst., vol. 196, p. 105754, May 2020, doi: 10.1016/j.knosys.2020.105754.

Q. Yang and Y. Tang, “Big Data-based Human Resource Performance Evaluation Model Using Bayesian Network of Deep Learning,” Applied Artificial Intelligence, vol. 37, no. 1, Dec. 2023, doi: 10.1080/08839514.2023.2198897.

J. Awiti, “Algorithms and Architecture for Managing Evolving ETL Workflows,” 2019, pp. 539–545. doi: 10.1007/978-3-030-30278-8_51.

D. Andriansyah, “Implementasi Extract-Transform-Load (ETL) Data Warehouse Laporan Harian Pool,” Jurnal Teknik Informatika, vol. 8, no. 2, pp. 45–49, Aug. 2022, doi: 10.51998/jti.v8i2.486.

M. Hendayun, E. Yulianto, J. F. Rusdi, A. Setiawan, and B. Ilman, “Extract transform load process in banking reporting system,” MethodsX, vol. 8, p. 101260, 2021, doi: 10.1016/j.mex.2021.101260.

Systematic Literature Review: Machine Learning Algorithm Performance Evaluation of Extract-Transform-Load

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License