Performance Metrics in Machine Learning: A Comprehensive Narrative Review
Akolgo Eric Ayintareba *
Department of Computer Science, Regentropfen University College, Bongo, Ghana.
Dennis Redeemer Korda
Department of Computing & Information Technology, Bolgatanga Technical University, Sumbrungu, Ghana.
Obeng Owusu-Boateng
Department of Mathematics & ICT, E. P College of Education, Bimbilla, Ghana.
*Author to whom correspondence should be addressed.
Abstract
Machine learning has become a cornerstone of data-driven decision-making across scientific, industrial, and social domains. Central to the rigorous development and deployment of machine learning models is the selection and interpretation of appropriate performance metrics. Yet the field lacks a unified framework that addresses the breadth of metric choices across diverse learning paradigms, from supervised classification and regression to unsupervised clustering, reinforcement learning, and federated systems. This review synthesises the current state of knowledge on machine learning performance metrics, examining their theoretical foundations, practical applications, and limitations. Drawing on peer-reviewed literature published between 2000 and 2026, the review covers classification metrics (accuracy, precision, recall, F1-score, AUC-ROC, Matthews correlation coefficient), regression metrics (mean absolute error, root mean squared error, R²), clustering validity indices, uncertainty and calibration metrics, fairness and bias measures, and emerging evaluation paradigms in deep learning, adversarial robustness, and federated learning. The review further addresses the challenge of metric selection under class imbalance, multi-class settings, and high-stakes application domains such as medicine and criminal justice. The findings reveal persistent tensions between simplicity and informativeness in metric design and underscore the need for domain-sensitive, multi-metric evaluation frameworks. Recommendations are offered for researchers and practitioners seeking to align metric choice with task objectives, data characteristics, and ethical constraints.
Keywords: Machine learning evaluation, performance metrics, clustering validity, fairness metrics, model calibration, deep learning evaluation.