ML evaluation metrics

What are ML evaluation metrics?

Evaluation metrics are quantitative measures used to assess the accuracy, performance, and robustness of machine learning models. Different tasks (e.g., classification vs. regression) require different metrics.

Why it matters in AI/ML

Using the wrong evaluation metric can mislead teams about model performance and lead to incorrect decisions. A model that looks “accurate” may still fail if it performs poorly on minority classes, edge cases, or high-risk predictions.

Common metrics by task

1. Classification metrics

Accuracy: Overall proportion of correct predictions
Precision: How many predicted positives are actually positive
Recall: How many actual positives were correctly predicted
F1 Score: Harmonic mean of precision and recall
AUC-ROC: Measures model’s ability to distinguish between classes

2. Regression metrics

Mean Absolute Error (MAE): Average absolute difference between predicted and actual values
Mean Squared Error (MSE): Average of squared differences (penalizes large errors)
R² (Coefficient of Determination): How well predictions explain the variance in target values

3. Ranking / recommendation metrics

Precision@K: How many of the top-K recommended items are relevant
NDCG (Normalized Discounted Cumulative Gain): Evaluates the quality of ranked lists

Advanced and fairness-oriented metrics

Confusion Matrix: Visualizes true positives, false negatives, etc.
Bias Metrics: Evaluate disparities in outcomes across demographic groups
Coverage/Error Tradeoffs: Balancing model confidence and breadth of predictions

Best practices

Align metrics with business or user goals
Track multiple metrics (not just accuracy)
Evaluate on both validation and production data
Monitor metrics over time to catch regressions

Model validation

Explore how evaluation fits into the broader model lifecycle in entries like ML testing or Model monitoring.