ML evaluation metrics
What are ML evaluation metrics?
Evaluation metrics are quantitative measures used to assess the accuracy, performance, and robustness of machine learning models. Different tasks (e.g., classification vs. regression) require different metrics.
Why it matters in AI/ML
Using the wrong evaluation metric can mislead teams about model performance and lead to incorrect decisions. A model that looks “accurate” may still fail if it performs poorly on minority classes, edge cases, or high-risk predictions.
Common metrics by task
1. Classification metrics
- Accuracy: Overall proportion of correct predictions
- Precision: How many predicted positives are actually positive
- Recall: How many actual positives were correctly predicted
- F1 Score: Harmonic mean of precision and recall
- AUC-ROC: Measures model’s ability to distinguish between classes
2. Regression metrics
- Mean Absolute Error (MAE): Average absolute difference between predicted and actual values
- Mean Squared Error (MSE): Average of squared differences (penalizes large errors)
- R² (Coefficient of Determination): How well predictions explain the variance in target values
3. Ranking / recommendation metrics
- Precision@K: How many of the top-K recommended items are relevant
- NDCG (Normalized Discounted Cumulative Gain): Evaluates the quality of ranked lists
Advanced and fairness-oriented metrics
- Confusion Matrix: Visualizes true positives, false negatives, etc.
- Bias Metrics: Evaluate disparities in outcomes across demographic groups
- Coverage/Error Tradeoffs: Balancing model confidence and breadth of predictions
Best practices
- Align metrics with business or user goals
- Track multiple metrics (not just accuracy)
- Evaluate on both validation and production data
- Monitor metrics over time to catch regressions
Related
Explore how evaluation fits into the broader model lifecycle in entries like ML testing or Model monitoring.