Model quality
Error analysis

A beginner’s guide to evaluating machine learning models beyond aggregate metrics

A high accuracy is not enough

How do you decide if you have a high-quality machine learning (ML) model in your hands?

The insights that help you answer this question should come from the model evaluation stage. However, more often than not, people interpret “model evaluation” as a synonym for computing overall aggregate metrics on a validation set (such as accuracy, or F1).

The problem is that seeing aggregate metrics as proxies for model quality can be very misleading. In this post, we will show three ways you can evaluate models beyond aggregate metrics.

Learn the secrets of performant and explainable ML

Be the first to know by subscribing to the blog. You will be notified whenever there is a new post.

The problem with aggregate metrics

We could spend all day presenting the deceptions of relying too much on overall aggregate metrics. Model evaluation should assess model quality, and the quality of a model goes way beyond a single number. However, in this post, we will focus on three main problems and how each can be overcome.

The first problem of relying too much on aggregate metrics is that they are hard to interpret in isolation. This often leads to misinterpretation of model capabilities and difficulty getting management buy-in — since aggregate metrics are usually not directly tied to business value.

The second is that overall aggregate metrics provide a very low-resolution image of model performance. Consequently, model evaluation overly based on them is prone to selecting models that will, later, fail silently in production.

Finally, aggregate metrics tell us nothing about why our models perform the way they do. A team that only looks at aggregate metrics might be deceived by a model that performs well but, in fact, relies on spurious data from the training set to make its predictions.

To overcome these problems, one must expand their model evaluation process. That’s what we will explore in the next section.

Model evaluation beyond aggregate metrics


Imagine you are building an ML model that predicts house prices. After training the model, you obtain a mean squared error (MSE) of 120,000 dollars^2 on the validation set.

Now you need to decide: is such an MSE good or not?

Note how answering one of the main questions from model evaluation is not as trivial as it seems. Even for a standard metric such as the MSE, it is impossible to answer this question by just staring at a single number. More context is needed.

Benchmarks are one piece of this additional context. To adequately evaluate models, we should strive to iteratively compare their performance to benchmarks, which serve as goalposts along the way. They help us ground the evaluation process and ensure we are moving in the right direction.

Back to our example, if the model you are developing is expected to replace an existing system (ML powered or not), the natural benchmark would be the performance of such a system. As you progress in the model evaluation process, you should often ask yourself if the current model surpasses your benchmark. Otherwise, you risk pouring energy into something that ultimately won’t be used.

Alternatively, if you are developing the model from scratch, it is important to start with simple benchmarks and progressively increase their complexity over time. Usually, a good place to start is with the performance of the random (chance) model. Then, moving to rule-based estimates. To simple models using a subset of features. And so on. This is similar to the approach we explored in detail in our baseline models post.

Benchmarks are also useful during the last mile of model development. At that stage, it is common to discuss the results from model evaluation with people from a business background. Using the standard ML aggregate metrics might not be the best choice to convince them of how great your model is. First, because most aggregate metrics are not intuitive. Second, they might not be directly tied to business metrics, which is what a company cares about at the end of the day.

In these situations, well-crafted benchmarks that are easy to compute and understand are extremely valuable. How these benchmarks look will vary greatly between problems, and they can even be the human performance for the task at hand. After all, the argument in favor of a model that surpasses the performance of real estate agents is much stronger than any MSE score in our example.

Data cohorts

Aggregate metrics are convenient because they summarize a lot of information into a single number. However, as a consequence, they provide a low-resolution picture of what’s going on with our model performance.

Hidden behind a high accuracy, for example, are usually many data cohorts where our model performance is not so stellar. Model evaluation based mostly on overall aggregate metrics is prone to choosing models that fail silently on some data pockets.

Let’s look at a real example.

Consider a churn classifier that predicts whether a user is likely to churn from demographic and usage data. The code for this example is available on GitHub. The resulting model’s accuracy on the validation set is equal to 0.71.

Now, if we start to dig deeper and explore different data cohorts, we see that there are underperforming subpopulations. For example, if we evaluate our model on the fraction of the dataset where the feature Gender is Female, we notice that the performance is equal to 0.52. This is significantly lower than the overall accuracy.

(Screenshot from Openlayer.)

There are also many other data cohorts defined by feature value combinations that result in high error rates, as shown in the image below. In this case, we are using an algorithm to automatically find the data cohorts that might be problematic for the model.

(Screenshot from Openlayer.)

Identifying such subpopulations is critical to evaluate models. Otherwise, our view of model performance can be myopic.


ML models are trained to pick up patterns in the training data. The problem is that datasets contain good and bad patterns. Signal as well as noise. Thus, if we are not thorough during model evaluation, we can end up with models that rely on spurious information.

Models that seem to be performing well on the surface might be overfitting to noise in the training data. This happens because aggregate metrics tell us nothing about why the models behave the way they do.

That’s one of the key roles explainability takes.

Using post-hoc explainability techniques, such as LIME or SHAP, we can start to shed light on black box models. They help us identify which features contributed the most to the (mis)predictions made by the model. Therefore, we can ensure that our models are making predictions based on reasonable information by inspecting the feature importance scores.

Quoting the original LIME paper, “understanding the reasons behind predictions is quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model.”

For instance, in the image below, we note how the churn classifier we mentioned in the previous section seems to attribute a lot of importance to the features Age and Gender. Using domain expertise, we can decide if this is expected or if the model is biased.

(Screenshot from Openlayer.)

Tabular data is not the only one that benefits from explainability techniques for model evaluation. For example, in natural language processing (NLP) problems, they are just as important. In a sentiment analysis task, where a phrase is categorized as positive or negative, if a model predicts that the sentence “I’m having a great day” is positive because of the word “having” and “day” and not because of the word “great”, is it really a good model?

Not really. Ideally, the model should be predicting a positive label because of the word “great,” which is a strong positive word. Predicting the right label is only half of the story.


The process followed during model evaluation can make or break an ML application. In this post, we explored some of the potential problems of evaluating models mostly using overall aggregate metrics. By incorporating benchmarks, data cohort analysis, and explainability into your evaluation stack, you can detect many issues that otherwise would go unnoticed. These three are just the first step of model evaluation beyond aggregate metrics. If you don’t want to miss out, make sure to join our newsletter!

Recommended posts

Error analysis
Model quality

Error analysis x Model monitoring: how are they different?

Show me your ML development pipeline and I'll tell you who you are

Gustavo Cid

August 15th, 2023 • 5 minute read

Error analysis x Model monitoring: how are they different?
Error analysis

Error analysis in machine learning: going beyond predictive performance

Going beyond predictive performance

Gustavo Cid

July 10th, 2023 • 7 minute read

Error analysis in machine learning: going beyond predictive performance
Error analysis

The race to put AI to work

A tipping point or hype for businesses and environmental, social, and corporate governance (ESG)?

Vikas Nair

June 28th, 2022 • 10 minute read

The race to put AI to work