Model quality

Testing and its many guises

Understand the need for testing and learn three ML testing frameworks to help you ship with confidence

Tests are all around us and appear wearing many guises. In a manufacturing facility, tests might look like checklists that ensure that a product meets the quality standards. In a university, they may look like a few sheets of paper with questions that try to measure students’ understanding. In software development, they usually look like scripts that make a piece of software go through specific scenarios and check its behavior.

Regardless of whether we are in the world of bits or atoms, tests are a way to ensure that something works as expected.

When put in perspective, it is clear that tests can never be perfect. Then, what’s the hidden incentive that makes them so ubiquitous?

Despite their imperfections, they are one of the best ways we have to catch potential flaws before they become much more expensive. After all, it is cheaper to make a product go through a manufacturing process again than to recall it; to make a student relearn a concept than to deal with a professional that might not possess the basic knowledge required by their occupation, or to fix a bug before the code is deployed in production.

In software engineering, a lot of research has been dedicated to developing systematic and rigorous testing frameworks, to the point where test-driven development has become a common practice. In machine learning (ML), a field not that far away, tests are not as popular as they should be.

Testing in ML (if done at all) is usually comprised of a single engineer writing a script to test a few cases that came up during a sloppy error analysis procedure. The consequence of not taking testing seriously is that more often than not, models are shipped making obvious mistakes — or worse, exhibiting biases and behaving in unethical ways.

In this post, we explore what makes testing in ML different than traditional software testing; then, we elucidate the differences between model evaluation and model testing; finally, we present three ML model testing possibilities.

Learn the secrets of performant and explainable ML

Be the first to know by subscribing to the blog. You will be notified whenever there is a new post.

If you are already comfortable with ML testing, few free to skip this post and start testing your models with Openlayer right away!

The quirks of ML testing

Traditionally, software tests are broadly categorized into three different buckets:

  • Unit tests focus on atomic pieces of the code with a single responsibility, such as a function;
  • Integration tests verify the combined functionality of multiple atomic pieces;
  • Regression tests reproduce bugs that were encountered and fixed during the development process, to ensure they are not reintroduced by new versions.

These three buckets do not exhaust all the possible tests that a piece of software can be put through, but they encompass a lot of them.

When we talk about ML, there are a couple of fundamental distinctions from traditional software that are important to take into account in the context of testing.

The first one is that while the logic behind traditional software systems is written explicitly by a human, ML models learn the logic that dictates their behavior from examples. Thus, traditional software tests are more direct and verify aspects of the programmed logic; ML tests, in contrast, are more indirect and focus on ensuring that the model learns an appropriate logic.

The second difference is that a lot of the popular ML models have a significant stochastic component to their behavior. For example, a plethora of models benefit from a dose of randomness during the learning stages to produce better results. As a consequence, ML tests will generally focus on the deterministic components of the data and the model.

ML evaluation x ML testing

If you’ve read our blog post on model evaluation, you might be wondering: why do we need testing when we have already evaluated a model?

Unfortunately, model evaluation does not solve all of our problems. When performing model evaluation, we are mainly interested in estimating our model’s generalization capacity, i.e., its performance on new data other than the one seen during training. The generalization capacity is a quantity that every stakeholder deeply cares about, and there are reliable ways to estimate it, such as with a holdout dataset or via cross-validation.

The problem is the model’s performance measured by aggregate metrics, such as accuracy or precision, obtained via cross-validation tells little about what the model has actually learned or how that accuracy translates to different subsets of the data. Furthermore, such a metric shows only a glimpse of how the model will behave in the wild, where it will encounter a long, long tail of edge cases

There is no way model evaluation will answer all of those questions. The way to increase the trust in a model and ship with confidence is through error analysis. Error analysis is an umbrella term that encompasses many activities, one of which is ML testing, which borrows ideas from traditional software testing and applies them to ML, as a way to ensure model quality.

Three model testing frameworks

As we mentioned earlier, the focus of ML testing is ensuring that the model learns an appropriate logic from data. Furthermore, the tests are generally centered on the deterministic components of the data and of the model. With that in mind, many authors divide ML testing into two parts: data testing and model testing.

In this section, we go through three model testing possibilities.

1. Confidence tests

It is easy to be misguided by aggregate metrics, such as accuracy or precision, calculated over whole datasets. The model performance might not be uniform over all the cohorts of the data, and you may even find data pockets with specific failure modes. Amid this context, building over the ideas from error cohort analysis, it is possible to define confidence tests.

  • Objective: assert that the model’s performance surpasses a particular threshold for different subgroups of the data;
  • Recipe: separate various subgroups of the data that are of interest and evaluate the model on them; alternatively, randomly sample data instances from a larger dataset and evaluate the model on them;
  • Insights: creates a higher-resolution picture of model performance and avoids deploying a model with very non-uniform performances.

In one of our previous posts, we explored a hypothetical example using a model that predicts whether a user will churn or not based on a set of features, such as age, gender, geography, and others. As is often the case, a high accuracy in the validation set (in this case, equal to 90%) is masking a very non-uniform performance across the different cohorts of the data, as shown below.

Testing such a model using the confidence test framework presented might help reveal, before shipping the model, that its performance is not satisfactory in all of the data cohorts of interest.

2. Invariance tests

ML models should remain invariant under certain scenarios, i.e., their predictions should not change across some data instances. Invariance tests leverage the power of synthetic data to verify model robustness.

  • Objective: assert that the model’s predictions remain invariant for particular data samples;
  • Recipe: generate synthetic data that looks like the original samples available and check if the model’s predictions remain the same;
  • Insights: allows practitioners to identify edge cases before shipping and also use the generated synthetic data to retrain the model if needed.

The crucial step in invariance tests is generating synthetic data that manifests the kind of invariance that the model should exhibit. In computer vision applications, for example, to ensure that the model’s predictions are invariant to translations and rotations, it is common to perturb the original data samples to augment the dataset. An invariance test in this case could check whether the model identifies that all of the images below are images of a cat. After all, a cat at the corner of the image or upside down is still a cat.

There are many ways to generate synthetic data, with varying degrees of complexity. It can be as simple as creating data samples from a template or as complex as using generative adversarial networks (GANs). An example from natural language processing (NLP) could be testing invariance to first names by generating synthetic data from a template and asserting whether the model’s predictions remain the same.

Borrowing insights from software engineering, Marco Tulio et al. proposed the CheckList: a new testing methodology for NLP models. The template example shown above is just one of the many testing methods proposed in the paper. This work shows that “although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models”. Moreover, “NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it”.

3. Counterfactual and adversarial tests

This is probably the most sophisticated test among the ones presented in this post. The idea is to build over the research results from counterfactual and adversarial analysis and create tests that strive to flip the predictions made by a model by manipulating the feature values.

  • Objective: flip the predictions made by a model by manipulating the input values;
  • Recipe: there are different adversarial attack recipes proposed in the literature, which are highly dependent on the data types;
  • Insights: further understand the model’s predictions, increase trust in the model, and uncover possible biases.

Consider a model that either approves or rejects a loan based on a set of applicant features, such as their income, amount of current debts, gender, among others. While it makes sense that the model changes its predictions if we vary the applicant’s income, the model shouldn’t change its predictions for different genders, all other features being equal.

Source: modified from R. K. Mothilal et al., “Explaining Machine Learning Classifiers through Diverse Counterfactual Explanations”, FAT, 2020.

Not all situations are as straightforward as the one presented here. That’s why it can be interesting to take advantage of the attack recipes presented in the literature and uncover possible biases while corrective actions can be taken in time.

Systematic testing with Openlayer

The only way to ship with confidence is through extensive testing procedures. With Openlayer, it is possible to create tests that evaluate your ML model across multiple dimensions.

All of the tests presented in this post (and many more) are available at Openlayer for models that work with tabular or language data, and testing them with state-of-the-art frameworks is just a few clicks away.

For instance, with invariance tests in the context of NLP, it is possible to check model invariance to typos, to paraphrases, to changes in names, to changes in locations, and more, all of which are based on the research advances made on the CheckList paper.

We are constantly implementing new testing frameworks to ensure that ML engineers and data scientists ship high-quality models. If you want to start testing your ML models right away, upload a model and a dataset to Openlayer and have fun!

* A previous version of this article listed the company name as Unbox, which has since been rebranded to Openlayer.

Recommended posts


Evaluating RAG pipelines with Ragas and Openlayer

How to use synthetic data to evaluate RAG systems

Shahul Es

February 23rd, 2024 • 6 minute read

Evaluating RAG pipelines with Ragas and Openlayer
Model quality

What is data-centric AI: 3 reasons to pay attention to it

The trend in AI

Gustavo Cid

November 23rd, 2023 • 5 minute read

What is data-centric AI: 3 reasons to pay attention to it
Error analysis
Model quality

Error analysis x Model monitoring: how are they different?

Show me your ML development pipeline and I'll tell you who you are

Gustavo Cid

August 15th, 2023 • 5 minute read

Error analysis x Model monitoring: how are they different?