Building the future of ML

The path towards performant and explainable machine learning

Learn the secrets of performant and explainable ML

Be the first to know by subscribing to the blog. You will be notified whenever there is a new post.

When we look back in history and see how the foundations of the technologies we use today were developed, we are often in awe that things worked out at all. For example, think about how the software development process looked like as recently as a few decades ago. Isn’t it impressive that whole companies were built by engineers editing source code with part-manual version control?

A lot of great software was built like this, but today, no one would dare to say they preferred the old way, particularly considering the widespread availability of tools like git that make the development process much smoother.

The tipping point

Machine learning (ML) is at an interesting point in time. On the one hand, the world as a whole is convinced of its potential and its applications are already influencing billions of lives. On the other, the way ML models are developed in the industry is often far from ideal, to say the least.

ML engineers and data scientists usually receive datasets as CSV files, which they store or access locally through their machines; they train different models on a few Jupyter Notebooks and then perform tests with scripts that they came up with half-haphazardly.

Not only that but engineers are often siloed. It is hard for them to share their work with others and we cannot help but note the similarities with the early days of software development when we see Jupyter notebooks and error analysis docs bouncing back and forth with names such as training.ipynb, training_v1.ipynb, error_analysis_final.csv, ..., training_please_work_final_7.ipynb.

How can we expect ML to live up to its full potential if the development process still looks like this?

It is not surprising that from time to time, we see, in the media, pieces about products powered by ML making obvious mistakes, exhibiting biases, and even behaving in unethical ways. For teams shipping such products, in the best cases, the results are a piece of bad PR and upset users; in the worst cases, it can run them out of business or permanently impact people’s lives negatively.

Due to the obviosity of the problem, a lot of companies were created striving to address this issue. Most of them, though, end up focusing on the tail of the ML development process. They focus on managers and executives and provide monitoring solutions, to check whether or not the ML models that were already deployed are still working.

Instead of spending profusely only on monitoring solutions and dealing with the never-ending anxiety of waiting for the moment a model will break in production, wouldn’t it be better if engineers were able to catch errors proactively, rather than retroactively?

Error analysis

To be able to do so, engineers must conduct rigorous and systematic error analysis.

Error analysis is the attempt to analyze when, how, and why models fail. It embraces the process of isolating, observing, and diagnosing erroneous ML predictions, thereby helping understand pockets of high and low performance of the model.

Error analysis is an umbrella term that, in fact, encompasses various activities. Constructing a comprehensive view of model quality is akin to assembling a puzzle and each activity under the umbrella of error analysis is responsible for giving us a piece.

Based on the set of activities that an organization conducts, inspired by this blog post, we can categorize them into five different layers, called L0 to L4. The idea is that this construct can help you assess where your organization is in terms of maturity dealing with ML models and inspire you, pointing to possible next steps.

Most teams are still in L0. They train their models on a training set, perform model selection on a validation set and assess their models on a test set using aggregate metrics, such as accuracy, precision, recall, and F1. To get to L4, teams must incorporate various other activities into their workflows such as using global and local explanations, taking advantage of adversarial analysis, performing model diffs and unit testing their data.

ML-mature organizations understand that systematic error analysis should lie at the center of all their efforts. As Stanford professor Andrew Ng puts it, “if you do error analysis well, it will tell you what’s the most efficient use of your time to improve performance”.

For example, using the identified failure modes to guide the efforts of data collection and labeling can greatly save resources. Being able to pinpoint exactly what kind of data is required to boost the model’s performance is a competitive advantage in an age where most organizations collect data in an almost arbitrary way.

This is a trend that is not confined to academia. Data scientists in the front lines are already raising the error analysis flag, increasing the community’s awareness. For instance, this great Twitter thread by senior data scientist Mark Tenenholtz summarizes important principles that should guide practitioners’ activities.


If you’ve read this far, you now understand that the way the ML development pipeline flows in most companies is far from ideal and that the consequences of shipping ML models that fail silently in production can be dire. Furthermore, error analysis is the light at the end of the tunnel.

Now, if you call for a meeting tomorrow first thing in the morning and excitedly say “Hey team! Let’s start conducting rigorous and systematic error analysis! That’s the way to go to improve our models”, as a response, you will likely only get silent stares from the data scientists and ML engineers.

Currently, conducting error analysis is very challenging and time-consuming, as engineers are left to their own devices to dig through unorganized CSVs and guess why their models aren’t behaving as expected. Conducting error analysis is currently closer to an art than to a science. Even though there is indeed a beauty to it, the process must be scientific, with reproducibility and rigor in the leading roles.

This is why we are working so hard to develop a set of tools that helps put error analysis at the heart of the ML development pipeline. We want to take organizations from L0 to L4!

Openlayer is the debugging workspace for machine learning, where companies are able to track and version models, uncover errors, and make informed decisions on data collection and model re-training.

We build over the amazing research results obtained over the last few years in areas such as interpretability and counterfactual analysis to help practitioners construct a much more comprehensive view of model quality.

We take error analysis very seriously at Openlayer. In this blog, we will explore many aspects related to it to help data scientists and ML engineers incorporate the best practices into their workflows. Also, feel free to check out our whitepaper, where we introduce the theme from a broader perspective.

One thing is clear: the path towards performant and explainable machine learning starts with proper error analysis. If you are serious about it, you know where to find us.

* A previous version of this article listed the company name as Unbox, which has since been rebranded to Openlayer.

Recommended posts


Navigating the chaos: why you don’t need another MLOps tool

And how to build trustworthy AI


November 29th, 2023 • 5 minute read

Navigating the chaos: why you don’t need another MLOps tool

How we built our public site with Next.js, Sanity and Emotion

Developing fast and reliably

Rex Garland

May 30th, 2023 • 6 minute read

How we built our public site with Next.js, Sanity and Emotion

Openlayer raises $4.8m seed round to build guardrails for AI

Read about our financing round led by Quiet Capital


May 4th, 2023 • 4 minute read

Openlayer raises $4.8m seed round to build guardrails for AI