OS

Explainability
Fairness

3 differences between ML in production and in academia

Going beyond ML models and algorithms


Regardless of whether you took classes at a university or learned from resources online, your ML learning journey likely prepared you to deal with ML models and the algorithms that surround them. However, it only takes a few weeks at a job as an ML engineer or as a data scientist to make you realize that the necessities around ML in production go far beyond training models and performing k-fold cross-validation.

In production, ML systems are being developed, and, as systems, various parts need to interact and work harmonically together. ML models are just one cog inside this large machine. Albeit an important one, by themselves, they don’t get an organization very far.

For instance, ML systems are generally comprised of an interface, with which users shall interact; adequate infrastructure, to support doing inference with the ML models; data managing tools, to handle all the data being produced for monitoring and re-training purposes, among many others.

Notice that if one part of the system is not working correctly, the system as a whole is in danger of not achieving its purposes.

Another realization is that even when we zoom into the core ML bit of the system, the incentives and dynamics that dictate the development processes are very different from what is often taught in the academic world. In this post, we will highlight three of those differences, namely the objectives, the importance of interpretability, and the need for fairness.

Learn the secrets of performant and explainable ML

Be the first to know by subscribing to the blog. You will be notified whenever there is a new post.

And hey — if you are already comfortable with the three differences we’ll talk about, feel free to skip this post and head straight to Openlayer!

1. Objectives

When we think about ML in the way that it is usually presented in the academic environment, there is a single objective to be optimized for, and that is model performance. In contrast, in the context of ML in production, there are many stakeholders involved and, as is often the case, each one has a distinct objective when it comes to the ML system. To increase the challenge even further, some objectives conflict and point in different directions, but despite the conflicts, teams must be able to come up with a solution that suits everyone.

In the face of the nightmare that is dealing with all of the stakeholders’ objectives, it might feel comforting to develop models with only performance in mind. However, the solutions that arise from the single objective scenario are often too complex to be useful in real-world systems.

For example, in recent years, there was a spike in the popularity of deep learning models due to the amazing results they presented on many academic benchmarks. These famous models have millions or even billions of parameters and most organizations simply don’t have the resources to deploy models this large, let alone fine-tune them for their specific needs.

Dealing with the multi-facet nature of ML systems is big challenge ML engineers and data scientists need to be prepared to encounter. There is a good example illustrating the different objectives in a real-world setting in the lecture notes from Stanford’s course CS329S, which we slightly adapt and reproduce below.

Imagine there is a project to construct an ML system that recommends restaurants to users. Among the project’s stakeholders, there are ML engineers, salespeople, product managers, and infrastructure engineers.

Each stakeholder might want something very different:

  • ML engineers might want to develop the model with the greatest performance, which may be complex and require a lot of data;
  • salespeople might push for a model that recommends restaurants that pay the highest advertising fee to be shown in the app;
  • product managers may want a model that is very fast for inference, as low latency is often associated with more orders in the app;
  • the infrastructure team may want to hold off the production line to update their existing infrastructure before deploying the new model, due to previous problems they encountered.

How do you find a solution that makes everyone happy?

I’ll leave that one up to you.

2. Interpretability

When learning about ML, interpretability is often pushed aside. Furthermore, add to that the fact that we live amid the golden age of benchmarks, where a new state-of-the-art accuracy achieved is read as a synonym of progress, and it is not surprising that solutions favoring predictive performance over interpretability are among the most popular approaches.

For example, ensemble models are very popular methods that often have good results in ML competitions, such as in Kaggle. But, depending on the ensemble method being used, it is going to be hard to interpret their underlying mechanisms. What has the model actually learned? How is it possible to know the model is learning useful information and not over-indexing to certain features?

Interpretability needs to be a first-class citizen in ML systems in production.

Users are more likely to take actions based on the model if they trust it; in many fields, being able to justify model’s predictions is a must, which might even be required by law; ML practitioners are able to triangulate the root causes of their models’ mistakes more quickly if they understand why their models are behaving a certain way.

Interpretability lies at the heart of trustworthy ML. If this is something you would like to explore further, check out our white paper and our blog post about LIME!

3. Fairness and bias

Virtually everyone knows about the importance of fairness and the dangers of bias. However, how do you measure them?

Since they are difficult quantities to measure objectively, fairness and bias, like interpretability, often assume a secondary role when ML is being taught.

In production, the stakes are different. The consequences of deploying an unfair ML model can never be positive. In the best cases, the results are a piece of bad PR and upset users. In the worst cases, it can run organizations out of business or permanently impact people’s lives negatively.

Let’s look at an example.

Google currently relies on complex ML models to answer the user’s queries. From time to time, some of the search engine’s odd results end up gathering media attention. Recently, for example, there was a considerable controversy related to the image search results that come up after searching for “school boy” and “school girl”.

On the one hand, the image results for “school boy” show young boys dressed for school, as one would expect. On the other, the results for “school girl” show sexualized images of women wearing school uniforms. The difference between the two queries is a single word, the gender.

Unfortunately, Google’s example is not the only one. In the book “Weapons of Math Destruction”, Cathy O'Neil explores many different instances that depict how currently deployed algorithms might be reinforcing discrimination and negatively affecting people’s lives in a myriad of contexts.

The objectives, the need for interpretability, and the importance of fairness are just a few of the differentiators that appear when we compare the necessities of ML systems in production and the version of ML that is often taught in universities and online resources. It is fundamental that we, as practitioners, remember that ML models are part of the solutions provided by ML systems, but they’re not all of it. These are important steps towards performant, explainable, and ethical ML. Check out the tools we’re building at Openlayer that might help you on your journey!

* A previous version of this article listed the company name as Unbox, which has since been rebranded to Openlayer.

Recommended posts

Explainability

SHAP demystified: understand what Shapley values are and how they work

From game theory to machine learning

Gustavo Cid

November 1st, 2023 • 8 minute read

SHAP demystified: understand what Shapley values are and how they work
Error analysis
Explainability

Error analysis in machine learning: going beyond predictive performance

Going beyond predictive performance

Gustavo Cid

July 10th, 2023 • 7 minute read

Error analysis in machine learning: going beyond predictive performance
Explainability
Model quality
Error analysis

A beginner’s guide to evaluating machine learning models beyond aggregate metrics

A high accuracy is not enough

Gustavo Cid

May 23rd, 2023 • 5 minute read

A beginner’s guide to evaluating machine learning models beyond aggregate metrics