Navigating the chaos: why you don’t need another MLOps tool

And how to build trustworthy AI

We are often in awe that things worked out at all when we look back on history and see how the foundations of the technologies we use today were developed. For example, think about how the software development process looked like as recently as a few decades ago. Isn’t it impressive that whole companies were built by engineers editing source code with partly manual version control?

A lot of great software was built like this, but today, no one would dare to say they preferred the old way — particularly considering the widespread availability of tools like Git that make development much smoother.

Learn the secrets of performant and explainable ML

Be the first to know by subscribing to the blog. You will be notified whenever there is a new post.

What’s wrong with AI/ML development

AI is at an interesting point in time. On the one hand, the world as a whole is convinced of its potential and its applications are already influencing billions of lives. On the other, the way models are developed and deployed in the industry is often far from ideal.

Engineers and data scientists usually receive datasets as CSV files, which they store or access locally through their machines; they experiment with different models on a few Jupyter Notebooks and then perform tests with scripts that they came up with half-haphazardly.

Not only that but engineers are often siloed. It is hard for them to share their work with others and we cannot help but note the similarities with the early days of software development when we see Jupyter notebooks and analysis docs bouncing back and forth with names such as training.ipynbprompt_change_v1.ipynberror_analysis_final.csv, ..., please_work_in_production_final_7.ipynb.

How can we expect AI to live up to its full potential if the development process still looks like this?

It is not surprising that from time to time, we see, in the media, pieces about products powered by AI making obvious mistakes, exhibiting biases, and even behaving in unethical ways. For teams shipping such products, in the best cases, the results are a piece of bad PR and upset users; in the worst cases, it can run them out of business.

Making development systematic

To overcome such issues, teams must be systematic with their processes.

The challenge is that the complexity and black-box nature of AI/ML has made rigorous evaluation a lot harder than it is in most software development. There is no toolkit that reliably gives developers insight into how and why their models fail.

Openlayer is here to simplify this for both common and long-tail failure scenarios. We are a testing tool that fits into your development and production pipelines to help you ship high-quality models with confidence.

For example, you can promptly detect sudden changes in the data run through your model caused by a change in user behavior. You can also keep track of hallucination scores, to ensure your model responses are always grounded. And that's not all — our platform lets you choose from a comprehensive suite of tests that your model (or agent) needs to clear.

We support seamlessly switching between (1) development mode, which lets you track, version, and compare your tests before you deploy them to production, and (2) monitoring mode, which lets you run tests live in production and receive alerts when things go sideways.

Say you're using an LLM for RAG and want to make sure the output is always relevant to the question. You can set up hallucination tests, and we'll buzz you when the average score dips below your comfort zone.

Or say you're working with a fraud prediction model and are losing sleep over false negatives. With Openlayer, you can quickly write granular tests to measure and monitor the performance of specific cohorts of data and understand why your model made the choices it did on individual data points.

Unified approach

The MLOps landscape is currently fragmented. We’ve seen countless data and ML teams glue together a ton of bespoke and third-party tools to meet basic needs: one for experiment tracking, another for monitoring, and another for CI automation and version control. With LLMOps now thrown into the mix, it can feel like you need yet another set of entirely new tools.

We don’t think you should, so we're building Openlayer to condense and simplify AI evaluation. It’s a collaborative platform that solves long-standing ML problems, while tackling the new crop of challenges presented by Generative AI and foundation models (e.g. prompt versioning, quality control).

We address these problems in a single, consistent way that doesn't require you to learn a new approach. We’ve spent a lot of time ensuring our evaluation methodology remains robust even as the boundaries of AI continue to be redrawn.

We are confident that the way to harness the true potential of AI lies in adopting systematic evaluation methodologies. If you’re interested in building high-quality and impactful AI solutions, you can try Openlayer for free.

Recommended posts


How we built our public site with Next.js, Sanity and Emotion

Developing fast and reliably

Rex Garland

May 30th, 2023 • 6 minute read

How we built our public site with Next.js, Sanity and Emotion

Openlayer raises $4.8m seed round to build guardrails for AI

Read about our financing round led by Quiet Capital


May 4th, 2023 • 4 minute read

Openlayer raises $4.8m seed round to build guardrails for AI
Error analysis

The race to put AI to work

A tipping point or hype for businesses and environmental, social, and corporate governance (ESG)?

Vikas Nair

June 28th, 2022 • 10 minute read

The race to put AI to work