No ill-formed inputs
No SSN
No credit card information
No ill-formed inputs
No SSN
No credit card information
No ill-formed inputs
No SSN
No credit card information
Is valid JSON
Is valid Python code
No column drift
Is valid JSON
Is valid Python code
No column drift
Is valid JSON
Is valid Python code
No column drift
High GPT-evaluation score
High answer relevancy
Response is concise
High GPT-evaluation score
High answer relevancy
Response is concise
High GPT-evaluation score
High answer relevancy
Response is concise
Response is factual
High BLEU-1 score on protected subpopulation
High METEOR score
Response is factual
High BLEU-1 score on protected subpopulation
High METEOR score
Response is factual
High BLEU-1 score on protected subpopulation
High METEOR score
No ill-formed sentences
No new tokens
No disparity between gender-related pronouns
No ill-formed sentences
No new tokens
No disparity between gender-related pronouns
No ill-formed sentences
No new tokens
No disparity between gender-related pronouns
No rows from the training set present in validation
No new labels in the validation set
No label drift
No rows from the training set present in validation
No new labels in the validation set
No label drift
No rows from the training set present in validation
No new labels in the validation set
No label drift
No significant label drift
No rows from the training set present in validation
Expect high performance on sentences containing "help"
No significant label drift
No rows from the training set present in validation
Expect high performance on sentences containing "help"
No significant label drift
No rows from the training set present in validation
Expect high performance on sentences containing "help"
High precision on sentences containing key tokens
High confidence and accuracy on must-pass cases
High precision on "urgent" predictions
High precision on sentences containing key tokens
High confidence and accuracy on must-pass cases
High precision on "urgent" predictions
High precision on sentences containing key tokens
High confidence and accuracy on must-pass cases
High precision on "urgent" predictions
No more than 10 rows with nulls
No more than 10 duplicate rows
No significant difference in accuracy between gender feature values
No more than 10 rows with nulls
No more than 10 duplicate rows
No significant difference in accuracy between gender feature values
No more than 10 rows with nulls
No more than 10 duplicate rows
No significant difference in accuracy between gender feature values
No rows from the training set present in validation
No new labels in the validation set
No null columns
No rows from the training set present in validation
No new labels in the validation set
No null columns
No rows from the training set present in validation
No new labels in the validation set
No null columns
No significant label drift
No rows from the training set present in validation
Expect high performance on young adult females in South Africa
No significant label drift
No rows from the training set present in validation
Expect high performance on young adult females in South Africa
No significant label drift
No rows from the training set present in validation
Expect high performance on young adult females in South Africa
High precision on "fraudulent" predictions
High confidence and accuracy on must-pass cases
High precision on "urgent" predictions
High precision on "fraudulent" predictions
High confidence and accuracy on must-pass cases
High precision on "urgent" predictions
High precision on "fraudulent" predictions
High confidence and accuracy on must-pass cases
High precision on "urgent" predictions
We support a diverse range of task types so that all of your use cases are covered.
Make sure your AI is at peak performance when it gets in the hands of your users. Tests are a great way to track all of the constraints that are important for your model's performance.
Start by improving the foundation of your AI system: the data.
Make sure data stays consistent between different datasets.
Identify underperforming subpopulations Choose from the most advanced metrics.
Guard against biases and ensure equal treatment of sensitive groups.
Probe for edge-cases not captured by your data that you may encounter in the wild. See how your AI performs under adversarial attack.
Add your monitoring pipeline to set production-specific tests and keep a close eye on your model behavior in the wild.
Something went wrong in production? Be the first to know with real-time pings on email, Slack, or in-app.
Openlayer
@sophia commented on No duplicate rows
– thoughts on changing the threshold?
Openlayer
Test status updated for No output drift
To 🔴 Failing From 🟢 Passing
Different tests require different windows of data. Set custom evaluation windows to determine when to run each test.
The monitoring dashboard offers a comprehensive view of current test results. Click on any individual test to dive deeper into its performance history.
Breeze through the "why?"s behind failed tests. The information you need to find the root cause of issues is at your fingertips.
Dive deeper into every test to understand why it is failing. Stop questioning what to do next to improve your model.
Understand which features are the culprits for driving model performance over a particular data slice or r the whole dataset.
Perturb individual model inputs and see how the prediction changes. Compare the explainability scores and model predictions side-by-side to validate your hypotheses.
Track and version your models, prompts, and datasets. Compare performance across versions, and systematically choose the best AI stack.
Experiment with different models, prompts, and parameters and generate test cases.
Keep track of and easily switch between model and dataset versions.
Quickly compare versions to pick a winner or revert ones that introduce regressions.
Bring the whole team in on the development of your AI. Work with others to diagnosis issues and identify next steps. Keep everyone in the loop.
Increase visibility with everyone in one place. Everyone stays up to date on the versions and progress towards deployment.
Work together to meet tests and improve your model by leaving comments. Add context by tagging data, features, and more.
Get your free account up and running in 60 seconds
Our team will give you a demo of our platform and answer any questions
We'll get your and your team's accounts set up according to your specific needs