Changelog

A log of all the changes and improvements made to our app

Subscribe to the changelog

February 1st, 2024

Go deep on test result history and add multiple criteria to GPT evaluation tests

📈 Go deep on test result history and add multiple criteria to GPT evaluation tests

You can now click on any test to dive deep into the test result history. Select specific date ranges to see the requests from that time period, scrub through the graph to spot patterns over time, and get a full picture of performance.

We’ve also added the ability to add multiple criteria to GPT evaluation tests. Let’s say you’re using an LLM to parse customer support tickets and want to make sure every output contains the correct name, email address and account ID for the customer. You can now set a unique threshold for each of these criteria in one test.

New features

Scrub through the entire history of test results in the individual test pages
- See which requests were evaluated per each evaluation window
- See results and requests from a specific time period
Improved GPT evaluation test
- Add multiple criteria to a single test
- Choose how you want each row to be scored against the criteria: on a range from 0-1, or a binary 0 or 1

Improvements

Bolster backend server to handle higher loads
Table headers no longer wrap
Null columns hidden in data table
Test metadata moved to the side panel so that the test results graph and data are viewed more easily
Skipped test results are rendered with the most recent result value
Test results graph height increased in the page for an individual test
Date labels in tests results graph improved
Only render rows that were evaluated for GPT metric threshold tests
Test card graphs no longer flash empty state before loading

Bug fixes

Results graph was not sorted correctly
Test results graph did not overflow properly
Test results graph did not render all data points
Empty and qausi-constant features test creation was broken
Undefined column values now rendered as null
Most recent rows were not being shown by default
Label chip in results graphs for string validation tests was not inline
Test results were not rendering properly in development mode
Plan name label overflowed navigation
Buttons for exploring subpopulations was active even when no subpopulations existed
Results graph rendered loading indicator even after networking completed for skipped evaluations
Rows for the current evaluation window were not rendered in test modals
Commit, metrics, and other tables were not rendering rows
Duplicate loading and empty placeholders rendered in monitoring mode