A log of all the changes and improvements made to our app

Subscribe to the changelog

February 1st, 2024

Go deep on test result history and add multiple criteria to GPT evaluation tests

📈 Go deep on test result history and add multiple criteria to GPT evaluation tests

You can now click on any test to dive deep into the test result history. Select specific date ranges to see the requests from that time period, scrub through the graph to spot patterns over time, and get a full picture of performance.

We’ve also added the ability to add multiple criteria to GPT evaluation tests. Let’s say you’re using an LLM to parse customer support tickets and want to make sure every output contains the correct name, email address and account ID for the customer. You can now set a unique threshold for each of these criteria in one test.

New features

  • Scrub through the entire history of test results in the individual test pages
    • See which requests were evaluated per each evaluation window
    • See results and requests from a specific time period
  • Improved GPT evaluation test
    • Add multiple criteria to a single test
    • Choose how you want each row to be scored against the criteria: on a range from 0-1, or a binary 0 or 1


  • Bolster backend server to handle higher loads
  • Table headers no longer wrap
  • Null columns hidden in data table
  • Test metadata moved to the side panel so that the test results graph and data are viewed more easily
  • Skipped test results are rendered with the most recent result value
  • Test results graph height increased in the page for an individual test
  • Date labels in tests results graph improved
  • Only render rows that were evaluated for GPT metric threshold tests
  • Test card graphs no longer flash empty state before loading

Bug fixes

  • Results graph was not sorted correctly
  • Test results graph did not overflow properly
  • Test results graph did not render all data points
  • Empty and qausi-constant features test creation was broken
  • Undefined column values now rendered as null
  • Most recent rows were not being shown by default
  • Label chip in results graphs for string validation tests was not inline
  • Test results were not rendering properly in development mode
  • Plan name label overflowed navigation
  • Buttons for exploring subpopulations was active even when no subpopulations existed
  • Results graph rendered loading indicator even after networking completed for skipped evaluations
  • Rows for the current evaluation window were not rendered in test modals
  • Commit, metrics, and other tables were not rendering rows
  • Duplicate loading and empty placeholders rendered in monitoring mode