Skip to main content

Definition

The session guideline adherence test evaluates whether the assistant followed one or more customer-supplied behavioural guidelines across a conversation. You supply each guideline as an object with a criteria rule (e.g., “always refer to the user by their first name”) and a scoring mode ("Yes or No" or "0-1"). An LLM-as-a-judge evaluates the full session against each guideline independently and reports per-guideline scores plus aggregate adherence metrics. This is the session-level metric most suited to codifying product-specific behaviour that generic catalog tests don’t cover.

Taxonomy

  • Task types: LLM.
  • Availability: and .
  • Evaluation level: session.
  • Polarity: higher score = better (guideline followed).

Why it matters

  • Almost every product has a handful of custom behavioural rules that don’t map to the generic catalog — this metric is the knob for those.
  • Because the guideline is customer-supplied, you can track behaviour unique to your product, brand voice, or regulatory context.
  • Multi-guideline support lets you bundle a policy (brand voice + tone + forbidden topics) into a single insight and see aggregate adherence across a session.

Required columns

  • Input: The user’s message in each turn.
  • Output: The assistant’s response in each turn.
  • Session ID: Groups turns belonging to the same conversation.
  • Timestamp: Used to reconstruct turn order within a session.

Insight parameters

  • criteria_list (list of objects, required): One or more guidelines to evaluate. Each object contains:
    • criteria (string): The guideline text in plain English.
    • scoring (string): Either "Yes or No" (binary judgment — adherent or not) or "0-1" (a continuous 0–1 float where 1 = perfect adherence).
All guidelines in a single criteria_list should use the same scoring mode — the default adherence threshold is derived from the first entry’s mode.
This metric relies on an LLM evaluator. On Openlayer you can configure the underlying LLM used to compute it. Check out the OpenAI or Anthropic integration guides for details.

Default adherence threshold

A session is counted as “adherent” when its aggregate score is at or above:
  • 0.8 when scoring is "0-1"
  • 0.5 when scoring is "Yes or No" (Yes is encoded as 1.0, No as 0.0)
This threshold drives the adherentSessionsPercent and violationRate measurements below.

Available measurements

MeasurementWhat it means
meanGuidelineAdherenceAverage adherence score across sessions in the window
medianGuidelineAdherenceMedian adherence score across sessions
stdGuidelineAdherenceStandard deviation of per-session adherence scores
minGuidelineAdherenceLowest per-session adherence score
maxGuidelineAdherenceHighest per-session adherence score
adherentSessionsPercent% of sessions meeting the default adherence threshold
violationRate% of sessions below the default adherence threshold
totalEvaluatedSessionsCount of sessions the judge evaluated
erroredSessionsCountCount of sessions that errored during evaluation
skippedSessionsCountCount of sessions skipped (e.g., insufficient turns)

Test configuration examples

[
  {
    "name": "Brand-voice guideline adherence above 0.7",
    "description": "Ensure the assistant follows the brand-voice guideline across sessions",
    "type": "performance",
    "subtype": "sessionGuidelineAdherence",
    "thresholds": [
      {
        "insightName": "sessionGuidelineAdherence",
        "insightParameters": [
          {
            "name": "criteria_list",
            "value": [
              {
                "criteria": "Always refer to the user by their first name and never recommend a competitor product.",
                "scoring": "0-1"
              }
            ]
          }
        ],
        "measurement": "meanGuidelineAdherence",
        "operator": ">=",
        "value": 0.7
      }
    ],
    "subpopulationFilters": null,
    "mode": "monitoring",
    "usesProductionData": true,
    "evaluationWindow": 3600,
    "delayWindow": 0
  }
]