Session guideline adherence

Definition

The session guideline adherence test evaluates whether the assistant followed one or more customer-supplied behavioural guidelines across a conversation. You supply each guideline as an object with a criteria rule (e.g., “always refer to the user by their first name”) and a scoring mode ("Yes or No" or "0-1"). An LLM-as-a-judge evaluates the full session against each guideline independently and reports per-guideline scores plus aggregate adherence metrics. This is the session-level metric most suited to codifying product-specific behaviour that generic catalog tests don’t cover.

Taxonomy

Task types: LLM.
Availability: and .
Evaluation level: session.
Polarity: higher score = better (guideline followed).

Why it matters

Almost every product has a handful of custom behavioural rules that don’t map to the generic catalog — this metric is the knob for those.
Because the guideline is customer-supplied, you can track behaviour unique to your product, brand voice, or regulatory context.
Multi-guideline support lets you bundle a policy (brand voice + tone + forbidden topics) into a single insight and see aggregate adherence across a session.

Required columns

Input: The user’s message in each turn.
Output: The assistant’s response in each turn.
Session ID: Groups turns belonging to the same conversation.
Timestamp: Used to reconstruct turn order within a session.

Insight parameters

criteria_list (list of objects, required): One or more guidelines to evaluate. Each object contains:
- criteria (string): The guideline text in plain English.
- scoring (string): Either "Yes or No" (binary judgment — adherent or not) or "0-1" (a continuous 0–1 float where 1 = perfect adherence).

All guidelines in a single criteria_list should use the same scoring mode — the default adherence threshold is derived from the first entry’s mode.

This metric relies on an LLM evaluator. On Openlayer you can configure the underlying LLM used to compute it. Check out the OpenAI or Anthropic integration guides for details.

Default adherence threshold

A session is counted as “adherent” when its aggregate score is at or above:

0.8 when scoring is "0-1"
0.5 when scoring is "Yes or No" (Yes is encoded as 1.0, No as 0.0)

This threshold drives the adherentSessionsPercent and violationRate measurements below.

Available measurements

Measurement	What it means
`meanGuidelineAdherence`	Average adherence score across sessions in the window
`medianGuidelineAdherence`	Median adherence score across sessions
`stdGuidelineAdherence`	Standard deviation of per-session adherence scores
`minGuidelineAdherence`	Lowest per-session adherence score
`maxGuidelineAdherence`	Highest per-session adherence score
`adherentSessionsPercent`	% of sessions meeting the default adherence threshold
`violationRate`	% of sessions below the default adherence threshold
`totalEvaluatedSessions`	Count of sessions the judge evaluated
`erroredSessionsCount`	Count of sessions that errored during evaluation
`skippedSessionsCount`	Count of sessions skipped (e.g., insufficient turns)

Test configuration examples

[
  {
    "name": "Brand-voice guideline adherence above 0.7",
    "description": "Ensure the assistant follows the brand-voice guideline across sessions",
    "type": "performance",
    "subtype": "sessionGuidelineAdherence",
    "thresholds": [
      {
        "insightName": "sessionGuidelineAdherence",
        "insightParameters": [
          {
            "name": "criteria_list",
            "value": [
              {
                "criteria": "Always refer to the user by their first name and never recommend a competitor product.",
                "scoring": "0-1"
              }
            ]
          }
        ],
        "measurement": "meanGuidelineAdherence",
        "operator": ">=",
        "value": 0.7
      }
    ],
    "subpopulationFilters": null,
    "mode": "monitoring",
    "usesProductionData": true,
    "evaluationWindow": 3600,
    "delayWindow": 0
  }
]

Session role adherence — related signal focused on staying in a defined role.

Session goal achievement

Session latency

⌘I

Get started

Workspace setup

Governance

Observability

Offline testing

Tests

Gateway

Data quality monitoring

Administration

Notifications

Other resources

Session guideline adherence

Definition

Taxonomy

Why it matters

Required columns

Insight parameters

Default adherence threshold

Available measurements

Test configuration examples

​Definition

​Taxonomy

​Why it matters

​Required columns

​Insight parameters

​Default adherence threshold

​Available measurements

​Test configuration examples

​Related

Definition

Taxonomy

Why it matters

Required columns

Insight parameters

Default adherence threshold

Available measurements

Test configuration examples

Related