Skip to main content

Definition

The session turn relevancy test evaluates whether each turn is relevant to the ongoing conversation. An LLM-as-a-judge reads the full session and scores it against four criteria:
  • Each response directly addresses the user’s question or request
  • Responses avoid irrelevant information or padding
  • Off-topic tangents are handled appropriately (redirected rather than indulged)
  • The conversation stays focused on the thread the user is pursuing

Taxonomy

  • Task types: LLM.
  • Availability: and .
  • Evaluation level: session.
  • Polarity: higher score = better. 0 = completely irrelevant turns, 1 = all turns highly relevant.

Why it matters

  • Even assistants that answer factually correctly sometimes respond to adjacent but not directly-requested questions — a subtle quality degradation.
  • Turn-level relevancy aggregated across a session catches patterns where the assistant consistently half-misunderstands the user.

Required columns

  • Input: The user’s message in each turn.
  • Output: The assistant’s response in each turn.
  • Session ID: Groups turns belonging to the same conversation.
  • Timestamp: Used to reconstruct turn order within a session.
This metric relies on an LLM evaluator. On Openlayer you can configure the underlying LLM used to compute it. Check out the OpenAI or Anthropic integration guides for details.

Test configuration examples

[
  {
    "name": "Session turn relevancy above 0.7",
    "description": "Ensure each turn is relevant to the user's question",
    "type": "performance",
    "subtype": "sessionTurnRelevancy",
    "thresholds": [
      {
        "insightName": "sessionTurnRelevancy",
        "measurement": "meanScore",
        "operator": ">=",
        "value": 0.7
      }
    ],
    "subpopulationFilters": null,
    "mode": "monitoring",
    "usesProductionData": true,
    "evaluationWindow": 3600,
    "delayWindow": 0
  }
]