Skip to main content

Definition

The session coherence test evaluates the logical flow and consistency of a multi-turn conversation. An LLM-as-a-judge reads the full conversation and scores it against four criteria:
  • Responses logically follow from the user’s messages
  • The overall trajectory of the conversation is easy to follow
  • Individual responses are well-structured and internally consistent
  • Transitions between topics feel smooth

Taxonomy

  • Task types: LLM.
  • Availability: and .
  • Evaluation level: session.
  • Polarity: higher score = better. 0 = completely incoherent, 1 = perfectly coherent.

Why it matters

  • Coherence is a distinct quality from correctness: an assistant can give factually correct answers that still feel disjointed or contradictory across turns.
  • Low coherence is a strong leading indicator of user dissatisfaction even when task outcomes look fine.

Required columns

  • Input: The user’s message in each turn.
  • Output: The assistant’s response in each turn.
  • Session ID: Groups turns belonging to the same conversation.
  • Timestamp: Used to reconstruct turn order within a session.
This metric relies on an LLM evaluator. On Openlayer you can configure the underlying LLM used to compute it. Check out the OpenAI or Anthropic integration guides for details.

Test configuration examples

[
  {
    "name": "Session coherence above 0.7",
    "description": "Ensure conversations maintain logical flow across turns",
    "type": "performance",
    "subtype": "sessionCoherence",
    "thresholds": [
      {
        "insightName": "sessionCoherence",
        "measurement": "meanScore",
        "operator": ">=",
        "value": 0.7
      }
    ],
    "subpopulationFilters": null,
    "mode": "monitoring",
    "usesProductionData": true,
    "evaluationWindow": 3600,
    "delayWindow": 0
  }
]