Skip to content

Response Relevance Evaluator

The ResponseRelevanceEvaluator evaluates whether an agent’s response is relevant to the user’s question. It assesses if the response addresses what was actually asked, rather than going off-topic or providing unrelated information.

  • Trace-Level Evaluation: Evaluates the most recent turn in the conversation
  • Five-Level Scoring: Granular scale from “Not At All” to “Completely Yes”
  • Async Support: Supports both synchronous and asynchronous evaluation
  • Structured Reasoning: Provides step-by-step reasoning for each evaluation

Use the ResponseRelevanceEvaluator when you need to:

  • Detect off-topic or tangential responses
  • Ensure agents stay focused on the user’s question
  • Identify cases where agents misinterpret user intent
  • Measure response alignment with user queries

This evaluator operates at the TRACE_LEVEL, evaluating the most recent turn in the conversation.

  • Type: Union[Model, str, None]
  • Default: None (uses default Bedrock model)
  • Description: The model to use as the judge.
  • Type: str | None
  • Default: None (uses built-in template)
  • Description: Custom system prompt for the judge model.
  • Type: bool
  • Default: True
  • Description: Whether to include the input prompt in the evaluation context.
  • Type: str
  • Default: "v0"
  • Description: Prompt template version.
RatingScoreDescription
Not At All0.0Response is completely unrelated to the question
Not Generally0.25Response is mostly off-topic with minor relevance
Neutral/Mixed0.5Response partially addresses the question
Generally Yes0.75Response is mostly relevant with minor tangents
Completely Yes1.0Response directly and fully addresses the question

A response passes the evaluation if the score is >= 0.5.

from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import ResponseRelevanceEvaluator
from strands_evals.mappers import StrandsInMemorySessionMapper
from strands_evals.telemetry import StrandsEvalsTelemetry
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
def task_function(case: Case) -> dict:
telemetry.in_memory_exporter.clear()
agent = Agent(
trace_attributes={"session.id": case.session_id},
callback_handler=None
)
response = agent(case.input)
spans = telemetry.in_memory_exporter.get_finished_spans()
mapper = StrandsInMemorySessionMapper()
session = mapper.map_to_session(spans, session_id=case.session_id)
return {"output": str(response), "trajectory": session}
cases = [
Case(name="relevance-check", input="How do I reset my password?")
]
experiment = Experiment(cases=cases, evaluators=[ResponseRelevanceEvaluator()])
reports = experiment.run_evaluations(task_function)
reports[0].run_display()