Response Relevance Evaluator
Overview
Section titled “Overview”The ResponseRelevanceEvaluator evaluates whether an agent’s response is relevant to the user’s question. It assesses if the response addresses what was actually asked, rather than going off-topic or providing unrelated information.
Key Features
Section titled “Key Features”- Trace-Level Evaluation: Evaluates the most recent turn in the conversation
- Five-Level Scoring: Granular scale from “Not At All” to “Completely Yes”
- Async Support: Supports both synchronous and asynchronous evaluation
- Structured Reasoning: Provides step-by-step reasoning for each evaluation
When to Use
Section titled “When to Use”Use the ResponseRelevanceEvaluator when you need to:
- Detect off-topic or tangential responses
- Ensure agents stay focused on the user’s question
- Identify cases where agents misinterpret user intent
- Measure response alignment with user queries
Evaluation Level
Section titled “Evaluation Level”This evaluator operates at the TRACE_LEVEL, evaluating the most recent turn in the conversation.
Parameters
Section titled “Parameters”model (optional)
Section titled “model (optional)”- Type:
Union[Model, str, None] - Default:
None(uses default Bedrock model) - Description: The model to use as the judge.
system_prompt (optional)
Section titled “system_prompt (optional)”- Type:
str | None - Default:
None(uses built-in template) - Description: Custom system prompt for the judge model.
include_inputs (optional)
Section titled “include_inputs (optional)”- Type:
bool - Default:
True - Description: Whether to include the input prompt in the evaluation context.
version (optional)
Section titled “version (optional)”- Type:
str - Default:
"v0" - Description: Prompt template version.
Scoring System
Section titled “Scoring System”| Rating | Score | Description |
|---|---|---|
| Not At All | 0.0 | Response is completely unrelated to the question |
| Not Generally | 0.25 | Response is mostly off-topic with minor relevance |
| Neutral/Mixed | 0.5 | Response partially addresses the question |
| Generally Yes | 0.75 | Response is mostly relevant with minor tangents |
| Completely Yes | 1.0 | Response directly and fully addresses the question |
A response passes the evaluation if the score is >= 0.5.
Basic Usage
Section titled “Basic Usage”from strands import Agentfrom strands_evals import Case, Experimentfrom strands_evals.evaluators import ResponseRelevanceEvaluatorfrom strands_evals.mappers import StrandsInMemorySessionMapperfrom strands_evals.telemetry import StrandsEvalsTelemetry
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
def task_function(case: Case) -> dict: telemetry.in_memory_exporter.clear() agent = Agent( trace_attributes={"session.id": case.session_id}, callback_handler=None ) response = agent(case.input) spans = telemetry.in_memory_exporter.get_finished_spans() mapper = StrandsInMemorySessionMapper() session = mapper.map_to_session(spans, session_id=case.session_id) return {"output": str(response), "trajectory": session}
cases = [ Case(name="relevance-check", input="How do I reset my password?")]
experiment = Experiment(cases=cases, evaluators=[ResponseRelevanceEvaluator()])reports = experiment.run_evaluations(task_function)reports[0].run_display()Related Evaluators
Section titled “Related Evaluators”- CoherenceEvaluator: Evaluates logical consistency
- CorrectnessEvaluator: Evaluates factual accuracy
- FaithfulnessEvaluator: Checks if responses are grounded in conversation history