Skip to content

Session Diagnosis

diagnose_session runs the full detection-and-analysis pipeline in a single call: it detects failures, performs root cause analysis on any failures found, and returns a combined DiagnosisResult with deduplicated fix recommendations. It also integrates directly into the Experiment class via DiagnosisConfig for automatic diagnosis of failing evaluation cases.

  • Single-call pipeline: Runs detect_failures then analyze_root_cause in sequence
  • Deduplicated recommendations: The .recommendations property returns unique fix suggestions across all root causes
  • Experiment integration: Wire into Experiment with DiagnosisConfig for automatic diagnosis
  • Configurable triggers: Run diagnosis on every case (DiagnosisTrigger.ALWAYS) or only on failing cases (DiagnosisTrigger.ON_FAILURE)

Use diagnose_session when you need to:

  • Run the full pipeline without managing individual detector calls
  • Integrate diagnosis into experiments for automatic debugging of failing cases
  • Get a single result object with failures, root causes, and recommendations

Use the individual detectors (detect_failures, analyze_root_cause) when you need finer control — for example, running failure detection with different confidence thresholds before deciding whether to proceed with RCA.

  • Type: Session
  • Description: The Session object containing traces and spans to diagnose.
  • Type: Model | str | None
  • Default: None (uses Claude Sonnet via Bedrock)
  • Description: The model for both failure detection and root cause analysis.
  • Type: ConfidenceLevel
  • Default: ConfidenceLevel.LOW
  • Description: Minimum confidence level for failure detection.
from strands_evals.detectors import diagnose_session
result = diagnose_session(session)
# Failures found
print(f"Failures: {len(result.failures)}")
for f in result.failures:
print(f" [{f.category[0]}] at span {f.span_id}")
# Root causes
print(f"\nRoot causes: {len(result.root_causes)}")
for rc in result.root_causes:
print(f" {rc.causality} at {rc.location}: {rc.root_cause_explanation}")
# Deduplicated recommendations
print("\nRecommendations:")
for rec in result.recommendations:
print(f" - {rec}")

diagnose_session returns a DiagnosisResult:

class DiagnosisResult(BaseModel):
session_id: str
failures: list[FailureItem]
root_causes: list[RCAItem]
@property
def recommendations(self) -> list[str]:
"""Deduplicated fix recommendations from all root causes."""

If no failures are detected, root_causes will be empty and recommendations will return an empty list.

The most powerful way to use diagnosis is through the Experiment class. Pass a DiagnosisConfig to automatically diagnose cases during evaluation.

from strands_evals import DiagnosisConfig
from strands_evals.detectors import ConfidenceLevel, DiagnosisTrigger
class DiagnosisConfig(BaseModel):
trigger: DiagnosisTrigger = DiagnosisTrigger.ON_FAILURE
model: Model | str | None = None
confidence_threshold: ConfidenceLevel = ConfidenceLevel.MEDIUM
ParameterDefaultDescription
triggerDiagnosisTrigger.ON_FAILUREWhen to run diagnosis. ON_FAILURE runs only when at least one evaluator fails. ALWAYS runs on every case.
modelNoneModel for the detectors. None uses the default.
confidence_thresholdConfidenceLevel.MEDIUMMinimum confidence for failure detection.
from strands import Agent
from strands_evals import Case, Experiment, DiagnosisConfig, eval_task, TracedHandler
from strands_evals.detectors import ConfidenceLevel, DiagnosisTrigger
from strands_evals.evaluators import GoalSuccessRateEvaluator
@eval_task(TracedHandler())
def my_agent_task():
return Agent(
system_prompt="You are a helpful travel booking assistant.",
callback_handler=None,
)
cases = [
Case(
name="booking-1",
input="Book me a flight from NYC to London for next Friday.",
metadata={"task_description": "Flight booked with confirmation number"},
),
Case(
name="booking-2",
input="I need to cancel my reservation ABC123.",
metadata={"task_description": "Reservation cancelled successfully"},
),
]
experiment = Experiment(
cases=cases,
evaluators=[GoalSuccessRateEvaluator()],
diagnosis_config=DiagnosisConfig(
trigger=DiagnosisTrigger.ON_FAILURE,
confidence_threshold=ConfidenceLevel.MEDIUM,
),
)
reports = experiment.run_evaluations(my_agent_task)

Display recommendations in the evaluation report:

# Display with recommendations column
reports[0].display(include_recommendations=True)

Or access them programmatically:

report = reports[0]
for i, rec in enumerate(report.recommendations):
if rec is not None:
case_name = report.cases[i].get("name", f"case_{i}")
passed = report.test_passes[i]
print(f"[{'PASS' if passed else 'FAIL'}] {case_name}")
print(f" Recommendation: {rec}")

The full diagnosis dict (failures + root causes) is available per case:

report = reports[0]
for i, diagnosis in enumerate(report.diagnoses):
if diagnosis is not None:
case_name = report.cases[i].get("name", f"case_{i}")
n_failures = len(diagnosis.get("failures", []))
n_rca = len(diagnosis.get("root_causes", []))
print(f"{case_name}: {n_failures} failures, {n_rca} root causes")
for rc in diagnosis.get("root_causes", []):
print(f" [{rc['fix_type']}] {rc['fix_recommendation']}")

DiagnosisTrigger.ON_FAILURE (default): Diagnosis runs only when at least one evaluator returns test_pass=False for the case. This is the most efficient option — no LLM calls are spent diagnosing passing cases.

DiagnosisConfig(trigger=DiagnosisTrigger.ON_FAILURE)

DiagnosisTrigger.ALWAYS: Diagnosis runs on every case regardless of evaluator results. Useful for deep analysis or when you want to detect latent issues in passing cases (e.g., the agent succeeded but through a suboptimal path).

DiagnosisConfig(trigger=DiagnosisTrigger.ALWAYS)

Diagnosis requires the task function to return a Session object as the trajectory. This means using either:

  • The @eval_task(TracedHandler()) decorator (recommended)
  • A manual task function that collects spans and maps them to a Session via StrandsInMemorySessionMapper
  • A trace provider that returns Session objects

If the trajectory is not a Session (e.g., it’s a plain list of tool names), diagnosis is silently skipped for that case.

This example shows the complete workflow: run evaluations with diagnosis, then analyze the results.

from strands import Agent
from strands_evals import Case, Experiment, DiagnosisConfig, eval_task, TracedHandler
from strands_evals.detectors import ConfidenceLevel, DiagnosisTrigger
from strands_evals.evaluators import OutputEvaluator, GoalSuccessRateEvaluator
from strands_tools import calculator
@eval_task(TracedHandler())
def math_agent():
return Agent(
tools=[calculator],
system_prompt="You are a math assistant. Use the calculator tool for computations.",
callback_handler=None,
)
cases = [
Case(
name="basic-math",
input="What is 15% of 230?",
expected_output="34.5",
metadata={"task_description": "Correct calculation provided"},
),
]
experiment = Experiment(
cases=cases,
evaluators=[
OutputEvaluator(rubric="Score 1.0 if the answer is numerically correct. Score 0.0 otherwise."),
GoalSuccessRateEvaluator(),
],
diagnosis_config=DiagnosisConfig(
trigger=DiagnosisTrigger.ON_FAILURE,
confidence_threshold=ConfidenceLevel.MEDIUM,
),
)
reports = experiment.run_evaluations(math_agent)
# Flatten all evaluator reports into one and display
from strands_evals.types.evaluation_report import EvaluationReport
combined = EvaluationReport.flatten(reports)
combined.display(include_recommendations=True)
  1. Start with DiagnosisTrigger.ON_FAILURE to minimize LLM costs — only diagnose cases that actually fail
  2. Use ConfidenceLevel.MEDIUM for diagnosis to balance signal and noise
  3. Use TracedHandler with the @eval_task decorator to automatically collect Session objects
  4. Group recommendations across cases to identify systematic issues vs. one-off problems
  5. Use DiagnosisTrigger.ALWAYS sparingly — it’s useful for deep analysis but doubles the LLM cost per case