Command-Line Interface
Overview
Section titled “Overview”Installing strands-agents-evals also installs the strands-evals console script — a thin wrapper over the public Python API for CI gates and one-off use. It exposes five subcommands that map directly to library calls so behavior in CI matches what you get from a Python script.
| Command | Purpose |
|---|---|
strands-evals run | Execute an Experiment against an --agent factory or --task callable, or run a single ad-hoc case via --input + --evaluator/--expected-output/--rubric. |
strands-evals validate | Schema-check a serialized Experiment JSON file. Useful as a CI gate before run. |
strands-evals report | Render an existing EvaluationReport JSON via Rich, or dump it as JSON. |
strands-evals diagnose | Run detect_failures, analyze_root_cause, or the full diagnose_session pipeline on a Session JSON file. |
strands-evals generate | Synthesize an Experiment via ExperimentGenerator from a free-form --context or an existing --experiment file. |
Run any subcommand with --help for the full flag set.
Installation
Section titled “Installation”pip install strands-agents-evalsThe strands-evals script is registered as a console entry point and is on your PATH after installation.
Entry Point Convention
Section titled “Entry Point Convention”--agent, --task, --evaluator, and --custom-evaluator all accept a MODULE:ATTR reference. The same convention is used by pytest --pyargs, gunicorn, and inspect-ai eval.
Two forms are accepted:
- Dotted module:
pkg.module:attr— resolved viaimportlib.import_module. The current working directory is added tosys.pathso a sibling file likeagent.pyworks asagent:build_agentwithoutPYTHONPATH=.. - Path-like:
./agent.py:build_agent,../sibling/agent:build_agent, or/abs/path/agent.py:build_agent. Anything that contains a path separator, starts with.//..//~, or ends in.py.
run — Execute an Experiment
Section titled “run — Execute an Experiment”Two modes:
- Experiment file mode: pass an
EXPERIMENT_FILE(a JSON document produced byExperiment.to_file). - Ad-hoc mode: omit the file and provide
--input+ at least one of--evaluator,--expected-output, or--rubricfor a single-case run without authoring an experiment.
The two modes are mutually exclusive — argparse rejects mixing them.
Choosing --agent vs --task
Section titled “Choosing --agent vs --task”--agent is the standard path. It expects a factory callable that returns a fresh strands.Agent per invocation:
from strands import Agentfrom strands_tools import calculator
def build_agent(): return Agent(tools=[calculator], callback_handler=None)The CLI synthesizes the standard task wrapper around it: telemetry setup → per-case OTel context (session.id, gen_ai.conversation.id) → factory call → invoke with case.input → map spans to a Session → return {"output", "trajectory"}. Trace-based evaluators (HelpfulnessEvaluator, FaithfulnessEvaluator, GoalSuccessRateEvaluator, etc.) read the trajectory directly.
The factory may also take a single Case argument for per-case customization:
def build_agent(case): tools = [calculator] if (case.metadata or {}).get("use_calc") else [] return Agent(tools=tools, callback_handler=None)A prebuilt strands.Agent instance or an Agent subclass is rejected — the conversation state would leak across cases.
--task is the escape hatch for non-standard task shapes — multi-turn loops, custom session mapping, etc. It expects a Callable[[Case], dict|str]. When --task is used, the user owns agent instantiation; --trace-attributes is a no-op and is logged as a warning.
Experiment file run
Section titled “Experiment file run”# Schema-check first, then run against a factorystrands-evals validate experiments/customer_service.jsonstrands-evals run experiments/customer_service.json \ --agent my_pkg.agents:build_agent \ --display--display renders a Rich table on stdout with input, expected output, actual output, and per-evaluator scores.
run is the one subcommand whose primary stdout output does not follow the global --rich/--json TTY auto-detection — building the Rich table eagerly walks every case row, which is wasteful on large experiments where output is typically piped to strands-evals report or written via -o. Concretely, with no --display and no -o:
--json(or stdout is a pipe) → flattened report JSON on stdout.- TTY with no
--json/--rich→ silent on stdout (pass--displayto see results). -o PATH→ JSON written to the file; nothing on stdout.
Ad-hoc run
Section titled “Ad-hoc run”For a one-off check with no experiment file:
# Substring match against the agent's responsestrands-evals run \ --input "What is the capital of France?" \ --expected-output "Paris" \ --agent my_pkg.agents:build_agent
# LLM-as-judge with a rubricstrands-evals run \ --input "Explain recursion in one paragraph." \ --rubric "Score 1.0 if accurate and one paragraph. Score 0.0 otherwise." \ --agent my_pkg.agents:build_agent
# Built-in shortname evaluatorstrands-evals run \ --input "Is 17 prime?" \ --evaluator helpfulness \ --agent my_pkg.agents:build_agentAuto-wiring rules in ad-hoc mode:
--expected-output TEXT(without--evaluator) →Contains(value=TEXT).--rubric TEXT(without--evaluator) →OutputEvaluator(rubric=TEXT).--expected-outputand--rubriccompose — both auto-evaluators are appended.- An explicit
--evaluatordisables the auto-wiring; pass it again to add more (--evaluatoris repeatable).
--evaluator accepts either a built-in shortname or MODULE:CLASS for a custom Evaluator subclass. Built-in shortnames instantiate with no arguments; richer config (custom rubrics, judge models, target tool names) belongs in an experiment file.
Built-in shortnames: coherence, conciseness, correctness, equals, faithfulness, goal-success-rate, harmfulness, helpfulness, instruction-following, refusal, response-relevance, stereotyping, tool-parameter-accuracy, tool-selection-accuracy.
Concurrency, caching, and exit codes
Section titled “Concurrency, caching, and exit codes”strands-evals run experiments/regression.json \ --agent my_pkg.agents:build_agent \ --max-workers 8 \ --data-store ./.cache/regression \ --fail-on threshold:0.8 \ -o reports/regression.json--max-workerscontrols parallelism forrun_evaluations_async(default1).--data-store DIRenablesLocalFileTaskResultStoreso cached task outputs short-circuit reruns. See Result Caching for details.--fail-onchooses the exit-code rule:any(default — exit non-zero on any case failure),none(always exit 0 on completion), orthreshold:0.X(exit non-zero when the report’s overall score falls below the threshold).--exit-zerooverrides--fail-onand always returns 0. Useful when you want to record the report without breaking the build.-o PATHwrites the flattened report JSON to a file. Without-o, the JSON goes to stdout.
Exit codes:
| Code | Meaning |
|---|---|
0 | Success (all cases passed, or --fail-on=none / --exit-zero). |
1 | Evaluation failures triggered by --fail-on. |
2 | Bad input (invalid flags, missing entry point, schema error). |
3 | Unexpected runtime error. |
Diagnosis during a run
Section titled “Diagnosis during a run”Combine run with on-failure diagnosis to capture root causes alongside scores:
strands-evals run experiments/regression.json \ --agent my_pkg.agents:build_agent \ --diagnose on_failure \ --confidence medium \ --display--diagnose accepts on_failure or always. Diagnosis requires Session trajectories, which only --agent produces. With --display, recommendations render in the Rich table.
Trace attributes and custom evaluators
Section titled “Trace attributes and custom evaluators”strands-evals run experiments/regression.json \ --agent my_pkg.agents:build_agent \ --trace-attributes service.name=billing \ --trace-attributes deployment.env=staging \ --custom-evaluator my_pkg.evaluators:DomainSafetyEvaluator--trace-attributes KEY=VALUEis repeatable. The pairs are set as W3C Baggage on the per-case context and stamped on every span the agent emits.session.idandgen_ai.conversation.idare always set from the case —--trace-attributesis for additional keys. No-op when--taskis used.--custom-evaluator MODULE:CLASSregisters a customEvaluatorsubclass beforeExperiment.from_fileso the deserializer can rehydrate it. Repeatable. Ignored in ad-hoc mode (passMODULE:CLASSdirectly to--evaluatorinstead).
validate — Schema-check an Experiment
Section titled “validate — Schema-check an Experiment”strands-evals validate experiments/customer_service.json# valid: 12 case(s), 3 evaluator(s) [OutputEvaluator, TrajectoryEvaluator, HelpfulnessEvaluator]validate loads the file via Experiment.from_file and reports case + evaluator counts. It exits non-zero on schema or I/O errors, making it a fast CI gate before run. Use --custom-evaluator MODULE:CLASS (repeatable) when the experiment references custom evaluators.
report — Render an existing report
Section titled “report — Render an existing report”# Static Rich rendering on stdout (Rich on a TTY, JSON when piped — pass --rich to force)strands-evals report reports/regression.json --rich
# Interactive Rich table (expand/collapse rows)strands-evals report reports/regression.json --interactive
# Include diagnosis recommendationsstrands-evals report reports/regression.json --recommendations
# Re-emit as JSONstrands-evals report reports/regression.json --jsonreport accepts - to read from stdin, so it composes with run:
strands-evals run experiments/regression.json --agent my_pkg.agents:build_agent \ | strands-evals report - --recommendations-o PATH always writes JSON regardless of --interactive/--rich, so you can pipe through report to persist a stable on-disk format.
diagnose — Detect failures and analyze root causes
Section titled “diagnose — Detect failures and analyze root causes”diagnose operates on a serialized Session (the same Session object trace-based evaluators consume). Three modes:
# Full pipeline: detect failures and analyze root causesstrands-evals diagnose session.json --confidence medium
# Detection onlystrands-evals diagnose session.json --detect-only --confidence high
# Root cause analysis onlystrands-evals diagnose session.json --rca-only
# Read from stdin, write JSON to a filecat session.json | strands-evals diagnose - --output diagnosis.json--confidenceis the minimum confidence threshold for failure detection (low|medium|high, defaultlow).--model MODEL_IDoverrides the judge model used for detection and RCA.--detect-onlyand--rca-onlyare mutually exclusive; omit both for the full pipeline.- A one-line summary is always written to stderr (
diagnosis: N failure(s), M root cause(s)), so the command is scriptable even when the rich output goes to a TTY.
See Detectors for the underlying API.
generate — Synthesize an Experiment
Section titled “generate — Synthesize an Experiment”generate wraps ExperimentGenerator to produce a starter experiment from either a free-form context or an existing experiment file. The two source flags are mutually exclusive.
From a context description
Section titled “From a context description”strands-evals generate \ --context "$(cat tools.txt)" \ --num-cases 10 \ --evaluator TrajectoryEvaluator \ --task-description "Calculation and time-aware assistant" \ --num-topics 3 \ -o experiments/generated.json--contextaccepts free-form text. Use shell substitution for file contents.--num-cases(default5) is the number of test cases to generate.--evaluator(context mode only) attaches a default evaluator with a generated rubric. Choices:OutputEvaluator,TrajectoryEvaluator,InteractionsEvaluator. Omit to produce an experiment with a placeholderEvaluator.--num-topics(context mode only) splits generation across N topic-specific prompts for diverse coverage.
From an existing experiment
Section titled “From an existing experiment”strands-evals generate \ --experiment experiments/baseline.json \ --num-cases 20 \ --extra-information "Focus on edge cases involving timezone handling." \ -o experiments/expanded.json- New cases are inspired by the source; evaluators are inherited from the source’s defaults (so
--evaluatorand--num-topicsare rejected). --custom-evaluator MODULE:CLASS(experiment mode only, repeatable) registers custom evaluators before loading the source.--extra-information(experiment mode only) is extra context for the new cases and rubric.
--model MODEL_ID overrides the judge model used by the generator. With -o, the experiment is written via Experiment.to_file (a .json extension is enforced). Without -o, the JSON document is written to stdout. A one-line summary on stderr reports the case and evaluator counts.
See Experiment Generator for the underlying API.
Global flags
Section titled “Global flags”Every subcommand accepts the same global flags from the parent parser:
| Flag | Purpose |
|---|---|
--json | Emit machine-readable JSON to stdout. |
--rich | Emit Rich-rendered output to stdout. Default when stdout is a TTY. |
-v, --verbose | Increase log verbosity. Repeat (-vv) for DEBUG. |
--debug | DEBUG logging plus full tracebacks on errors. |
--json and --rich are mutually exclusive; without either, the format is auto-detected from whether stdout is a TTY.
CI Integration
Section titled “CI Integration”A typical CI flow combines validate (fast schema gate) with run (the actual evaluation):
# .github/workflows/evals.yml (excerpt)- name: Validate experiments run: strands-evals validate experiments/regression.json
- name: Run evaluations run: | strands-evals run experiments/regression.json \ --agent my_pkg.agents:build_agent \ --max-workers 8 \ --data-store ./.cache/regression \ --fail-on threshold:0.85 \ -o regression-report.json
- name: Upload report if: always() uses: actions/upload-artifact@v4 with: name: eval-report path: regression-report.jsonvalidate exits non-zero on schema errors before any agent calls, and run exits non-zero on evaluation failures via --fail-on. The cached results from --data-store make reruns cheap when only the evaluators or the agent change.
Next Steps
Section titled “Next Steps”- Task Decorator — the Python equivalent of
--agent’s synthesized task wrapper, for use in scripts. - Result Caching — what
--data-storewrites and how cache hits work. - Serialization — the on-disk shapes consumed by
validate,report, andgenerate --experiment. - Experiment Generator — the API behind
strands-evals generate. - Detectors — the API behind
strands-evals diagnose.