Deploy Strands Evaluators as AgentCore Code-Based Evaluators
This guide is for a Strands developer who has deployed agents to Amazon Bedrock
AgentCore and wants to run Strands evaluators against those agents. It walks
through wrapping an evaluator from the Python Strands Evals SDK
(strands-agents-evals) as an AgentCore code-based evaluator — an AWS Lambda
function that AgentCore Evaluations invokes to score sessions, traces, or tool
calls.
Prerequisites
Section titled “Prerequisites”Before following this guide, make sure you have:
- Python 3.10+
strands-agentsinstalledstrands-agents-evalsinstalled- An AWS account with access to AWS Lambda
How it fits together
Section titled “How it fits together”The diagram below shows the control flow inside the Lambda handler at runtime, from the moment AgentCore Evaluations invokes it to the schema-valid response the handler returns.
sequenceDiagram participant AC as AgentCore Evaluations participant L as Lambda (sample handler) participant M as StrandsInMemory\nSessionMapper participant E as Strands Evaluator\n(HelpfulnessEvaluator)
AC->>L: event (schemaVersion, evaluationLevel,\nsessionSpans, referenceInputs, target) L->>L: parse_payload(event) L->>M: map sessionSpans → Session M-->>L: Session L->>E: evaluator.evaluate(EvaluationData) alt evaluator supports level E-->>L: EvaluationOutput(score, reason) L->>L: build_response(score, reason,\nlevel, evaluatorId) L-->>AC: schema-valid response JSON else unsupported level L->>L: error_handling: build "not run" response L-->>AC: schema-valid response JSON\n(score=None, reason=explanation) else evaluator raised L->>L: error_handling: catch, log,\nbuild failure response L-->>AC: schema-valid response JSON\n(score=None, reason=error) endAgentCore request payload
Section titled “AgentCore request payload”AgentCore Evaluations invokes your Lambda with a JSON event defined by the AWS code-based evaluators contract.
The handler reads every field listed below:
schemaVersion— contract version; currently"1.0".evaluatorId— ARN of the registered code-based evaluator. Example:"arn:aws:bedrock-agentcore:us-east-1:123456789012:evaluator/strands-helpfulness".evaluatorName— human-readable label. Example:"strands-helpfulness".evaluationLevel— one ofTRACE,TOOL_CALL, orSESSION. Selects which Strands evaluator level the handler can answer.evaluationInput.sessionSpans— array of OTel-style span dicts the handler feeds toStrandsInMemorySessionMapperto build aSession.evaluationReferenceInputs— optional reference data for evaluators that need it (for exampleFaithfulnessEvaluator). Example:{"reference_output": "Paris"}.evaluationTarget— metadata about the target agent or session. Example:{"agentId": "my-agent", "sessionId": "abc-123"}.
A complete request looks like this:
{ "schemaVersion": "1.0", "evaluatorId": "arn:aws:bedrock-agentcore:us-east-1:...:evaluator/...", "evaluatorName": "strands-helpfulness", "evaluationLevel": "TRACE", "evaluationInput": { "sessionSpans": [ /* array of OTel-style span dicts */ ] }, "evaluationReferenceInputs": { /* optional reference data */ }, "evaluationTarget": { /* metadata about the target agent/session */ }}Evaluation-level mapping
Section titled “Evaluation-level mapping”AgentCore’s evaluationLevel must line up with the Strands evaluator’s declared
level for the handler to score the request. The table below shows the
one-to-one mapping the handler relies on:
AgentCore evaluationLevel | Strands evaluator level | Typical Strands evaluators |
|---|---|---|
TOOL_CALL | OUTPUT_LEVEL | OutputEvaluator, deterministic evaluators |
TRACE | TRACE_LEVEL | HelpfulnessEvaluator, FaithfulnessEvaluator |
SESSION | SESSION_LEVEL | TrajectoryEvaluator, GoalSuccessRateEvaluator |
When the incoming level doesn’t match the wrapped evaluator, the handler returns a schema-valid “not-run” response instead of raising — the details of that response shape are covered in the error-handling section later on this page.
Writing the Lambda handler
Section titled “Writing the Lambda handler”The rest of this section walks through the sample handler region by region.
Every code block is pulled verbatim from
agentcore_code_evaluator_handler.py via MkDocs snippet
syntax, so the docs and the runnable sample can’t drift apart.
Imports and constants
Section titled “Imports and constants”The imports region pulls in the stdlib utilities, the Strands Evals SDK
building blocks the handler needs, and the module-level constants echoed back
in every AgentCore response. The evaluator instance is created at module scope
rather than inside lambda_handler so warm Lambda invocations reuse the same
HelpfulnessEvaluator — and therefore the same judge-model client —
instead of re-initializing it on every call.
SCHEMA_VERSION is the AgentCore code-based evaluator contract version the
handler echoes in responses. PASS_THRESHOLD_ENV and DEFAULT_PASS_THRESHOLD
give deployers a way to override the numeric cutoff that maps evaluator scores
to PASS / FAIL without editing code. The AGENTCORE_LEVEL_* constants are
the exact string values AgentCore sends for evaluationLevel, defined once so
the rest of the handler can reference them symbolically.
import jsonimport loggingimport osfrom typing import Any
from strands_evals.evaluators import HelpfulnessEvaluatorfrom strands_evals.mappers import StrandsInMemorySessionMapperfrom strands_evals.types import EvaluationData, EvaluationOutput
logger = logging.getLogger(__name__)
# AgentCore code-based evaluator contract version echoed in responses.SCHEMA_VERSION = "1.0"
# Env var used to override the pass/fail threshold at deploy time.PASS_THRESHOLD_ENV = "STRANDS_EVAL_PASS_THRESHOLD"
# Default pass threshold when the evaluator does not set ``test_pass``.DEFAULT_PASS_THRESHOLD = 0.5
# AgentCore ``evaluationLevel`` values, per the AgentCore contract:# https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/code-based-evaluators.htmlAGENTCORE_LEVEL_TOOL_CALL = "TOOL_CALL"AGENTCORE_LEVEL_TRACE = "TRACE"AGENTCORE_LEVEL_SESSION = "SESSION"Entry point
Section titled “Entry point”lambda_handler is intentionally small: parse the event, invoke the
evaluator, build the response. Each step lives in its own helper so the docs
can introduce them one at a time, and so unit tests can exercise each stage
independently.
The outer try/except is the key part. AgentCore treats any Lambda error as
an evaluator failure that aborts the evaluation run, so the handler converts
every failure — level mismatches, validation errors, evaluator exceptions —
into a schema-valid response. The two helpers referenced here,
LevelMismatchError and _error_response, are defined alongside the other
error-handling code further down.
# Module-level evaluator instance reused across Lambda invocations so# warm starts don't re-initialize the judge-model client on every call._EVALUATOR = HelpfulnessEvaluator()
def lambda_handler(event: dict[str, Any], context: Any) -> dict[str, Any]: """Entry point AgentCore Evaluations invokes for each scored unit.
Parses the AgentCore request, runs the configured Strands evaluator, and returns a schema-valid AgentCore response. All failure modes are converted into schema-valid responses so AgentCore never observes an unhandled Lambda error. ``LevelMismatchError`` and ``_error_response`` are defined alongside the other error-handling helpers below. """ try: evaluation_data, level = _parse_payload(event) evaluation_output = _invoke_evaluator( _EVALUATOR, evaluation_data, level ) return _build_response(event, evaluation_output, status="OK") except LevelMismatchError as exc: return _error_response(event, exc, label="NOT_RUN") except Exception as exc: # noqa: BLE001 - graceful failure response logger.exception("Evaluator invocation failed") return _error_response(event, exc, label="ERROR")Parsing the request
Section titled “Parsing the request”_parse_payload translates the AgentCore request into a Strands
EvaluationData. It starts with a soft version check: an unexpected or
missing schemaVersion logs a warning but doesn’t abort, because the rest of
the payload may still be readable. evaluationLevel is different — it’s
required to pick the right evaluator path, so a missing value raises a
ValueError that the outer try/except turns into an ERROR response.
The OTel-style sessionSpans are fed to StrandsInMemorySessionMapper to
produce the Session object that Strands evaluators accept. Optional
passthroughs (evaluationReferenceInputs and evaluationTarget) are read
only when the evaluator needs them.
The _recover_prompt and _recover_final_output helpers are illustrative.
They pull the user prompt from the root span’s attributes and the final
assistant message from the session’s message list, which works for the shape
this sample assumes. Adapt both helpers to match the span and message schema
your agent actually records.
def _parse_payload( event: dict[str, Any],) -> tuple[EvaluationData, str]: """Translate an AgentCore request into Strands evaluator input.
Returns the ``EvaluationData`` the evaluator will score and the ``evaluationLevel`` string from the request so the caller can compare it against the evaluator's declared level. """ # Version check. Unknown versions are logged but not fatal; the # request may still carry fields we know how to read. schema_version = event.get("schemaVersion") if schema_version is None: logger.warning("AgentCore request is missing schemaVersion") elif schema_version != SCHEMA_VERSION: logger.warning( "Unexpected schemaVersion %r (handler expects %r)", schema_version, SCHEMA_VERSION, )
# ``evaluationLevel`` is required to pick the right evaluator path. # The outer handler converts ValueError into an ERROR response that # AgentCore can surface to the reader. evaluation_level = event.get("evaluationLevel") if not evaluation_level: raise ValueError( "AgentCore request is missing required 'evaluationLevel'" )
# Convert the OTel-style session spans into a Strands ``Session``. evaluation_input = event.get("evaluationInput") or {} session_spans = evaluation_input.get("sessionSpans") or [] mapper = StrandsInMemorySessionMapper() session = mapper.map(session_spans)
# Optional passthroughs. AgentCore only sends these fields when the # evaluation configuration references them. reference_inputs = event.get("evaluationReferenceInputs") evaluation_target = event.get("evaluationTarget")
# Illustrative defaults: recover the prompt from the root span's # attributes and treat the final assistant message as the output. # Adjust these helpers to match the span and message shape your # agent actually records. input_text = _recover_prompt(session) actual_output = _recover_final_output(session)
evaluation_data = EvaluationData( input=input_text, actual_output=actual_output, trajectory=session, reference_input=reference_inputs, )
# ``evaluation_target`` is accepted for future use (for example # logging the target agent version alongside the score) but is not # consumed by ``EvaluationData`` today. _ = evaluation_target
return evaluation_data, evaluation_level
def _recover_prompt(session: Any) -> str: """Pull the user prompt out of the mapped session.
Illustrative: reads the root span's attributes. Real deployments should adapt this to match the span schema their agent emits. """ root_span = getattr(session, "root_span", None) attributes = getattr(root_span, "attributes", {}) or {} return str( attributes.get("input") or attributes.get("prompt") or "" )
def _recover_final_output(session: Any) -> str: """Pull the final assistant message out of the mapped session.
Illustrative: reads the last message on the session. Real deployments should adapt this to match the message shape their agent records. """ messages = getattr(session, "messages", []) or [] if not messages: return "" last = messages[-1] return str(getattr(last, "content", last) or "")Invoking the evaluator
Section titled “Invoking the evaluator”_invoke_evaluator enforces the level-compatibility check before calling the
evaluator. It reads the evaluator’s declared Strands level (for
HelpfulnessEvaluator that’s TRACE_LEVEL) and compares it to the expected
Strands level for the incoming AgentCore level via
_AGENTCORE_TO_STRANDS_LEVEL, the single mapping table the handler uses for
this translation.
On a mismatch, the function raises LevelMismatchError. Python resolves the
forward reference at call time, so the class can live in the error-handling
region below without a circular import. The outer handler catches the
mismatch and converts it into the NOT_RUN response.
# Strands evaluators expose their native level as a string. The sample# wraps ``HelpfulnessEvaluator``, which is a ``TRACE_LEVEL`` evaluator.STRANDS_OUTPUT_LEVEL = "OUTPUT_LEVEL"STRANDS_TRACE_LEVEL = "TRACE_LEVEL"STRANDS_SESSION_LEVEL = "SESSION_LEVEL"
# How each AgentCore ``evaluationLevel`` maps to a Strands level._AGENTCORE_TO_STRANDS_LEVEL = { AGENTCORE_LEVEL_TOOL_CALL: STRANDS_OUTPUT_LEVEL, AGENTCORE_LEVEL_TRACE: STRANDS_TRACE_LEVEL, AGENTCORE_LEVEL_SESSION: STRANDS_SESSION_LEVEL,}
def _invoke_evaluator( evaluator: Any, evaluation_data: EvaluationData, level: str,) -> EvaluationOutput: """Run the evaluator when the request level matches its Strands level.
Raises ``LevelMismatchError`` when the incoming AgentCore level does not correspond to the evaluator's declared Strands level. The outer handler converts that into a schema-valid "not-run" response so AgentCore never observes an unhandled Lambda error. The ``LevelMismatchError`` class is defined in the error-handling region below; Python resolves the forward reference at call time. """ # Every Strands evaluator declares the level at which it operates. # ``HelpfulnessEvaluator`` is ``TRACE_LEVEL``; ``OutputEvaluator`` is # ``OUTPUT_LEVEL``; ``TrajectoryEvaluator`` is ``SESSION_LEVEL``. # The fallback matches this sample's default evaluator. evaluator_level = getattr(evaluator, "level", STRANDS_TRACE_LEVEL)
# ``.get(...)`` returns ``None`` for unknown AgentCore levels, which # never matches ``evaluator_level`` and so raises the mismatch error. expected_strands_level = _AGENTCORE_TO_STRANDS_LEVEL.get(level) if expected_strands_level != evaluator_level: raise LevelMismatchError( f"Evaluator level {evaluator_level!r} does not support " f"AgentCore evaluationLevel {level!r}" )
return evaluator.evaluate(evaluation_data)Building the response
Section titled “Building the response”_build_response shapes the AgentCore response the Lambda returns on the
happy path. The pass threshold is read per-invocation from
STRANDS_EVAL_PASS_THRESHOLD (falling back to DEFAULT_PASS_THRESHOLD when
the env var is missing or non-numeric), so tests and deployments can override
it without touching code.
passed and label follow an explicit precedence. If the evaluator sets
test_pass directly — deterministic evaluators often do — that value wins;
otherwise passed is derived by comparing score to the threshold. The
label preserves an evaluator-provided value when one exists, and otherwise
falls back to the AgentCore PASS / FAIL vocabulary. schemaVersion,
evaluatorId, evaluatorName, and evaluationLevel are echoed straight
from the request so AgentCore can correlate the response with the evaluation
it triggered.
def _build_response( event: dict[str, Any], evaluation_output: EvaluationOutput, status: str,) -> dict[str, Any]: """Assemble an AgentCore response from a Strands evaluator output.
``status`` is accepted for symmetry with ``_error_response(label=...)`` so the docs can explain the two response helpers in parallel. The happy path always derives ``label`` from ``passed`` (or an explicit evaluator label), so the flag is not consumed here. """ _ = status # reserved for parity with _error_response
# Pass threshold for turning numeric scores into pass/fail booleans. # Read per-invocation so tests and deploys can override via env var. try: threshold = float( os.environ.get(PASS_THRESHOLD_ENV, DEFAULT_PASS_THRESHOLD) ) except (TypeError, ValueError): threshold = DEFAULT_PASS_THRESHOLD
score = getattr(evaluation_output, "score", None) reason = getattr(evaluation_output, "reason", None)
# Strands evaluators may set ``test_pass`` directly (deterministic # evaluators often do); otherwise fall back to the threshold compare. explicit_pass = getattr(evaluation_output, "test_pass", None) if explicit_pass is not None: passed = bool(explicit_pass) else: passed = score is not None and score >= threshold
# Prefer an evaluator-provided label when one exists; otherwise map # the boolean into the AgentCore ``PASS`` / ``FAIL`` vocabulary. explicit_label = getattr(evaluation_output, "label", None) label = explicit_label if explicit_label else ( "PASS" if passed else "FAIL" )
return { "schemaVersion": event.get("schemaVersion", SCHEMA_VERSION), "results": [ { "evaluatorId": event.get("evaluatorId"), "evaluatorName": event.get("evaluatorName"), "evaluationLevel": event.get("evaluationLevel"), "score": score, "passed": passed, "label": label, "reason": reason, } ], }Error handling
Section titled “Error handling”The error_handling region collects the three failure modes the handler is
prepared for:
- Level mismatch.
LevelMismatchErroris raised by_invoke_evaluatorwhen the incomingevaluationLeveldoesn’t map to the evaluator’s declared Strands level. The outer handler catches it and calls_error_responsewithlabel="NOT_RUN", signalling that the evaluator is healthy but cannot score this request. - Unexpected exceptions. Any other
Exception— an evaluator crash, a Bedrock error, a bug in the mapping helpers — is caught by the genericexcept Exceptioninlambda_handler, logged withlogger.exception(...)so the full traceback lands in CloudWatch, and converted into anERRORresponse. - Missing required fields. When
_parse_payloadencounters a missingevaluationLevel, it raisesValueError. That also falls into the generic catch above and surfaces as anERRORresponse.
All three failure modes return schema-valid responses because AgentCore
treats an uncaught Lambda error as an evaluator failure that aborts the
evaluation run — surfacing a graceful score with a clear reason is almost
always more useful than taking the whole run down.
class LevelMismatchError(Exception): """Raised when AgentCore's ``evaluationLevel`` does not map to the wrapped evaluator's declared Strands level.
The outer ``try/except`` in ``lambda_handler`` catches this and converts it into a schema-valid "not-run" response (``score=None``, ``passed=False``, ``label="NOT_RUN"``) so AgentCore surfaces a graceful result instead of an unhandled Lambda error. """
def _error_response( event: dict[str, Any], exc: Exception, label: str,) -> dict[str, Any]: """Build a schema-valid AgentCore response for a failed invocation.
``label`` is ``"NOT_RUN"`` for level-mismatch failures (the evaluator is healthy but cannot score this ``evaluationLevel``) and ``"ERROR"`` for everything else (validation failure, evaluator raise, Bedrock error). ``score`` is ``None`` and ``passed`` is ``False`` in both cases so AgentCore can distinguish a missing score from a real ``0.0``. ``reason`` carries a short summary (exception class and message); full tracebacks go to CloudWatch via ``logger.exception(...)`` in ``lambda_handler``. """ reason = f"{type(exc).__name__}: {exc}" return { "schemaVersion": event.get("schemaVersion", SCHEMA_VERSION), "results": [ { "evaluatorId": event.get("evaluatorId"), "evaluatorName": event.get("evaluatorName"), "evaluationLevel": event.get("evaluationLevel"), "score": None, "passed": False, "label": label, "reason": reason, } ], }Swapping in a different built-in evaluator is a one-line change. Replace the
module-level _EVALUATOR = HelpfulnessEvaluator() with, for example,
OutputEvaluator() for TOOL_CALL-level requests or TrajectoryEvaluator()
for SESSION-level requests — the surrounding parse, invoke, and build-response
logic stays the same.
AgentCore response schema
Section titled “AgentCore response schema”The handler returns JSON shaped to the AgentCore code-based evaluator response contract — AgentCore uses this payload to record the score and attach it back to the evaluation run.
The response carries these fields:
schemaVersion— echoed from the request so AgentCore can confirm the contract version the handler answered.results[0].evaluatorId— echoed from the request; the ARN of the registered code-based evaluator.results[0].evaluatorName— echoed from the request; the human-readable evaluator label.results[0].evaluationLevel— echoed from the request (TRACE,TOOL_CALL, orSESSION).results[0].score— numeric score produced by the Strands evaluator, ornullon theNOT_RUN/ERRORpaths.results[0].passed— boolean pass indicator. See the defaults table below for how the sample handler derives it.results[0].label— short string label (PASS/FAIL/NOT_RUN/ERROR) that AgentCore surfaces in the dashboard.results[0].reason— human-readable rationale taken from the evaluator’sEvaluationOutput.reason.
A successful response looks like this:
{ "schemaVersion": "1.0", "results": [ { "evaluatorId": "arn:aws:bedrock-agentcore:...", "evaluatorName": "strands-helpfulness", "evaluationLevel": "TRACE", "score": 0.83, "passed": true, "label": "PASS", "reason": "Response directly answered the user's question with ..." } ]}AgentCore requires a handful of fields that Strands evaluators don’t produce on their own. The sample handler fills them in as follows:
| Field | Default source in sample handler |
|---|---|
schemaVersion | Echoed from the request |
evaluatorId / evaluatorName | Echoed from the request |
evaluationLevel | Echoed from the request |
passed | test_pass, else score >= threshold (env: STRANDS_EVAL_PASS_THRESHOLD) |
label | PASS / FAIL from passed unless the evaluator sets an explicit label |
reason | Taken from evaluation_output.reason |
Round-trip translation
Section titled “Round-trip translation”The sample handler preserves a simple round-trip property: parsing a valid
AgentCore payload, running the wrapped Strands evaluator, and serializing the
result produces an AgentCore response whose score equals the evaluator’s
score and whose evaluationLevel equals the request’s evaluationLevel. When
the evaluator doesn’t support the incoming level, the response is still
schema-valid — a NOT_RUN result with score: null and a clear reason
explaining that the evaluator did not run.
Request → evaluator input
Section titled “Request → evaluator input”When _parse_payload processes a request, each AgentCore field lands in a
specific slot on the Strands EvaluationData the evaluator receives:
evaluationInput.sessionSpans→ aSessionbuilt byStrandsInMemorySessionMapper, used as thetrajectoryonEvaluationDataforTRACE_LEVELandSESSION_LEVELevaluators.input→ the user prompt recovered from the root span’s attributes by_recover_prompt.actual_output→ the final assistant message in the mappedSession, recovered by_recover_final_output.reference_input→ taken fromevaluationReferenceInputswhen present, so reference-based evaluators (for exampleFaithfulnessEvaluator) can read the keys they need.evaluationTarget→ passed through untouched; the handler does not rewrite it, and evaluators that need agent or session metadata can read it directly.
Evaluator output → response
Section titled “Evaluator output → response”When _build_response serializes the result, the evaluator’s
EvaluationOutput maps back into the AgentCore response as follows:
score→results[0].score, echoed verbatim (ornullon theNOT_RUNandERRORpaths).reason→results[0].reason, echoed verbatim.test_pass/label→results[0].passedandresults[0].label, with the precedence documented earlier: an evaluator-providedtest_passwins over the threshold comparison, and an evaluator-providedlabelwins over the derivedPASS/FAILvocabulary.schemaVersion,evaluatorId,evaluatorName,evaluationLevel→ echoed straight from the request so AgentCore can correlate the response with the evaluation run that triggered it.
Worked example
Section titled “Worked example”Given a TRACE-level request targeting the HelpfulnessEvaluator:
{ "schemaVersion": "1.0", "evaluatorId": "arn:aws:bedrock-agentcore:us-east-1:...:evaluator/helpfulness", "evaluatorName": "strands-helpfulness", "evaluationLevel": "TRACE", "evaluationInput": { "sessionSpans": [ { "name": "invoke_agent", "attributes": { "gen_ai.prompt": "What's the capital of France?" } }, { "name": "assistant.message", "attributes": { "gen_ai.completion": "The capital of France is Paris." } } ] }, "evaluationReferenceInputs": { "reference_output": "Paris" }, "evaluationTarget": { "agentId": "my-agent", "sessionId": "abc-123" }}The sample handler produces this response:
{ "schemaVersion": "1.0", "results": [ { "evaluatorId": "arn:aws:bedrock-agentcore:us-east-1:...:evaluator/helpfulness", "evaluatorName": "strands-helpfulness", "evaluationLevel": "TRACE", "score": 0.9, "passed": true, "label": "PASS", "reason": "Response directly answered the user's question with Paris." } ]}Deploy the Lambda function
Section titled “Deploy the Lambda function”With the handler code in place, the remaining work is packaging it, creating the Lambda, and wiring up IAM so the function can call Bedrock (for evaluators that need it) and write logs. This section collects the AWS-side steps a Strands reader needs to get a code-based evaluator Lambda running.
Package the function
Section titled “Package the function”Lambda expects a ZIP archive containing the handler module plus its
dependencies. The simplest approach is to pip install the Strands Evals
SDK into a local package/ directory, drop the handler next to it, and zip
everything together:
mkdir -p packagepip install --target ./package strands-agents-evalscp agentcore_code_evaluator_handler.py package/cd package && zip -r ../function.zip . && cd ..Create the Lambda
Section titled “Create the Lambda”Once function.zip exists, create the Lambda with the AWS CLI. Use a
Python 3.11 runtime, point --handler at
agentcore_code_evaluator_handler.lambda_handler, and give the function
enough time and memory to load the SDK and invoke the judge model:
aws lambda create-function \ --function-name strands-evals-helpfulness \ --runtime python3.11 \ --handler agentcore_code_evaluator_handler.lambda_handler \ --timeout 60 \ --memory-size 512 \ --role arn:aws:iam::ACCOUNT_ID:role/StrandsEvalsLambdaRole \ --zip-file fileb://function.zipReplace ACCOUNT_ID with your AWS account ID and substitute the execution
role you create under Execution-role IAM policy below.
Environment variables
Section titled “Environment variables”The sample handler reads a small number of environment variables so deployers can tune runtime behavior without editing code:
STRANDS_EVAL_PASS_THRESHOLD— numeric cutoff the handler uses to derivepassedfromscorewhen the evaluator doesn’t settest_passdirectly. Defaults to0.5. Non-numeric values fall back to the default.AWS_REGION— region the Strands Evals SDK uses for Bedrock model calls. Lambda sets this automatically to the function’s region; override it only when you want the judge model to run in a different region than the Lambda.
Evaluators that instantiate BedrockModel also respect the standard
Bedrock SDK environment variables for overriding the model ID, endpoint,
and credentials — useful if you want to swap the judge model without
redeploying the handler.
Execution-role IAM policy
Section titled “Execution-role IAM policy”The Lambda execution role needs just two permission families: call the Bedrock models the evaluator uses, and write logs to CloudWatch. The minimal policy below covers both:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "InvokeBedrockModel", "Effect": "Allow", "Action": "bedrock:InvokeModel", "Resource": "*" }, { "Sid": "CloudWatchLogs", "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": "*" } ]}Scope Resource on bedrock:InvokeModel down to the specific model ARNs
your evaluator uses for a tighter policy in production. The wildcard shown
here keeps the example concise.
Region and additional resources
Section titled “Region and additional resources”The Lambda must live in a region that supports both AgentCore Evaluations
and the Bedrock models the chosen evaluator uses. Evaluators that call
Bedrock (for example HelpfulnessEvaluator) fail at runtime if the judge
model isn’t available in the Lambda’s region — pick a region that
satisfies both services before creating the function.
An IAM execution role must exist before aws lambda create-function will
succeed. See the AWS Lambda docs on
creating an execution role
for step-by-step guidance; attach the policy above to the role you create.
Register the code-based evaluator
Section titled “Register the code-based evaluator”With the Lambda deployed, the next step is to register it with AgentCore Evaluations so that AgentCore knows which function to invoke for a given evaluator. AgentCore’s code-based evaluators dev guide covers the registration procedure in detail; this section summarizes what registration accomplishes and the AWS-side IAM permissions AgentCore needs to call the function.
Registration does three things. AgentCore records the Lambda ARN against a
new evaluator resource, assigns that resource an evaluatorId (the same
value that flows back into the evaluatorId field of every request the
handler receives), and grants itself invoke permissions on the function so
it can dispatch evaluation calls at runtime.
AgentCore-side IAM policy
Section titled “AgentCore-side IAM policy”AgentCore needs permission to invoke and introspect the deployed Lambda. The minimal policy below grants both, scoped to a single function ARN:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "InvokeEvaluatorLambda", "Effect": "Allow", "Action": [ "lambda:InvokeFunction", "lambda:GetFunction" ], "Resource": "arn:aws:lambda:us-east-1:ACCOUNT_ID:function:strands-evals-helpfulness" } ]}AgentCore normally manages this policy automatically when you register the evaluator through the registration console or API — the explicit policy above is shown for reference and for deployments that prefer to attach it manually.
Run an evaluation
Section titled “Run an evaluation”With the Lambda deployed and the evaluator registered, AgentCore Evaluations can invoke the code-based evaluator against the target agent’s traces, sessions, or tool calls as part of an evaluation run. The CLI invocation below is a minimal starting point that references the registered evaluator by ARN and points at an agent and session to score:
aws bedrock-agentcore-control start-evaluation-run \ --evaluator-id EVALUATOR_ARN \ --target agentId=AGENT_ID,sessionId=SESSION_ID \ --region us-east-1The exact command name and arguments follow the AWS dev guide’s current code-based evaluators API — if the invocation above doesn’t match what your AWS CLI expects, the AWS page linked earlier in this guide is the source of truth.
The evaluator’s scores appear in the AgentCore Evaluations dashboard — see AgentCore Evaluation Dashboard for how to interpret the results.
TypeScript support
Section titled “TypeScript support”Related documentation
Section titled “Related documentation”- Evals SDK quickstart — install the SDK and write your first evaluator.
- Evaluators overview — the built-in evaluators available to wrap for AgentCore.
- Custom evaluators — build a domain-specific evaluator to wrap.
- Deploy to AgentCore — how to deploy the agent you’re evaluating.
- AgentCore Evaluation Dashboard — interpret the scores your code-based evaluator produces.