Deploy Strands Evaluators as AgentCore Code-Based Evaluators

This guide is for a Strands developer who has deployed agents to Amazon Bedrock AgentCore and wants to run Strands evaluators against those agents. It walks through wrapping an evaluator from the Python Strands Evals SDK (strands-agents-evals) as an AgentCore code-based evaluator — an AWS Lambda function that AgentCore Evaluations invokes to score sessions, traces, or tool calls.

Prerequisites

Before following this guide, make sure you have:

Python 3.10+
strands-agents installed
strands-agents-evals installed
An AWS account with access to AWS Lambda

How it fits together

The diagram below shows the control flow inside the Lambda handler at runtime, from the moment AgentCore Evaluations invokes it to the schema-valid response the handler returns.

sequenceDiagram
    participant AC as AgentCore Evaluations
    participant L as Lambda (sample handler)
    participant M as StrandsInMemory\nSessionMapper
    participant E as Strands Evaluator\n(HelpfulnessEvaluator)

    AC->>L: event (schemaVersion, evaluationLevel,\nsessionSpans, referenceInputs, target)
    L->>L: parse_payload(event)
    L->>M: map sessionSpans → Session
    M-->>L: Session
    L->>E: evaluator.evaluate(EvaluationData)
    alt evaluator supports level
        E-->>L: EvaluationOutput(score, reason)
        L->>L: build_response(score, reason,\nlevel, evaluatorId)
        L-->>AC: schema-valid response JSON
    else unsupported level
        L->>L: error_handling: build "not run" response
        L-->>AC: schema-valid response JSON\n(score=None, reason=explanation)
    else evaluator raised
        L->>L: error_handling: catch, log,\nbuild failure response
        L-->>AC: schema-valid response JSON\n(score=None, reason=error)
    end

AgentCore request payload

AgentCore Evaluations invokes your Lambda with a JSON event defined by the AWS code-based evaluators contract.

The handler reads every field listed below:

schemaVersion — contract version; currently "1.0".
evaluatorId — ARN of the registered code-based evaluator. Example: "arn:aws:bedrock-agentcore:us-east-1:123456789012:evaluator/strands-helpfulness".
evaluatorName — human-readable label. Example: "strands-helpfulness".
evaluationLevel — one of TRACE, TOOL_CALL, or SESSION. Selects which Strands evaluator level the handler can answer.
evaluationInput.sessionSpans — array of OTel-style span dicts the handler feeds to StrandsInMemorySessionMapper to build a Session.
evaluationReferenceInputs — optional reference data for evaluators that need it (for example FaithfulnessEvaluator). Example: {"reference_output": "Paris"}.
evaluationTarget — metadata about the target agent or session. Example: {"agentId": "my-agent", "sessionId": "abc-123"}.

A complete request looks like this:

{
  "schemaVersion": "1.0",
  "evaluatorId": "arn:aws:bedrock-agentcore:us-east-1:...:evaluator/...",
  "evaluatorName": "strands-helpfulness",
  "evaluationLevel": "TRACE",
  "evaluationInput": {
    "sessionSpans": [ /* array of OTel-style span dicts */ ]
  },
  "evaluationReferenceInputs": { /* optional reference data */ },
  "evaluationTarget": { /* metadata about the target agent/session */ }
}

Evaluation-level mapping

AgentCore’s evaluationLevel must line up with the Strands evaluator’s declared level for the handler to score the request. The table below shows the one-to-one mapping the handler relies on:

AgentCore `evaluationLevel`	Strands evaluator level	Typical Strands evaluators
`TOOL_CALL`	`OUTPUT_LEVEL`	`OutputEvaluator`, deterministic evaluators
`TRACE`	`TRACE_LEVEL`	`HelpfulnessEvaluator`, `FaithfulnessEvaluator`
`SESSION`	`SESSION_LEVEL`	`TrajectoryEvaluator`, `GoalSuccessRateEvaluator`

When the incoming level doesn’t match the wrapped evaluator, the handler returns a schema-valid “not-run” response instead of raising — the details of that response shape are covered in the error-handling section later on this page.

Writing the Lambda handler

The rest of this section walks through the sample handler region by region. Every code block is pulled verbatim from agentcore_code_evaluator_handler.py via MkDocs snippet syntax, so the docs and the runnable sample can’t drift apart.

Imports and constants

The imports region pulls in the stdlib utilities, the Strands Evals SDK building blocks the handler needs, and the module-level constants echoed back in every AgentCore response. The evaluator instance is created at module scope rather than inside lambda_handler so warm Lambda invocations reuse the same HelpfulnessEvaluator — and therefore the same judge-model client — instead of re-initializing it on every call.

SCHEMA_VERSION is the AgentCore code-based evaluator contract version the handler echoes in responses. PASS_THRESHOLD_ENV and DEFAULT_PASS_THRESHOLD give deployers a way to override the numeric cutoff that maps evaluator scores to PASS / FAIL without editing code. The AGENTCORE_LEVEL_* constants are the exact string values AgentCore sends for evaluationLevel, defined once so the rest of the handler can reference them symbolically.

import json
import logging
import os
from typing import Any

from strands_evals.evaluators import HelpfulnessEvaluator
from strands_evals.mappers import StrandsInMemorySessionMapper
from strands_evals.types import EvaluationData, EvaluationOutput

logger = logging.getLogger(__name__)

# AgentCore code-based evaluator contract version echoed in responses.
SCHEMA_VERSION = "1.0"

# Env var used to override the pass/fail threshold at deploy time.
PASS_THRESHOLD_ENV = "STRANDS_EVAL_PASS_THRESHOLD"

# Default pass threshold when the evaluator does not set ``test_pass``.
DEFAULT_PASS_THRESHOLD = 0.5

# AgentCore ``evaluationLevel`` values, per the AgentCore contract:
# https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/code-based-evaluators.html
AGENTCORE_LEVEL_TOOL_CALL = "TOOL_CALL"
AGENTCORE_LEVEL_TRACE = "TRACE"
AGENTCORE_LEVEL_SESSION = "SESSION"

Entry point

lambda_handler is intentionally small: parse the event, invoke the evaluator, build the response. Each step lives in its own helper so the docs can introduce them one at a time, and so unit tests can exercise each stage independently.

The outer try/except is the key part. AgentCore treats any Lambda error as an evaluator failure that aborts the evaluation run, so the handler converts every failure — level mismatches, validation errors, evaluator exceptions — into a schema-valid response. The two helpers referenced here, LevelMismatchError and _error_response, are defined alongside the other error-handling code further down.

# Module-level evaluator instance reused across Lambda invocations so
# warm starts don't re-initialize the judge-model client on every call.
_EVALUATOR = HelpfulnessEvaluator()


def lambda_handler(event: dict[str, Any], context: Any) -> dict[str, Any]:
    """Entry point AgentCore Evaluations invokes for each scored unit.

    Parses the AgentCore request, runs the configured Strands evaluator,
    and returns a schema-valid AgentCore response. All failure modes are
    converted into schema-valid responses so AgentCore never observes an
    unhandled Lambda error. ``LevelMismatchError`` and ``_error_response``
    are defined alongside the other error-handling helpers below.
    """
    try:
        evaluation_data, level = _parse_payload(event)
        evaluation_output = _invoke_evaluator(
            _EVALUATOR, evaluation_data, level
        )
        return _build_response(event, evaluation_output, status="OK")
    except LevelMismatchError as exc:
        return _error_response(event, exc, label="NOT_RUN")
    except Exception as exc:  # noqa: BLE001 - graceful failure response
        logger.exception("Evaluator invocation failed")
        return _error_response(event, exc, label="ERROR")

Parsing the request

_parse_payload translates the AgentCore request into a Strands EvaluationData. It starts with a soft version check: an unexpected or missing schemaVersion logs a warning but doesn’t abort, because the rest of the payload may still be readable. evaluationLevel is different — it’s required to pick the right evaluator path, so a missing value raises a ValueError that the outer try/except turns into an ERROR response.

The OTel-style sessionSpans are fed to StrandsInMemorySessionMapper to produce the Session object that Strands evaluators accept. Optional passthroughs (evaluationReferenceInputs and evaluationTarget) are read only when the evaluator needs them.

The _recover_prompt and _recover_final_output helpers are illustrative. They pull the user prompt from the root span’s attributes and the final assistant message from the session’s message list, which works for the shape this sample assumes. Adapt both helpers to match the span and message schema your agent actually records.

def _parse_payload(
    event: dict[str, Any],
) -> tuple[EvaluationData, str]:
    """Translate an AgentCore request into Strands evaluator input.

    Returns the ``EvaluationData`` the evaluator will score and the
    ``evaluationLevel`` string from the request so the caller can
    compare it against the evaluator's declared level.
    """
    # Version check. Unknown versions are logged but not fatal; the
    # request may still carry fields we know how to read.
    schema_version = event.get("schemaVersion")
    if schema_version is None:
        logger.warning("AgentCore request is missing schemaVersion")
    elif schema_version != SCHEMA_VERSION:
        logger.warning(
            "Unexpected schemaVersion %r (handler expects %r)",
            schema_version,
            SCHEMA_VERSION,
        )

    # ``evaluationLevel`` is required to pick the right evaluator path.
    # The outer handler converts ValueError into an ERROR response that
    # AgentCore can surface to the reader.
    evaluation_level = event.get("evaluationLevel")
    if not evaluation_level:
        raise ValueError(
            "AgentCore request is missing required 'evaluationLevel'"
        )

    # Convert the OTel-style session spans into a Strands ``Session``.
    evaluation_input = event.get("evaluationInput") or {}
    session_spans = evaluation_input.get("sessionSpans") or []
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map(session_spans)

    # Optional passthroughs. AgentCore only sends these fields when the
    # evaluation configuration references them.
    reference_inputs = event.get("evaluationReferenceInputs")
    evaluation_target = event.get("evaluationTarget")

    # Illustrative defaults: recover the prompt from the root span's
    # attributes and treat the final assistant message as the output.
    # Adjust these helpers to match the span and message shape your
    # agent actually records.
    input_text = _recover_prompt(session)
    actual_output = _recover_final_output(session)

    evaluation_data = EvaluationData(
        input=input_text,
        actual_output=actual_output,
        trajectory=session,
        reference_input=reference_inputs,
    )

    # ``evaluation_target`` is accepted for future use (for example
    # logging the target agent version alongside the score) but is not
    # consumed by ``EvaluationData`` today.
    _ = evaluation_target

    return evaluation_data, evaluation_level


def _recover_prompt(session: Any) -> str:
    """Pull the user prompt out of the mapped session.

    Illustrative: reads the root span's attributes. Real deployments
    should adapt this to match the span schema their agent emits.
    """
    root_span = getattr(session, "root_span", None)
    attributes = getattr(root_span, "attributes", {}) or {}
    return str(
        attributes.get("input") or attributes.get("prompt") or ""
    )


def _recover_final_output(session: Any) -> str:
    """Pull the final assistant message out of the mapped session.

    Illustrative: reads the last message on the session. Real
    deployments should adapt this to match the message shape their
    agent records.
    """
    messages = getattr(session, "messages", []) or []
    if not messages:
        return ""
    last = messages[-1]
    return str(getattr(last, "content", last) or "")

Invoking the evaluator

_invoke_evaluator enforces the level-compatibility check before calling the evaluator. It reads the evaluator’s declared Strands level (for HelpfulnessEvaluator that’s TRACE_LEVEL) and compares it to the expected Strands level for the incoming AgentCore level via _AGENTCORE_TO_STRANDS_LEVEL, the single mapping table the handler uses for this translation.

On a mismatch, the function raises LevelMismatchError. Python resolves the forward reference at call time, so the class can live in the error-handling region below without a circular import. The outer handler catches the mismatch and converts it into the NOT_RUN response.

# Strands evaluators expose their native level as a string. The sample
# wraps ``HelpfulnessEvaluator``, which is a ``TRACE_LEVEL`` evaluator.
STRANDS_OUTPUT_LEVEL = "OUTPUT_LEVEL"
STRANDS_TRACE_LEVEL = "TRACE_LEVEL"
STRANDS_SESSION_LEVEL = "SESSION_LEVEL"

# How each AgentCore ``evaluationLevel`` maps to a Strands level.
_AGENTCORE_TO_STRANDS_LEVEL = {
    AGENTCORE_LEVEL_TOOL_CALL: STRANDS_OUTPUT_LEVEL,
    AGENTCORE_LEVEL_TRACE: STRANDS_TRACE_LEVEL,
    AGENTCORE_LEVEL_SESSION: STRANDS_SESSION_LEVEL,
}


def _invoke_evaluator(
    evaluator: Any,
    evaluation_data: EvaluationData,
    level: str,
) -> EvaluationOutput:
    """Run the evaluator when the request level matches its Strands level.

    Raises ``LevelMismatchError`` when the incoming AgentCore level does
    not correspond to the evaluator's declared Strands level. The outer
    handler converts that into a schema-valid "not-run" response so
    AgentCore never observes an unhandled Lambda error. The
    ``LevelMismatchError`` class is defined in the error-handling region
    below; Python resolves the forward reference at call time.
    """
    # Every Strands evaluator declares the level at which it operates.
    # ``HelpfulnessEvaluator`` is ``TRACE_LEVEL``; ``OutputEvaluator`` is
    # ``OUTPUT_LEVEL``; ``TrajectoryEvaluator`` is ``SESSION_LEVEL``.
    # The fallback matches this sample's default evaluator.
    evaluator_level = getattr(evaluator, "level", STRANDS_TRACE_LEVEL)

    # ``.get(...)`` returns ``None`` for unknown AgentCore levels, which
    # never matches ``evaluator_level`` and so raises the mismatch error.
    expected_strands_level = _AGENTCORE_TO_STRANDS_LEVEL.get(level)
    if expected_strands_level != evaluator_level:
        raise LevelMismatchError(
            f"Evaluator level {evaluator_level!r} does not support "
            f"AgentCore evaluationLevel {level!r}"
        )

    return evaluator.evaluate(evaluation_data)

Building the response

_build_response shapes the AgentCore response the Lambda returns on the happy path. The pass threshold is read per-invocation from STRANDS_EVAL_PASS_THRESHOLD (falling back to DEFAULT_PASS_THRESHOLD when the env var is missing or non-numeric), so tests and deployments can override it without touching code.

passed and label follow an explicit precedence. If the evaluator sets test_pass directly — deterministic evaluators often do — that value wins; otherwise passed is derived by comparing score to the threshold. The label preserves an evaluator-provided value when one exists, and otherwise falls back to the AgentCore PASS / FAIL vocabulary. schemaVersion, evaluatorId, evaluatorName, and evaluationLevel are echoed straight from the request so AgentCore can correlate the response with the evaluation it triggered.

def _build_response(
    event: dict[str, Any],
    evaluation_output: EvaluationOutput,
    status: str,
) -> dict[str, Any]:
    """Assemble an AgentCore response from a Strands evaluator output.

    ``status`` is accepted for symmetry with ``_error_response(label=...)``
    so the docs can explain the two response helpers in parallel. The
    happy path always derives ``label`` from ``passed`` (or an explicit
    evaluator label), so the flag is not consumed here.
    """
    _ = status  # reserved for parity with _error_response

    # Pass threshold for turning numeric scores into pass/fail booleans.
    # Read per-invocation so tests and deploys can override via env var.
    try:
        threshold = float(
            os.environ.get(PASS_THRESHOLD_ENV, DEFAULT_PASS_THRESHOLD)
        )
    except (TypeError, ValueError):
        threshold = DEFAULT_PASS_THRESHOLD

    score = getattr(evaluation_output, "score", None)
    reason = getattr(evaluation_output, "reason", None)

    # Strands evaluators may set ``test_pass`` directly (deterministic
    # evaluators often do); otherwise fall back to the threshold compare.
    explicit_pass = getattr(evaluation_output, "test_pass", None)
    if explicit_pass is not None:
        passed = bool(explicit_pass)
    else:
        passed = score is not None and score >= threshold

    # Prefer an evaluator-provided label when one exists; otherwise map
    # the boolean into the AgentCore ``PASS`` / ``FAIL`` vocabulary.
    explicit_label = getattr(evaluation_output, "label", None)
    label = explicit_label if explicit_label else (
        "PASS" if passed else "FAIL"
    )

    return {
        "schemaVersion": event.get("schemaVersion", SCHEMA_VERSION),
        "results": [
            {
                "evaluatorId": event.get("evaluatorId"),
                "evaluatorName": event.get("evaluatorName"),
                "evaluationLevel": event.get("evaluationLevel"),
                "score": score,
                "passed": passed,
                "label": label,
                "reason": reason,
            }
        ],
    }

Error handling

The error_handling region collects the three failure modes the handler is prepared for:

Level mismatch. LevelMismatchError is raised by _invoke_evaluator when the incoming evaluationLevel doesn’t map to the evaluator’s declared Strands level. The outer handler catches it and calls _error_response with label="NOT_RUN", signalling that the evaluator is healthy but cannot score this request.
Unexpected exceptions. Any other Exception — an evaluator crash, a Bedrock error, a bug in the mapping helpers — is caught by the generic except Exception in lambda_handler, logged with logger.exception(...) so the full traceback lands in CloudWatch, and converted into an ERROR response.
Missing required fields. When _parse_payload encounters a missing evaluationLevel, it raises ValueError. That also falls into the generic catch above and surfaces as an ERROR response.

All three failure modes return schema-valid responses because AgentCore treats an uncaught Lambda error as an evaluator failure that aborts the evaluation run — surfacing a graceful score with a clear reason is almost always more useful than taking the whole run down.

class LevelMismatchError(Exception):
    """Raised when AgentCore's ``evaluationLevel`` does not map to the
    wrapped evaluator's declared Strands level.

    The outer ``try/except`` in ``lambda_handler`` catches this and
    converts it into a schema-valid "not-run" response (``score=None``,
    ``passed=False``, ``label="NOT_RUN"``) so AgentCore surfaces a
    graceful result instead of an unhandled Lambda error.
    """


def _error_response(
    event: dict[str, Any],
    exc: Exception,
    label: str,
) -> dict[str, Any]:
    """Build a schema-valid AgentCore response for a failed invocation.

    ``label`` is ``"NOT_RUN"`` for level-mismatch failures (the
    evaluator is healthy but cannot score this ``evaluationLevel``) and
    ``"ERROR"`` for everything else (validation failure, evaluator
    raise, Bedrock error). ``score`` is ``None`` and ``passed`` is
    ``False`` in both cases so AgentCore can distinguish a missing
    score from a real ``0.0``. ``reason`` carries a short summary
    (exception class and message); full tracebacks go to CloudWatch
    via ``logger.exception(...)`` in ``lambda_handler``.
    """
    reason = f"{type(exc).__name__}: {exc}"
    return {
        "schemaVersion": event.get("schemaVersion", SCHEMA_VERSION),
        "results": [
            {
                "evaluatorId": event.get("evaluatorId"),
                "evaluatorName": event.get("evaluatorName"),
                "evaluationLevel": event.get("evaluationLevel"),
                "score": None,
                "passed": False,
                "label": label,
                "reason": reason,
            }
        ],
    }

Swapping in a different built-in evaluator is a one-line change. Replace the module-level _EVALUATOR = HelpfulnessEvaluator() with, for example, OutputEvaluator() for TOOL_CALL-level requests or TrajectoryEvaluator() for SESSION-level requests — the surrounding parse, invoke, and build-response logic stays the same.

AgentCore response schema

The handler returns JSON shaped to the AgentCore code-based evaluator response contract — AgentCore uses this payload to record the score and attach it back to the evaluation run.

The response carries these fields:

schemaVersion — echoed from the request so AgentCore can confirm the contract version the handler answered.
results[0].evaluatorId — echoed from the request; the ARN of the registered code-based evaluator.
results[0].evaluatorName — echoed from the request; the human-readable evaluator label.
results[0].evaluationLevel — echoed from the request (TRACE, TOOL_CALL, or SESSION).
results[0].score — numeric score produced by the Strands evaluator, or null on the NOT_RUN / ERROR paths.
results[0].passed — boolean pass indicator. See the defaults table below for how the sample handler derives it.
results[0].label — short string label (PASS / FAIL / NOT_RUN / ERROR) that AgentCore surfaces in the dashboard.
results[0].reason — human-readable rationale taken from the evaluator’s EvaluationOutput.reason.

A successful response looks like this:

{
  "schemaVersion": "1.0",
  "results": [
    {
      "evaluatorId": "arn:aws:bedrock-agentcore:...",
      "evaluatorName": "strands-helpfulness",
      "evaluationLevel": "TRACE",
      "score": 0.83,
      "passed": true,
      "label": "PASS",
      "reason": "Response directly answered the user's question with ..."
    }
  ]
}

AgentCore requires a handful of fields that Strands evaluators don’t produce on their own. The sample handler fills them in as follows:

Field	Default source in sample handler
`schemaVersion`	Echoed from the request
`evaluatorId` / `evaluatorName`	Echoed from the request
`evaluationLevel`	Echoed from the request
`passed`	`test_pass`, else `score >= threshold` (env: `STRANDS_EVAL_PASS_THRESHOLD`)
`label`	`PASS` / `FAIL` from `passed` unless the evaluator sets an explicit `label`
`reason`	Taken from `evaluation_output.reason`

Round-trip translation

The sample handler preserves a simple round-trip property: parsing a valid AgentCore payload, running the wrapped Strands evaluator, and serializing the result produces an AgentCore response whose score equals the evaluator’s score and whose evaluationLevel equals the request’s evaluationLevel. When the evaluator doesn’t support the incoming level, the response is still schema-valid — a NOT_RUN result with score: null and a clear reason explaining that the evaluator did not run.

Request → evaluator input

When _parse_payload processes a request, each AgentCore field lands in a specific slot on the Strands EvaluationData the evaluator receives:

evaluationInput.sessionSpans → a Session built by StrandsInMemorySessionMapper, used as the trajectory on EvaluationData for TRACE_LEVEL and SESSION_LEVEL evaluators.
input → the user prompt recovered from the root span’s attributes by _recover_prompt.
actual_output → the final assistant message in the mapped Session, recovered by _recover_final_output.
reference_input → taken from evaluationReferenceInputs when present, so reference-based evaluators (for example FaithfulnessEvaluator) can read the keys they need.
evaluationTarget → passed through untouched; the handler does not rewrite it, and evaluators that need agent or session metadata can read it directly.

Evaluator output → response

When _build_response serializes the result, the evaluator’s EvaluationOutput maps back into the AgentCore response as follows:

score → results[0].score, echoed verbatim (or null on the NOT_RUN and ERROR paths).
reason → results[0].reason, echoed verbatim.
test_pass / label → results[0].passed and results[0].label, with the precedence documented earlier: an evaluator-provided test_pass wins over the threshold comparison, and an evaluator-provided label wins over the derived PASS / FAIL vocabulary.
schemaVersion, evaluatorId, evaluatorName, evaluationLevel → echoed straight from the request so AgentCore can correlate the response with the evaluation run that triggered it.

Worked example

Given a TRACE-level request targeting the HelpfulnessEvaluator:

{
  "schemaVersion": "1.0",
  "evaluatorId": "arn:aws:bedrock-agentcore:us-east-1:...:evaluator/helpfulness",
  "evaluatorName": "strands-helpfulness",
  "evaluationLevel": "TRACE",
  "evaluationInput": {
    "sessionSpans": [
      {
        "name": "invoke_agent",
        "attributes": {
          "gen_ai.prompt": "What's the capital of France?"
        }
      },
      {
        "name": "assistant.message",
        "attributes": {
          "gen_ai.completion": "The capital of France is Paris."
        }
      }
    ]
  },
  "evaluationReferenceInputs": {
    "reference_output": "Paris"
  },
  "evaluationTarget": {
    "agentId": "my-agent",
    "sessionId": "abc-123"
  }
}

The sample handler produces this response:

{
  "schemaVersion": "1.0",
  "results": [
    {
      "evaluatorId": "arn:aws:bedrock-agentcore:us-east-1:...:evaluator/helpfulness",
      "evaluatorName": "strands-helpfulness",
      "evaluationLevel": "TRACE",
      "score": 0.9,
      "passed": true,
      "label": "PASS",
      "reason": "Response directly answered the user's question with Paris."
    }
  ]
}

Deploy the Lambda function

With the handler code in place, the remaining work is packaging it, creating the Lambda, and wiring up IAM so the function can call Bedrock (for evaluators that need it) and write logs. This section collects the AWS-side steps a Strands reader needs to get a code-based evaluator Lambda running.

Package the function

Lambda expects a ZIP archive containing the handler module plus its dependencies. The simplest approach is to pip install the Strands Evals SDK into a local package/ directory, drop the handler next to it, and zip everything together:

mkdir -p package
pip install --target ./package strands-agents-evals
cp agentcore_code_evaluator_handler.py package/
cd package && zip -r ../function.zip . && cd ..

Create the Lambda

Once function.zip exists, create the Lambda with the AWS CLI. Use a Python 3.11 runtime, point --handler at agentcore_code_evaluator_handler.lambda_handler, and give the function enough time and memory to load the SDK and invoke the judge model:

aws lambda create-function \
  --function-name strands-evals-helpfulness \
  --runtime python3.11 \
  --handler agentcore_code_evaluator_handler.lambda_handler \
  --timeout 60 \
  --memory-size 512 \
  --role arn:aws:iam::ACCOUNT_ID:role/StrandsEvalsLambdaRole \
  --zip-file fileb://function.zip

Replace ACCOUNT_ID with your AWS account ID and substitute the execution role you create under Execution-role IAM policy below.

Environment variables

The sample handler reads a small number of environment variables so deployers can tune runtime behavior without editing code:

STRANDS_EVAL_PASS_THRESHOLD — numeric cutoff the handler uses to derive passed from score when the evaluator doesn’t set test_pass directly. Defaults to 0.5. Non-numeric values fall back to the default.
AWS_REGION — region the Strands Evals SDK uses for Bedrock model calls. Lambda sets this automatically to the function’s region; override it only when you want the judge model to run in a different region than the Lambda.

Evaluators that instantiate BedrockModel also respect the standard Bedrock SDK environment variables for overriding the model ID, endpoint, and credentials — useful if you want to swap the judge model without redeploying the handler.

Execution-role IAM policy

The Lambda execution role needs just two permission families: call the Bedrock models the evaluator uses, and write logs to CloudWatch. The minimal policy below covers both:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "InvokeBedrockModel",
      "Effect": "Allow",
      "Action": "bedrock:InvokeModel",
      "Resource": "*"
    },
    {
      "Sid": "CloudWatchLogs",
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "*"
    }
  ]
}

Scope Resource on bedrock:InvokeModel down to the specific model ARNs your evaluator uses for a tighter policy in production. The wildcard shown here keeps the example concise.

Region and additional resources

The Lambda must live in a region that supports both AgentCore Evaluations and the Bedrock models the chosen evaluator uses. Evaluators that call Bedrock (for example HelpfulnessEvaluator) fail at runtime if the judge model isn’t available in the Lambda’s region — pick a region that satisfies both services before creating the function.

An IAM execution role must exist before aws lambda create-function will succeed. See the AWS Lambda docs on creating an execution role for step-by-step guidance; attach the policy above to the role you create.

Register the code-based evaluator

With the Lambda deployed, the next step is to register it with AgentCore Evaluations so that AgentCore knows which function to invoke for a given evaluator. AgentCore’s code-based evaluators dev guide covers the registration procedure in detail; this section summarizes what registration accomplishes and the AWS-side IAM permissions AgentCore needs to call the function.

Registration does three things. AgentCore records the Lambda ARN against a new evaluator resource, assigns that resource an evaluatorId (the same value that flows back into the evaluatorId field of every request the handler receives), and grants itself invoke permissions on the function so it can dispatch evaluation calls at runtime.

AgentCore-side IAM policy

AgentCore needs permission to invoke and introspect the deployed Lambda. The minimal policy below grants both, scoped to a single function ARN:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "InvokeEvaluatorLambda",
      "Effect": "Allow",
      "Action": [
        "lambda:InvokeFunction",
        "lambda:GetFunction"
      ],
      "Resource": "arn:aws:lambda:us-east-1:ACCOUNT_ID:function:strands-evals-helpfulness"
    }
  ]
}

AgentCore normally manages this policy automatically when you register the evaluator through the registration console or API — the explicit policy above is shown for reference and for deployments that prefer to attach it manually.

Run an evaluation

With the Lambda deployed and the evaluator registered, AgentCore Evaluations can invoke the code-based evaluator against the target agent’s traces, sessions, or tool calls as part of an evaluation run. The CLI invocation below is a minimal starting point that references the registered evaluator by ARN and points at an agent and session to score:

aws bedrock-agentcore-control start-evaluation-run \
  --evaluator-id EVALUATOR_ARN \
  --target agentId=AGENT_ID,sessionId=SESSION_ID \
  --region us-east-1

The exact command name and arguments follow the AWS dev guide’s current code-based evaluators API — if the invocation above doesn’t match what your AWS CLI expects, the AWS page linked earlier in this guide is the source of truth.

The evaluator’s scores appear in the AgentCore Evaluations dashboard — see AgentCore Evaluation Dashboard for how to interpret the results.

TypeScript support

Evals SDK quickstart — install the SDK and write your first evaluator.
Evaluators overview — the built-in evaluators available to wrap for AgentCore.
Custom evaluators — build a domain-specific evaluator to wrap.
Deploy to AgentCore — how to deploy the agent you’re evaluating.
AgentCore Evaluation Dashboard — interpret the scores your code-based evaluator produces.