Skip to content

AWS Neuron

strands-neuron is a vLLM on AWS Neuron model provider for Strands Agents SDK. It connects to vLLM servers running on AWS AI Chips (Trainium and Inferentia) via an OpenAI-compatible API, enabling high-performance LLM inference on AWS Neuron hardware.

Features:

  • OpenAI-Compatible API: Works with any OpenAI-compatible vLLM server
  • Full Streaming Support: Async generators for real-time token streaming
  • Tool/Function Calling: Native support for function calling and tool use
  • Structured Output: Generate structured data via tool calls
  • Neuron-Optimized: Designed for AWS Neuron hardware acceleration
  • Flexible Configuration: Extensive configuration options for model behavior

Install strands-neuron along with the Strands Agents SDK:

Terminal window
pip install strands-neuron strands-agents
  • AWS EC2 instance with Neuron hardware (inf2, trn1, trn2, or trn3)
  • AWS Neuron Deep Learning AMI (DLAMI) for Ubuntu 22.04
  • Running vLLM Neuron server accessible via HTTP

Set up and start your vLLM Neuron server on your AWS Neuron instance. The server should expose an OpenAI-compatible endpoint (default: http://localhost:8080/v1).

For tool calling support, start vLLM with the appropriate flags:

Terminal window
vllm serve <MODEL_ID> \
--host 0.0.0.0 \
--port 8080 \
--enable-auto-tool-choice \
--tool-call-parser <PARSER> # e.g., llama3_json, mistral, etc.
from strands import Agent
from strands_neuron import NeuronModel
model = NeuronModel(
config={
"model_id": "mistralai/Mistral-7B-Instruct-v0.3",
"base_url": "http://localhost:8080/v1",
"api_key": "EMPTY", # Not required for local servers
}
)
agent = Agent(
system_prompt="You are a helpful assistant.",
model=model,
)
response = agent("What is machine learning?")
print(response)

The NeuronModel accepts a config dictionary with the following parameters:

ParameterDescriptionExampleRequired
model_idModel identifier"mistralai/Mistral-7B-Instruct-v0.3"Yes
base_urlBase URL for the OpenAI-compatible API"http://localhost:8080/v1"No (default: "http://localhost:8080/v1")
api_keyAPI key for authentication"EMPTY"No (default: "EMPTY")
support_tool_choice_autoSet True if vLLM has --enable-auto-tool-choice and --tool-call-parser flagsTrueNo (default: False)
temperatureSampling temperature (0.0 to 2.0)0.7No
top_pNucleus sampling parameter0.9No
max_completion_tokensMaximum tokens to generate1000No
stopSequences that stop generation["\n\n"]No
frequency_penaltyPenalize tokens based on frequency (-2.0 to 2.0)0.0No
presence_penaltyPenalize tokens based on presence (-2.0 to 2.0)0.0No
additional_argsAdditional arguments passed to the API request{}No
model = NeuronModel(
config={
"model_id": "mistralai/Mistral-7B-Instruct-v0.3",
"base_url": "http://localhost:8080/v1",
"api_key": "EMPTY",
"temperature": 0.7,
"top_p": 0.9,
"max_completion_tokens": 1000,
"support_tool_choice_auto": True,
}
)

Ensure your vLLM Neuron server is running and accessible:

Terminal window
curl http://localhost:8080/health

If you see "This model only supports single tool-calls at once!", this is a model-level constraint. Switch to a model that supports parallel tool calls (Llama 4, Granite 3.1, xLAM), or use structured_output() for single-tool workflows.

Ensure the vLLM server was started with --enable-auto-tool-choice and --tool-call-parser flags, and set "support_tool_choice_auto": True in the model config.