AWS Neuron
strands-neuron is a vLLM on AWS Neuron model provider for Strands Agents SDK. It connects to vLLM servers running on AWS AI Chips (Trainium and Inferentia) via an OpenAI-compatible API, enabling high-performance LLM inference on AWS Neuron hardware.
Features:
- OpenAI-Compatible API: Works with any OpenAI-compatible vLLM server
- Full Streaming Support: Async generators for real-time token streaming
- Tool/Function Calling: Native support for function calling and tool use
- Structured Output: Generate structured data via tool calls
- Neuron-Optimized: Designed for AWS Neuron hardware acceleration
- Flexible Configuration: Extensive configuration options for model behavior
Installation
Section titled “Installation”Install strands-neuron along with the Strands Agents SDK:
pip install strands-neuron strands-agentsRequirements
Section titled “Requirements”- AWS EC2 instance with Neuron hardware (inf2, trn1, trn2, or trn3)
- AWS Neuron Deep Learning AMI (DLAMI) for Ubuntu 22.04
- Running vLLM Neuron server accessible via HTTP
Start the vLLM Neuron Server
Section titled “Start the vLLM Neuron Server”Set up and start your vLLM Neuron server on your AWS Neuron instance. The server should expose an OpenAI-compatible endpoint (default: http://localhost:8080/v1).
For tool calling support, start vLLM with the appropriate flags:
vllm serve <MODEL_ID> \ --host 0.0.0.0 \ --port 8080 \ --enable-auto-tool-choice \ --tool-call-parser <PARSER> # e.g., llama3_json, mistral, etc.Basic Agent
Section titled “Basic Agent”from strands import Agentfrom strands_neuron import NeuronModel
model = NeuronModel( config={ "model_id": "mistralai/Mistral-7B-Instruct-v0.3", "base_url": "http://localhost:8080/v1", "api_key": "EMPTY", # Not required for local servers })
agent = Agent( system_prompt="You are a helpful assistant.", model=model,)
response = agent("What is machine learning?")print(response)Configuration
Section titled “Configuration”The NeuronModel accepts a config dictionary with the following parameters:
| Parameter | Description | Example | Required |
|---|---|---|---|
model_id | Model identifier | "mistralai/Mistral-7B-Instruct-v0.3" | Yes |
base_url | Base URL for the OpenAI-compatible API | "http://localhost:8080/v1" | No (default: "http://localhost:8080/v1") |
api_key | API key for authentication | "EMPTY" | No (default: "EMPTY") |
support_tool_choice_auto | Set True if vLLM has --enable-auto-tool-choice and --tool-call-parser flags | True | No (default: False) |
temperature | Sampling temperature (0.0 to 2.0) | 0.7 | No |
top_p | Nucleus sampling parameter | 0.9 | No |
max_completion_tokens | Maximum tokens to generate | 1000 | No |
stop | Sequences that stop generation | ["\n\n"] | No |
frequency_penalty | Penalize tokens based on frequency (-2.0 to 2.0) | 0.0 | No |
presence_penalty | Penalize tokens based on presence (-2.0 to 2.0) | 0.0 | No |
additional_args | Additional arguments passed to the API request | {} | No |
Example Configuration
Section titled “Example Configuration”model = NeuronModel( config={ "model_id": "mistralai/Mistral-7B-Instruct-v0.3", "base_url": "http://localhost:8080/v1", "api_key": "EMPTY", "temperature": 0.7, "top_p": 0.9, "max_completion_tokens": 1000, "support_tool_choice_auto": True, })Troubleshooting
Section titled “Troubleshooting”Connection errors to vLLM server
Section titled “Connection errors to vLLM server”Ensure your vLLM Neuron server is running and accessible:
curl http://localhost:8080/healthModel only supports single tool calls
Section titled “Model only supports single tool calls”If you see "This model only supports single tool-calls at once!", this is a model-level constraint. Switch to a model that supports parallel tool calls (Llama 4, Granite 3.1, xLAM), or use structured_output() for single-tool workflows.
Tool calling not working
Section titled “Tool calling not working”Ensure the vLLM server was started with --enable-auto-tool-choice and --tool-call-parser flags, and set "support_tool_choice_auto": True in the model config.