Providers

vLLM

vLLM serves open-source (and some custom) models through an OpenAI-compatible HTTP API. OpenClaw connects using the openai-completions API and can auto-discover models when you opt in with VLLM_API_KEY.

Property	Value
Provider ID	`vllm`
API	`openai-completions` (OpenAI-compatible)
Auth	`VLLM_API_KEY` environment variable
Default base URL	`http://127.0.0.1:8000/v1`
Streaming usage	Supported (`stream_options.include_usage`)

Getting started

Start vLLM with an OpenAI-compatible server

Your base URL must expose /v1 endpoints (/v1/models, /v1/chat/completions). vLLM commonly runs on:

text

http://127.0.0.1:8000/v1

Set the API key environment variable

Any non-empty value works if your server does not enforce auth:

bash

export VLLM_API_KEY="vllm-local"

Select a model

Replace with one of your vLLM model IDs:

json5

{  agents: {    defaults: {      model: { primary: "vllm/your-model-id" },    },  },}

Verify the model is available

bash

openclaw models list --provider vllm

Tip

For non-interactive setup (CI, scripting), pass the base URL, key, and model directly:

bash

openclaw onboard --non-interactive \--mode local \--auth-choice vllm \--custom-base-url "http://127.0.0.1:8000/v1" \--custom-api-key "vllm-local" \--custom-model-id "your-model-id"

Model discovery (implicit provider)

When VLLM_API_KEY is set (or an auth profile exists) and models.providers.vllm is not defined, OpenClaw queries GET http://127.0.0.1:8000/v1/models and converts the returned IDs into model entries.

Explicit configuration

Configure explicitly when vLLM runs on a different host or port, you want to pin contextWindow/maxTokens, your server requires a real API key, or you connect to a trusted loopback, LAN, or Tailscale endpoint:

json5

{  models: {    providers: {      vllm: {        baseUrl: "http://127.0.0.1:8000/v1",        apiKey: "${VLLM_API_KEY}",        api: "openai-completions",        timeoutSeconds: 300, // Optional: extend request timeout for slow local models        models: [          {            id: "your-model-id",            name: "Local vLLM Model",            reasoning: false,            input: ["text"],            cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },            contextWindow: 128000,            maxTokens: 8192,          },        ],      },    },  },}

To keep the provider dynamic without listing every model, add a wildcard to the visible model catalog:

json5

{  agents: {    defaults: {      models: {        "vllm/*": {},      },    },  },}

Advanced configuration

Proxy-style behavior

vLLM is treated as a proxy-style OpenAI-compatible /v1 backend, not a native OpenAI endpoint:

Behavior	Applied?
Native OpenAI request shaping	No
`service_tier`	Not sent
Responses `store`	Not sent
Prompt-cache hints	Not sent
OpenAI reasoning-compat payload shaping	Not applied
Hidden OpenClaw attribution headers	Not injected on custom base URLs

Qwen thinking controls

For Qwen models, set compat.thinkingFormat: "qwen-chat-template" on the model row when the server expects Qwen chat-template kwargs. These models expose a binary /think profile (off, on) because Qwen chat-template thinking is an on/off flag, not an OpenAI-style effort ladder.

json5

{  models: {    providers: {      vllm: {        models: [          {            id: "Qwen/Qwen3-8B",            name: "Qwen3 8B",            reasoning: true,            compat: { thinkingFormat: "qwen-chat-template" },          },        ],      },    },  },}

OpenClaw maps /think off to:

json

{  "chat_template_kwargs": {    "enable_thinking": false,    "preserve_thinking": true  }}

Non-off thinking levels send enable_thinking: true. If your endpoint expects DashScope-style top-level flags instead, use compat.thinkingFormat: "qwen" to send enable_thinking at the request root.

Nemotron 3 thinking controls

For vllm/nemotron-3-* models with thinking off, the bundled plugin sends:

json

{  "chat_template_kwargs": {    "enable_thinking": false,    "force_nonempty_content": true  }}

To customize these values, set chat_template_kwargs under the model params. If you also set params.extra_body.chat_template_kwargs, that value wins because extra_body is the last request-body override.

json5

{  agents: {    defaults: {      models: {        "vllm/nemotron-3-super": {          params: {            chat_template_kwargs: {              enable_thinking: false,              force_nonempty_content: true,            },          },        },      },    },  },}

Qwen tool calls appear as text

First confirm vLLM was started with the right tool-call parser and chat template for the model. vLLM documents hermes for Qwen2.5 models and qwen3_xml for Qwen3-Coder models.

Symptoms: skills/tools never run, the assistant prints raw JSON/XML such as {"name":"read","arguments":...}, or vLLM returns an empty tool_calls array when OpenClaw sends tool_choice: "auto".

Some Qwen/vLLM combinations return structured tool calls only when the request uses tool_choice: "required". Force it per model with params.extra_body:

json5

{  agents: {    defaults: {      models: {        "vllm/Qwen-Qwen2.5-Coder-32B-Instruct": {          params: {            extra_body: {              tool_choice: "required",            },          },        },      },    },  },}

Replace the model id with the exact id from openclaw models list --provider vllm, or apply the same override from the CLI:

bash

openclaw config set agents.defaults.models '{"vllm/Qwen-Qwen2.5-Coder-32B-Instruct":{"params":{"extra_body":{"tool_choice":"required"}}}}' --strict-json --merge

This is an opt-in workaround: it forces every turn with tools to make a tool call, so use it only for a dedicated model entry where that is acceptable. Do not set it as a global default for all vLLM models, and do not pair it with a proxy that converts arbitrary assistant text into executable tool calls.

Custom base URL

If your vLLM server runs on a non-default host or port, set baseUrl in the explicit provider config:

json5

{  models: {    providers: {      vllm: {        baseUrl: "http://192.168.1.50:9000/v1",        apiKey: "${VLLM_API_KEY}",        api: "openai-completions",        timeoutSeconds: 300,        models: [          {            id: "my-custom-model",            name: "Remote vLLM Model",            reasoning: false,            input: ["text"],            contextWindow: 64000,            maxTokens: 4096,          },        ],      },    },  },}

Troubleshooting

Slow first response or remote server timeout

For large local models, remote LAN hosts, or tailnet links, set a provider-scoped request timeout:

json5

{  models: {    providers: {      vllm: {        baseUrl: "http://192.168.1.50:8000/v1",        apiKey: "${VLLM_API_KEY}",        api: "openai-completions",        timeoutSeconds: 300,        models: [{ id: "your-model-id", name: "Local vLLM Model" }],      },    },  },}

timeoutSeconds applies to vLLM model HTTP requests only: connection setup, response headers, body streaming, and the total guarded-fetch abort. It also raises the LLM idle/stream watchdog ceiling above the implicit ~120s default for this provider. Prefer this over increasing agents.defaults.timeoutSeconds, which controls the whole agent run.

Server not reachable

Check that the vLLM server is running and accessible:

bash

curl http://127.0.0.1:8000/v1/models

If you see a connection error, verify the host, port, and that vLLM started in OpenAI-compatible server mode. OpenClaw trusts the exact configured models.providers.vllm.baseUrl origin for guarded model requests on loopback, LAN, and Tailscale endpoints. Metadata/link-local origins remain blocked without explicit opt-in. Set models.providers.vllm.request.allowPrivateNetwork: true only when vLLM requests must reach another private origin, or false to opt out of exact-origin trust.

Auth errors on requests

If requests fail with auth errors, set a real VLLM_API_KEY that matches your server configuration, or configure the provider explicitly under models.providers.vllm.

No models discovered

Auto-discovery requires VLLM_API_KEY to be set. If you have defined models.providers.vllm, OpenClaw uses only your declared models unless agents.defaults.models includes "vllm/*": {}.

Tools render as raw text

If a Qwen model prints JSON/XML tool syntax instead of executing a skill:

Start vLLM with the correct parser/template for that model.
Confirm the exact model id with openclaw models list --provider vllm.
Add a dedicated per-model params.extra_body.tool_choice: "required" override only if tool_choice: "auto" still returns empty or text-only tool calls.

Model selection

Choosing providers, model refs, and failover behavior.

OpenAI

Native OpenAI provider and OpenAI-compatible route behavior.

OAuth and auth

Auth details and credential reuse rules.

Troubleshooting

Common issues and how to resolve them.

Was this useful?

vLLM

Getting started

Start vLLM with an OpenAI-compatible server

Set the API key environment variable

Select a model

Verify the model is available

Model discovery (implicit provider)

Explicit configuration

Advanced configuration

Troubleshooting

On this page

Molty

Getting started

Start vLLM with an OpenAI-compatible server

Set the API key environment variable

Select a model

Verify the model is available

Model discovery (implicit provider)

Explicit configuration

Advanced configuration

Troubleshooting

Related

On this page