Bridging vLLM and VS Code: A Proxy for Running Local Qwen Models

I explain how to use a proxy as a protocol adapter and behavior shim so that certain model families can function properly with Github Copilot in VS Code.

Share
Bridging vLLM and VS Code: A Proxy for Running Local Qwen Models

If you've tried running local models through VS Code and GitHub Copilot, you've probably hit the same wall: the built-in customoai provider path is finicky, model ids don't match, and reasoning output can behave strangely.
I ran into this head-on while trying to get Qwen models working smoothly in VS Code through vLLM. Here's what we built to fix it.

(Btw, this technique can work with any inference backend, but I tend to use vLLM because of it's speed and compatibility with certain model families, such as Qwen.)


The Problem

vLLM is an excellent inference server. It speaks OpenAI-compatible APIs, supports streaming, and handles reasoning models well. VS Code's customoai vendor is supposed to work with any OpenAI-compatible endpoint.

In theory, you just point VS Code at your vLLM server and everything works.

In practice, there are a few mismatches:

  • Model ids: vLLM exposes models with path-style ids like /root/models/Qwopus3.6-27B-v1-preview. VS Code's config expects something friendlier.
  • Parameter shapes: VS Code sends camelCase parameters (topP, maxOutputTokens). vLLM expects snake_case (top_p, max_tokens).
  • Reasoning output: Some models spend their entire completion budget in internal reasoning before producing visible text. With a small token budget, you get a response that looks like it stopped early.
  • Streaming format: The SSE (Server-Sent Events) stream needs to be clean and consistent for VS Code to render it properly.

These aren't deal-breakers, but they add up. You end up with models that sometimes work, sometimes don't, and are hard to debug.


The Solution: A Thin Local Proxy

I built a small Node.js proxy that sits between VS Code and vLLM. It's not a full API gateway — it's a focused adapter that handles the mismatches so the model can function through Github Copilot.

The proxy listens on a local port (we use 11434) and exposes two surfaces:

  • /v1/* — OpenAI-compatible endpoints for customoai clients and direct testing
  • /api/* — Ollama-compatible endpoints (more on that below)

When VS Code makes a request, the proxy translates it into what vLLM expects, forwards it, then translates the response back.

What the proxy handles

Model id resolution: You can configure VS Code with a friendly model name like Qwopus3.6-27B-v1-preview, and the proxy resolves it to the real upstream id /root/models/Qwopus3.6-27B-v1-preview before forwarding.

Parameter normalization: CamelCase parameters from VS Code get converted to snake_case for vLLM. The proxy handles topPtop_p, maxOutputTokensmax_tokens, and other common mappings.

Reasoning translation: When vLLM returns reasoning content (whether in delta.reasoning fields or inline thinking tags), the proxy normalizes it into the aliases VS Code expects: reasoning_content, thinking, reasoning_text. This makes reasoning blocks render correctly in the UI.

Model-specific behavior fixes: Some models have quirks. Qwopus, for example, was spending its entire 128-token completion budget in reasoning and never reaching visible output. The proxy now enforces a minimum `max_tokens` floor for Qwopus, so visible replies survive even when VS Code sends a small budget.

Stale port recovery: If you restart the proxy and the old instance is still bound to the port, the new one detects and reclaims it automatically. No manual cleanup needed.

What doesn't work

The context window tracking (the circle graph in the bottom right hand side of the Copilot Chat window) doesn't properly track the context window usage. Eventually when the context fills up you will see errors like "Recovered from a request error" until you manually compact. I'll see if I can patch this soon.


Setting Up chatLanguageModels.json

The proxy gives you a stable local endpoint. Here's how to configure VS Code to use it.

The config file

VS Code stores custom language model configurations in chatLanguageModels.json, located in your user settings folder: %APPDATA%\Code - Insiders\User\chatLanguageModels.json

(On Windows Insiders. Adjust the path for stable VS Code or other platforms.)

A working customoai configuration

Here's a configuration that works with the proxy using Qwen/Qwen3.6-27B:

{
  "name": "vLLM",
  "vendor": "customoai",
  "models": [
    {
      "id": "Qwen3.6-27B",
      "name": "Qwen3.6-27B",
      "url": "http://localhost:11434",
      "model": "/root/models/Qwen3.6-27B",
      "toolCalling": true,
      "vision": true,
      "maxInputTokens": 262144,
      "maxOutputTokens": 32000,
      "streaming": true,
      "temperature": 0.7,
      "topP": 0.9
    }
  ]
}

A few notes on the configuration:

  • url: Points to the proxy, not directly to vLLM. The proxy handles the translation.
  • model: You can use the raw vLLM id (like /root/models/...) or a friendly name. The proxy resolves either form.
  • toolCalling and vision: These tell VS Code what capabilities the model supports. Set them based on what your actual model can do.
  • maxInputTokens and maxOutputTokens: Match these to your vLLM server's configuration.

Starting the proxy


Before you can use the configuration, start the proxy:

node proxy.js --port 11434 --vllm-url http://localhost:30001

The proxy expects vLLM to be running on port 30001 (or whatever port you configured). It will listen on 11434 and forward requests to vLLM.
Once the proxy is running, reload VS Code's custom model settings, and your models should appear in the picker.


What About the Ollama Path?

We also experimented with making the proxy impersonate Ollama on port 11434. The idea was to use VS Code's built-in ollama vendor path instead of customoai, since the Ollama integration is more mature.
The proxy does expose Ollama-compatible endpoints (/api/version, /api/tags, /api/show, /api/chat), and the basic handshake works — VS Code can discover models and start conversations.

However, we hit capability-gating issues. VS Code's model picker and agent modes check for specific capability metadata (tool calling, vision support), and the Ollama path doesn't always report those correctly through the proxy. The models appear in some places but not others.

This isn't a fundamental blocker — it's a matter of getting the capability metadata right in the /api/show response. But the customoai path worked well enough that we focused our energy there instead.

If you want to explore the Ollama path, the proxy already has the endpoints. You'd just need to refine the capability reporting. I left the code as is in case it might be useful for someone else to extend.


Why This Matters

Local models are getting better every month. But the tooling around them — especially editor integration — hasn't caught up. You shouldn't have to choose between cloud models and a smooth development experience.

This proxy is a small piece of infrastructure that makes local models feel like first-class citizens in VS Code. It's not a replacement for better upstream support, but it's a practical workaround that works today.

The code is lightweight (a single Node.js file, no external dependencies), and it's designed to be easy to extend. If you have a model with different quirks, you can add model-specific defaults. If VS Code changes its API, you can update the translation layer.


Getting Started

If you want to try this:

  1. Make sure vLLM is running with your model loaded
  2. Clone or download the proxy code
  3. Start the proxy: node proxy.js --port 11434 --vllm-url http://localhost:30001
  4. Configure chatLanguageModels.json as shown above
  5. Reload VS Code and select your model

The proxy logs incoming requests, so you can see what's happening if something goes wrong.


This is a practical solution for a real problem. The script is not meant to be a long-term project — the goal in fact is for it to become irrelevant because that would mean the Github Copilot extension is fixed (but unfortunately that's not the case yet). If you run into issues or have improvements, the code is open and easy to modify.