Develop AI apps on Windows with full Linux performance. WSL2 gives you native Linux inside Windows with NVIDIA GPU passthrough — your RTX GPU runs CUDA in WSL2 at near-native speed. Ollama Herd routes AI requests across WSL2 instances and native Windows machines.
# PowerShell (admin)
wsl --install -d Ubuntu
wsl --set-default-version 2
Verify WSL2 NVIDIA GPU access:
# Inside WSL2
nvidia-smi # should show your RTX GPU
# Inside WSL2
curl -fsSL https://ollama.ai/install.sh | sh
ollama serve &
# Inside WSL2
pip install ollama-herd
herd # start WSL2 AI router on port 11435
herd-node # register WSL2 as a node
Your WSL2 AI endpoint is accessible from Windows at http://localhost:11435 — WSL2 forwards ports automatically.
# From Windows PowerShell
curl http://localhost:11435/api/tags # see WSL2 AI models
from openai import OpenAI
# Same URL works from Windows and WSL2
client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
# WSL2 handles the inference via NVIDIA GPU
response = client.chat.completions.create(
model="qwen3.5:32b",
messages=[{"role": "user", "content": "Write a Docker Compose file for a Python API"}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="")
// .vscode/settings.json — Continue.dev configuration
{
"continue.models": [{
"title": "WSL2 Local",
"provider": "openai",
"model": "codestral",
"apiBase": "http://localhost:11435/v1",
"apiKey": "not-needed"
}]
}
# WSL2 inference
curl http://localhost:11435/api/chat -d '{
"model": "codestral",
"messages": [{"role": "user", "content": "Refactor this Python function"}],
"stream": false
}'
Run Ollama in Docker on WSL2 for containerized AI:
# WSL2 Docker + Ollama
docker run -d --gpus all -p 11434:11434 ollama/ollama
# Herd routes between Docker Ollama and native Ollama
pip install ollama-herd
herd &
herd-node
| Windows PC | GPU | WSL2 AI models |
|---|---|---|
| ------------ | ----- | --------------- |
| RTX 4090 desktop | 24GB shared with WSL2 | llama3.3:70b, qwen3.5:32b |
| RTX 4080 desktop | 16GB shared with WSL2 | phi4, codestral, qwen3.5:14b |
| RTX 4060 laptop | 8GB shared with WSL2 | phi4-mini, gemma3:4b |
> WSL2 shares GPU memory with Windows. Close GPU-heavy Windows apps for more WSL2 AI vRAM.
# WSL2 Ollama optimization
export OLLAMA_KEEP_ALIVE=-1
export OLLAMA_MAX_LOADED_MODELS=-1
# Add to ~/.bashrc for persistence in WSL2
echo 'export OLLAMA_KEEP_ALIVE=-1' >> ~/.bashrc
echo 'export OLLAMA_MAX_LOADED_MODELS=-1' >> ~/.bashrc
# WSL2 fleet status
curl -s http://localhost:11435/fleet/status | python3 -m json.tool
# WSL2 health checks
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
Dashboard at http://localhost:11435/dashboard — accessible from both Windows browser and WSL2.
curl http://localhost:11435/api/generate-image \
-d '{"model": "z-image-turbo", "prompt": "developer workspace", "width": 1024, "height": 1024}'
curl http://localhost:11435/api/embed \
-d '{"model": "nomic-embed-text", "input": "WSL2 Windows development AI"}'
Ollama Herd is open source (MIT). WSL2 developers welcome:
~/.fleet-manager/.共 1 个版本