← 返回
未分类 中文

Ollama — Herd Your LLMs Into One Smart Endpoint

Ollama fleet router — herd your Ollama LLMs into one smart endpoint. Route Llama, Qwen, DeepSeek, Phi, Mistral, and Gemma across multiple devices with 7-sign...
Ollama fleet router — herd your Ollama LLMs into one smart endpoint. Route Llama, Qwen, DeepSeek, Phi, Mistral, and Gemma across multiple devices with 7-sign...
twinsgeeks twinsgeeks 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 2
Stars
📥 304
下载
💾 0
安装
1
版本
#apple-silicon#deepseek#fleet#gemma#inference#latest#llama#llm#load-balancer#mistral#multimodal#ollama#phi#qwen#routing

概述

Ollama — Herd Your LLMs Into One Endpoint

You have Ollama running on multiple machines. This skill gives you one endpoint that routes every request to the best available device automatically. No more hardcoding IPs, no more manual load balancing, no more "which machine has that model loaded?"

Setup

pip install ollama-herd
herd              # start the router on port 11435
herd-node         # run on each machine with Ollama

Now point everything at http://localhost:11435 instead of http://localhost:11434. Same Ollama API, same models, smarter routing.

Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd

Use your Ollama models through the fleet

OpenAI SDK (drop-in)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="llama3.3:70b",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

Ollama API (same as before, different port)

# Chat
curl http://localhost:11435/api/chat -d '{
  "model": "qwen3:235b",
  "messages": [{"role": "user", "content": "Hello"}],
  "stream": false
}'

# List all models across all machines
curl http://localhost:11435/api/tags

# Models currently in GPU memory
curl http://localhost:11435/api/ps

# Embeddings
curl http://localhost:11435/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "search query"
}'

What the router does

When a request comes in, the router scores every online node on 7 signals:

  1. Thermal — is the model already loaded in GPU memory? (+50 for hot)
  2. Memory fit — how much headroom does the node have?
  3. Queue depth — how many requests are waiting?
  4. Wait time — estimated latency based on history
  5. Role affinity — large models prefer big machines
  6. Availability — is the node reliably available?
  7. Context fit — does the loaded context window fit the request?

The highest-scoring node handles the request. If it fails, the router retries on the next best node automatically.

Supported Ollama models

Any model that runs on Ollama works through the fleet. Popular ones:

ModelSizesBest for
------------------------
llama3.38B, 70BGeneral purpose
qwen30.6B–235BMultilingual, reasoning
qwen3.50.8B–397BLatest generation
deepseek-v3671B (37B active)Matches GPT-4o
deepseek-r11.5B–671BReasoning (like o3)
phi414BSmall, fast, capable
mistral7BFast, European languages
gemma31B–27BGoogle's open model
codestral22BCode generation
qwen3-coder30B (3.3B active)Agentic coding
nomic-embed-text137MEmbeddings for RAG

Resilience features

  • Auto-retry — re-routes to next best node on failure (before first chunk)
  • VRAM-aware fallback — routes to a loaded model in the same category instead of cold-loading
  • Context protection — prevents num_ctx from triggering expensive model reloads
  • Zombie reaper — cleans up stuck in-flight requests
  • Auto-pull — downloads missing models to the best node automatically

Also available

The same fleet router handles three more workloads:

Image generation

curl -o image.png http://localhost:11435/api/generate-image \
  -H "Content-Type: application/json" \
  -d '{"model":"z-image-turbo","prompt":"a sunset","width":1024,"height":1024,"steps":4}'

Enable: curl -X POST .../dashboard/api/settings -d '{"image_generation":true}'

Speech-to-text

curl http://localhost:11435/api/transcribe -F "audio=@recording.wav"

Enable: curl -X POST .../dashboard/api/settings -d '{"transcription":true}'

Embeddings

curl http://localhost:11435/api/embeddings -d '{"model":"nomic-embed-text","prompt":"text"}'

Already enabled — routes through Ollama automatically.

Dashboard

http://localhost:11435/dashboard — 8 tabs: Fleet Overview, Trends, Model Insights, Apps, Benchmarks, Health, Recommendations, Settings. Real-time queue visibility with [TEXT], [IMAGE], [STT], [EMBED] badges.

Request tagging

Track per-project usage:

response = client.chat.completions.create(
    model="llama3.3:70b",
    messages=messages,
    extra_body={"metadata": {"tags": ["my-project", "reasoning"]}},
)

Full documentation

Agent Setup Guide

Guardrails

  • Never restart the router or node agents without user confirmation.
  • Never delete or modify files in ~/.fleet-manager/.
  • Never pull or delete models without user confirmation.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-07 18:02 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-agent

Agent Browser

rez0
用于 AI 代理的浏览器自动化 CLI。当用户需要与网站交互(包括浏览页面、填写表单、点击按钮、截图等)时使用。
★ 831 📥 298,435
ai-agent

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,390 📥 321,902
ai-agent

self-improving agent

pskoett
捕获经验教训、错误及修正内容,以实现持续改进。适用于以下场景:(1)命令或操作意外失败;(2)用户纠正Claude(如“不,那不对……”“实际上……”);(3)用户请求的功能不存在;(4)外部API或工具出现故障;(5)Claude发现自身
★ 4,095 📥 821,831