← 返回
未分类 中文

Smart Router for Ollama

Intelligent task routing between local and cloud Ollama LLM instances. Use when the user wants cost-efficient AI responses by routing simple tasks to a local...
Intelligent task routing between local and cloud Ollama LLM instances. Use when the user wants cost-efficient AI responses by routing simple tasks to a local...
simoncatbot
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 349
下载
💾 0
安装
1
版本
#latest

概述

Smart Router

Routes tasks between a local Ollama instance (fast, cheap) and a remote/cloud Ollama instance (more capable) based on task complexity classification and system capabilities.

Quick Start

# 1. Profile your system
python scripts/system_profiler.py

# 2. Check endpoints are healthy
python scripts/health_check.py

# 3. Route a task
python scripts/route.py "What is quantum computing?"

How It Works

User Request
    ↓
System Profiler (detects compatible models)
    ↓
Health Check (verifies endpoints are up)
    ↓
Classify Task (1-5 complexity score)
    ↓
├─ Score 1-2 → Local Ollama (fast, cheap)
├─ Score 3-5 → Cloud Ollama (powerful)
└─ Match specialist → Dedicated model
    ↓
Verify model available (fallback if not)
    ↓
Stream Response

Classification Scale

ScoreComplexityExamplesRouted To
----------------------------------------
1Simple"What is 2+2?", "Define entropy"Local
2Basic"Write hello world in Python"Local
3Complex"Debug this error", "Compare X vs Y"Cloud
4Deep"Design a system", "Research topic"Cloud
5Expert"Build from scratch", "Multi-file project"Cloud

File Structure

smart-router/
├── SKILL.md                          # This file
├── __init__.py                       # Python package interface
├── requirements.txt                    # Dependencies
│
├── config/
│   ├── router.yaml                   # Main configuration
│   └── system_profile.json            # Auto-generated system specs
│
├── scripts/
│   ├── classify.py                   # Task complexity classifier
│   ├── execute.py                    # Ollama API client
│   ├── route.py                      # Main routing logic
│   ├── system_profiler.py            # Hardware detection
│   └── health_check.py               # Endpoint health verification
│
├── tests/
│   └── test_classifier.py            # Test suite
│
└── references/
    └── classifier-prompt.txt         # LLM fallback prompt

Configuration

Edit config/router.yaml:

# Local Ollama (your machine)
local:
  model: "llama3.2"
  base_url: "http://localhost:11434"

# Cloud Ollama (remote server)
cloud:
  model: "qwen2.5:14b"
  base_url: "http://192.168.1.100:11434"

# Tasks scoring >= this go to cloud
threshold: 3

# Domain specialists (checked first)
specialists:
  code:
    model: "codellama:34b"
    base_url: "http://192.168.1.100:11434"
    triggers: ["code review", "refactor"]

# Performance settings
performance:
  timeout_seconds: 60
  stream_responses: true
  retry_attempts: 2

# Caching
cache:
  enabled: true
  db_path: "cache/router.db"
  ttl_seconds: 86400

Usage

CLI

# Basic routing
python scripts/route.py "What is the capital of France?"

# With profiling (updates system profile)
python scripts/route.py "Debug this error" --profile

# Custom config
python scripts/route.py "Design a system" --config config/my-router.yaml

# No streaming (wait for full response)
python scripts/route.py "Summarize this" --no-stream

# Health check all endpoints
python scripts/health_check.py

# Manual classification
python scripts/classify.py "Write a function"
# Output: "2:basic-task"

Python API

from smart_router import SmartRouter

# Initialize
router = SmartRouter()

# Route with streaming
for chunk in router.route("Explain quantum computing"):
    print(chunk, end='')

# Classify only
score, reason = router.classify("Debug this code")
print(f"Complexity: {score}/5, Reason: {reason}")

# Get configuration
config = router.get_config()
print(f"Local model: {config['local']['model']}")

Workflow

1. System Profiling

Run once (or when hardware changes):

python scripts/system_profiler.py

This creates config/system_profile.json with:

  • Total/available RAM
  • GPU detection (VRAM, name)
  • CPU cores
  • Compatible model list
  • Recommended local model

2. Health Check

Verify endpoints before use:

python scripts/health_check.py

Checks:

  • Ollama version
  • Available models
  • Response latency
  • Connection status

3. Routing

When you submit a task:

  1. Specialist check — Match against specialist triggers
  2. Classification — Pattern-based scoring (1-5)
  3. Model selection — Local (1-2) or Cloud (3-5)
  4. Availability check — Verify model exists in Ollama
  5. Fallback — Use compatible model if preferred unavailable
  6. Execution — Stream response from selected model

Features

Pattern-Based Classification

Uses regex patterns (not LLM calls) for speed:

  • 30ms classification time
  • 0 tokens cost
  • Handles false positives ("zip code" ≠ code task)

System-Aware Model Selection

Automatically detects what your system can run:

  • No GPU → Filters to CPU-compatible models
  • 8GB RAM → Excludes 70B models
  • GPU available → Prioritizes GPU-accelerated models

Health Monitoring

Pre-flight checks prevent routing to dead endpoints:

✓ local     | Status: healthy | Latency: 45ms | Models: 5
✗ cloud     | Status: unreachable | Error: Connection refused

Automatic Fallbacks

  1. Model fallback — If configured model unavailable, picks compatible alternative
  2. Endpoint fallback — If cloud fails, retries with local
  3. Error handling — Never crashes, always returns something

Cost Tracking

Even though Ollama is free, logs track latency:

[2024-01-15T10:30:00] task: '...' -> local | model: llama3.2 | latency: 0.85s
[2024-01-15T10:30:45] task: '...' -> cloud | model: qwen2.5:14b | latency: 3.2s

Testing

# Run classifier tests
python tests/test_classifier.py

# Expected output:
# ✓ PASS [1] Simple factual question
# ✓ PASS [1] Zip code (not code)
# ✓ PASS [3] Debugging
# ...
# Results: X passed, Y failed

Troubleshooting

"Cannot connect to Ollama"

# Check if Ollama is running
ollama serve

# Verify endpoint
curl http://localhost:11434/api/tags

"Model not found"

# Pull the model
ollama pull llama3.2

# Or let router auto-fallback to available model

"Classification seems wrong"

Check pattern in scripts/classify.py:

# Add new pattern
COMPLEXITY_PATTERNS[2].append(r'your\s+pattern\s+here')

"Cloud endpoint slow"

# In config/router.yaml
performance:
  timeout_seconds: 30  # Reduce timeout

Requirements

  • Python 3.8+
  • Ollama (local or remote)
  • pip install -r requirements.txt

Architecture Decision Records

Why Pattern Matching vs LLM?

ApproachLatencyCostAccuracyVerdict
--------------------------------------------
Pattern matching30ms0 tokens90%✅ Used
LLM classification500ms50 tokens95%Optional (--llm)

Pattern matching wins on speed/cost. Accuracy is good enough for routing.

Why Not Cloud APIs (Claude, GPT-4)?

Ollama-only keeps everything:

  • Private — No data leaves your infrastructure
  • Free — Server costs only, no per-token fees
  • Customizable — Run fine-tuned models

Future Enhancements

  • [ ] Adaptive threshold learning from feedback
  • [ ] Conversation context (multi-turn routing)
  • [ ] Cost/latency budget enforcement
  • [ ] Automatic model downloading
  • [ ] Metrics dashboard

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-07 13:54 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

Caid Multi Agent

simoncatbot
协调多个子代理协同完成长期软件工程任务,采用CAID(集中式异步隔离委托)...
★ 0 📥 380

Web Search via SearXNG

simoncatbot
Search the web using SearXNG meta-search engine. Use when the user wants to search the web, find current information, lo
★ 0 📥 394

Multi-Agent Debate

simoncatbot
通过结构化的多智能体辩论验证事实、降低幻觉并探索多种观点。多个智能体独立回答相同的查询。
★ 0 📥 380