Benchmark LLM model throughput (tokens/s). Two modes available:
openclaw infer model run, no API key requiredUse when: User explicitly requests a model throughput test.
Trigger words:
Do NOT trigger: Broad performance discussion terms (e.g. "model performance", standalone "benchmark") should not auto-trigger execution.
Auto Mode (no API key):
python3 throughput.py --auto --model "<current session model>"
Auto-detects the current session model and benchmarks throughput, zero configuration needed.
python3 throughput.py --auto
Test a specific model:
python3 throughput.py --auto --model "zai/glm-5-turbo"
python3 throughput.py \
--url https://api.example.com/v1 \
--key sk-xxx \
--models gpt-4o-mini,gpt-4o
| Parameter | Default | Description |
|---|---|---|
| ----------- | --------- | ------------- |
--iterations | 3 | Test iterations per model |
--max-tokens | 512 | Max output tokens |
--test-prompt | English prose (summer field) | Test prompt |
--timeout | 60 | Single request timeout (seconds) |
--output | throughput-report.md | Output report filename |
--csv | false | Also generate CSV |
1. Read current session model from openclaw.json (provider/model)
2. Send test prompt via openclaw infer model run
3. Timer: command start → output complete
4. Estimate token count from response text (English: 0.75 word/token, Chinese: 1.5 chars/token)
5. Calculate tokens/s
6. Generate summary report
1. Build /v1/chat/completions request
2. Timer: request start → last token received
3. Extract usage.completion_tokens from response (precise)
4. Calculate tokens/s, error rate
5. Generate summary report
| Metric | Description |
|---|---|
| -------- | ------------- |
| Tokens/s | Throughput = Output Tokens / Elapsed Time |
| Avg Latency | Average single request latency |
| Avg Output Tokens | Average output token count |
| Error Rate | Failed request ratio |
# Model Throughput Report
**Mode:** Auto (openclaw infer)
**Iterations:** 3
## Summary
| Model | Avg Tokens/s | Avg Latency(s) | Avg Output Tokens | Error Rate |
|-------|-------------|----------------|-------------------|------------|
| zai/glm-5-turbo | 57.9 | 20.6 | 979.0 | 0.0% |
## Detail
### zai/glm-5-turbo
| Iter | Latency(s) | Output Tokens | Tokens/s | Status |
|------|------------|--------------|---------|--------|
| 1 | 19.5 | 950 | 48.7 | ✅ |
| 2 | 21.3 | 1010 | 47.4 | ✅ |
| 3 | 20.9 | 977 | 46.7 | ✅ |
| Scenario | Auto Mode | API Mode |
|---|---|---|
| ---------- | ----------- | ---------- |
| openclaw not installed | cli_error | — |
| Model not found | api_error | http_404 |
| Network timeout | timeout | timeout |
| Token estimation | English 0.75 word/token, Chinese 1.5 chars/token | Precise from API |
python3 ~/.openclaw/workspace/skills/model-throughput-tester/throughput.py --auto --model "<current session model>"
# Or auto-detect (may not match session override)
python3 ~/.openclaw/workspace/skills/model-throughput-tester/throughput.py --auto
python3 throughput.py \
--url "https://api.openai.com/v1" \
--key "sk-xxx" \
--models "gpt-4o-mini,gpt-4o" \
--iterations 5
python3 throughput.py --auto \
--test-prompt "Explain quantum computing in detail." \
--iterations 5
openclaw infer model run --json, Python subprocess callurllib (Python built-in), OpenAI-compatible /v1/chat/completionstime.perf_counter() nanosecond-levelusage.completion_tokens (precise), Auto mode estimates by character count/v1, /v4, /chat/completions paths共 2 个版本