Detect intelligence degradation in AI models with standardized benchmarks. 30 curated Chinese questions across Math, Reasoning, and Code — designed around real degradation patterns from the Chinese developer community.
"Claude/GPT 降智" was a top-3 hot topic during April-May 2026 Chinese developer community scans:
claudecode node: 12-reply hot thread on Claude Code behavior changesdeepseek node: 4 posts on frequent service disruptionspip install claude-intel-monitor
# Test a model
claude-intel-monitor test --model claude-sonnet-4 --provider anthropic
# Set baseline for degradation detection
claude-intel-monitor baseline --model claude-sonnet-4
# View history
claude-intel-monitor history
# Continuous watch mode
claude-intel-monitor watch --model claude-sonnet-4 --provider anthropic --interval 6h
30 questions, 3 dimensions:
| Dimension | Count | Weight | Detection Target |
|---|---|---|---|
| ----------- | ------- | -------- | ------------------ |
| Math | 10 | 1.0x | Mathematical reasoning, hallucination tendency |
| Reasoning | 10 | 1.2x | Logical reasoning, reduced safety awareness |
| Code | 10 | 1.3x | Code quality, architectural degradation |
All Chinese. Each answer validated by deterministic check functions (no AI grading bias).
🧠 Testing deepseek-chat via deepseek — 30 questions
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ 91.1% ██████████████████░░ ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
📊 DeepSeek first live baseline: 27/30 (91.1%)
共 2 个版本