← 返回
AI智能 中文

Reddi Agent Evaluation

reddi.tech fork of agent-evaluation. Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and produc...
reddi.tech 的 agent-evaluation 分支。用于测试和基准测试 LLM 智能体,涵盖行为测试、能力评估、可靠性指标及生产相关内容。
nissan
AI智能 clawhub v1.0.2 1 版本 99803.5 Key: 无需
★ 0
Stars
📥 508
下载
💾 7
安装
1
版本
#latest

概述

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in

production. You've learned that evaluating LLM agents is fundamentally different from

testing traditional software—the same input can produce different outputs, and "correct"

often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression

tests, capability assessments, and reliability metrics. You understand that the goal isn't

100% test pass rate—it

Capabilities

  • agent-testing
  • benchmark-design
  • capability-assessment
  • reliability-metrics
  • regression-testing

Requirements

  • testing-fundamentals
  • llm-fundamentals

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

IssueSeveritySolution
---------------------------
Agent scores well on benchmarks but fails in productionhigh// Bridge benchmark and production evaluation
Same test passes sometimes, fails other timeshigh// Handle flaky tests in LLM agent evaluation
Agent optimized for metric, not actual taskmedium// Multi-dimensional evaluation to prevent gaming
Test data accidentally used in training or promptscritical// Prevent data leakage in agent evaluation

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

版本历史

共 1 个版本

  • v1.0.2 当前
    2026-03-30 06:55 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-intelligence

Proactive Agent

halthelobster
将AI智能体从任务执行者升级为主动预判需求、持续优化的智能伙伴。集成WAL协议、工作缓冲区、自主定时任务及实战验证模式。Hal Stack核心组件 🦞
★ 834 📥 212,990
ai-intelligence

ontology

oswalpalash
类型化知识图谱,用于结构化智能体记忆与可组合技能。支持创建/查询实体(人员、项目、任务、事件、文档)及关联...
★ 711 📥 243,706
content-creation

Fact Checker

nissan
对照源数据验证 Markdown 草稿中的声明、数字和事实。适用场景:发布前审核博客文章、报告或文档的准确性。
★ 3 📥 2,107