← 返回
未分类

Llm Evaluation

Deep LLM evaluation workflow—quality dimensions, golden sets, human vs automatic metrics, regression suites, offline/online signals, and safe rollout gates f...
深度大模型评估工作流:涵盖质量维度、黄金集、人机指标对比、回归套件、离在线信号及安全发布门禁。
codenova58 codenova58 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 492
下载
💾 2
安装
1
版本
#latest

概述

LLM Evaluation (Deep Workflow)

Evaluation turns “it feels better” into reproducible evidence. Design around failure modes your product cares about—not only aggregate scores.

When to Offer This Workflow

Trigger conditions:

  • Prompt or model change; need before/after proof
  • Building CI for LLM outputs; flaky quality in production
  • RAG/agents: grounding, tool use, safety regressions

Initial offer:

Use six stages: (1) define quality & constraints, (2) build datasets & rubrics, (3) automatic metrics, (4) human evaluation, (5) regression & gates, (6) online validation & iteration. Confirm latency/cost budgets and risk (PII, safety).


Stage 1: Define Quality & Constraints

Goal: Name dimensions that map to user harm if they fail.

Typical dimensions (pick what matters)

  • Correctness / task success; groundedness (RAG); faithfulness to sources
  • Safety: policy violations, jailbreaks, PII leakage
  • Style: tone, brevity, format (when product-critical)
  • Robustness: paraphrase, multilingual, edge inputs

Constraints

  • Max tokens, latency p95, cost per request; locale requirements

Exit condition: Weighted priority of dimensions; non-goals stated.


Stage 2: Datasets & Rubrics

Goal: Fixed eval sets + clear scoring rules.

Practices

  • Stratify by intent: easy/medium/hard; adversarial slice separate
  • Rubrics: 1–5 scales with anchors; binary checks for safety
  • Version datasets (git or table); no silent edits without changelog
  • Privacy: synthetic or redacted real examples per policy

Exit condition: Golden set size justified; inter-rater plan if human scoring.


Stage 3: Automatic Metrics

Goal: Fast signals—know limitations.

Options

  • Reference-based: BLEU/ROUGE—often weak for assistants
  • Model-as-judge: fast, biased—calibrate vs human
  • Task-specific: exact match, JSON schema validity, tool-call args match
  • RAG: citation overlap, nugget recall, entailment models (use carefully)

Hygiene

  • No training on test; detect leakage from prompts

Exit condition: Each auto metric has known blind spots documented.


Stage 4: Human Evaluation

Goal: Authoritative judgment where automatic metrics lie.

Design

  • Sample size for confidence; blind A/B when possible
  • Guidelines + examples; adjudication for disagreements
  • Locale-native raters when language quality matters

Exit condition: Human scores correlate enough with auto for ongoing monitoring—or you rely on human for release.


Stage 5: Regression & Gates

Goal: Block bad deploys in CI or release pipeline.

Gates

  • Must-pass suites: safety, critical user journeys
  • Trend tracking: not only point-in-time
  • Canary with online metrics (see Stage 6)

Artifacts

  • Report: model/prompt id, dataset versions, scores, diff

Exit condition: Rollback criteria defined before rollout.


Stage 6: Online Validation

Goal: Production truth—shadow, A/B, or gradual ramp.

Signals

  • Implicit: thumbs, edits, task completion, support tickets
  • Explicit: user ratings (sparse)

Causality

  • Confounds: seasonality, cohort—control where possible

Final Review Checklist

  • [ ] Quality dimensions prioritized for the product
  • [ ] Versioned eval sets and rubrics
  • [ ] Auto + human roles explicit; limitations documented
  • [ ] Release gates and rollback tied to metrics
  • [ ] Plan for online feedback loop

Tips for Effective Guidance

  • Slice metrics—averages hide regressions on critical intents.
  • For agents, evaluate trajectories, not only final text.
  • Never claim objective truth—evaluation is operationalized judgment.

Handling Deviations

  • No labels: start with smallest pairwise comparison set + spot human review.
  • High-stakes (medical/legal): human-in-the-loop gate; disclaim limits of auto eval.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-31 06:12 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-agent

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,397 📥 322,737
professional

Eastmoney

codenova58
访问并总结来自东方财富的公开市场数据、新闻和行业趋势,不包括交易或批量数据提取。
★ 1 📥 5,986
ai-agent

Find Skills

guipi888
场景驱动+关键词双模式技能发现工具。当用户用自然语言描述场景/需求(如"我想做一个海报""帮我分析股票"),或明确说"安装技能/find skills/找个skill"时,自动从官方内置、本地已安装、SkillHub、虾评、GitHub、C
★ 1,463 📥 530,393