← 返回
未分类 中文

Guard

Deep AI safety guardrails workflow—policy definition, input/output filtering, monitoring, escalation, and false-positive handling. Use when reducing harmful...
深度AI安全防护工作流—策略定义、输入/输出过滤、监控、升级与误报处理。用于降低有害内容。
clawkk clawkk 来源
未分类 clawhub v1.0.0 1 版本 99887.3 Key: 无需
★ 0
Stars
📥 886
下载
💾 1
安装
1
版本
#latest

概述

AI Guardrails (Deep Workflow)

Guardrails turn product and legal policy into enforced behavior: blocking, rewriting, logging, and human review—with attention to false positives and latency.

When to Offer This Workflow

Trigger conditions:

  • Launching consumer-facing LLM features
  • Jailbreak attempts, policy violations, or PII leakage risks
  • Region-specific compliance (minors, regulated advice)

Initial offer:

Use six stages: (1) policy scope, (2) threat model, (3) controls stack, (4) implementation patterns, (5) monitoring & review, (6) iteration & appeals). Confirm latency budget and jurisdictions.


Stage 1: Policy Scope

Goal: Define prohibited categories (hate, sexual content, violence, self-harm, malware instructions, etc.) and required disclaimers for sensitive domains (medical, legal).

Exit condition: Policy document owned by legal/product; escalation path for gray areas.


Stage 2: Threat Model

Goal: Identify adversaries (prompt injection, data exfiltration, tool abuse) and assets (user data, system prompts, connectors).


Stage 3: Controls Stack

Goal: Layer defenses: input screening, model safety APIs, output classifiers, tool sandboxing, allowlists for tools and URLs.


Stage 4: Implementation Patterns

Goal: Structured refusal messages; telemetry on every block; distinguish block vs rewrite vs warn; avoid silent failures.


Stage 5: Monitoring & Review

Goal: Sample borderline cases for human review; dashboards on block rates by category; abuse spike alerts.


Stage 6: Iteration & Appeals

Goal: User appeals path where appropriate; version policy changes; measure false positives by locale and use case.


Final Review Checklist

  • [ ] Policy categories and owners defined
  • [ ] Threat model aligned with product
  • [ ] Layered controls with clear responsibilities
  • [ ] Telemetry and review for edge cases
  • [ ] Appeals and iteration process where applicable

Tips for Effective Guidance

  • Defense in depth—no single classifier is sufficient.
  • Pair with moderation for UGC and tool-calling for agent safety.

Handling Deviations

  • Enterprise internal bots: emphasize data-leak prevention and connector scope over public “safety” categories alone.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-03 04:44 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

Media Relations

clawkk
提供媒体公关的可落地指南与SOP。在开展媒体公关相关工作时调用。
★ 0 📥 657

Retro

clawkk
深度无责事后分析工作流——时间线、影响、根因与促成因素、做得好/不好的方面、带负责人的行动项、以及跟进……
★ 0 📥 698

Wechat

clawkk
提供微信视频号公开页面的视频数据检索与表现摘要,包括话题合集、账号主页及榜单统计分析。
★ 0 📥 708