← 返回
未分类

敏感文件脱敏Skill

Use when anonymizing, desensitizing, de-identifying, masking, replacing, or removing sensitive information from PDF, Word DOCX, Markdown, text, contracts, reports, resumes, customer files, or documents containing PII, credentials, financial data, business secrets, names, phones, emails, IDs, addresses, tokens, companies, projects, or custom terms; also use for Chinese requests about 脱敏, 匿名化, 敏感信息, 个人信息, 身份证, 手机号, 邮箱, 地址, 银行卡, 合同, 客户, 公司, 项目, 密钥, or 自定义敏感词.
这是一个用于文档脱敏的 AI Agent Skill。它可以扫描并替换文档中的姓名、手机号、邮箱、身份证号、统一社会信用代码、银行卡号、银行联行号、地址、合同编号、公司/机构名、自定义敏感词等信息,同时尽量保留原文档的可读结构。 适合用于合同、客户资料、采购文件、财务文件、内部报告、简历、业务文档等需要在共享前去除敏感信息的场景。 主要能力 - 扫描敏感信息并生成 JSON 报告。 - 生成不覆盖原文件的脱敏副本。 - 使用真实感假数据替换原值,而不是简单写成 `[REDACTED]`。 - 同一次运行中,相同原值会替换为相同假值。 - 支持自定义敏感词表。 - 支持 Markdown 表格、DOCX 表格和 Excel 工作表里的相邻字段,例如 `A1=法定代表人`, `B1=某姓名`;Excel 也支持列头在上、值在下的字段规则,例如 `A1=姓名`, `A2=某姓名`。 - Excel 处理覆盖工作簿属性、工作表名、字符串单元格和 hyperlink 文本/目标,并会尽量保留工作簿、样式、隐藏表和宏;批注、图片、图表、文本框、透视表、嵌入对象和宏代码需要额外复核。 - `.xlsm` 会保留 VBA 项目,但不会扫描或修改宏代码。 - 公式表达式会在 scan/verify 中扫描,但 anonymize 不会改写公式;如果公式里检出敏感值,命令会拒绝生成可能残留敏感值的输出文件。 - 数字、日期和布尔值单元格默认不改,避免破坏类型和格式;如果相邻字段规则在这类单元格中检出敏感值,anonymize 会失败而不是保存残留敏感值的工作簿。如需自动替换这类值,可先将目标列转换为文本或导出为 CSV/TSV。
user_c0e4b78f
未分类 community v1.0.1 2 版本 100000 Key: 无需
★ 0
Stars
📥 25
下载
💾 0
安装
2
版本
#latest

概述

Anonymize Sensitive Files

Overview

Use this skill to replace sensitive document data with realistic fake values while preserving readable structure. Prefer the bundled runner so the workflow works in Codex and in other skill-loading tools without manual dependency setup.

Workflow

  1. Identify the input files and file types.
  2. Use the CLI's bundled global field rules for every supported document type; they are loaded by default from references/sensitive-field-rules.yaml.
  3. Run scan first for unfamiliar or high-risk files.
  4. Run anonymize to create new output files; the CLI refuses outputs that overwrite originals.
  5. Run verify on anonymized outputs when the source values or terms file are available; verification ignores the CLI's built-in fake values but still reports custom terms. Residual findings make the CLI exit non-zero.
  6. For field-rule misses, update references/sensitive-field-rules.yaml and add or update tests before rerunning anonymization.
  7. Review the JSON report for residual findings, skipped inputs, unsupported content, and warnings.

CLI

From the skill directory:

python3 scripts/run_anonymize.py input.md
python3 scripts/run_anonymize.py input.docx --report report.json
python3 scripts/run_anonymize.py input.xlsx --mode scan
python3 scripts/run_anonymize.py input.pdf --mode scan
python3 scripts/run_anonymize.py ./docs --recursive --output-dir ./anonymized
python3 scripts/run_anonymize.py contract.md --terms sensitive_terms.txt --seed 20260603
python3 scripts/run_anonymize.py input.docx --field-rules custom_field_rules.yaml
python3 scripts/run_anonymize.py input.docx --no-field-rules
python3 scripts/run_anonymize.py output.anonymized.md --mode verify --terms sensitive_terms.txt

scripts/run_anonymize.py automatically creates .venv, installs scripts/requirements.txt, and runs the real CLI through the virtual environment Python. This avoids polluting global Python installs and works even when the caller cannot activate a shell environment.

For text-only work in an environment that already has PyYAML installed, skip installation:

python3 scripts/run_anonymize.py input.md --no-install

If the caller already manages dependencies, call the lower-level CLI directly:

python3 scripts/anonymize_files.py input.md

The lower-level CLI loads the bundled field rules by default, so PyYAML is required unless --no-field-rules is used. Use --field-rules to replace the bundled rules, or --no-field-rules only for debugging false positives.

Replacement Policy

Use fake data, not [REDACTED_*] placeholders. The CLI keeps the same real value mapped to the same fake value within one run, so repeated names, phone numbers, organizations, and custom terms remain internally consistent.

If a source value already matches the first fake candidate for its category, the CLI chooses a different fake value so the original is not preserved by accident.

Default examples:

  • Names: 张三, 李四, 王五
  • Phones: 19999999999, 18888888888
  • Emails: zhangsan@example.com
  • IDs: structurally valid fake Chinese ID numbers
  • Organizations: 北京星河科技有限公司
  • Projects: 星云迁移项目
  • Secrets and tokens: non-production fake values such as fake_00000000000000000000000000000000

For detailed categories and term-file syntax, read references/anonymization-rules.md.

Custom Terms

Use --terms for one-off names, organizations, customer names, project names, contract numbers, or other business-specific values that regexes and bundled field rules cannot infer.

Term file format:

name:李雷
org:星河集团
project:天枢计划
customer:华东重点客户
plain sensitive phrase

Lines without a category use the generic custom fake-value pool.

Global Field Rules

The bundled rules in references/sensitive-field-rules.yaml apply to all supported file types, not only contracts. They cover common label/value fields such as emergency contacts, finance contacts, recipients, phone/email fields, addresses, bank accounts, bank routing codes, organizations, and common identifiers.

For Markdown, DOCX, and Excel tables, the CLI also checks adjacent cells: when a cell contains a configured label such as 财务联系人, 收件人, or 联行号, the cell immediately to its right is scanned with that rule. Excel also applies field rules to values below header-like rows, such as A1=姓名, A2=某姓名. Field labels themselves are preserved; only values are replaced. Excel reports use safe coordinates such as sheet 1!B2 rather than raw worksheet titles.

Safety Rules

  • Do not overwrite original files.
  • Do not show full original sensitive values in chat unless the user explicitly provided them and needs them referenced.
  • Do not export raw mappings unless the user asks; raw mappings are sensitive artifacts and are written only to --mapping, never to stdout or reports.
  • Prefer reports with category counts, safe path references, locations, and warnings over reports containing original values or raw file paths.
  • Treat skipped_inputs as incomplete processing. Explicit unsupported or missing input paths make the CLI exit non-zero.
  • For Excel, scan workbook properties, worksheet titles, string cells, formula expressions, hyperlink text, and non-string field-rule values. Anonymization rewrites string workbook properties, worksheet titles, string cells, and hyperlink text, but refuses to save output when formula expressions or non-string field-rule values still contain detected sensitive values.
  • For PDF, require true redaction application, not visual black boxes. The bundled CLI uses PyMuPDF redaction annotations and apply_redactions(), and fails if detected text cannot be matched to redaction rectangles.
  • For scanned PDFs, images, forms, annotations, comments, tracked changes, footnotes, and text boxes, report the limitation and ask for OCR/manual review when needed.

Format Notes

Read references/supported-formats.md when processing DOCX/Excel/PDF files, when preserving layout matters, or when the report contains warnings.

版本历史

共 2 个版本

  • v1.0.1 更新对xlsx文件的支持 当前
    2026-06-04 13:33 安全 安全
  • v1.0.0 Initial release
    2026-06-04 10:39 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

office-efficiency

Word / DOCX

ivangdavila
创建、检查和编辑 Microsoft Word 文档及 DOCX 文件,支持样式、编号、修订记录、表格、分节符及兼容性检查等功能。
★ 475 📥 157,700
office-efficiency

Gog

steipete
Google Workspace 命令行工具,支持 Gmail、日历、云端硬盘、通讯录、表格和文档。
★ 937 📥 187,749
office-efficiency

Excel / XLSX

ivangdavila
创建、检查和编辑 Microsoft Excel 工作簿及 XLSX 文件,支持可靠的公式、日期、类型、格式、重算及模板保留功能。
★ 399 📥 149,881