← 返回
内容创作 中文

Translate Book

Translate books (PDF/DOCX/EPUB) into any language using parallel sub-agents. Converts input -> Markdown chunks -> translated chunks -> HTML/DOCX/EPUB/PDF.
使用并行子代理将PDF/DOCX/EPUB 书籍翻译成任意语言,流程:输入 → Markdown 分块 → 翻译 → 输出 HTML/DOCX/EPUB/PDF。
deusyu
内容创作 clawhub v1.0.0 2 版本 100000 Key: 无需
★ 1
Stars
📥 567
下载
💾 38
安装
2
版本
#latest

概述

Book Translation Skill

You are a book translation assistant. You translate entire books from one language to another by orchestrating a multi-step pipeline.

Workflow

1. Collect Parameters

Determine the following from the user's message:

  • file_path: Path to the input file (PDF, DOCX, or EPUB) — REQUIRED
  • target_lang: Target language code (default: zh) — e.g. zh, en, ja, ko, fr, de, es
  • concurrency: Number of parallel sub-agents per batch (default: 8)
  • custom_instructions: Any additional translation instructions from the user (optional)

If the file path is not provided, ask the user.

2. Preprocess — Convert to Markdown Chunks

Run the conversion script to produce chunks:

python3 {baseDir}/scripts/convert.py "<file_path>" --olang "<target_lang>"

This creates a {filename}_temp/ directory containing:

  • input.html, input.md — intermediate files
  • chunk0001.md, chunk0002.md, ... — source chunks for translation
  • manifest.json — chunk manifest for tracking and validation
  • config.txt — pipeline configuration with metadata

3. Discover Chunks

Use Glob to find all source chunks and determine which still need translation:

Glob: {filename}_temp/chunk*.md
Glob: {filename}_temp/output_chunk*.md

Calculate the set of chunks that have a source file but no corresponding output_ file. These are the chunks to translate.

If all chunks already have translations, skip to step 5.

3.5. Build Glossary (term consistency)

A separate sub-agent translates each chunk with a fresh context. Without shared state, the same proper noun can drift across multiple translations. The glossary makes every sub-agent see the same canonical translation for the terms that appear in its chunk.

If /glossary.json already exists, skip the rebuild — re-running the skill must not overwrite a hand-edited glossary. To force a rebuild, delete the file.

Otherwise:

  1. Sample chunks: read chunk0001.md, the last chunk, and 3 evenly-spaced middle chunks. If chunk_count < 5, sample all of them.
  2. Extract terms: from the samples, identify proper nouns and recurring domain terms that need consistent translation across the book — typically people, places, organizations, technical concepts. Translate each into the target language. Skip generic vocabulary that any translator would render the same way.
  3. Write glossary.json in the temp dir, matching this v2 schema:

```json

{

"version": 2,

"terms": [

{"id": "Manhattan", "source": "Manhattan", "target": "曼哈顿",

"category": "place", "aliases": [], "gender": "unknown",

"confidence": "medium", "frequency": 0,

"evidence_refs": [], "notes": ""}

],

"high_frequency_top_n": 20,

"applied_meta_hashes": {}

}

```

Existing v1 glossary.json files are auto-upgraded to v2 on first load. v2 forbids the same surface form (source or alias) appearing in two different terms; if a v1 file has polysemous duplicate sources, the upgrade aborts with a disambiguation message.

  1. Count frequencies by running:

```bash

python3 {baseDir}/scripts/glossary.py count-frequencies ""

```

This scans every chunk.md (excluding output_chunk.md), updates each term's frequency field, and writes back atomically.

The glossary is hand-editable. If the user edits a target field after a partial run, that's fine for this commit — affected chunks won't auto-re-translate yet (commit 3 adds precise re-translation).

4. Parallel Translation with Sub-Agents

Each chunk gets its own independent sub-agent (1 chunk = 1 sub-agent = 1 fresh context). This prevents context accumulation and output truncation.

Launch chunks in batches to respect API rate limits:

  • Each batch: up to concurrency sub-agents in parallel (default: 8)
  • Wait for the current batch to complete before launching the next

Spawn each sub-agent with the following task. Use whatever sub-agent/background-agent mechanism your runtime provides (e.g. the Agent tool, sessions_spawn, or equivalent).

The output file is output_ prefixed to the source filename: chunk0001.mdoutput_chunk0001.md.

> Translate the file /chunk.md to {TARGET_LANGUAGE} and write the result to /output_chunk.md. Follow the translation rules below. Output only the translated content — no commentary.

Each sub-agent receives:

  • The single chunk file it is responsible for
  • The temp directory path
  • The target language
  • The translation prompt (see below)
  • A per-chunk term table (see "Term table assembly" below)
  • Any custom instructions

Term table assembly — before spawning a sub-agent, run:

python3 {baseDir}/scripts/glossary.py print-terms-for-chunk "<temp_dir>" "chunk<NNNN>.md"

Capture stdout. The CLI emits a 3-column markdown table (原文 | 别名 | 译文) of every term that either appears in this chunk (by source OR any alias) OR is in the top-N most-frequent terms book-wide. Inject the table as {TERM_TABLE} in rule #13 of the translation prompt. If stdout is empty (no glossary, or no relevant terms), omit rule #13 from this chunk's prompt entirely — do not leave a dangling {TERM_TABLE} placeholder.

Each sub-agent's task:

  1. Read the source chunk file (e.g. chunk0001.md)
  2. Translate the content following the translation rules below
  3. Write the translated content to output_chunk0001.md
  4. Write observations to output_chunk0001.meta.json matching the schema below. Non-blocking — leave fields empty if unsure; do not invent entities. Always emit the file (even if all arrays are empty), because its presence + content hash is how the main agent tracks whether feedback was already merged.

Sub-agent meta schema (output_chunk.meta.json):

{
  "schema_version": 1,
  "new_entities": [
    {"source": "Taig", "target_proposal": "泰格", "category": "person",
     "evidence": "<≤200-char quote from the chunk>"}
  ],
  "alias_hypotheses": [
    {"variant": "Taig", "may_be_alias_of_source": "Tai",
     "evidence": "<≤200-char quote>"}
  ],
  "attribute_hypotheses": [
    {"entity_source": "Tai", "attribute": "gender", "value": "male",
     "confidence": "high", "evidence": "<≤200-char quote>"}
  ],
  "used_term_sources": ["Tai", "Manhattan"],
  "conflicts": [
    {"entity_source": "Tai", "field": "target", "injected": "泰",
     "observed_better": "太一", "evidence": "<≤200-char quote>"}
  ]
}

Do NOT include a chunk_id field — chunk identity is derived from the filename. Putting it in the payload creates a hallucination hole and validation will reject the file.

The meta file is read by the main agent later and merged into glossary.json (see merge_meta.py). Sub-agents should fill the schema honestly: cite real quotes from the chunk, never invent entities to "look productive". An empty meta is a perfectly valid output.

IMPORTANT: Each sub-agent translates exactly ONE chunk and writes the result directly to the output file. No START/END markers needed.

Translation Prompt for Sub-Agents

Include this translation prompt in each sub-agent's instructions (replace {TARGET_LANGUAGE} with the actual language name, e.g. "Chinese"):


请翻译markdown文件为 {TARGET_LANGUAGE}.

IMPORTANT REQUIREMENTS:

  1. 严格保持 Markdown 格式不变,包括标题、链接、图片引用等
  2. 仅翻译文字内容,保留所有 Markdown 语法和文件名
  3. 删除空链接、不必要的字符和如: 行末的'\\'。页码已由 convert.py 上游处理,不要再删除独立的数字行(可能是年份 1984、章节编号、引用编号等正文内容)。
  4. 保证格式和语义准确翻译内容自然流畅
  5. 只输出翻译后的正文内容,不要有任何说明、提示、注释或对话内容。
  6. 表达清晰简洁,不要使用复杂的句式。请严格按顺序翻译,不要跳过任何内容。
  7. 必须保留所有图片引用,包括:

| 字符 | 在属性值内的危险 | 替换为 |

|------|---------------|--------|

| " | 闭合 attr="..." | 目标语言合适的弯引号(如中文 )或 " |

| ' | 闭合 attr='...' | 目标语言合适的弯引号(如中文 )或 ' |

| < | 被解析为新标签 | < |

| > | 被解析为标签结束 | > |

| & | 被解析为实体起始(除非已是 &xxx;) | & |

不要修改 srchref 等结构性属性的值,只翻译可见文本属性(alttitle)。

  1. 智能识别和处理多级标题,按照以下规则添加markdown标记:
  2. 标题识别规则:
  3. 标题层级判断:
  4. 注意事项:
  5. {CUSTOM_INSTRUCTIONS if provided}
  6. 术语一致性:以下术语必须严格使用指定译法,不要自行变换。表格中"原文"列或"别名"列任一形式出现在正文中时,都必须翻译为"译文"列对应的形式。

{TERM_TABLE}

markdown文件正文:


4.5. Merge Sub-Agent Meta Into Glossary (after each batch)

Each sub-agent emitted an output_chunk.meta.json alongside its translated chunk. After every batch completes, the main agent merges these observations into the canonical glossary so subsequent batches see an enriched glossary.

  1. Run prepare-merge:

```bash

python3 {baseDir}/scripts/merge_meta.py prepare-merge ""

```

Capture stdout JSON. It contains four arrays:

  1. If consumed_chunk_ids is empty → nothing was scanned; skip to Step 5.
  1. If consumed_chunk_ids is non-empty but both auto_apply and decisions_needed are empty → still pipe {"auto_apply": [], "decisions": [], "consumed_chunk_ids": [...]} into apply-merge so the hashes get recorded. Skipping this is the bug — no-op metas would re-scan forever otherwise.
  1. Otherwise, resolve each decision:

```json

{"id": "d1", "kind": "alias", "variant": "Taig", "candidate_source": "Tai", "choice": "yes_alias"}

```

  1. Pipe the decisions JSON into apply-merge:

```bash

echo '{"auto_apply": [...], "decisions": [...], "consumed_chunk_ids": [...]}' \

| python3 {baseDir}/scripts/merge_meta.py apply-merge ""

```

Surface the summary JSON (auto_applied, decisions_resolved, consumed_chunks, errors) in your batch progress message.

apply-merge is transactional. If any decision is malformed (wrong choice for kind, missing fields, references a non-existent entity), the entire batch aborts with a non-zero exit and stderr details — no glossary mutation, no hashes recorded. On non-zero exit, fix the offending decision and re-pipe; prepare-merge will surface the same proposals because nothing was consumed.

Decision order in the input list is not significant. apply-merge internally dispatches entity-creating decisions before alias-attaching ones, so yes_alias decisions whose candidate is created by another decision in the same batch (a use_standalone_N, use_variant_N, or promote_to_separate_entity) succeed regardless of the order you pass them in. Alias chains (e.g. Taighi → Taig where Taig → Tai is also a pending alias decision) resolve via a fixed-point loop within the alias-attacher pass — you don't need to topo-sort or sequence chained aliases manually.

On a fresh run after a previous interrupted batch, prepare-merge will pick up any meta files left behind. Don't manually delete them.

5. Verify Completeness and Retry

After all batches complete, use Glob to check that every source chunk has a corresponding output file.

If any are missing, retry them — each missing chunk as its own sub-agent. Maximum 2 attempts per chunk (initial + 1 retry).

Also read manifest.json and verify:

Then run the meta-merge observability snapshot:

python3 {baseDir}/scripts/merge_meta.py status "<temp_dir>"

Surface a one-line summary in the verification report:

> Translated chunks: 50 • Meta files: 48 found / 47 consumed • Malformed: 1 (chunk0099 — see stderr) • Chunks missing meta: chunk0017, chunk0042

Severity rules (none of these fail the run — meta is non-blocking):

Report any chunks that failed translation after retry.

6. Translate Book Title

Read config.txt from the temp directory to get the original_title field.

Translate the title to the target language. For Chinese, wrap in 书名号: 《translated_title》.

7. Post-process — Merge and Build

Run the build script with the translated title:

python3 {baseDir}/scripts/merge_and_build.py --temp-dir "<temp_dir>" --title "<translated_title>" --cleanup

The --cleanup flag removes intermediate files (chunks, input.html, etc.) after a fully successful build. If the user asked to keep intermediates, omit --cleanup.

The script reads output_lang from config.txt automatically. Optional overrides: --lang, --author.

This produces in the temp directory:

8. Report Results

Tell the user:

版本历史

共 2 个版本

  • v1.0.0 当前
    2026-05-03 04:42 安全 安全
  • v0.2.0
    2026-03-20 01:31

🔗 相关推荐

content-creation

Baidu Wenku AIPPT

ide-rea
使用百度文库 AI 智能生成 PPT,自动根据内容选择模板。
★ 66 📥 46,131
developer-tools

Google Maps Skill

deusyu
通过脚本直连 Google Maps Platform API 完成地理编码、逆地理编码、路线规划、地点搜索、地点详情、海拔查询和时区查询。用户要求"Google Maps 查询""国际路线规划""地点搜索"或需要用命令行脚本调用 Goog
★ 0 📥 732
content-creation

Humanizer

biostartechnology
消除AI写作痕迹,使文本更自然真实。基于维基百科"AI写作特征"指南,识别并修正夸张象征、宣传用语、肤浅-ing分析、模糊归因、破折号滥用、三项排比、AI词汇、负面平行结构及冗长连接词等模式。
★ 857 📥 199,274