概述

Story Illustrated Video Skill

Create a complete illustrated story video: AI-generated images + TTS narration + FFmpeg assembly.

Critical Implementation Notes (Lessons Learned)

⚠️ Common mistakes to avoid:

DO NOT generate a single long TTS then split it — API will truncate at ~10k chars and you lose the ending
DO NOT use a uniform duration per image — each narration segment has different length; images must match their segment's actual audio duration
DO NOT use Arabic numerals like "80后" — TTS reads them as "八十后". Use Chinese numerals: "八零后"
DO NOT generate all images in parallel then all audio — causes misalignment between segments and images

✅ Correct approach:

Split story into N segments (≤150 chars each)
Generate TTS for segment → then generate its image, sequentially per segment
Assemble: each segment becomes one video clip (image + its own audio)
Concatenate all clips in order

Workflow (Multi-Turn Conversation)

Step 1: Collect Story

Trigger: User provides a story idea or narrative.

Save to state file, then ask for style.

Step 2: Collect Style & Generate Plan

Trigger: User provides image style.

Split story into N segments (~80-150 chars each). Output:

旁白总字数：XXX 字
预估语音时长：约 XX 秒（X 分 X 秒）
建议配图数量：N 张

Ask for confirmation.

Step 3: Confirm Image Count

Trigger: User confirms or adjusts.

Step 4: Execute (No More Questions)

Trigger: User confirms.

Execute pipeline per segment, sequentially:

for each segment i (0 to N-1):
    1. Generate TTS: mmx speech synthesize --text "segment_i" --out seg_i.mp3
    2. Generate image: mmx image generate --prompt "style, segment_i scene" --out-dir images/ --out-prefix i

Then assemble each segment into its own video clip, concatenate all clips.

Number Formatting Rule

Always use Chinese numerals for generations and years in narration:

❌ "80后、90后" → TTS reads "八十后、九十后"
✅ "八零后、九零后"
❌ "1999年" in narration → ✅ "1999年" (numbers in years are fine as-is)

Per-Segment Assembly Script

See scripts/make_video.py — it handles per-segment clip creation and concatenation automatically.

State File

{
  "story": "original story text",
  "style": "confirmed style",
  "imageCount": 8,
  "segments": ["segment 1 text", "segment 2 text", ...],
  "outputDir": "/tmp/story-video-XXXX"
}

Execution Commands

Per-segment TTS (sequential — one segment at a time)

mmx speech synthesize --text "SEGMENT_TEXT" --out seg_00.mp3 --voice male-qn-badao --non-interactive

Per-segment Image

mmx image generate --prompt "STYLE, scene description" --aspect-ratio 16:9 --out-dir images/ --out-prefix 00 --non-interactive

Assembly

python3 scripts/make_video.py --segments-dir /path/to/segments --images /path/to/images --output final.mp4

Output

Final video: {outputDir}/story-video.mp4

Voice Selection

Use mmx speech voices to list available voices. Recommended for dramatic stories:

male-qn-badao — deep, dramatic male voice
male-qn-jingying — mature male
female-chengshu — mature female

Important Notes

Image style is applied uniformly to ALL images
Narration must be vivid story narration, not summary
Each segment: ≤150 Chinese characters for TTS (avoid truncation)
Aspect ratio: 16:9 for all images and video
Use Chinese numerals for generational references (八零后 not 80后)
After completion, report file location and duration to user

版本历史

共 2 个版本

v1.0.1 添加脚本当前

2026-05-25 14:40 安全安全
v1.0.0 Initial release

2026-05-22 20:13 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)