← 返回
未分类 Key 已验 中文

vision-skill

Use this skill for computer vision tasks including image recognition (OCR, object detection) and image generation (text-to-image, image-to-image). Supports a...
Vision Skill 是专为 AI Agent(如 Trae, Claude, OpenClaw 等)设计的视觉能力扩展插件。它赋予纯文本模型强大的**视觉理解(Recognition)和图像生成(Generation)**能力,支持多种 OpenAI 兼容视觉模型,提供灵活的图片上传模式(COS 或 BASE64),实现高效、异步、高质量的视觉任务处理。 🆕 v2.0:集成 832 个 GPT Image 2 精选案例库(质量过滤自 3 个社区项目 1322 raw cases,去除 490 低质) + 智能提示词引擎,自动生成高质量生图 prompt。 🆕 v2.1:集成 PPT 生成引擎 — 30+ 内置风格、Markdown 转 PPTX、slide_spec 精确编辑、模板克隆、外部图片贴入、版本回滚。 🆕 v2.2:参考图角色优化 — 支持多参考图各自指定角色(背景/产品/风格等),Prompt 中自然融合,不再笼统 “参考图是 xxx”。 🌟 项目亮点 灵活模型配置:支持自定义 BaseURL、API Key 和模型名称,兼容 OpenAI 格式 API,可对接豆包、通义千问、DeepSeek 等多种视觉模型。 双模式图片处理:支持 COS 云存储和 BASE64 直接传输两种模式,适应不同场景需求。 全能视觉识别:支持 OCR 文字提取、场景描述、细节问答,准确率极高。 微信截图识别:专门优化的 wechat_chat 预设,自动区分 PC/手机端,识别发送者身份。 全场景图像生成: 文生图(Text-to-Image):精准理解 Prompt 生成高质量图像。 图生图(Image-to-Image):基于参考图进行风格迁移或细节修改。 文生图组(Sequential Generation):支持生成连贯的故事插画或分镜。 🧠 智能提示词引擎 (prompt_builder.py):自动匹配案例并提炼灵感信号,默认生成简洁 creative brief,避免案例/摄影术语过度堆砌。 📚 GPT Image 2 案例库 (832 cases):来自 3 个社区项目的精选 prompt(质量过滤自 1322 raw,去除 490 低质),10 大统一分类,单文件存储(cases/index.json),支持关键词检索。 📊 PPT 生成引擎 (ppt_generator.py):30+ 内置风格,Markdown→PPTX 全流程,slide_spec 精确编辑,模板克隆,外部图片贴入,版本回滚。 🚨 强制确认工作流:生图/生成PPT前必须经过需求确认→用户确认→冒烟测试,杜绝低质量输出。 🚫 参考图必要性检测:自动识别需要参考图的场景(照片优化/风格迁移/图生图等),无参考图则阻断。 🖼️ 多参考图角色融合:每张参考图可指定独立角色(背景场景/产品元素/风格参考等 7 类),Prompt 自然融合,不再笼统 “参考图是 xxx”。 场景化预置 (Presets):内置 ppt, business_flat, tech_isometric, hand_drawn 等生成风格,以及 invoice, contract, form, slide, whiteboard, table, json, wechat_chat 等识别输出格式。 高可用性设计:支持备用模型配置(FALLBACK_MODEL),主模型失败时自动切换。 批处理与稳定性增强:支持一次识别多张图片,内置失败重试、可选降级模型。 可观测任务元数据:任务状态中包含 started_at / ended_at / duration_ms / api_attempts / upload_mode 等字段。 灵活的任务架构:既支持异步轮询(适合批量大任务),也支持同步等待(适合即时反馈)。
吴亮 lgwanai 来源
未分类 clawhub v1.0.1 2 版本 100000 Key: 需要
★ 0
Stars
📥 879
下载
💾 3
安装
2
版本
#latest

概述

Vision Skill

Overview

This skill provides capabilities for visual recognition, image generation, and PPT generation. It supports configurable vision models (OpenAI-compatible APIs), flexible image upload modes (COS or BASE64), and executes tasks asynchronously. Includes a quality-filtered GPT Image 2 case library (832 curated cases) for prompt engineering reference, bilingual prompt output (Chinese + English) for ChatGPT and 豆包 compatibility, and 30+ built-in PPT styles for slide generation.

Capabilities

1. Vision Recognition

Analyze images to describe content, extract text (OCR), or answer questions about the image.

  • Input: Local image path or URL, optional prompt.
  • Process: Based on IMAGE_UPLOAD_MODE, uploads to COS (URL mode) or converts to base64.
  • Output: Text description or answer.

2. Image Generation (⚠️ Mandatory Confirmation Workflow)

Generate images from text prompts, optionally using reference images. CRITICAL: The workflow below is mandatory — never skip steps.

🚨 Mandatory Image Generation Workflow

┌──────────────────────────────────────────────────────────────┐
│  STEP 0: DETECT — Check if reference image is REQUIRED       │
│  STEP 1: UNDERSTAND — Confirm requirements with user         │
│  STEP 1.5: FIDELITY — Choose strict/normal/creative mode     │
│  STEP 2: SEARCH — Find 1-2 inspiration cases, do not copy    │
│  STEP 3: DRAFT — Compose a concise bilingual creative brief  │
│  STEP 4: CONFIRM — User MUST approve before generation       │
│  STEP 5: ITERATE — Refine until user says "generate"         │
│  STEP 6: EXECUTE — Generate image via vision_cli.py          │
└──────────────────────────────────────────────────────────────┘

🚫 STEP 0 — REFERENCE IMAGE NECESSITY DETECTION (参考图必要性检测)

Before any requirement gathering, first determine if this task REQUIRES a reference image. If it does and the user hasn't provided one, BLOCK immediately and ask for it.

Detection Matrix: When is a reference image MANDATORY?
ScenarioRequirementExamples
----------:-----------:----------
Photo Optimization/Enhancement🔴 REQUIRED"优化这张照片", "enhance this photo", "improve image quality"
Photo Style Transfer🔴 REQUIRED"把这张照片变成动漫风格", "make this photo look like oil painting"
Photo Editing/Retouching🔴 REQUIRED"去掉背景", "remove background", "修复这张图", "fix this image"
Face/Person Swap or Edit🔴 REQUIRED"换脸", "change the person's hair", "add sunglasses to this person"
Product Photo Variation🔴 REQUIRED"给这个产品换一个背景", "change this product's packaging color"
Image-to-Image (图生图)🔴 REQUIRED"基于这张图生成", "use this as reference", "--ref"
Style Reference ("like this")🔴 REQUIRED"像这张图一样的风格", "similar style to this image"
Screenshot/UI Modification🔴 REQUIRED"改一下这个界面", "redesign this screen"
Image Restoration🔴 REQUIRED"修复老照片", "restore this old photo", "去水印"
Sequential Character Consistency🟡 OPTIONAL"生成同一个角色的多张图" (reference helps consistency)
Pure Text-to-Image Creation🟢 NOT REQUIRED"画一只猫", "generate a cyberpunk city", "create a poster"
Brand/Logo Design from Scratch🟢 NOT REQUIRED"设计一个logo", "create a brand identity"
UI Mockup from Description🟢 NOT REQUIRED"设计一个APP界面", "create a dashboard mockup"
Infographic from Data🟢 NOT REQUIRED"生成一张信息图", "make a chart about..."
Detection Logic (check in this order):
  1. User provided a file path/URL?--ref mode
    • Single image: proceed normally (role confirmation happens in STEP 1)
    • 2+ images without roles specified: → BLOCK with MULTIPLE Images WITHOUT Roles template below
    • 2+ images WITH roles specified: proceed normally
  2. User's request contains reference-required keywords?BLOCK and ask for reference image:
    • 优化/增强/修复/enhance/optimize/improve/fix/restore/retouch + 照片/图片/photo/image/picture
    • 改成/变成/换成/change/convert/transform/transfer/make this + 风格/样式/style/look
    • 去掉/删除/添加/修改/remove/add/modify/edit/change + 背景/颜色/元素/background/color/element
    • 换脸/修图/美颜/P图/face swap/retouch/beautify
    • 老照片/水印/模糊/watermark/blurry/old photo
    • 基于/参考/参照/according to/based on/refer to + 图/照片/图片/image/photo/picture
  3. User says "like this image" or "similar to this"?BLOCK and ask for reference image
  4. Pure creation from imagination? → No reference needed, proceed to STEP 1
When BLOCKED — Response Template:
⚠️ 这个任务需要参考图才能继续。

您的需求属于 [具体场景,如: 照片风格优化],必须提供一张参考图片。

请提供参考图片的文件路径,例如:
- 本地文件: /path/to/your/image.jpg
- URL: https://example.com/image.png

提供后我将继续处理。
When MULTIPLE Images WITHOUT Roles — Response Template:

When user provides 2+ reference images but not their roles, BLOCK and ask:

⚠️ 我注意到您提供了多张参考图。为了生成更好的融合效果,请说明每张图的角色定位:

- 图1(文件: xxx.jpg):这张图作为什么参考?
- 图2(文件: yyy.png):这张图作为什么参考?

常见的角色类型:
• 🏞️ 背景场景 — 保留构图/光影作为画面背景
• 📦 产品元素 — 保留造型/颜色作为画面主体
• 🎨 风格参考 — 保留色调/笔触作为风格基准
• 🌈 色调参考 — 保留色彩搭配/明暗关系
• 🧍 人物姿态 — 保留动作/角度/比例
• 🖼️ 构图模板 — 保留布局/层次/视觉流
• 🪵 材质参考 — 保留质感/纹理/光泽

您也可以直接描述每张图的具体用途,例如:"图1作为温暖秋日街道背景,保留其构图和光影;图2作为产品元素,保留其造型和材质"。

Do NOT proceed past STEP 0 until:

  • A reference image path/URL is provided (for 🔴 REQUIRED scenarios), OR
  • Each of 2+ reference images has its role confirmed by the user, OR
  • It's confirmed as 🟢 NOT REQUIRED (pure text-to-image)

STEP 1 — UNDERSTAND (需求确认):

Once reference requirements are resolved, confirm the following. Use AskUserQuestion when multiple options exist:

  1. Subject (主体/内容): What exactly should be in the image? (person, product, scene, abstract concept)
  2. Style (风格): What visual style?
    • 📷 Photography (realistic, cinematic, editorial, snapshot)
    • 🎨 Illustration (anime, watercolor, oil painting, sketch, vector)
    • 🏗️ 3D Render (isometric, CGI, clay, pixel art)
    • 📰 Poster/Typography (editorial, minimal, vintage, cyberpunk)
    • 📱 UI/Screenshot (app mockup, social media, dashboard)
    • 📊 Infographic (chart, diagram, recipe card, explainer)
  3. Purpose (用途): What is this for? (social media, e-commerce, poster, avatar, thumbnail, ad)
  4. Format (规格): Aspect ratio? (1:1, 9:16, 16:9, 4:5, 3:4)
  5. Reference Image Details (if --ref mode):
    • ⚠️ CRITICAL PRINCIPLE: 用户没说的不要猜!不要自行脑补 keep/change/role。
    • 用户只给了图片路径 → 追问用途,不要补"保留构图"之类的默认值
    • 用户说了 role 但没说 keep/change → 只描述 role,不要补 keep/change
    • 用户说了"随便"/"你决定" → 才可以合理建议默认值
    • ❌ Bad: 用户给了一张图,你就说"保留构图和色调,变更主体内容" → 这是脑补,会产生误导
    • ✅ Good: 用户说"图1作为背景,保留光影,变更色调" → 如实融入 Prompt
    • Single image: Confirm what to KEEP and CHANGE
    • What to KEEP from the reference? (composition, subject, colors, style, structure)
    • What to CHANGE? (style, background, colors, details, mood)
    • Multiple images (2+): Additionally confirm each image's ROLE
    • ⚠️ If roles were already gathered in STEP 0, skip this step — do not ask again.
    • CRITICAL: If user provides 2+ reference images without specifying what each is for → ASK before proceeding
    • Ask: "这些参考图各自扮演什么角色?(例如:图1作为背景场景,图2作为产品元素,图3作为风格参考)"
    • If user provides roles for only some images: ask about the remaining ones specifically
    • ⚠️ 用户只说了角色但没说保留/变更什么 → 不要自行脑补,只描述角色即可
    • Common role categories:
    • 🏞️ 背景场景 (background scene) — 保留构图/光影/空间关系作为画面背景
    • 📦 产品元素 (product element) — 保留产品造型/颜色/材质作为画面主体
    • 🎨 风格参考 (style reference) — 保留色调/笔触/渲染方式作为风格基准
    • 🌈 色调参考 (color palette ref) — 保留色彩搭配/明暗关系
    • 🧍 人物姿态 (character pose) — 保留人物动作/角度/比例
    • 🖼️ 构图模板 (composition template) — 保留布局/层次/视觉流
    • 🪵 材质参考 (material reference) — 保留质感/纹理/光泽
    • Role description principles: 角色描述应自然而非机械标签,描述每张图对最终画面的贡献
    • ✅ Good: "图1作为温暖的秋日街道背景,保留构图和光影" vs ❌ Bad: "图1=背景"
    • Reference image path: the user-provided file path or URL
  6. Text/Content (文字): Any specific text, labels, or copy that must appear?
  7. Constraints (约束): Any must-avoid elements? (no watermarks, no text, specific colors)

⚖️ STEP 1.5 — FIDELITY MODE (忠实度模式)

Ask the user to choose a fidelity mode. This is critical — the mode determines how much AI creativity is allowed and whether cases are referenced.

Use AskUserQuestion to present the three options:

ModeLabelBehaviorBest For
---------------------------------
🔒 Strict严格模式文字内容严格遵循用户原文,AI 不得修改/编造/改写任何文字。光影、构图、画质等视觉效果正常优化。正常参考案例库。品牌文案、UI 界面、海报文字、广告标语等需要精确文字的场合。中文内容尤其容易出现 AI 篡改的场景。
📋 Normal普通模式以用户需求为主,案例只作灵感参考,避免堆砌摄影/渲染术语。大多数图像生成场景的默认选择。
🎨 Creative创意模式把需求当作方向,不机械执行清单;允许 AI 重组构图、材质、光影和叙事关系。案例仅作灵感参考。艺术插画、概念设计、抽象表达、风格探索等创意场景。
Mode Selection Logic:
  1. Default = Normal unless the user explicitly requests otherwise
  2. Auto-detect strict: If the user's request contains specific text/copy that must appear (slogans, brand names, UI labels, headings), suggest strict mode to prevent AI from altering it
  3. Auto-detect creative: If the user asks for "artistic", "conceptual", "abstract", "experimental", "unique style", "创意", "艺术" — suggest creative mode
  4. For UI/text-heavy tasks: Always suggest strict mode to ensure text accuracy (no hallucinated copy)
CLI flags:
--fidelity strict    # Text exactly as specified, visual quality optimized
--fidelity normal    # Balanced, case-referenced (default)
--fidelity creative  # Maximum artistic freedom
Why this matters:

The key insight: AI image generators often hallucinate or alter text content (Chinese text especially). Strict mode enforces text accuracy while still optimizing visual quality. Conversely, creative mode unleashes full artistic potential when text precision is not a priority.


STEP 2 — SEARCH (案例检索):

Once requirements are confirmed, search the unified case library (832 quality-filtered cases):

python3 scripts/search_cases.py "<keywords>" --category "<category>" --style "<style>" --limit 5

Use 1-2 relevant cases only as inspiration signals. Present case titles/categories if useful, but do not paste long case prompts into the final generation prompt and do not force the model to reproduce a case. The user's brief and any supplied reference image take priority over the case library.

STEP 3 — DRAFT (智能提示词生成):

Use the intelligent prompt builder engine that integrates three-project methodology. Outputs bilingual (Chinese + English) by default for ChatGPT and 豆包 compatibility:

python3 scripts/prompt_builder.py \
  --subject "<subject>" \
  --style "<style>" \
  --purpose "<purpose>" \
  --category "<category>" \
  --ratio "<ratio>" \
  --mood "<mood>" \
  --composition "<composition>" \
  --colors "<palette>" \
  --text "<text_content>" \
  --constraints "<constraints>" \
  --ref "<reference_image_path_or_url>" \
  --lang "<zh/en/ja>" \
  --fidelity "<strict|normal|creative>" \
  --json-output

The engine automatically:

  • Selects the optimal approach (JSON-structured / natural language / photography spec / platform-specific)
  • Outputs both Chinese and English prompts — use the Chinese version for 豆包/国内模型, English for ChatGPT/DALL·E
  • Respects fidelity mode: strict (text accuracy) / normal (concise, user-led) / creative (open-ended, high creative freedom)
  • Extracts light inspiration signals from the quality-filtered case library without copying full case prompts
  • Avoids automatic camera/lens/lighting keyword injection unless the user explicitly asks for those controls
  • Integrates negative constraints from the constraint library
  • References top-matched cases for pattern extraction (normal mode only)

Prompt Quality Principles (from quality-filtered case analysis):

  1. Brief first — keep the final prompt short enough for the image model to interpret creatively
  2. Structured sections for complex layouts — panels, grids, rows, columns with explicit numbering
  3. Negative constraints — include only user constraints or 1-3 essential defaults
  4. Aspect ratio — always specify, as 21.8% of quality cases do
  5. Text specs — for UI/posters, explicitly list every text string that must appear
  6. Color system — 4-6 color palette for posters/brand; dominant + accent for photos
  7. Raycast — use {argument} for flexible parameters when user needs variants

STEP 4 — CONFIRM (用户确认):

Present the draft prompt to the user with:

  • The full prompt text
  • Reference to the cases that inspired it, clearly marked as inspiration only
  • For --ref mode: confirm the reference image path and transformation details
  • Ask: "Does this look good? Would you like to adjust anything, or shall I generate?"

⚠️ NEVER generate until the user explicitly confirms (e.g., "生成", "generate", "ok", "yes").

STEP 5 — ITERATE (迭代优化):

If the user requests changes:

  • Adjust specific elements as requested
  • Re-present the updated prompt
  • Continue until user approves

STEP 6 — EXECUTE (执行生图):

Only after user confirmation, execute. For --ref mode, the reference image is uploaded via COS (returns URL) or converted to BASE64 based on IMAGE_UPLOAD_MODE config.

# Pure Text-to-Image (no reference)
python3 scripts/vision_cli.py generate "<final_prompt>" --style <preset> --wait --output ./output.png

# Image-to-Image with reference (COS URL or BASE64 handled automatically by vision_cli.py)
python3 scripts/vision_cli.py generate "<final_prompt>" --ref <user_provided_image_path> --wait --output ./output.png

The vision_cli.py script handles reference image upload automatically:

  • COS mode (IMAGE_UPLOAD_MODE=cos): Uploads the reference image to COS and passes the COS URL to the API
  • BASE64 mode (IMAGE_UPLOAD_MODE=base64): Converts the reference image to base64 and embeds it in the API request

Text-to-Image

Generate images from a text description. STEP 0 check: 🟢 reference NOT required unless user explicitly asks for style reference.

Image-to-Image

Generate images based on a reference image. STEP 0 check: 🔴 reference REQUIRED. In STEP 1, also confirm: what to keep from the reference, what to change. Reference image is uploaded via COS or BASE64 in STEP 6.

Sequential Generation

Generate a series of consistent images (e.g., storyboards). Each image follows the same workflow. Reference image for character consistency is 🟡 OPTIONAL but recommended.

3. WeChat Chat Screenshot Recognition

Specialized recognition for WeChat chat screenshots:

  • Platform Detection: Distinguish between PC and mobile WeChat
  • Sender Identification: Green bubbles = self, gray/white bubbles = others
  • Structured Output: Returns JSON with platform, chat title, and message list
  • Message Types: Supports text, image, voice, video, link detection

4. Unified Case Library (832 quality-filtered cases)

Searchable library of GPT Image 2 prompts from 3 community projects, deduplicated and quality-filtered. Organized in 10 unified categories:

  • Portrait & People (193 cases)
  • Poster & Typography (180 cases)
  • Social Media & UI (107 cases)
  • Photography & Realism (102 cases)
  • Illustration & Storytelling (87 cases)
  • Product & E-commerce (54 cases)
  • Infographic & Charts (42 cases)
  • Brand & Identity (20 cases)
  • Architecture & Scenes (15 cases)
  • Other (32 cases)

> Quality filter: Removed 490 low-quality cases (empty prompts, duplicates, near-duplicates). Original data preserved in references/data/.

> Deduplication script: python3 scripts/deduplicate_cases.py to regenerate after updating reference data.

Search cases: python3 scripts/search_cases.py "" --limit 5

Browse by category: cases//README.md

Master index: cases/INDEX.md

Usage

The skill is exposed via a CLI script scripts/vision_cli.py.

Prerequisites

Environment variables must be set in config.txt (or system environment):

  • VISION_API_BASE_URL, VISION_API_KEY, VISION_MODEL (for vision recognition)
  • IMAGE_API_BASE_URL, IMAGE_API_KEY, IMAGE_MODEL (for image generation)
  • IMAGE_UPLOAD_MODE: Choose cos or base64 (default: cos)
  • If IMAGE_UPLOAD_MODE=cos, also set: COS_SECRET_ID, COS_SECRET_KEY, COS_REGION, COS_BUCKET_NAME

Commands

Vision Recognition

# Basic Usage
python3 scripts/vision_cli.py recognize <image_path> --prompt "Describe this image"

# Using Presets (--format)
# Available formats: invoice, contract, form, slide, whiteboard, table, json, key_value, markdown_note, qa_pairs, code, ocr, analysis, wechat_chat
python3 scripts/vision_cli.py recognize ./invoice.jpg --format json
python3 scripts/vision_cli.py recognize ./screenshot.png --format code

# Batch recognition
python3 scripts/vision_cli.py recognize ./a.jpg ./b.jpg ./c.jpg --format table --wait --output ./batch_result.json

# Quality mode and retry
python3 scripts/vision_cli.py recognize ./contract.png --format contract --quality high --retry 3 --wait

# Wait for result and save to file
python3 scripts/vision_cli.py recognize ./doc.jpg --format ocr --wait --output ./result.txt

# WeChat chat screenshot recognition
python3 scripts/vision_cli.py recognize ./wechat_screenshot.png --format wechat_chat --wait

Image Generation

# Text to Image with Style Presets (--style)
# Available styles: ppt, business_flat, cartoon, tech_isometric, hand_drawn, icon, photo, anime, sketch
python3 scripts/vision_cli.py generate "A cyberpunk city" --style anime

# Image to Image
python3 scripts/vision_cli.py generate "Make it snowy" --ref <image_path>

# Sequential Generation
python3 scripts/vision_cli.py generate "A story about a cat" --seq 4 --style cartoon

# Wait for result and save image
python3 scripts/vision_cli.py generate "App icon for a camera" --style icon --wait --output ./icon.png

# Quality mode and retry
python3 scripts/vision_cli.py generate "A SaaS architecture illustration" --style tech_isometric --quality high --retry 3 --wait

Check Status

python3 scripts/vision_cli.py status <task_id>
# Or save result if completed
python3 scripts/vision_cli.py status <task_id> --output ./final_result.png

5. PPT Generation (⚠️ Interactive Confirmation Workflow)

Generate visually striking 16:9 PPT slides from a Markdown outline + visual style, then package into a .pptx file. Uses the existing IMAGE_API_* model (does NOT require gpt-image-2).

30+ built-in styles in styles/ covering tech, business, academic, creative, and industry-specific aesthetics. See docs/distilled-styles.md for visual previews.

🚨 Mandatory Interactive Workflow

┌──────────────────────────────────────────────────────────────┐
│  STEP 1: GATHER — Ask user: content/audience/page count?     │
│  STEP 2: RECOMMEND — Suggest 1-2 styles based on scenario    │
│  STEP 3: DRAFT — Write slides_plan.md, show user for review  │
│  STEP 4: CONFIRM PLAN — User MUST approve content            │
│  STEP 5: SMOKE TEST — Generate 1 slide (cover) to verify     │
│  STEP 6: CONFIRM STYLE — User approves style → generate all  │
│  STEP 7: DELIVER — Package PPTX, tell user the path          │
└──────────────────────────────────────────────────────────────┘

⚠️ NEVER skip confirmation steps. Information insufficiency = ASK, never guess.


STEP 1 — GATHER (信息收集)

Ask the user these questions. Use AskUserQuestion when choices are involved:

  1. Content/Topic: What is this presentation about?
  2. Audience: Who will see it? (executives, investors, students, general public)
  3. Page Count: How many slides approximately?
  4. Style Preference: Any visual preference? Tech-dark / business-clean / academic / creative / hand-drawn?
  5. Template: Does the user have an existing .pptx template to mimic? (optional)

If any of items 1-4 are unclear, ASK before proceeding.


STEP 2 — RECOMMEND (风格推荐)

Based on the topic and audience, recommend 2-3 styles:

python3 scripts/vision_cli.py ppt list-styles --format json

Style selection guide:

  • Tech/AI/DevTools: dark-aurora, gradient-glass, data-science-consulting
  • Business/Pitch/Strategy: clean-tech-blue, editorial-mono, investment-company-business-plan, eco-green-business-plan
  • Academic/Thesis/Report: swiss-grid, geometric-duotone-thesis, final-year-project-thesis-defense
  • Creative/Brand/Culture: creative-agency, flowery, japanese-wabi, vector-illustration
  • Workshop/Training: hand-sketch, mind-maps-workshop-professional

STEP 3 — DRAFT (草拟大纲)

Write slides_plan.md following this format:

---
title: <Presentation Title>
---

## 1. [cover] Title Line
Subtitle or tagline

## 2. [content] Section Title
- Key point 1
- Key point 2

## 3. [data] Key Metrics
- Metric A: 85%
- Metric B: 3.2x

Rules:

  • Each ## N. heading = one slide
  • [page_type]: cover / content / data (default: content)
  • Heading line text becomes the slide title; body becomes content
  • Present the md to user for review. Do NOT proceed until confirmed.

STEP 4 — CONFIRM PLAN (确认大纲)

Show the user the slides_plan.md content and ask:

  • "Content looks correct? Any changes to titles, body, or slide order?"
  • Make edits as requested until user says OK.

Then convert to JSON:

python3 scripts/vision_cli.py ppt plan slides_plan.md -o slides_plan.json

STEP 5 — SMOKE TEST (冒烟测试)

Generate ONLY the first slide (cover) to verify style before full generation:

python3 scripts/vision_cli.py ppt generate \
  --plan slides_plan.json \
  --style dark-aurora \
  --slides 1

Review the output with the user:

  • "Does this visual style work? Adjust style or proceed with all slides?"

STEP 6 — CONFIRM STYLE & GENERATE (确认风格→全量生成)

Once user approves, generate all remaining slides:

python3 scripts/vision_cli.py ppt generate \
  --plan slides_plan.json \
  --style dark-aurora

STEP 7 — DELIVER (交付)

Tell the user:

✅ PPT 已生成!
- 输出目录: outputs/<timestamp>/
- PPTX 文件: outputs/<timestamp>/<title>.pptx
- 每页图片: outputs/<timestamp>/images/

PPT Commands Summary

CommandDescription
----------------------
vision_cli.py ppt list-stylesList all available styles
vision_cli.py ppt plan Convert slides_plan.md → slides_plan.json
vision_cli.py ppt generate --plan --style Generate slides
vision_cli.py ppt generate ... --slides 1Smoke test (first slide only)
vision_cli.py ppt generate ... --slides 1,3,5Generate specific slides only
vision_cli.py ppt sessionsList generation history

Style Preview

To see visual previews of all styles, read docs/distilled-styles.md. Each style's full prompt template is in styles/.md.

Task Management

All tasks are executed asynchronously by default.

  • Use --wait flag to block until completion (useful for Agent workflow).
  • Use --output flag to automatically save text or download images.
  • Task data is stored in .tasks/ directory.

版本历史

共 2 个版本

  • v1.0.1 v2.1:集成 PPT 生成引擎 — 30+ 内置风格、Markdown 转 PPTX、slide_spec 精确编辑、模板克隆、外部图片贴入、版本回滚。 v2.2:参考图角色优化 — 支持多参考图各自指定角色(背景/产品/风格等),Prompt 中自然融合,不再笼统 “参考图是 xxx”。 当前
    2026-06-27 08:50 安全 安全
  • v1.0.0
    2026-05-03 03:55 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

office-efficiency

mail-skill

lgwanai
综合邮件管理技能。当用户需要获取、搜索、阅读、发送、回复、移动、删除、标记或汇总邮件时使用此技能。
★ 0 📥 681
design-media

Nano Banana Pro

steipete
使用 Nano Banana Pro (Gemini 3 Pro Image) 生成或编辑图像。支持文生图、图生图及 1K/2K/4K 分辨率,适用于图像创建、修改及编辑请求,使用 --input-image 指定输入图像。
★ 430 📥 117,254
design-media

UI/UX Pro Max

xobi667
提供 UI/UX 设计智能与实现指导,帮助打造精美界面。适用于 UI 设计、UX 流程、信息架构、视觉风格、设计系统/标记、组件规格、文案/微文案、无障碍及前端 UI(HTML/CSS/JS、React、Next.js、Vue、Svelte
★ 219 📥 48,022