This skill provides capabilities for visual recognition, image generation, and PPT generation. It supports configurable vision models (OpenAI-compatible APIs), flexible image upload modes (COS or BASE64), and executes tasks asynchronously. Includes a quality-filtered GPT Image 2 case library (832 curated cases) for prompt engineering reference, bilingual prompt output (Chinese + English) for ChatGPT and 豆包 compatibility, and 30+ built-in PPT styles for slide generation.
Analyze images to describe content, extract text (OCR), or answer questions about the image.
IMAGE_UPLOAD_MODE, uploads to COS (URL mode) or converts to base64.Generate images from text prompts, optionally using reference images. CRITICAL: The workflow below is mandatory — never skip steps.
┌──────────────────────────────────────────────────────────────┐
│ STEP 0: DETECT — Check if reference image is REQUIRED │
│ STEP 1: UNDERSTAND — Confirm requirements with user │
│ STEP 1.5: FIDELITY — Choose strict/normal/creative mode │
│ STEP 2: SEARCH — Find 1-2 inspiration cases, do not copy │
│ STEP 3: DRAFT — Compose a concise bilingual creative brief │
│ STEP 4: CONFIRM — User MUST approve before generation │
│ STEP 5: ITERATE — Refine until user says "generate" │
│ STEP 6: EXECUTE — Generate image via vision_cli.py │
└──────────────────────────────────────────────────────────────┘
Before any requirement gathering, first determine if this task REQUIRES a reference image. If it does and the user hasn't provided one, BLOCK immediately and ask for it.
| Scenario | Requirement | Examples |
|---|---|---|
| ---------- | :-----------: | ---------- |
| Photo Optimization/Enhancement | 🔴 REQUIRED | "优化这张照片", "enhance this photo", "improve image quality" |
| Photo Style Transfer | 🔴 REQUIRED | "把这张照片变成动漫风格", "make this photo look like oil painting" |
| Photo Editing/Retouching | 🔴 REQUIRED | "去掉背景", "remove background", "修复这张图", "fix this image" |
| Face/Person Swap or Edit | 🔴 REQUIRED | "换脸", "change the person's hair", "add sunglasses to this person" |
| Product Photo Variation | 🔴 REQUIRED | "给这个产品换一个背景", "change this product's packaging color" |
| Image-to-Image (图生图) | 🔴 REQUIRED | "基于这张图生成", "use this as reference", "--ref" |
| Style Reference ("like this") | 🔴 REQUIRED | "像这张图一样的风格", "similar style to this image" |
| Screenshot/UI Modification | 🔴 REQUIRED | "改一下这个界面", "redesign this screen" |
| Image Restoration | 🔴 REQUIRED | "修复老照片", "restore this old photo", "去水印" |
| Sequential Character Consistency | 🟡 OPTIONAL | "生成同一个角色的多张图" (reference helps consistency) |
| Pure Text-to-Image Creation | 🟢 NOT REQUIRED | "画一只猫", "generate a cyberpunk city", "create a poster" |
| Brand/Logo Design from Scratch | 🟢 NOT REQUIRED | "设计一个logo", "create a brand identity" |
| UI Mockup from Description | 🟢 NOT REQUIRED | "设计一个APP界面", "create a dashboard mockup" |
| Infographic from Data | 🟢 NOT REQUIRED | "生成一张信息图", "make a chart about..." |
--ref mode⚠️ 这个任务需要参考图才能继续。
您的需求属于 [具体场景,如: 照片风格优化],必须提供一张参考图片。
请提供参考图片的文件路径,例如:
- 本地文件: /path/to/your/image.jpg
- URL: https://example.com/image.png
提供后我将继续处理。
When user provides 2+ reference images but not their roles, BLOCK and ask:
⚠️ 我注意到您提供了多张参考图。为了生成更好的融合效果,请说明每张图的角色定位:
- 图1(文件: xxx.jpg):这张图作为什么参考?
- 图2(文件: yyy.png):这张图作为什么参考?
常见的角色类型:
• 🏞️ 背景场景 — 保留构图/光影作为画面背景
• 📦 产品元素 — 保留造型/颜色作为画面主体
• 🎨 风格参考 — 保留色调/笔触作为风格基准
• 🌈 色调参考 — 保留色彩搭配/明暗关系
• 🧍 人物姿态 — 保留动作/角度/比例
• 🖼️ 构图模板 — 保留布局/层次/视觉流
• 🪵 材质参考 — 保留质感/纹理/光泽
您也可以直接描述每张图的具体用途,例如:"图1作为温暖秋日街道背景,保留其构图和光影;图2作为产品元素,保留其造型和材质"。
Do NOT proceed past STEP 0 until:
STEP 1 — UNDERSTAND (需求确认):
Once reference requirements are resolved, confirm the following. Use AskUserQuestion when multiple options exist:
--ref mode):Ask the user to choose a fidelity mode. This is critical — the mode determines how much AI creativity is allowed and whether cases are referenced.
Use AskUserQuestion to present the three options:
| Mode | Label | Behavior | Best For |
|---|---|---|---|
| ------ | ------- | ---------- | ---------- |
| 🔒 Strict | 严格模式 | 文字内容严格遵循用户原文,AI 不得修改/编造/改写任何文字。光影、构图、画质等视觉效果正常优化。正常参考案例库。 | 品牌文案、UI 界面、海报文字、广告标语等需要精确文字的场合。中文内容尤其容易出现 AI 篡改的场景。 |
| 📋 Normal | 普通模式 | 以用户需求为主,案例只作灵感参考,避免堆砌摄影/渲染术语。 | 大多数图像生成场景的默认选择。 |
| 🎨 Creative | 创意模式 | 把需求当作方向,不机械执行清单;允许 AI 重组构图、材质、光影和叙事关系。案例仅作灵感参考。 | 艺术插画、概念设计、抽象表达、风格探索等创意场景。 |
--fidelity strict # Text exactly as specified, visual quality optimized
--fidelity normal # Balanced, case-referenced (default)
--fidelity creative # Maximum artistic freedom
The key insight: AI image generators often hallucinate or alter text content (Chinese text especially). Strict mode enforces text accuracy while still optimizing visual quality. Conversely, creative mode unleashes full artistic potential when text precision is not a priority.
STEP 2 — SEARCH (案例检索):
Once requirements are confirmed, search the unified case library (832 quality-filtered cases):
python3 scripts/search_cases.py "<keywords>" --category "<category>" --style "<style>" --limit 5
Use 1-2 relevant cases only as inspiration signals. Present case titles/categories if useful, but do not paste long case prompts into the final generation prompt and do not force the model to reproduce a case. The user's brief and any supplied reference image take priority over the case library.
STEP 3 — DRAFT (智能提示词生成):
Use the intelligent prompt builder engine that integrates three-project methodology. Outputs bilingual (Chinese + English) by default for ChatGPT and 豆包 compatibility:
python3 scripts/prompt_builder.py \
--subject "<subject>" \
--style "<style>" \
--purpose "<purpose>" \
--category "<category>" \
--ratio "<ratio>" \
--mood "<mood>" \
--composition "<composition>" \
--colors "<palette>" \
--text "<text_content>" \
--constraints "<constraints>" \
--ref "<reference_image_path_or_url>" \
--lang "<zh/en/ja>" \
--fidelity "<strict|normal|creative>" \
--json-output
The engine automatically:
Prompt Quality Principles (from quality-filtered case analysis):
{argument} for flexible parameters when user needs variantsSTEP 4 — CONFIRM (用户确认):
Present the draft prompt to the user with:
--ref mode: confirm the reference image path and transformation details⚠️ NEVER generate until the user explicitly confirms (e.g., "生成", "generate", "ok", "yes").
STEP 5 — ITERATE (迭代优化):
If the user requests changes:
STEP 6 — EXECUTE (执行生图):
Only after user confirmation, execute. For --ref mode, the reference image is uploaded via COS (returns URL) or converted to BASE64 based on IMAGE_UPLOAD_MODE config.
# Pure Text-to-Image (no reference)
python3 scripts/vision_cli.py generate "<final_prompt>" --style <preset> --wait --output ./output.png
# Image-to-Image with reference (COS URL or BASE64 handled automatically by vision_cli.py)
python3 scripts/vision_cli.py generate "<final_prompt>" --ref <user_provided_image_path> --wait --output ./output.png
The vision_cli.py script handles reference image upload automatically:
IMAGE_UPLOAD_MODE=cos): Uploads the reference image to COS and passes the COS URL to the APIIMAGE_UPLOAD_MODE=base64): Converts the reference image to base64 and embeds it in the API requestGenerate images from a text description. STEP 0 check: 🟢 reference NOT required unless user explicitly asks for style reference.
Generate images based on a reference image. STEP 0 check: 🔴 reference REQUIRED. In STEP 1, also confirm: what to keep from the reference, what to change. Reference image is uploaded via COS or BASE64 in STEP 6.
Generate a series of consistent images (e.g., storyboards). Each image follows the same workflow. Reference image for character consistency is 🟡 OPTIONAL but recommended.
Specialized recognition for WeChat chat screenshots:
Searchable library of GPT Image 2 prompts from 3 community projects, deduplicated and quality-filtered. Organized in 10 unified categories:
> Quality filter: Removed 490 low-quality cases (empty prompts, duplicates, near-duplicates). Original data preserved in references/data/.
> Deduplication script: python3 scripts/deduplicate_cases.py to regenerate after updating reference data.
Search cases: python3 scripts/search_cases.py "
Browse by category: cases/
Master index: cases/INDEX.md
The skill is exposed via a CLI script scripts/vision_cli.py.
Environment variables must be set in config.txt (or system environment):
VISION_API_BASE_URL, VISION_API_KEY, VISION_MODEL (for vision recognition)IMAGE_API_BASE_URL, IMAGE_API_KEY, IMAGE_MODEL (for image generation)IMAGE_UPLOAD_MODE: Choose cos or base64 (default: cos)IMAGE_UPLOAD_MODE=cos, also set: COS_SECRET_ID, COS_SECRET_KEY, COS_REGION, COS_BUCKET_NAME# Basic Usage
python3 scripts/vision_cli.py recognize <image_path> --prompt "Describe this image"
# Using Presets (--format)
# Available formats: invoice, contract, form, slide, whiteboard, table, json, key_value, markdown_note, qa_pairs, code, ocr, analysis, wechat_chat
python3 scripts/vision_cli.py recognize ./invoice.jpg --format json
python3 scripts/vision_cli.py recognize ./screenshot.png --format code
# Batch recognition
python3 scripts/vision_cli.py recognize ./a.jpg ./b.jpg ./c.jpg --format table --wait --output ./batch_result.json
# Quality mode and retry
python3 scripts/vision_cli.py recognize ./contract.png --format contract --quality high --retry 3 --wait
# Wait for result and save to file
python3 scripts/vision_cli.py recognize ./doc.jpg --format ocr --wait --output ./result.txt
# WeChat chat screenshot recognition
python3 scripts/vision_cli.py recognize ./wechat_screenshot.png --format wechat_chat --wait
# Text to Image with Style Presets (--style)
# Available styles: ppt, business_flat, cartoon, tech_isometric, hand_drawn, icon, photo, anime, sketch
python3 scripts/vision_cli.py generate "A cyberpunk city" --style anime
# Image to Image
python3 scripts/vision_cli.py generate "Make it snowy" --ref <image_path>
# Sequential Generation
python3 scripts/vision_cli.py generate "A story about a cat" --seq 4 --style cartoon
# Wait for result and save image
python3 scripts/vision_cli.py generate "App icon for a camera" --style icon --wait --output ./icon.png
# Quality mode and retry
python3 scripts/vision_cli.py generate "A SaaS architecture illustration" --style tech_isometric --quality high --retry 3 --wait
python3 scripts/vision_cli.py status <task_id>
# Or save result if completed
python3 scripts/vision_cli.py status <task_id> --output ./final_result.png
Generate visually striking 16:9 PPT slides from a Markdown outline + visual style, then package into a .pptx file. Uses the existing IMAGE_API_* model (does NOT require gpt-image-2).
30+ built-in styles in styles/ covering tech, business, academic, creative, and industry-specific aesthetics. See docs/distilled-styles.md for visual previews.
┌──────────────────────────────────────────────────────────────┐
│ STEP 1: GATHER — Ask user: content/audience/page count? │
│ STEP 2: RECOMMEND — Suggest 1-2 styles based on scenario │
│ STEP 3: DRAFT — Write slides_plan.md, show user for review │
│ STEP 4: CONFIRM PLAN — User MUST approve content │
│ STEP 5: SMOKE TEST — Generate 1 slide (cover) to verify │
│ STEP 6: CONFIRM STYLE — User approves style → generate all │
│ STEP 7: DELIVER — Package PPTX, tell user the path │
└──────────────────────────────────────────────────────────────┘
⚠️ NEVER skip confirmation steps. Information insufficiency = ASK, never guess.
Ask the user these questions. Use AskUserQuestion when choices are involved:
If any of items 1-4 are unclear, ASK before proceeding.
Based on the topic and audience, recommend 2-3 styles:
python3 scripts/vision_cli.py ppt list-styles --format json
Style selection guide:
dark-aurora, gradient-glass, data-science-consultingclean-tech-blue, editorial-mono, investment-company-business-plan, eco-green-business-planswiss-grid, geometric-duotone-thesis, final-year-project-thesis-defensecreative-agency, flowery, japanese-wabi, vector-illustrationhand-sketch, mind-maps-workshop-professionalWrite slides_plan.md following this format:
---
title: <Presentation Title>
---
## 1. [cover] Title Line
Subtitle or tagline
## 2. [content] Section Title
- Key point 1
- Key point 2
## 3. [data] Key Metrics
- Metric A: 85%
- Metric B: 3.2x
Rules:
## N. heading = one slide[page_type]: cover / content / data (default: content)Show the user the slides_plan.md content and ask:
Then convert to JSON:
python3 scripts/vision_cli.py ppt plan slides_plan.md -o slides_plan.json
Generate ONLY the first slide (cover) to verify style before full generation:
python3 scripts/vision_cli.py ppt generate \
--plan slides_plan.json \
--style dark-aurora \
--slides 1
Review the output with the user:
Once user approves, generate all remaining slides:
python3 scripts/vision_cli.py ppt generate \
--plan slides_plan.json \
--style dark-aurora
Tell the user:
✅ PPT 已生成!
- 输出目录: outputs/<timestamp>/
- PPTX 文件: outputs/<timestamp>/<title>.pptx
- 每页图片: outputs/<timestamp>/images/
| Command | Description |
|---|---|
| --------- | ------------- |
vision_cli.py ppt list-styles | List all available styles |
vision_cli.py ppt plan | Convert slides_plan.md → slides_plan.json |
vision_cli.py ppt generate --plan | Generate slides |
vision_cli.py ppt generate ... --slides 1 | Smoke test (first slide only) |
vision_cli.py ppt generate ... --slides 1,3,5 | Generate specific slides only |
vision_cli.py ppt sessions | List generation history |
To see visual previews of all styles, read docs/distilled-styles.md. Each style's full prompt template is in styles/.
All tasks are executed asynchronously by default.
--wait flag to block until completion (useful for Agent workflow).--output flag to automatically save text or download images..tasks/ directory.共 2 个版本