← 返回
未分类

Vision Helper — AI Image Analysis

Analyze images using local or cloud vision models via Ollama to identify content, UI elements, screenshots, or extract text with OCR support.
使用 Ollama 本地或云端视觉模型分析图像,识别内容、UI 元素和截图,支持 OCR 提取文本。
ravenquasar ravenquasar 来源
未分类 clawhub v1.0.0 1 版本 99882.6 Key: 无需
★ 0
Stars
📥 851
下载
💾 0
安装
1
版本
#latest

概述

📸 Vision Helper — Image Analysis

Analyze images using vision models via Ollama, with extended timeout support for cloud-based models.

Why Not Use the Built-in image Tool?

The built-in image tool has limited timeout settings that cause failures with cloud vision models (which often need 40–120 seconds). This skill calls the Ollama API directly with a 180-second timeout, supporting both local and cloud models reliably.

It also bypasses the built-in tool's file path restrictions, allowing analysis of images from any readable directory.

Usage

Basic

# Analyze an image (default: English description)
python3 <skill-dir>/scripts/analyze_image.py <image_path>

# With a custom prompt
python3 <skill-dir>/scripts/analyze_image.py <image_path> "Is this a chess game? Describe the board state"

# With a specific model
python3 <skill-dir>/scripts/analyze_image.py <image_path> "Describe content" kimi-k2.5:cloud

> resolves to your OpenClaw skill installation directory, typically ~/.openclaw/workspace/skills/vision-helper/.

In Conversation

When you need to analyze an image, use the exec tool:

exec: python3 <skill-dir>/scripts/analyze_image.py /path/to/image.png "What do you see?"

Important: Set exec timeout to 120–180 seconds, as cloud vision models are slow.

Screenshot + Analysis Workflow

Option A: Browser screenshot → analyze

1. browser(action="screenshot") → get screenshot path (MEDIA: xxx)
2. exec("<skill-dir>/scripts/analyze_image.py <screenshot_path> 'Describe this UI'")
3. Act on the analysis result

Option B: Desktop screenshot → analyze

macOS:

1. exec("screencapture -x /tmp/screen.png")
2. exec("<skill-dir>/scripts/analyze_image.py /tmp/screen.png 'Describe the desktop'")

Linux:

1. exec("gnome-screenshot -f /tmp/screen.png")
   — or —
   exec("import /tmp/screen.png")  # ImageMagick
   — or —
   exec("scrot /tmp/screen.png")
2. exec("<skill-dir>/scripts/analyze_image.py /tmp/screen.png 'Describe the desktop'")

Option C: Game/App UI → analyze → act

1. Screenshot the current screen
2. Use vision-helper to identify UI elements, buttons, text
3. Execute clicks/input based on the analysis

Environment Variables

VariableDefaultDescription
--------------------------------
VISION_MODELgemma4:31bDefault vision model
VISION_TIMEOUT180Request timeout in seconds
OLLAMA_API_URLhttp://localhost:11434/api/chatOllama API endpoint

Supported Models

ModelVisionSpeedRecommendation
--------------------------------------
gemma4:31bLocal, fastPrimary (privacy, no API needed)
kimi-k2.6:cloud40–120s🔬 Advanced (high quality, cloud)
kimi-k2.5:cloud40–90sAlternative cloud option
qwen3.5:cloud30–60sFast cloud recognition
qwen3.5:397b-cloud40–90sHigh quality cloud
gemma4:31bLocal, fastPrivacy-first (runs offline)

Note: Cloud models require the model to be available in your Ollama instance. Use VISION_MODEL env var to switch.

FAQ

Q: Can I use the built-in image tool instead?

A: It works for local models but will time out on cloud vision models. Always prefer this skill's script for reliable results.

Q: What image formats are supported?

A: PNG, JPG, JPEG, GIF, WebP, BMP, TIFF, SVG. Maximum file size: 20 MB.

Q: Where should I save screenshots?

A: Any readable directory works — /tmp/, your workspace, etc. This script has no path restrictions.

Q: How do I use a Chinese prompt?

A: Pass it as the second argument: python3 /scripts/analyze_image.py /tmp/img.png "请描述这张图片的内容"

Automation Ideas

  • Game automation: Screenshot → analyze game state → decide next action
  • Browser verification: Screenshot → verify page loaded correctly
  • Desktop monitoring: Periodic screenshots → detect changes
  • UI testing: Screenshot → verify rendered output
  • OCR: Extract text content from images

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-07 04:10 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

design-media

Openai Whisper

steipete
使用 Whisper CLI 进行本地语音转文字(无需 API 密钥)
★ 329 📥 92,815
design-media

Nano Banana Pro

steipete
使用 Nano Banana Pro (Gemini 3 Pro Image) 生成或编辑图像。支持文生图、图生图及 1K/2K/4K 分辨率,适用于图像创建、修改及编辑请求,使用 --input-image 指定输入图像。
★ 424 📥 116,138
design-media

UI/UX Pro Max

xobi667
提供 UI/UX 设计智能与实现指导,帮助打造精美界面。适用于 UI 设计、UX 流程、信息架构、视觉风格、设计系统/标记、组件规格、文案/微文案、无障碍及前端 UI(HTML/CSS/JS、React、Next.js、Vue、Svelte
★ 216 📥 46,492