← 返回
未分类 Key

Screen Vision

AI screen vision and desktop computer control skill for OpenClaw. Let your AI agent see the screen, understand UI elements, and autonomously perform mouse an...
AI屏幕视觉与桌面控制技能,让AI智能体能够看到屏幕、理解界面元素,并自主执行鼠标和键盘操作。
guitu917 guitu917 来源
未分类 clawhub v1.1.0 1 版本 100000 Key: 需要
★ 0
Stars
📥 709
下载
💾 3
安装
1
版本
#latest

概述

Screen Vision

Control the desktop visually: screenshot → AI vision analysis → execute actions → loop until done.

Quick Start

1. Setup (one-time)

Detect platform and install dependencies:

bash scripts/setup/setup-linux.sh --headless   # Linux server (no desktop)
bash scripts/setup/setup-linux.sh --desktop     # Linux with desktop
bash scripts/setup/setup-mac.sh                 # macOS
python scripts/setup/setup-win.py          # Windows

2. Configure API

Copy config.example.json to config.json and fill in your vision API credentials.

You must set baseUrl, apiKey, and model — supports any OpenAI-compatible API.

{
  "vision": {
    "baseUrl": "https://api.siliconflow.cn/v1",
    "apiKey": "sk-your-key",
    "model": "Qwen/Qwen3-VL-32B"
  }
}

Environment variables also work: SV_VISION_API_KEY, SV_VISION_BASE_URL, SV_VISION_MODEL.

See references/API_CONFIG.md for all supported providers and detailed setup.

3. Usage

The skill operates through a screenshot-analyze-action loop:

  1. Take screenshotbash scripts/platform/screenshot.sh [output_path] [display]
  2. Analyze with AIpython3 scripts/vision/analyze.py --image --task ""
  3. Execute actionpython3 scripts/platform/execute.py --action [options]
  4. Full task looppython3 scripts/core/run_task.py --task ""

Architecture

User task → run_task.py (orchestrator)
  ├── screenshot.sh (capture screen)
  ├── diff_check.py (detect changes, skip if unchanged → saves tokens)
  ├── analyze.py (send screenshot + task to vision API)
  ├── safety_check.py (block dangerous operations)
  ├── execute.py (xdotool/cliclick/pyautogui)
  └── loop until done or timeout

Platform Tools

PlatformScreenshotMouse/KeyboardNotes
--------------------------------------------
LinuxscrotxdotoolHeadless: XFCE4 + VNC
macOSscreencapturecliclickNeeds Accessibility permission
WindowspyautoguipyautoguiNo extra setup needed

See references/PLATFORM_GUIDE.md for platform-specific commands.

Vision Providers

Supports any OpenAI-compatible vision API. You choose the provider and model.

Recommended Models

ModelProviderCost/TaskQuality
-------------------------------------
Qwen3-VL-32BSiliconFlowLow★★★★
GLM-4V-PlusZhipu BigModelLow★★★★
GPT-5.4-MiniOpenAI / relaysMedium★★★★★
GPT-5.4 CUAOpenAIHigh★★★★★
Llama 3.2 VisionOllama (local)Free★★

See references/API_CONFIG.md for per-provider configuration examples.

No defaults are hardcoded — you must configure your own API credentials before use.

Action Types

  • click — Click at (x, y). Supports left/right/double-click.
  • type — Type text string.
  • key — Press a key (Return, Tab, Escape, etc.).
  • scroll — Scroll up or down.
  • drag — Drag from (x1,y1) to (x2,y2).
  • wait — Wait for screen to update.
  • done — Task complete.
  • failed — Cannot complete task.

Safety

  • Blocked: rm -rf, format disk, shutdown, drop database, etc.
  • Confirmation required: delete, sudo, payment-related operations
  • Limits: max 5 minutes, max 100 actions per task
  • Logging: all screenshots saved to /tmp/screen-vision/logs/
  • Auto-stop on error or API failure

Examples

See references/EXAMPLES.md for usage examples.

Config

VariableDefaultDescription
--------------------------------
SV_VISION_API_KEYVision API key
SV_VISION_BASE_URLAPI endpoint (required)
SV_VISION_MODELVision model name (required)
SV_DISPLAY:1X11 display (Linux)
SV_MAX_DURATION5Max task duration (min)
SV_MAX_ACTIONS100Max actions per task
SV_SCREENSHOT_INTERVAL1.0Seconds between screenshots

版本历史

共 1 个版本

  • v1.1.0 当前
    2026-05-03 08:13 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-agent

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,418 📥 325,871
ai-agent

Agent Browser

rez0
用于 AI 代理的浏览器自动化 CLI。当用户需要与网站交互(包括浏览页面、填写表单、点击按钮、截图等)时使用。
★ 849 📥 329,378
design-media

Bian16 Wallpaper Downloader

guitu917
从彼岸图网(pic.netbian.com)下载4K手机壁纸,包括微信二维码登录、基于令牌的原图下载、3分钟限流...
★ 0 📥 433