Control the desktop visually: screenshot → AI vision analysis → execute actions → loop until done.
Detect platform and install dependencies:
bash scripts/setup/setup-linux.sh --headless # Linux server (no desktop)
bash scripts/setup/setup-linux.sh --desktop # Linux with desktop
bash scripts/setup/setup-mac.sh # macOS
python scripts/setup/setup-win.py # Windows
Copy config.example.json to config.json and fill in your vision API credentials.
You must set baseUrl, apiKey, and model — supports any OpenAI-compatible API.
{
"vision": {
"baseUrl": "https://api.siliconflow.cn/v1",
"apiKey": "sk-your-key",
"model": "Qwen/Qwen3-VL-32B"
}
}
Environment variables also work: SV_VISION_API_KEY, SV_VISION_BASE_URL, SV_VISION_MODEL.
See references/API_CONFIG.md for all supported providers and detailed setup.
The skill operates through a screenshot-analyze-action loop:
bash scripts/platform/screenshot.sh [output_path] [display]python3 scripts/vision/analyze.py --image --task "" python3 scripts/platform/execute.py --action [options] python3 scripts/core/run_task.py --task "" User task → run_task.py (orchestrator)
├── screenshot.sh (capture screen)
├── diff_check.py (detect changes, skip if unchanged → saves tokens)
├── analyze.py (send screenshot + task to vision API)
├── safety_check.py (block dangerous operations)
├── execute.py (xdotool/cliclick/pyautogui)
└── loop until done or timeout
| Platform | Screenshot | Mouse/Keyboard | Notes |
|---|---|---|---|
| ---------- | ----------- | ---------------- | ------- |
| Linux | scrot | xdotool | Headless: XFCE4 + VNC |
| macOS | screencapture | cliclick | Needs Accessibility permission |
| Windows | pyautogui | pyautogui | No extra setup needed |
See references/PLATFORM_GUIDE.md for platform-specific commands.
Supports any OpenAI-compatible vision API. You choose the provider and model.
| Model | Provider | Cost/Task | Quality |
|---|---|---|---|
| ------- | ---------- | ----------- | --------- |
| Qwen3-VL-32B | SiliconFlow | Low | ★★★★ |
| GLM-4V-Plus | Zhipu BigModel | Low | ★★★★ |
| GPT-5.4-Mini | OpenAI / relays | Medium | ★★★★★ |
| GPT-5.4 CUA | OpenAI | High | ★★★★★ |
| Llama 3.2 Vision | Ollama (local) | Free | ★★ |
See references/API_CONFIG.md for per-provider configuration examples.
No defaults are hardcoded — you must configure your own API credentials before use.
click — Click at (x, y). Supports left/right/double-click.type — Type text string.key — Press a key (Return, Tab, Escape, etc.).scroll — Scroll up or down.drag — Drag from (x1,y1) to (x2,y2).wait — Wait for screen to update.done — Task complete.failed — Cannot complete task./tmp/screen-vision/logs/See references/EXAMPLES.md for usage examples.
| Variable | Default | Description |
|---|---|---|
| ---------- | --------- | ------------- |
SV_VISION_API_KEY | — | Vision API key |
SV_VISION_BASE_URL | — | API endpoint (required) |
SV_VISION_MODEL | — | Vision model name (required) |
SV_DISPLAY | :1 | X11 display (Linux) |
SV_MAX_DURATION | 5 | Max task duration (min) |
SV_MAX_ACTIONS | 100 | Max actions per task |
SV_SCREENSHOT_INTERVAL | 1.0 | Seconds between screenshots |
共 1 个版本