Use this skill as a main-agent operating manual for desktop GUI tasks.
python3 <SKILL_DIR>/scripts/first_run_setup.py --check
If "ready": false, run setup (installs EVERYTHING automatically):
python3 <SKILL_DIR>/scripts/first_run_setup.py
Auto-installs on first run:
cliclick + tesseract (macOS via brew; Linux guide printed)After setup, set $PY for ALL subsequent calls:
PY=<output.env.DESKTOP_AGENT_OPS_PYTHON>
Do NOT proceed if setup is not ready.
Every desktop task follows this loop. No exceptions.
1. auto-setup gate ← run once per session
2. init task context ← create isolated temp directory
3. FOCUS the target app ← bring app to front, confirm frontmost
4. GET window bounds ← know exact position and size
5. CAPTURE that window ← screenshot ONLY the target window
6. ANALYZE the capture ← read screenshot or run OCR
7. LOCATE target via OCR ← find text/button within window bounds
8. VERIFY before acting ← move cursor, screenshot with cursor, confirm
9. EXECUTE one action ← click, type, scroll, press key
10. CAPTURE again ← screenshot to see result
11. VERIFY the result ← did the UI change as expected?
12. → if more steps, go to 5
13. CLEANUP ← remove task temp directory
Key principles:
NEVER do OCR or clicking on a full-screen screenshot. Always scope to the target app window.
┌─────────────────────────────────────────────────────────┐
│ Step 1: FOCUS the target app │
│ $PY desktop_ops.py focus-app --name "AppName" │
│ → brings app to front │
├─────────────────────────────────────────────────────────┤
│ Step 2: GET window bounds │
│ $PY desktop_ops.py front-window-bounds --app "AppName"│
│ → {x, y, width, height} in logical coordinates │
├─────────────────────────────────────────────────────────┤
│ Step 3: CAPTURE only that window │
│ $PY desktop_ops.py capture-region --x X --y Y │
│ --width W --height H --output /tmp/window.png │
├─────────────────────────────────────────────────────────┤
│ Step 4: OCR within the window │
│ $PY ocr_text.py --app "AppName" --python $PY │
│ → abs_box coordinates are INSIDE the window │
├─────────────────────────────────────────────────────────┤
│ Step 5: VERIFY before clicking │
│ $PY desktop_ops.py move --x TX --y TY │
│ $PY desktop_ops.py screenshot --with-cursor │
│ → confirm cursor is on the right element │
├─────────────────────────────────────────────────────────┤
│ Step 6: CLICK only if verified │
│ $PY desktop_ops.py click --x TX --y TY │
│ $PY desktop_ops.py screenshot → verify result │
└─────────────────────────────────────────────────────────┘
$PY scripts/target_resolver.py --app "AppName" --text "按钮文字" --python $PY
This single command: focuses app → gets bounds → OCR within window → returns best_candidate with {x, y, within_window}.
| Approach | Risk |
|---|---|
| ---------- | ------ |
| ❌ Full-screen OCR | "搜索" in WeChat AND Chrome → clicks wrong app |
| ✅ Window-scoped | "搜索" ONLY in WeChat window → correct click |
When something fails, follow these rules:
focus-app --name "AppName"front-window-bounds --app "AppName" (window may have moved/resized)content_area instead of bottom_input)--min-conf 30The pipeline works for any desktop application. Here is how to reason about new apps:
target_resolver.py --app "AppName" --text "target text"| Task | How to do it |
|---|---|
| ------ | ------------- |
| Click a button | OCR find text → verify → click |
| Type in a field | OCR find field label → click field → type --text |
| Search for something | OCR find search box → click → type query → press return |
| Scroll a list | Get window bounds → scroll at window center with --x --y |
| Switch between apps | focus-app --name "OtherApp" → re-get bounds |
| Handle a dialog | Screenshot → OCR for dialog buttons → click appropriate one |
| Navigate menus | Click menu item → wait → screenshot → OCR new menu → click |
| Select from dropdown | Click dropdown → wait → OCR options → click selection |
| Read screen content | OCR the window → extract all text boxes |
| Verify an action | Screenshot before and after → compare or OCR for expected text |
| App type | Special considerations |
|---|---|
| ---------- | ---------------------- |
| Chat apps (WeChat, Slack, etc.) | Verify conversation title before typing; use insert-newline for multi-line; verify send mechanism |
| Browsers (Chrome, Safari, etc.) | Address bar at top; content area varies; may need to handle tabs |
| System Settings | Deep navigation; panels change; re-get bounds after each navigation |
| File managers (Finder, Explorer) | Sidebar + content area; double-click to open; path bar for navigation |
| Editors (VS Code, TextEdit, etc.) | Tab bar + editor area; use hotkeys for save/undo; type in editor area |
$PY scripts/desktop_ops.py type --text "your message"
set the clipboard to + Cmd+V (single osascript call)Set-Clipboard + Ctrl+V (falls back to clip.exe)xclip + Ctrl+V$PY scripts/desktop_ops.py type --text "first line"
$PY scripts/desktop_ops.py insert-newline
$PY scripts/desktop_ops.py type --text "second line"
insert-newline for literal line breaks\n in type --text — it may trigger send in some apps发送) via OCR, then click itpress --key return ONLY when the app is verified to use Enter-to-send| Operation | Primary | Fallback |
|---|---|---|
| ----------- | --------- | ---------- |
type | Clipboard paste | cliclick (ASCII only) |
press | AppleScript key code | cliclick kp: |
hotkey | cliclick kd:/t:/ku: | pyautogui |
click | cliclick | pyautogui |
> Important: cliclick kp:return is NOT recognized by WeChat — always use AppleScript for key press.
> Important: cliclick t: silently drops CJK characters — always use clipboard paste for text input.
Handled automatically. No manual DPI work needed.
| Platform | Common scales | Detection method |
|---|---|---|
| ---------- | --------------- | ----------------- |
| macOS Retina | 2.0x | screenshot pixels ÷ logical screen bounds |
| Windows HiDPI | 1.25x, 1.5x, 2.0x | screenshot pixels ÷ pyautogui.size() |
| Linux X11 | 1.0x, 1.5x, 2.0x | screenshot pixels ÷ pyautogui.size() |
OCR output: box = logical (use for mouse), pixel_box = raw pixels, dpi_scale = factor.
CRITICAL: Use EXACTLY these names. Do NOT guess.
$PY scripts/desktop_ops.py screenshot [--output PATH] [--x X --y Y --width W --height H] [--with-cursor]
$PY scripts/desktop_ops.py capture-region --x X --y Y --width W --height H [--output PATH] [--with-cursor]
$PY scripts/desktop_ops.py frontmost
$PY scripts/desktop_ops.py list-apps
$PY scripts/desktop_ops.py front-window-bounds [--app NAME]
$PY scripts/desktop_ops.py focus-app --name "App Name"
$PY scripts/desktop_ops.py move --x X --y Y [--duration SECONDS]
$PY scripts/desktop_ops.py click [--x X --y Y] [--button left|right|middle]
$PY scripts/desktop_ops.py double-click [--x X --y Y] [--button left|right|middle]
$PY scripts/desktop_ops.py drag --x1 X1 --y1 Y1 --x2 X2 --y2 Y2 [--duration SEC] [--button left]
$PY scripts/desktop_ops.py scroll --amount N [--x X --y Y] [--direction vertical|horizontal]
$PY scripts/desktop_ops.py mouse-position
$PY scripts/desktop_ops.py press --key KEY
$PY scripts/desktop_ops.py type --text "text to type"
$PY scripts/desktop_ops.py insert-newline [--count N]
$PY scripts/desktop_ops.py hotkey --keys cmd c
$PY scripts/desktop_ops.py screen-size
$PY scripts/desktop_ops.py pixel-color --x X --y Y
$PY scripts/ocr_text.py --app "AppName" --python $PY [--region-label LABEL] [--lang auto]
$PY scripts/ocr_text.py --image /path/to/capture.png --python $PY [--lang auto]
$PY scripts/target_resolver.py --app "AppName" --text "text" --python $PY
$PY scripts/target_resolver.py --app "AppName" --template /path/icon.png --python $PY
$PY scripts/target_resolver.py --app "AppName" --text "text" --region-label LABEL --python $PY
$PY scripts/task_context.py init --task-id "my-task" # aliases: create, --name
$PY scripts/task_context.py show --task-id "my-task"
$PY scripts/cleanup_task.py --task-id "my-task"
$PY scripts/window_regions.py --window-x X --window-y Y --window-width W --window-height H [--label LABEL]
Labels: top_search, left_sidebar, left_sidebar_top, title_header, content_area, toolbar_row, bottom_input, primary_action
1. $PY first_run_setup.py --check → ready: true
2. $PY task_context.py init --task-id "click-button"
3. $PY desktop_ops.py focus-app --name "AppName"
4. $PY desktop_ops.py front-window-bounds --app "AppName" → {x, y, w, h}
5. $PY target_resolver.py --app "AppName" --text "OK" --python $PY
→ best_candidate: {x:450, y:520, within_window:true}
6. $PY desktop_ops.py move --x 450 --y 520
7. $PY desktop_ops.py screenshot --with-cursor → verify cursor on "OK"
8. $PY desktop_ops.py click --x 450 --y 520
9. $PY desktop_ops.py screenshot → verify result
10. $PY cleanup_task.py --task-id "click-button"
1. $PY desktop_ops.py focus-app --name "Safari"
2. $PY target_resolver.py --app "Safari" --text "Search" --region-label top_search --python $PY
→ {x:300, y:80, within_window:true}
3. $PY desktop_ops.py click --x 300 --y 80
4. $PY desktop_ops.py type --text "hello world"
5. $PY desktop_ops.py press --key return
6. $PY desktop_ops.py screenshot → verify search results
1. $PY desktop_ops.py focus-app --name "WeChat"
2. $PY desktop_ops.py front-window-bounds --app "WeChat"
3. # Navigate to the right conversation (OCR sidebar or search)
4. $PY target_resolver.py --app "WeChat" --text "ContactName" --region-label left_sidebar --python $PY
5. $PY desktop_ops.py click --x <found_x> --y <found_y>
6. # Verify conversation is open
7. $PY desktop_ops.py screenshot → confirm conversation title
8. # Click the input field
9. $PY target_resolver.py --app "WeChat" --text "" --region-label bottom_input --python $PY
OR: click at the bottom center of the window
10. $PY desktop_ops.py type --text "Hello!"
11. # Send: prefer visible send button; if not available, use press --key return
12. $PY target_resolver.py --app "WeChat" --text "发送" --python $PY
IF found: $PY desktop_ops.py click --x <x> --y <y>
ELSE: $PY desktop_ops.py press --key return
13. $PY desktop_ops.py screenshot → verify message sent
1. $PY desktop_ops.py focus-app --name "AppName"
2. $PY desktop_ops.py front-window-bounds --app "AppName" → {x:100, y:50, w:800, h:600}
3. # Scroll down in the window center
$PY desktop_ops.py scroll --amount -5 --x 500 --y 350
4. $PY desktop_ops.py screenshot → check if target visible
5. $PY target_resolver.py --app "AppName" --text "target item" --python $PY
6. IF not found: scroll more and retry (max 5 scrolls)
7. IF found: click it
1. # During any operation, if the expected UI doesn't match:
2. $PY desktop_ops.py screenshot → examine what's on screen
3. # If a dialog is visible, OCR it:
$PY ocr_text.py --app "AppName" --python $PY
4. # Find and click the appropriate button (OK, Cancel, Allow, etc.)
$PY target_resolver.py --app "AppName" --text "OK" --python $PY
5. $PY desktop_ops.py click --x <x> --y <y>
6. # After dialog is dismissed, re-get window bounds and continue
$PY desktop_ops.py front-window-bounds --app "AppName"
Load as needed:
| Document | When to read |
|---|---|
| ---------- | ------------- |
references/workflow.md | Core 8-step closed loop |
references/platform-macos.md | macOS-specific tools and permissions |
references/platform-windows.md | Windows setup |
references/platform-linux.md | Linux X11/Wayland setup |
references/operation-patterns.md | Reusable task templates |
references/validation-patterns.md | Two-stage validation |
references/precise-targeting.md | 5-layer precision targeting |
references/target-providers.md | Provider ordering and fallback contract |
references/coordinate-reconstruction.md | Rebuild click coordinates from screenshot evidence |
references/chat-app-macos.md | Chat app workflow |
references/app-wechat-desktop.md | Cross-platform WeChat guidance |
references/cleanup-rules.md | Cleanup timing and scope |
references/collaboration-rules.md | When multi-agent collaboration is justified |
references/example-cases.md | Repeatable task examples |
references/reproducible-setup.md | Host bring-up checklist |
Use this skill for: chat apps, browsers, file managers, editors, office apps, system settings, any closed desktop software with no usable API.
--python $PY to ocr_text.py and target_resolver.pyinsert-newline for line breaks; never use \n in type --textpress --key return only when verified共 1 个版本