Desktop Agent Ops

Use this skill as a main-agent operating manual for desktop GUI tasks.

MANDATORY: Auto-setup gate (FIRST ACTION, every time)

python3 <SKILL_DIR>/scripts/first_run_setup.py --check

If "ready": false, run setup (installs EVERYTHING automatically):

python3 <SKILL_DIR>/scripts/first_run_setup.py

Auto-installs on first run:

Platform detection (macOS / Windows / Linux)
cliclick + tesseract (macOS via brew; Linux guide printed)
OCR language packs auto-detected from system locale (中文→chi_sim, 日本語→jpn, etc.)
Python venv + pillow, pyautogui, pytesseract, opencv-python, numpy (via uv or pip)
OS permissions (Screen Recording, Accessibility, Automation) with auto-open System Settings
Smoke test (screenshot + mouse move verification)

After setup, set $PY for ALL subsequent calls:

PY=<output.env.DESKTOP_AGENT_OPS_PYTHON>

Do NOT proceed if setup is not ready.

Core Execution Loop

Every desktop task follows this loop. No exceptions.

 1. auto-setup gate           ← run once per session
 2. init task context          ← create isolated temp directory
 3. FOCUS the target app       ← bring app to front, confirm frontmost
 4. GET window bounds          ← know exact position and size
 5. CAPTURE that window        ← screenshot ONLY the target window
 6. ANALYZE the capture        ← read screenshot or run OCR
 7. LOCATE target via OCR      ← find text/button within window bounds
 8. VERIFY before acting       ← move cursor, screenshot with cursor, confirm
 9. EXECUTE one action         ← click, type, scroll, press key
10. CAPTURE again              ← screenshot to see result
11. VERIFY the result          ← did the UI change as expected?
12. → if more steps, go to 5
13. CLEANUP                    ← remove task temp directory

Key principles:

One action at a time. Never chain blind actions.
Always verify after each action. If verification fails, recapture and retry — do NOT guess.
Always work within a specific window. Never click based on full-screen assumptions.

Window-Scoped Targeting (THE CORRECT WAY)

NEVER do OCR or clicking on a full-screen screenshot. Always scope to the target app window.

The 6-Step Pipeline

┌─────────────────────────────────────────────────────────┐
│ Step 1: FOCUS the target app                            │
│   $PY desktop_ops.py focus-app --name "AppName"         │
│   → brings app to front                                 │
├─────────────────────────────────────────────────────────┤
│ Step 2: GET window bounds                               │
│   $PY desktop_ops.py front-window-bounds --app "AppName"│
│   → {x, y, width, height} in logical coordinates        │
├─────────────────────────────────────────────────────────┤
│ Step 3: CAPTURE only that window                        │
│   $PY desktop_ops.py capture-region --x X --y Y         │
│     --width W --height H --output /tmp/window.png       │
├─────────────────────────────────────────────────────────┤
│ Step 4: OCR within the window                           │
│   $PY ocr_text.py --app "AppName" --python $PY          │
│   → abs_box coordinates are INSIDE the window           │
├─────────────────────────────────────────────────────────┤
│ Step 5: VERIFY before clicking                          │
│   $PY desktop_ops.py move --x TX --y TY                 │
│   $PY desktop_ops.py screenshot --with-cursor            │
│   → confirm cursor is on the right element              │
├─────────────────────────────────────────────────────────┤
│ Step 6: CLICK only if verified                          │
│   $PY desktop_ops.py click --x TX --y TY                │
│   $PY desktop_ops.py screenshot → verify result          │
└─────────────────────────────────────────────────────────┘

Shortcut (RECOMMENDED for most targeting):

$PY scripts/target_resolver.py --app "AppName" --text "按钮文字" --python $PY

This single command: focuses app → gets bounds → OCR within window → returns best_candidate with {x, y, within_window}.

Why window-scoped matters:

Approach	Risk
----------	------
❌ Full-screen OCR	"搜索" in WeChat AND Chrome → clicks wrong app
✅ Window-scoped	"搜索" ONLY in WeChat window → correct click

Failure Recovery (CRITICAL)

When something fails, follow these rules:

OCR finds nothing

Re-focus the app: focus-app --name "AppName"
Re-get bounds: front-window-bounds --app "AppName" (window may have moved/resized)
Take a fresh screenshot and read it visually
Try a different region label (e.g. content_area instead of bottom_input)
Try lowering OCR confidence: --min-conf 30

Click doesn't work

Screenshot with cursor to check cursor position
The window may have moved — re-get bounds
Try clicking a few pixels offset from the OCR center
Check if a dialog/popup is blocking the target

App state changed (login screen, dialog, etc.)

ALWAYS re-get window bounds after any major UI change
ALWAYS re-run OCR after navigation or state change
Never reuse old coordinates — they may be stale

General retry rule

Maximum 3 retries per action
Each retry must recapture fresh state
If 3 retries fail, report the failure with screenshots and stop

Generalization: How to Apply This to ANY App

The pipeline works for any desktop application. Here is how to reason about new apps:

Step-by-step for ANY new app:

Identify the app name exactly as it appears in the system (e.g. "Google Chrome", "微信", "System Settings")
Focus and get bounds — this tells you the window's exact position
Screenshot the window — look at what's on screen
Identify the target — what text, button, or area do you need to interact with?
Use OCR to find it — target_resolver.py --app "AppName" --text "target text"
Verify and click

Common patterns across apps:

Task	How to do it
------	-------------
Click a button	OCR find text → verify → click
Type in a field	OCR find field label → click field → `type --text`
Search for something	OCR find search box → click → type query → press return
Scroll a list	Get window bounds → scroll at window center with `--x --y`
Switch between apps	`focus-app --name "OtherApp"` → re-get bounds
Handle a dialog	Screenshot → OCR for dialog buttons → click appropriate one
Navigate menus	Click menu item → wait → screenshot → OCR new menu → click
Select from dropdown	Click dropdown → wait → OCR options → click selection
Read screen content	OCR the window → extract all text boxes
Verify an action	Screenshot before and after → compare or OCR for expected text

App-specific adaptations:

App type	Special considerations
----------	----------------------
Chat apps (WeChat, Slack, etc.)	Verify conversation title before typing; use `insert-newline` for multi-line; verify send mechanism
Browsers (Chrome, Safari, etc.)	Address bar at top; content area varies; may need to handle tabs
System Settings	Deep navigation; panels change; re-get bounds after each navigation
File managers (Finder, Explorer)	Sidebar + content area; double-click to open; path bar for navigation
Editors (VS Code, TextEdit, etc.)	Tab bar + editor area; use hotkeys for save/undo; type in editor area

Text Input and Send Rules

Typing text

$PY scripts/desktop_ops.py type --text "your message"

Uses clipboard paste as primary method on all platforms — reliable for all languages including CJK
macOS: set the clipboard to + Cmd+V (single osascript call)
Windows: PowerShell Set-Clipboard + Ctrl+V (falls back to clip.exe)
Linux: xclip + Ctrl+V
First click on the input field to focus it before typing

Multi-line messages

$PY scripts/desktop_ops.py type --text "first line"
$PY scripts/desktop_ops.py insert-newline
$PY scripts/desktop_ops.py type --text "second line"

Use insert-newline for literal line breaks
Do NOT use \n in type --text — it may trigger send in some apps

Sending a message

Preferred: Look for a visible send button (e.g., 发送) via OCR, then click it
Alternative: Use press --key return ONLY when the app is verified to use Enter-to-send
Never guess which send method to use — verify first

Backend priority (macOS)

Operation	Primary	Fallback
-----------	---------	----------
`type`	Clipboard paste	cliclick (ASCII only)
`press`	AppleScript `key code`	cliclick `kp:`
`hotkey`	cliclick `kd:/t:/ku:`	pyautogui
`click`	cliclick	pyautogui

> Important: cliclick kp:return is NOT recognized by WeChat — always use AppleScript for key press.

> Important: cliclick t: silently drops CJK characters — always use clipboard paste for text input.

DPI / HiDPI / Retina (All Platforms)

Handled automatically. No manual DPI work needed.

Platform	Common scales	Detection method
----------	---------------	-----------------
macOS Retina	2.0x	screenshot pixels ÷ logical screen bounds
Windows HiDPI	1.25x, 1.5x, 2.0x	screenshot pixels ÷ pyautogui.size()
Linux X11	1.0x, 1.5x, 2.0x	screenshot pixels ÷ pyautogui.size()

OCR output: box = logical (use for mouse), pixel_box = raw pixels, dpi_scale = factor.

CLI Quick Reference (EXACT parameter names)

CRITICAL: Use EXACTLY these names. Do NOT guess.

desktop_ops.py

$PY scripts/desktop_ops.py screenshot [--output PATH] [--x X --y Y --width W --height H] [--with-cursor]
$PY scripts/desktop_ops.py capture-region --x X --y Y --width W --height H [--output PATH] [--with-cursor]
$PY scripts/desktop_ops.py frontmost
$PY scripts/desktop_ops.py list-apps
$PY scripts/desktop_ops.py front-window-bounds [--app NAME]
$PY scripts/desktop_ops.py focus-app --name "App Name"
$PY scripts/desktop_ops.py move --x X --y Y [--duration SECONDS]
$PY scripts/desktop_ops.py click [--x X --y Y] [--button left|right|middle]
$PY scripts/desktop_ops.py double-click [--x X --y Y] [--button left|right|middle]
$PY scripts/desktop_ops.py drag --x1 X1 --y1 Y1 --x2 X2 --y2 Y2 [--duration SEC] [--button left]
$PY scripts/desktop_ops.py scroll --amount N [--x X --y Y] [--direction vertical|horizontal]
$PY scripts/desktop_ops.py mouse-position
$PY scripts/desktop_ops.py press --key KEY
$PY scripts/desktop_ops.py type --text "text to type"
$PY scripts/desktop_ops.py insert-newline [--count N]
$PY scripts/desktop_ops.py hotkey --keys cmd c
$PY scripts/desktop_ops.py screen-size
$PY scripts/desktop_ops.py pixel-color --x X --y Y

ocr_text.py

$PY scripts/ocr_text.py --app "AppName" --python $PY [--region-label LABEL] [--lang auto]
$PY scripts/ocr_text.py --image /path/to/capture.png --python $PY [--lang auto]

target_resolver.py

$PY scripts/target_resolver.py --app "AppName" --text "text" --python $PY
$PY scripts/target_resolver.py --app "AppName" --template /path/icon.png --python $PY
$PY scripts/target_resolver.py --app "AppName" --text "text" --region-label LABEL --python $PY

task_context.py / cleanup_task.py

$PY scripts/task_context.py init --task-id "my-task"   # aliases: create, --name
$PY scripts/task_context.py show --task-id "my-task"
$PY scripts/cleanup_task.py --task-id "my-task"

window_regions.py

$PY scripts/window_regions.py --window-x X --window-y Y --window-width W --window-height H [--label LABEL]

Labels: top_search, left_sidebar, left_sidebar_top, title_header, content_area, toolbar_row, bottom_input, primary_action

Workflow Examples

Example 1: Click a button by text (any app)

1. $PY first_run_setup.py --check                           → ready: true
2. $PY task_context.py init --task-id "click-button"
3. $PY desktop_ops.py focus-app --name "AppName"
4. $PY desktop_ops.py front-window-bounds --app "AppName"    → {x, y, w, h}
5. $PY target_resolver.py --app "AppName" --text "OK" --python $PY
   → best_candidate: {x:450, y:520, within_window:true}
6. $PY desktop_ops.py move --x 450 --y 520
7. $PY desktop_ops.py screenshot --with-cursor               → verify cursor on "OK"
8. $PY desktop_ops.py click --x 450 --y 520
9. $PY desktop_ops.py screenshot                             → verify result
10. $PY cleanup_task.py --task-id "click-button"

Example 2: Type and search

1. $PY desktop_ops.py focus-app --name "Safari"
2. $PY target_resolver.py --app "Safari" --text "Search" --region-label top_search --python $PY
   → {x:300, y:80, within_window:true}
3. $PY desktop_ops.py click --x 300 --y 80
4. $PY desktop_ops.py type --text "hello world"
5. $PY desktop_ops.py press --key return
6. $PY desktop_ops.py screenshot                             → verify search results

Example 3: Send a chat message (WeChat, Slack, etc.)

1. $PY desktop_ops.py focus-app --name "WeChat"
2. $PY desktop_ops.py front-window-bounds --app "WeChat"
3. # Navigate to the right conversation (OCR sidebar or search)
4. $PY target_resolver.py --app "WeChat" --text "ContactName" --region-label left_sidebar --python $PY
5. $PY desktop_ops.py click --x <found_x> --y <found_y>
6. # Verify conversation is open
7. $PY desktop_ops.py screenshot → confirm conversation title
8. # Click the input field
9. $PY target_resolver.py --app "WeChat" --text "" --region-label bottom_input --python $PY
   OR: click at the bottom center of the window
10. $PY desktop_ops.py type --text "Hello!"
11. # Send: prefer visible send button; if not available, use press --key return
12. $PY target_resolver.py --app "WeChat" --text "发送" --python $PY
    IF found: $PY desktop_ops.py click --x <x> --y <y>
    ELSE: $PY desktop_ops.py press --key return
13. $PY desktop_ops.py screenshot → verify message sent

Example 4: Scroll a list and find an item

1. $PY desktop_ops.py focus-app --name "AppName"
2. $PY desktop_ops.py front-window-bounds --app "AppName"   → {x:100, y:50, w:800, h:600}
3. # Scroll down in the window center
   $PY desktop_ops.py scroll --amount -5 --x 500 --y 350
4. $PY desktop_ops.py screenshot                             → check if target visible
5. $PY target_resolver.py --app "AppName" --text "target item" --python $PY
6. IF not found: scroll more and retry (max 5 scrolls)
7. IF found: click it

Example 5: Handle an unexpected dialog

1. # During any operation, if the expected UI doesn't match:
2. $PY desktop_ops.py screenshot → examine what's on screen
3. # If a dialog is visible, OCR it:
   $PY ocr_text.py --app "AppName" --python $PY
4. # Find and click the appropriate button (OK, Cancel, Allow, etc.)
   $PY target_resolver.py --app "AppName" --text "OK" --python $PY
5. $PY desktop_ops.py click --x <x> --y <y>
6. # After dialog is dismissed, re-get window bounds and continue
   $PY desktop_ops.py front-window-bounds --app "AppName"

Reference Documents

Load as needed:

Document	When to read
----------	-------------
`references/workflow.md`	Core 8-step closed loop
`references/platform-macos.md`	macOS-specific tools and permissions
`references/platform-windows.md`	Windows setup
`references/platform-linux.md`	Linux X11/Wayland setup
`references/operation-patterns.md`	Reusable task templates
`references/validation-patterns.md`	Two-stage validation
`references/precise-targeting.md`	5-layer precision targeting
`references/target-providers.md`	Provider ordering and fallback contract
`references/coordinate-reconstruction.md`	Rebuild click coordinates from screenshot evidence
`references/chat-app-macos.md`	Chat app workflow
`references/app-wechat-desktop.md`	Cross-platform WeChat guidance
`references/cleanup-rules.md`	Cleanup timing and scope
`references/collaboration-rules.md`	When multi-agent collaboration is justified
`references/example-cases.md`	Repeatable task examples
`references/reproducible-setup.md`	Host bring-up checklist

Scope

Use this skill for: chat apps, browsers, file managers, editors, office apps, system settings, any closed desktop software with no usable API.

Hard Rules

Always run auto-setup gate first
Always use EXACT parameter names from CLI reference — never guess
Always scope OCR to the target app window — NEVER full-screen OCR
Always: focus-app → front-window-bounds → OCR within window → verify → act
Always pass --python $PY to ocr_text.py and target_resolver.py
Always verify coordinates are within window bounds before clicking
Always re-get window bounds after any UI state change (login, dialog, navigation)
Use insert-newline for line breaks; never use \n in type --text
For send actions: prefer visible send button; use press --key return only when verified
One action at a time; verify after each
Maximum 3 retries per action; each retry must recapture fresh state
Cleanup is mandatory at task end
If verification fails, recapture and rebuild — do not retry blindly

Desktop Agent Ops

概述

Desktop Agent Ops

MANDATORY: Auto-setup gate (FIRST ACTION, every time)

Core Execution Loop

Window-Scoped Targeting (THE CORRECT WAY)

The 6-Step Pipeline

Shortcut (RECOMMENDED for most targeting):

Why window-scoped matters:

Failure Recovery (CRITICAL)

OCR finds nothing

Click doesn't work

App state changed (login screen, dialog, etc.)

General retry rule

Generalization: How to Apply This to ANY App

Step-by-step for ANY new app:

Common patterns across apps:

App-specific adaptations:

Text Input and Send Rules

Typing text

Multi-line messages

Sending a message

Backend priority (macOS)

DPI / HiDPI / Retina (All Platforms)

CLI Quick Reference (EXACT parameter names)

desktop_ops.py

ocr_text.py

target_resolver.py

task_context.py / cleanup_task.py

window_regions.py

Workflow Examples

Example 1: Click a button by text (any app)

Example 2: Type and search

Example 3: Send a chat message (WeChat, Slack, etc.)

Example 4: Scroll a list and find an item

Example 5: Handle an unexpected dialog

Reference Documents

Scope

Hard Rules

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

Find Skills

Agent Browser

self-improving agent