← 返回
未分类 中文

Desktop Agent Ops

Execute cross-platform desktop tasks through a packaged desktop automation skill that guides the main agent to observe the screen, focus apps and windows, ca...
通过封装的桌面自动化技能执行跨平台桌面任务,引导主代理观察屏幕、聚焦应用和窗口,...
appergb appergb 来源
未分类 clawhub v1.0.3 1 版本 99829.6 Key: 无需
★ 0
Stars
📥 586
下载
💾 10
安装
1
版本
#latest

概述

Desktop Agent Ops

Use this skill as a main-agent operating manual for desktop GUI tasks.


MANDATORY: Auto-setup gate (FIRST ACTION, every time)

python3 <SKILL_DIR>/scripts/first_run_setup.py --check

If "ready": false, run setup (installs EVERYTHING automatically):

python3 <SKILL_DIR>/scripts/first_run_setup.py

Auto-installs on first run:

  1. Platform detection (macOS / Windows / Linux)
  2. cliclick + tesseract (macOS via brew; Linux guide printed)
  3. OCR language packs auto-detected from system locale (中文→chi_sim, 日本語→jpn, etc.)
  4. Python venv + pillow, pyautogui, pytesseract, opencv-python, numpy (via uv or pip)
  5. OS permissions (Screen Recording, Accessibility, Automation) with auto-open System Settings
  6. Smoke test (screenshot + mouse move verification)

After setup, set $PY for ALL subsequent calls:

PY=<output.env.DESKTOP_AGENT_OPS_PYTHON>

Do NOT proceed if setup is not ready.


Core Execution Loop

Every desktop task follows this loop. No exceptions.

 1. auto-setup gate           ← run once per session
 2. init task context          ← create isolated temp directory
 3. FOCUS the target app       ← bring app to front, confirm frontmost
 4. GET window bounds          ← know exact position and size
 5. CAPTURE that window        ← screenshot ONLY the target window
 6. ANALYZE the capture        ← read screenshot or run OCR
 7. LOCATE target via OCR      ← find text/button within window bounds
 8. VERIFY before acting       ← move cursor, screenshot with cursor, confirm
 9. EXECUTE one action         ← click, type, scroll, press key
10. CAPTURE again              ← screenshot to see result
11. VERIFY the result          ← did the UI change as expected?
12. → if more steps, go to 5
13. CLEANUP                    ← remove task temp directory

Key principles:

  • One action at a time. Never chain blind actions.
  • Always verify after each action. If verification fails, recapture and retry — do NOT guess.
  • Always work within a specific window. Never click based on full-screen assumptions.

Window-Scoped Targeting (THE CORRECT WAY)

NEVER do OCR or clicking on a full-screen screenshot. Always scope to the target app window.

The 6-Step Pipeline

┌─────────────────────────────────────────────────────────┐
│ Step 1: FOCUS the target app                            │
│   $PY desktop_ops.py focus-app --name "AppName"         │
│   → brings app to front                                 │
├─────────────────────────────────────────────────────────┤
│ Step 2: GET window bounds                               │
│   $PY desktop_ops.py front-window-bounds --app "AppName"│
│   → {x, y, width, height} in logical coordinates        │
├─────────────────────────────────────────────────────────┤
│ Step 3: CAPTURE only that window                        │
│   $PY desktop_ops.py capture-region --x X --y Y         │
│     --width W --height H --output /tmp/window.png       │
├─────────────────────────────────────────────────────────┤
│ Step 4: OCR within the window                           │
│   $PY ocr_text.py --app "AppName" --python $PY          │
│   → abs_box coordinates are INSIDE the window           │
├─────────────────────────────────────────────────────────┤
│ Step 5: VERIFY before clicking                          │
│   $PY desktop_ops.py move --x TX --y TY                 │
│   $PY desktop_ops.py screenshot --with-cursor            │
│   → confirm cursor is on the right element              │
├─────────────────────────────────────────────────────────┤
│ Step 6: CLICK only if verified                          │
│   $PY desktop_ops.py click --x TX --y TY                │
│   $PY desktop_ops.py screenshot → verify result          │
└─────────────────────────────────────────────────────────┘

Shortcut (RECOMMENDED for most targeting):

$PY scripts/target_resolver.py --app "AppName" --text "按钮文字" --python $PY

This single command: focuses app → gets bounds → OCR within window → returns best_candidate with {x, y, within_window}.

Why window-scoped matters:

ApproachRisk
----------------
❌ Full-screen OCR"搜索" in WeChat AND Chrome → clicks wrong app
✅ Window-scoped"搜索" ONLY in WeChat window → correct click

Failure Recovery (CRITICAL)

When something fails, follow these rules:

OCR finds nothing

  1. Re-focus the app: focus-app --name "AppName"
  2. Re-get bounds: front-window-bounds --app "AppName" (window may have moved/resized)
  3. Take a fresh screenshot and read it visually
  4. Try a different region label (e.g. content_area instead of bottom_input)
  5. Try lowering OCR confidence: --min-conf 30

Click doesn't work

  1. Screenshot with cursor to check cursor position
  2. The window may have moved — re-get bounds
  3. Try clicking a few pixels offset from the OCR center
  4. Check if a dialog/popup is blocking the target

App state changed (login screen, dialog, etc.)

  1. ALWAYS re-get window bounds after any major UI change
  2. ALWAYS re-run OCR after navigation or state change
  3. Never reuse old coordinates — they may be stale

General retry rule

  • Maximum 3 retries per action
  • Each retry must recapture fresh state
  • If 3 retries fail, report the failure with screenshots and stop

Generalization: How to Apply This to ANY App

The pipeline works for any desktop application. Here is how to reason about new apps:

Step-by-step for ANY new app:

  1. Identify the app name exactly as it appears in the system (e.g. "Google Chrome", "微信", "System Settings")
  2. Focus and get bounds — this tells you the window's exact position
  3. Screenshot the window — look at what's on screen
  4. Identify the target — what text, button, or area do you need to interact with?
  5. Use OCR to find ittarget_resolver.py --app "AppName" --text "target text"
  6. Verify and click

Common patterns across apps:

TaskHow to do it
-------------------
Click a buttonOCR find text → verify → click
Type in a fieldOCR find field label → click field → type --text
Search for somethingOCR find search box → click → type query → press return
Scroll a listGet window bounds → scroll at window center with --x --y
Switch between appsfocus-app --name "OtherApp" → re-get bounds
Handle a dialogScreenshot → OCR for dialog buttons → click appropriate one
Navigate menusClick menu item → wait → screenshot → OCR new menu → click
Select from dropdownClick dropdown → wait → OCR options → click selection
Read screen contentOCR the window → extract all text boxes
Verify an actionScreenshot before and after → compare or OCR for expected text

App-specific adaptations:

App typeSpecial considerations
--------------------------------
Chat apps (WeChat, Slack, etc.)Verify conversation title before typing; use insert-newline for multi-line; verify send mechanism
Browsers (Chrome, Safari, etc.)Address bar at top; content area varies; may need to handle tabs
System SettingsDeep navigation; panels change; re-get bounds after each navigation
File managers (Finder, Explorer)Sidebar + content area; double-click to open; path bar for navigation
Editors (VS Code, TextEdit, etc.)Tab bar + editor area; use hotkeys for save/undo; type in editor area

Text Input and Send Rules

Typing text

$PY scripts/desktop_ops.py type --text "your message"
  • Uses clipboard paste as primary method on all platforms — reliable for all languages including CJK
  • macOS: set the clipboard to + Cmd+V (single osascript call)
  • Windows: PowerShell Set-Clipboard + Ctrl+V (falls back to clip.exe)
  • Linux: xclip + Ctrl+V
  • First click on the input field to focus it before typing

Multi-line messages

$PY scripts/desktop_ops.py type --text "first line"
$PY scripts/desktop_ops.py insert-newline
$PY scripts/desktop_ops.py type --text "second line"
  • Use insert-newline for literal line breaks
  • Do NOT use \n in type --text — it may trigger send in some apps

Sending a message

  1. Preferred: Look for a visible send button (e.g., 发送) via OCR, then click it
  2. Alternative: Use press --key return ONLY when the app is verified to use Enter-to-send
  3. Never guess which send method to use — verify first

Backend priority (macOS)

OperationPrimaryFallback
------------------------------
typeClipboard pastecliclick (ASCII only)
pressAppleScript key codecliclick kp:
hotkeycliclick kd:/t:/ku:pyautogui
clickcliclickpyautogui

> Important: cliclick kp:return is NOT recognized by WeChat — always use AppleScript for key press.

> Important: cliclick t: silently drops CJK characters — always use clipboard paste for text input.


DPI / HiDPI / Retina (All Platforms)

Handled automatically. No manual DPI work needed.

PlatformCommon scalesDetection method
------------------------------------------
macOS Retina2.0xscreenshot pixels ÷ logical screen bounds
Windows HiDPI1.25x, 1.5x, 2.0xscreenshot pixels ÷ pyautogui.size()
Linux X111.0x, 1.5x, 2.0xscreenshot pixels ÷ pyautogui.size()

OCR output: box = logical (use for mouse), pixel_box = raw pixels, dpi_scale = factor.


CLI Quick Reference (EXACT parameter names)

CRITICAL: Use EXACTLY these names. Do NOT guess.

desktop_ops.py

$PY scripts/desktop_ops.py screenshot [--output PATH] [--x X --y Y --width W --height H] [--with-cursor]
$PY scripts/desktop_ops.py capture-region --x X --y Y --width W --height H [--output PATH] [--with-cursor]
$PY scripts/desktop_ops.py frontmost
$PY scripts/desktop_ops.py list-apps
$PY scripts/desktop_ops.py front-window-bounds [--app NAME]
$PY scripts/desktop_ops.py focus-app --name "App Name"
$PY scripts/desktop_ops.py move --x X --y Y [--duration SECONDS]
$PY scripts/desktop_ops.py click [--x X --y Y] [--button left|right|middle]
$PY scripts/desktop_ops.py double-click [--x X --y Y] [--button left|right|middle]
$PY scripts/desktop_ops.py drag --x1 X1 --y1 Y1 --x2 X2 --y2 Y2 [--duration SEC] [--button left]
$PY scripts/desktop_ops.py scroll --amount N [--x X --y Y] [--direction vertical|horizontal]
$PY scripts/desktop_ops.py mouse-position
$PY scripts/desktop_ops.py press --key KEY
$PY scripts/desktop_ops.py type --text "text to type"
$PY scripts/desktop_ops.py insert-newline [--count N]
$PY scripts/desktop_ops.py hotkey --keys cmd c
$PY scripts/desktop_ops.py screen-size
$PY scripts/desktop_ops.py pixel-color --x X --y Y

ocr_text.py

$PY scripts/ocr_text.py --app "AppName" --python $PY [--region-label LABEL] [--lang auto]
$PY scripts/ocr_text.py --image /path/to/capture.png --python $PY [--lang auto]

target_resolver.py

$PY scripts/target_resolver.py --app "AppName" --text "text" --python $PY
$PY scripts/target_resolver.py --app "AppName" --template /path/icon.png --python $PY
$PY scripts/target_resolver.py --app "AppName" --text "text" --region-label LABEL --python $PY

task_context.py / cleanup_task.py

$PY scripts/task_context.py init --task-id "my-task"   # aliases: create, --name
$PY scripts/task_context.py show --task-id "my-task"
$PY scripts/cleanup_task.py --task-id "my-task"

window_regions.py

$PY scripts/window_regions.py --window-x X --window-y Y --window-width W --window-height H [--label LABEL]

Labels: top_search, left_sidebar, left_sidebar_top, title_header, content_area, toolbar_row, bottom_input, primary_action


Workflow Examples

Example 1: Click a button by text (any app)

1. $PY first_run_setup.py --check                           → ready: true
2. $PY task_context.py init --task-id "click-button"
3. $PY desktop_ops.py focus-app --name "AppName"
4. $PY desktop_ops.py front-window-bounds --app "AppName"    → {x, y, w, h}
5. $PY target_resolver.py --app "AppName" --text "OK" --python $PY
   → best_candidate: {x:450, y:520, within_window:true}
6. $PY desktop_ops.py move --x 450 --y 520
7. $PY desktop_ops.py screenshot --with-cursor               → verify cursor on "OK"
8. $PY desktop_ops.py click --x 450 --y 520
9. $PY desktop_ops.py screenshot                             → verify result
10. $PY cleanup_task.py --task-id "click-button"

Example 2: Type and search

1. $PY desktop_ops.py focus-app --name "Safari"
2. $PY target_resolver.py --app "Safari" --text "Search" --region-label top_search --python $PY
   → {x:300, y:80, within_window:true}
3. $PY desktop_ops.py click --x 300 --y 80
4. $PY desktop_ops.py type --text "hello world"
5. $PY desktop_ops.py press --key return
6. $PY desktop_ops.py screenshot                             → verify search results

Example 3: Send a chat message (WeChat, Slack, etc.)

1. $PY desktop_ops.py focus-app --name "WeChat"
2. $PY desktop_ops.py front-window-bounds --app "WeChat"
3. # Navigate to the right conversation (OCR sidebar or search)
4. $PY target_resolver.py --app "WeChat" --text "ContactName" --region-label left_sidebar --python $PY
5. $PY desktop_ops.py click --x <found_x> --y <found_y>
6. # Verify conversation is open
7. $PY desktop_ops.py screenshot → confirm conversation title
8. # Click the input field
9. $PY target_resolver.py --app "WeChat" --text "" --region-label bottom_input --python $PY
   OR: click at the bottom center of the window
10. $PY desktop_ops.py type --text "Hello!"
11. # Send: prefer visible send button; if not available, use press --key return
12. $PY target_resolver.py --app "WeChat" --text "发送" --python $PY
    IF found: $PY desktop_ops.py click --x <x> --y <y>
    ELSE: $PY desktop_ops.py press --key return
13. $PY desktop_ops.py screenshot → verify message sent

Example 4: Scroll a list and find an item

1. $PY desktop_ops.py focus-app --name "AppName"
2. $PY desktop_ops.py front-window-bounds --app "AppName"   → {x:100, y:50, w:800, h:600}
3. # Scroll down in the window center
   $PY desktop_ops.py scroll --amount -5 --x 500 --y 350
4. $PY desktop_ops.py screenshot                             → check if target visible
5. $PY target_resolver.py --app "AppName" --text "target item" --python $PY
6. IF not found: scroll more and retry (max 5 scrolls)
7. IF found: click it

Example 5: Handle an unexpected dialog

1. # During any operation, if the expected UI doesn't match:
2. $PY desktop_ops.py screenshot → examine what's on screen
3. # If a dialog is visible, OCR it:
   $PY ocr_text.py --app "AppName" --python $PY
4. # Find and click the appropriate button (OK, Cancel, Allow, etc.)
   $PY target_resolver.py --app "AppName" --text "OK" --python $PY
5. $PY desktop_ops.py click --x <x> --y <y>
6. # After dialog is dismissed, re-get window bounds and continue
   $PY desktop_ops.py front-window-bounds --app "AppName"

Reference Documents

Load as needed:

DocumentWhen to read
-----------------------
references/workflow.mdCore 8-step closed loop
references/platform-macos.mdmacOS-specific tools and permissions
references/platform-windows.mdWindows setup
references/platform-linux.mdLinux X11/Wayland setup
references/operation-patterns.mdReusable task templates
references/validation-patterns.mdTwo-stage validation
references/precise-targeting.md5-layer precision targeting
references/target-providers.mdProvider ordering and fallback contract
references/coordinate-reconstruction.mdRebuild click coordinates from screenshot evidence
references/chat-app-macos.mdChat app workflow
references/app-wechat-desktop.mdCross-platform WeChat guidance
references/cleanup-rules.mdCleanup timing and scope
references/collaboration-rules.mdWhen multi-agent collaboration is justified
references/example-cases.mdRepeatable task examples
references/reproducible-setup.mdHost bring-up checklist

Scope

Use this skill for: chat apps, browsers, file managers, editors, office apps, system settings, any closed desktop software with no usable API.

Hard Rules

  1. Always run auto-setup gate first
  2. Always use EXACT parameter names from CLI reference — never guess
  3. Always scope OCR to the target app window — NEVER full-screen OCR
  4. Always: focus-app → front-window-bounds → OCR within window → verify → act
  5. Always pass --python $PY to ocr_text.py and target_resolver.py
  6. Always verify coordinates are within window bounds before clicking
  7. Always re-get window bounds after any UI state change (login, dialog, navigation)
  8. Use insert-newline for line breaks; never use \n in type --text
  9. For send actions: prefer visible send button; use press --key return only when verified
  10. One action at a time; verify after each
  11. Maximum 3 retries per action; each retry must recapture fresh state
  12. Cleanup is mandatory at task end
  13. If verification fails, recapture and rebuild — do not retry blindly

版本历史

共 1 个版本

  • v1.0.3 当前
    2026-03-31 04:56 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-agent

Find Skills

guipi888
场景驱动+关键词双模式技能发现工具。当用户用自然语言描述场景/需求(如"我想做一个海报""帮我分析股票"),或明确说"安装技能/find skills/找个skill"时,自动从官方内置、本地已安装、SkillHub、虾评、GitHub、C
★ 1,471 📥 535,398
ai-agent

Agent Browser

rez0
用于 AI 代理的浏览器自动化 CLI。当用户需要与网站交互(包括浏览页面、填写表单、点击按钮、截图等)时使用。
★ 838 📥 314,233
ai-agent

self-improving agent

pskoett
捕获经验教训、错误及修正内容,以实现持续改进。适用于以下场景:(1)命令或操作意外失败;(2)用户纠正Claude(如“不,那不对……”“实际上……”);(3)用户请求的功能不存在;(4)外部API或工具出现故障;(5)Claude发现自身
★ 4,107 📥 830,592