← 返回
数据分析

Agent Desktop

Desktop automation via native OS accessibility trees using the agent-desktop CLI. Use when an AI agent needs to observe, interact with, or automate desktop a...
通过原生操作系统可访问性树使用 agent-desktop CLI 进行桌面自动化。当 AI 代理需要观察、交互或自动化桌面应用程序时使用。
lahfir
数据分析 clawhub v0.1.13 5 版本 99841.9 Key: 无需
★ 0
Stars
📥 1,263
下载
💾 47
安装
5
版本
#accessibility#ai-agent#cli#desktop-automation#gui-automation#latest

概述

agent-desktop

CLI tool enabling AI agents to observe and control desktop applications via native OS accessibility trees.

Core principle: agent-desktop is NOT an AI agent. It is a tool that AI agents invoke. It outputs structured JSON with ref-based element identifiers. The observation-action loop lives in the calling agent.

Installation

npm install -g agent-desktop
# or
bun install -g --trust agent-desktop

Requires macOS 12+ with Accessibility permission granted to your terminal. Screen Recording permission is also required for screenshots.

Reference Files

Detailed documentation is split into focused reference files. Read them as needed:

ReferenceContents
---------------------
references/commands-observation.mdsnapshot, find, get, is, screenshot, list-surfaces — all flags, output examples
references/commands-interaction.mdclick, type, set-value, select, toggle, scroll, drag, keyboard, mouse — choosing the right command
references/commands-system.mdlaunch, close, windows, clipboard, wait, batch, status, permissions, version
references/workflows.md12 common patterns: forms, menus, dialogs, scroll-find, drag-drop, async wait, anti-patterns
references/macos.mdmacOS permissions/TCC, AX API internals, smart activation chain, surfaces, Notification Center, troubleshooting

The Observe-Act Loop (Progressive Skeleton Traversal)

Use progressive skeleton traversal as the default approach. It reduces token consumption 78-96% for dense apps by exploring the UI in two phases: a shallow skeleton overview, then targeted drill-downs into regions of interest.

1. SKELETON → agent-desktop snapshot --skeleton --app "App" -i --compact
   Parse the overview. Identify the region containing your target.
   Regions show children_count (e.g., "Sidebar" with children_count: 42).
   Named containers at truncation boundary have refs for drill-down.
   Keep the returned snapshot_id.

2. DRILL    → agent-desktop snapshot --root @e3 --snapshot <snapshot_id> -i --compact
   Expand the target region. Now you see its interactive elements.

3. ACT      → agent-desktop click @e12 --snapshot <snapshot_id>  (or type, select, toggle...)

4. VERIFY   → agent-desktop snapshot --root @e3 --snapshot <snapshot_id> -i --compact
   Re-drill the same region to confirm the state change.
   Scoped invalidation: only @e3's subtree refs are replaced.

5. REPEAT   → Continue drilling other regions or acting as needed.

When to skip skeleton and use full snapshot instead:

  • Simple apps with few elements (Finder, Calculator, TextEdit)
  • You already know the exact element name — use find instead
  • Surface snapshots (menus, sheets, alerts) — these are already focused

When skeleton shines:

  • Dense Electron apps (Slack, VS Code, Discord, Notion)
  • Any app where full snapshot exceeds ~50 refs
  • Multi-region workflows (sidebar + main content + toolbar)

Ref System

  • Refs assigned depth-first: @e1, @e2, @e3...
  • Only interactive elements get refs: button, textfield, checkbox, link, menuitem, tab, slider, combobox, treeitem, cell
  • In skeleton mode, named/described containers at truncation boundary also get refs (drill-down targets with empty available_actions)
  • Static text, groups, containers remain in tree for context but have no ref
  • Refs are deterministic within a snapshot but NOT stable across snapshots if UI changed
  • Every snapshot returns snapshot_id; ref-consuming commands accept --snapshot
  • last_refmap.json is only a latest-snapshot inspection artifact. The command path uses snapshot-scoped storage.
  • After any action that changes UI, re-drill the affected region or re-snapshot
  • Scoped invalidation: re-drilling --root @e3 only replaces refs from @e3's previous drill — refs from other regions and the skeleton itself are preserved

JSON Output Contract

Every command returns a JSON envelope on stdout:

Success: { "version": "2.0", "ok": true, "command": "snapshot", "data": { ... } }

Error: { "version": "2.0", "ok": false, "command": "click", "error": { "code": "STALE_REF", "message": "...", "suggestion": "..." } }

Exit codes: 0 success, 1 structured error, 2 argument error.

Error Codes

CodeMeaningRecovery
-------------------------
PERM_DENIEDAccessibility or Screen Recording permission not grantedGrant the named permission in System Settings
ELEMENT_NOT_FOUNDRef cannot be resolved against the live UIRe-run snapshot, use fresh ref
APP_NOT_FOUNDApp not runningLaunch it first
ACTION_FAILEDAX action rejectedTry an explicit alternative command
ACTION_NOT_SUPPORTEDElement can't do thisUse different command
STALE_REFRef from old snapshotRe-run snapshot
SNAPSHOT_NOT_FOUNDSnapshot ID is missing or expiredRun snapshot again and use the returned ID
POLICY_DENIEDA physical/headed path was blockedUse an explicit mouse/focus/keyboard command if physical interaction is intended
WINDOW_NOT_FOUNDNo matching windowCheck app name, use list-windows
PLATFORM_NOT_SUPPORTEDAdapter method not implemented on this platformUse a supported platform adapter
TIMEOUTWait condition not metIncrease --timeout
INVALID_ARGSBad argumentsCheck command syntax
NOTIFICATION_NOT_FOUNDNotification index no longer existsRe-run list-notifications

Command Quick Reference (54 commands)

Observation

agent-desktop snapshot --skeleton --app "App" -i --compact  # Skeleton overview (preferred)
agent-desktop snapshot --root @e3 -i --compact              # Drill into region
agent-desktop snapshot --app "App" -i                       # Full tree (simple apps)
agent-desktop snapshot --app "App" --surface menu -i        # Surface snapshot
agent-desktop screenshot --app "App" out.png                # PNG screenshot
agent-desktop find --app "App" --role button                # Search elements
agent-desktop get @e1 --snapshot <snapshot_id> --property text       # Read element property
agent-desktop is @e1 --snapshot <snapshot_id> --property enabled     # Check element state
agent-desktop list-surfaces --app "App"                     # Available surfaces

Interaction

agent-desktop click @e5 --snapshot <snapshot_id> # AX-first click, no cursor move by default
agent-desktop double-click @e3                  # AXOpen; physical double-click uses mouse-click --count 2
agent-desktop triple-click @e2                  # Physical triple-click uses mouse-click --count 3
agent-desktop right-click @e5                   # Right-click; menu returned when verified
agent-desktop type @e2 --snapshot <snapshot_id> "hello"  # Headless AX text insertion when supported
agent-desktop set-value @e2 "new value"         # Set value directly
agent-desktop clear @e2                         # Clear element value
agent-desktop focus @e2                         # Set keyboard focus
agent-desktop select @e4 "Option B"             # Select dropdown/list option
agent-desktop toggle @e6                        # Toggle checkbox/switch
agent-desktop check @e6                         # Idempotent check
agent-desktop uncheck @e6                       # Idempotent uncheck
agent-desktop expand @e7                        # Expand disclosure
agent-desktop collapse @e7                      # Collapse disclosure
agent-desktop scroll @e1 --direction down       # Scroll element
agent-desktop scroll-to @e8                     # Scroll into view

Keyboard & Mouse

agent-desktop press cmd+c                       # Key combo
agent-desktop press return --app "App"          # Targeted key press
agent-desktop key-down shift                    # Hold key
agent-desktop key-up shift                      # Release key
agent-desktop hover @e5                         # Explicit cursor movement
agent-desktop hover --xy 500,300                # Cursor to coordinates
agent-desktop drag --from @e1 --to @e5          # Drag between elements
agent-desktop mouse-click --xy 500,300          # Click at coordinates
agent-desktop mouse-move --xy 100,200           # Move cursor
agent-desktop mouse-down --xy 100,200           # Press mouse button
agent-desktop mouse-up --xy 300,400             # Release mouse button

App & Window

agent-desktop launch "System Settings"          # Launch and wait
agent-desktop close-app "TextEdit"              # Quit gracefully
agent-desktop close-app "TextEdit" --force      # Force kill
agent-desktop list-windows --app "Finder"       # List windows
agent-desktop list-apps                         # List running GUI apps
agent-desktop focus-window --app "Finder"       # Bring to front
agent-desktop resize-window --app "App" --width 800 --height 600
agent-desktop move-window --app "App" --x 0 --y 0
agent-desktop minimize --app "App"
agent-desktop maximize --app "App"
agent-desktop restore --app "App"

Notifications

agent-desktop list-notifications                # List all notifications
agent-desktop list-notifications --app "Slack"  # Filter by app
agent-desktop list-notifications --text "deploy" --limit 5  # Filter by text
agent-desktop dismiss-notification 1            # Dismiss by index
agent-desktop dismiss-all-notifications         # Dismiss all
agent-desktop dismiss-all-notifications --app "Slack"  # Dismiss all from app
agent-desktop notification-action 1 "Reply" --expected-app Slack   # Click action (with NC reorder guard)

Clipboard

agent-desktop clipboard-get                     # Read clipboard
agent-desktop clipboard-set "text"              # Write to clipboard
agent-desktop clipboard-clear                   # Clear clipboard

Wait

agent-desktop wait 1000                         # Pause 1 second
agent-desktop wait --element @e5 --snapshot <snapshot_id> --timeout 5000 # Wait for element
agent-desktop wait --window "Title"             # Wait for window
agent-desktop wait --text "Done" --app "App"    # Wait for text
agent-desktop wait --menu --app "App"           # Wait for menu surface
agent-desktop wait --menu-closed --app "App"    # Wait for menu dismissal
agent-desktop wait --notification --app "App"   # Wait for new notification

System

agent-desktop status                            # Health check
agent-desktop permissions                       # Check permission
agent-desktop permissions --request             # Trigger permission dialog
agent-desktop version --json                    # Version info
agent-desktop batch '[...]' --stop-on-error     # Batch uses the same typed command path as CLI
agent-desktop skills                            # List bundled skill docs
agent-desktop skills get desktop --full         # Load this skill + all references

Key Principles for Agents

  1. Skeleton first, drill second. Start with --skeleton -i --compact for dense apps. Drill into regions with --root @ref. Full snapshot only for simple apps.
  2. Use -i --compact flags. Filters to interactive elements and collapses empty wrappers, minimizing tokens.
  3. Refs are snapshot-scoped. Keep snapshot_id for deterministic multi-step use; re-drill the affected region after any UI-changing action. Scoped invalidation keeps other refs intact.
  4. Prefer refs over coordinates. click @e5 > mouse-click --xy 500,300.
  5. Use wait for async UI. After launch/dialog triggers, wait for expected state.
  6. Check permissions first. Run permissions on first use; screenshots also need Screen Recording.
  7. Handle errors. Parse error.code and follow error.suggestion.
  8. Use find for targeted searches. Faster than any snapshot when you know role/name.
  9. Use surfaces for overlays. snapshot --surface menu for menus, --surface sheet for dialogs. Never --skeleton for surfaces — they're already focused.
  10. Batch for performance. Multiple commands in one invocation.
  11. Headless by default. Ref actions use semantic AX paths and block silent focus stealing, cursor movement, keyboard synthesis, and pasteboard insertion. Use explicit focus, press, hover, drag, or mouse-* commands only when physical/headed interaction is intended.

版本历史

共 5 个版本

  • v0.1.13 当前
    2026-05-23 22:44 安全 安全
  • v0.1.12
    2026-05-21 12:26 安全 安全
  • v0.1.11
    2026-05-07 03:29 安全 安全
  • v0.1.10
    2026-05-01 00:37 安全
  • v0.1.8
    2026-03-30 03:22 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

Agent Desktop Ffi

lahfir
基于 agent‑desktop PlatformAdapter 的 C‑ABI 绑定,供 Python ctypes、Swift、Node ffi‑napi、Go cgo、C++、Ruby fiddle 使用者链接 libagent_de
★ 0 📥 467
data-analysis

Data Analysis

ivangdavila
{"answer":"数据分析与可视化。查询数据库、生成报告、自动化电子表格,将原始数据转化为清晰可行的见解。适用于:(1) 您……"}
★ 198 📥 65,123
data-analysis

A股量化 AkShare

mbpz
A股量化数据分析工具,基于AkShare库获取A股行情、财务数据、板块信息等。用于回答关于A股股票查询、行情数据、财务分析、选股等问题。
★ 165 📥 60,019