CLI tool enabling AI agents to observe and control desktop applications via native OS accessibility trees.
Core principle: agent-desktop is NOT an AI agent. It is a tool that AI agents invoke. It outputs structured JSON with ref-based element identifiers. The observation-action loop lives in the calling agent.
npm install -g agent-desktop
# or
bun install -g --trust agent-desktop
Requires macOS 12+ with Accessibility permission granted to your terminal. Screen Recording permission is also required for screenshots.
Detailed documentation is split into focused reference files. Read them as needed:
| Reference | Contents |
|---|---|
| ----------- | ---------- |
references/commands-observation.md | snapshot, find, get, is, screenshot, list-surfaces — all flags, output examples |
references/commands-interaction.md | click, type, set-value, select, toggle, scroll, drag, keyboard, mouse — choosing the right command |
references/commands-system.md | launch, close, windows, clipboard, wait, batch, status, permissions, version |
references/workflows.md | 12 common patterns: forms, menus, dialogs, scroll-find, drag-drop, async wait, anti-patterns |
references/macos.md | macOS permissions/TCC, AX API internals, smart activation chain, surfaces, Notification Center, troubleshooting |
Use progressive skeleton traversal as the default approach. It reduces token consumption 78-96% for dense apps by exploring the UI in two phases: a shallow skeleton overview, then targeted drill-downs into regions of interest.
1. SKELETON → agent-desktop snapshot --skeleton --app "App" -i --compact
Parse the overview. Identify the region containing your target.
Regions show children_count (e.g., "Sidebar" with children_count: 42).
Named containers at truncation boundary have refs for drill-down.
Keep the returned snapshot_id.
2. DRILL → agent-desktop snapshot --root @e3 --snapshot <snapshot_id> -i --compact
Expand the target region. Now you see its interactive elements.
3. ACT → agent-desktop click @e12 --snapshot <snapshot_id> (or type, select, toggle...)
4. VERIFY → agent-desktop snapshot --root @e3 --snapshot <snapshot_id> -i --compact
Re-drill the same region to confirm the state change.
Scoped invalidation: only @e3's subtree refs are replaced.
5. REPEAT → Continue drilling other regions or acting as needed.
When to skip skeleton and use full snapshot instead:
find insteadWhen skeleton shines:
@e1, @e2, @e3...available_actions)snapshot_id; ref-consuming commands accept --snapshot last_refmap.json is only a latest-snapshot inspection artifact. The command path uses snapshot-scoped storage.--root @e3 only replaces refs from @e3's previous drill — refs from other regions and the skeleton itself are preservedEvery command returns a JSON envelope on stdout:
Success: { "version": "2.0", "ok": true, "command": "snapshot", "data": { ... } }
Error: { "version": "2.0", "ok": false, "command": "click", "error": { "code": "STALE_REF", "message": "...", "suggestion": "..." } }
Exit codes: 0 success, 1 structured error, 2 argument error.
| Code | Meaning | Recovery |
|---|---|---|
| ------ | --------- | ---------- |
PERM_DENIED | Accessibility or Screen Recording permission not granted | Grant the named permission in System Settings |
ELEMENT_NOT_FOUND | Ref cannot be resolved against the live UI | Re-run snapshot, use fresh ref |
APP_NOT_FOUND | App not running | Launch it first |
ACTION_FAILED | AX action rejected | Try an explicit alternative command |
ACTION_NOT_SUPPORTED | Element can't do this | Use different command |
STALE_REF | Ref from old snapshot | Re-run snapshot |
SNAPSHOT_NOT_FOUND | Snapshot ID is missing or expired | Run snapshot again and use the returned ID |
POLICY_DENIED | A physical/headed path was blocked | Use an explicit mouse/focus/keyboard command if physical interaction is intended |
WINDOW_NOT_FOUND | No matching window | Check app name, use list-windows |
PLATFORM_NOT_SUPPORTED | Adapter method not implemented on this platform | Use a supported platform adapter |
TIMEOUT | Wait condition not met | Increase --timeout |
INVALID_ARGS | Bad arguments | Check command syntax |
NOTIFICATION_NOT_FOUND | Notification index no longer exists | Re-run list-notifications |
agent-desktop snapshot --skeleton --app "App" -i --compact # Skeleton overview (preferred)
agent-desktop snapshot --root @e3 -i --compact # Drill into region
agent-desktop snapshot --app "App" -i # Full tree (simple apps)
agent-desktop snapshot --app "App" --surface menu -i # Surface snapshot
agent-desktop screenshot --app "App" out.png # PNG screenshot
agent-desktop find --app "App" --role button # Search elements
agent-desktop get @e1 --snapshot <snapshot_id> --property text # Read element property
agent-desktop is @e1 --snapshot <snapshot_id> --property enabled # Check element state
agent-desktop list-surfaces --app "App" # Available surfaces
agent-desktop click @e5 --snapshot <snapshot_id> # AX-first click, no cursor move by default
agent-desktop double-click @e3 # AXOpen; physical double-click uses mouse-click --count 2
agent-desktop triple-click @e2 # Physical triple-click uses mouse-click --count 3
agent-desktop right-click @e5 # Right-click; menu returned when verified
agent-desktop type @e2 --snapshot <snapshot_id> "hello" # Headless AX text insertion when supported
agent-desktop set-value @e2 "new value" # Set value directly
agent-desktop clear @e2 # Clear element value
agent-desktop focus @e2 # Set keyboard focus
agent-desktop select @e4 "Option B" # Select dropdown/list option
agent-desktop toggle @e6 # Toggle checkbox/switch
agent-desktop check @e6 # Idempotent check
agent-desktop uncheck @e6 # Idempotent uncheck
agent-desktop expand @e7 # Expand disclosure
agent-desktop collapse @e7 # Collapse disclosure
agent-desktop scroll @e1 --direction down # Scroll element
agent-desktop scroll-to @e8 # Scroll into view
agent-desktop press cmd+c # Key combo
agent-desktop press return --app "App" # Targeted key press
agent-desktop key-down shift # Hold key
agent-desktop key-up shift # Release key
agent-desktop hover @e5 # Explicit cursor movement
agent-desktop hover --xy 500,300 # Cursor to coordinates
agent-desktop drag --from @e1 --to @e5 # Drag between elements
agent-desktop mouse-click --xy 500,300 # Click at coordinates
agent-desktop mouse-move --xy 100,200 # Move cursor
agent-desktop mouse-down --xy 100,200 # Press mouse button
agent-desktop mouse-up --xy 300,400 # Release mouse button
agent-desktop launch "System Settings" # Launch and wait
agent-desktop close-app "TextEdit" # Quit gracefully
agent-desktop close-app "TextEdit" --force # Force kill
agent-desktop list-windows --app "Finder" # List windows
agent-desktop list-apps # List running GUI apps
agent-desktop focus-window --app "Finder" # Bring to front
agent-desktop resize-window --app "App" --width 800 --height 600
agent-desktop move-window --app "App" --x 0 --y 0
agent-desktop minimize --app "App"
agent-desktop maximize --app "App"
agent-desktop restore --app "App"
agent-desktop list-notifications # List all notifications
agent-desktop list-notifications --app "Slack" # Filter by app
agent-desktop list-notifications --text "deploy" --limit 5 # Filter by text
agent-desktop dismiss-notification 1 # Dismiss by index
agent-desktop dismiss-all-notifications # Dismiss all
agent-desktop dismiss-all-notifications --app "Slack" # Dismiss all from app
agent-desktop notification-action 1 "Reply" --expected-app Slack # Click action (with NC reorder guard)
agent-desktop clipboard-get # Read clipboard
agent-desktop clipboard-set "text" # Write to clipboard
agent-desktop clipboard-clear # Clear clipboard
agent-desktop wait 1000 # Pause 1 second
agent-desktop wait --element @e5 --snapshot <snapshot_id> --timeout 5000 # Wait for element
agent-desktop wait --window "Title" # Wait for window
agent-desktop wait --text "Done" --app "App" # Wait for text
agent-desktop wait --menu --app "App" # Wait for menu surface
agent-desktop wait --menu-closed --app "App" # Wait for menu dismissal
agent-desktop wait --notification --app "App" # Wait for new notification
agent-desktop status # Health check
agent-desktop permissions # Check permission
agent-desktop permissions --request # Trigger permission dialog
agent-desktop version --json # Version info
agent-desktop batch '[...]' --stop-on-error # Batch uses the same typed command path as CLI
agent-desktop skills # List bundled skill docs
agent-desktop skills get desktop --full # Load this skill + all references
--skeleton -i --compact for dense apps. Drill into regions with --root @ref. Full snapshot only for simple apps.-i --compact flags. Filters to interactive elements and collapses empty wrappers, minimizing tokens.snapshot_id for deterministic multi-step use; re-drill the affected region after any UI-changing action. Scoped invalidation keeps other refs intact.click @e5 > mouse-click --xy 500,300.wait for async UI. After launch/dialog triggers, wait for expected state.permissions on first use; screenshots also need Screen Recording.error.code and follow error.suggestion.find for targeted searches. Faster than any snapshot when you know role/name.snapshot --surface menu for menus, --surface sheet for dialogs. Never --skeleton for surfaces — they're already focused.focus, press, hover, drag, or mouse-* commands only when physical/headed interaction is intended.共 5 个版本