by Fzkuji
๐ฆ Vision-based desktop automation skills for OpenClaw agents on macOS. See, learn, click โ any app.
# Add to your Claude Code skills
git clone https://github.com/Fzkuji/GUIClawYou ARE the agent loop. Every GUI task follows this flow, in order:
INTENT MATCH โ OBSERVE โ ENSURE APP READY โ ACT โ VERIFY โ SAVE WORKFLOW โ REPORT
Each step has detailed instructions in its own skill file:
| Step | Skill | When to read |
|------|-------|-------------|
| Observe | skills/gui-observe/SKILL.md | Before any action โ screenshot, OCR, identify state |
| Learn | skills/gui-learn/SKILL.md | App not in memory, or match rate < 80% |
| Act | skills/gui-act/SKILL.md | Clicking, typing, sending messages, waiting for UI |
| Memory | skills/gui-memory/SKILL.md | Visual memory โ profiles, components, pages, CRUD, cleanup |
| Workflow | skills/gui-workflow/SKILL.md | Intent matching, saving/replaying workflows, meta-workflows |
| Setup | skills/gui-setup/SKILL.md | First-time setup on a new machine |
Read the relevant sub-skill when you reach that step. You don't need to read all of them upfront.
All GUI operations go through scripts/agent.py. Do not call app_memory.py or gui_agent.py directly.
source ~/gui-agent-env/bin/activate
# Core operations
python3 scripts/agent.py open --app AppName
python3 scripts/agent.py learn --app AppName
python3 scripts/agent.py detect --app AppName
python3 scripts/agent.py click --app AppName --component ButtonName
python3 scripts/agent.py list --app AppName
python3 scripts/agent.py read_screen --app AppName
python3 scripts/agent.py wait_for --app AppName --component X
python3 scripts/agent.py cleanup --app AppName
python3 scripts/agent.py navigate --url "https://example.com"
python3 scripts/agent.py workflows --app AppName
python3 scripts/agent.py all_workflows
# Messaging
python3 scripts/agent.py send_message --app WeChat --contact "ๅฐๆ" --message "ๆๅคฉ่ง"
python3 scripts/agent.py read_messages --app WeChat --contact "ๅฐๆ"
wait_for command (template-match polling, no blind clicks); mandatory timing & token delta reporting; multi-window fix (selects largest window).You: "Send a message to John in WeChat saying see you tomorrow"
OBSERVE โ Screenshot, identify current state
โโโ Current app: Finder (not WeChat)
โโโ Action: need to switch to WeChat
STATE โ Check WeChat memory
โโโ Learned before? Yes (24 components)
โโโ OCR visible text: ["Chat", "Cowork", "Code", "Search", ...]
โโโ State identified: "initial" (89% match)
โโโ Components for this state: 18 โ use these for matching
NAVIGATE โ Find contact "John"
โโโ Template match search_bar โ found (conf=0.96) โ click
โโโ Paste "John" into search field (clipboard โ Cmd+V)
โโโ OCR search results โ found โ click
โโโ New state: "click:John" (chat opened)
VERIFY โ Confirm correct chat opened
โโโ OCR chat header โ "John" โ
โโโ Wrong contact? โ ABORT
ACT โ Send message
โโโ Click input field (template match)
โโโ Paste "see you tomorrow" (clipboard โ Cmd+V)
โโโ Press Enter
CONFIRM โ Verify message sent
โโโ OCR chat area โ "see you tomorrow" visible โ
โโโ Done
No comments yet. Be the first to share your thoughts!
agent.py automatically handles:
โ Details: skills/gui-workflow/SKILL.md
Match user request to saved workflows before doing anything. If matched, use workflow steps as plan. If not, proceed and save after success.
โ Details: skills/gui-observe/SKILL.md
Screenshot, identify current state. Record session_status for token reporting.
โ Details: skills/gui-learn/SKILL.md
Check if app is in memory. If not โ learn. If match rate < 80% โ re-learn. This is YOUR responsibility โ do not wait for the user.
โ Details: skills/gui-learn/SKILL.md
Detect all components (YOLO + OCR), identify them, filter, save to memory. Privacy check: delete personal info.
โ Details: skills/gui-act/SKILL.md
Execute clicks, typing, sending. Pre-verify before every click. Pre-verify contact before every message send.
โ Details: skills/gui-act/SKILL.md
Screenshot after every action. Did the expected change happen? If not โ re-observe.
โ Details: skills/gui-workflow/SKILL.md
Save successful multi-step sequences for future replay.
Every GUI task ends with a report:
โฑ 45.2s | ๐ +10k tokens (85kโ95k) | ๐ง 3 screenshots, 2 clicks, 1 learn
Compare session_status from STEP 0 vs now.
open <url>, osascript tell app to set URL, CLI tools) to manipulate app state. Only allowed system calls: activate (bring window to front), screencapture (take screenshot), cliclick (click/type after visual detection provides coordinates).These exist because of real bugs:
โ Details: skills/gui-memory/SKILL.md
Visual memory stores app profiles, components, page fingerprints, workflows. See gui-memory for directory structure, profile schema, CRUD operations, and cleanup rules.
gui-agent/
โโโ SKILL.md # This file (main orchestrator)
โโโ skills/ # Sub-skills (read on demand)
โ โโโ gui-observe/SKILL.md
โ โโโ gui-learn/SKILL.md
โ โโโ gui-act/SKILL.md
โ โโโ gui-memory/SKILL.md
โ โโโ gui-workflow/SKILL.md
โ โโโ gui-setup/SKILL.md
โโโ scripts/ # Core scripts
โ โโโ agent.py, ui_detector.py, app_memory.py, gui_agent.py, template_match.py
โโโ memory/ # Visual memory (gitignored)
โ โโโ apps/<appname>/
โ โโโ meta_workflows/
โโโ actions/ # Atomic operations
โโโ docs/
โโโ README.md
OBSERVE โ Screenshot โ CleanMyMac X not in foreground โ activate
โโโ Get main window bounds (largest window, skip status bar panels)
โโโ OCR window content โ identify current state
STATE โ Check memory for CleanMyMac X
โโโ OCR visible text: ["Smart Scan", "Malware Removal", "Privacy", ...]
โโโ State identified: "initial" (92% match)
โโโ Know which components to match: 21 components
NAVIGATE โ Click "Malware Removal" in sidebar
โโโ Find element in window (exact match, filter by window bounds)
โโโ Click โ new state: "click:Malware_Removal"
โโโ OCR confirms new state (87% match)
ACT โ Click "Scan" button
โโโ Find "Scan" (exact match, bottom position โ prevents matching "Deep Scan")
โโโ Click โ scan starts
POLL โ Wait for completion (event-driven, no fixed sleep)
โโโ Every 2s: screenshot โ OCR check for "No threats"
โโโ Target found โ proceed immediately
CONFIRM โ "No threats found" โ
OBSERVE โ Screenshot โ Chrome is open
โโโ Identify target: JupyterLab tab
NAVIGATE โ Find JupyterLab tab in browser
โโโ OCR tab bar or use bookmarks
โโโ Click to switch
EXPLORE โ Multiple terminal tabs visible
โโโ Screenshot terminal area
โโโ LLM vision analysis โ identify which tab has nvitop
โโโ Click the correct tab
READ โ Screenshot terminal content
โโโ LLM reads GPU utilization table
โโโ Report: "8 GPUs, 7 at 100% โ experiment running" โ
OBSERVE โ Screenshot current state
โโโ Neither GlobalProtect nor Activity Monitor in foreground
ACT โ Launch both apps
โโโ open -a "GlobalProtect"
โโโ open -a "Activity Monitor"
EXPLORE โ Screenshot Activity Monitor window
โโโ LLM vision โ "Network tab active, search field empty at top-right"
โโโ Decide: click search field first
ACT โ Search for process
โโโ Click search field (identified by explore)
โโโ Paste "GlobalProtect" (clipboard โ Cmd+V, never cliclick type)
โโโ Wait for filter results
VERIFY โ Process found in list โ select it
ACT โ Kill process
โโโ Click stop button (X) in toolbar
โโโ Confirmation dialog appears
VERIFY โ Click "Force Quit"
CONFIRM โ Screenshot โ process list empty โ terminated โ
1. Clone & install
git clone https://github.com/Fzkuji/GUIClaw.git
cd GUIClaw
bash scripts/setup.sh
2. Grant accessibility permissions
System Settings โ Privacy & Security โ Accessibility โ Add Terminal / OpenClaw
3. Enable in OpenClaw (recommended)
Add to ~/.openclaw/openclaw.json:
{ "skills": { "entries": { "gui-agent": { "enabled": true } } } }
Then just chat with your agent โ it reads SKILL.md and handles everything automatically.
First time โ YOLO detects everything (~4 seconds):
๐ YOLO: 43 icons ๐ OCR: 34 text elements ๐ โ 24 fixed UI components saved
Every time after โ instant template match (~0.3 seconds):
โ
search_bar_icon (202,70) conf=1.0
โ
emoji_button (354,530) conf=1.0
โ
sidebar_contacts (85,214) conf=1.0
| Detector | Speed | Finds | Why | |----------|-------|-------|-----| | GPA-GUI-Detector | 0.3s | Icons, buttons | Finds gray-on-gray icons others miss | | Apple Vision OCR | 1.6s | Text (CN + EN) | Best Chinese OCR, pixel-accurate | | Template Match | 0.3s | Known components | 100% accuracy after first learn |
Each app gets its own visual memory with a click-graph state model.
memory/apps/
โโโ wechat/
โ โโโ profile.json # Components + click-graph states
โ โโโ components/ # Cropped UI element images
โ โ โโโ search_bar.png
โ โ โโโ emoji_button.png
โ โ โโโ ...
โ โโโ workflows/ # Saved task sequences
โ โ โโโ send_message.json
โ โโโ pages/
โ โโโ main_annotated.jpg
โโโ cleanmymac_x/
โ โโโ profile.json
โ โโโ components/
โ โโโ workflows/
โ โ โโโ smart_scan_cleanup.json
โ โโโ pages/
โโโ claude/
โ โโโ profile.json
โ โโโ components/
โ โโโ workflows/
โ โ โโโ check_usage.json
โ โโโ pages/
โโโ google_chrome/
โโโ profile.json
โโโ components/
โโโ sites/ # Per-website memory
โโโ 12306_cn/
โโโ github_com/
The UI is modeled as a graph of states. Each state is defined by which components are visible on screen.
profile.json structure:
{
"app": "Claude",
"window_size": [1512, 828],
"components": {
"Search": { "type": "icon", "rel_x": 115, "rel_y": 143, "icon_file": "components/Search.png", ... },
"Settings": { ... }
},
"states": {
"initial": {
"visible": ["Chat_tab", "Cowork_tab", "Code_tab", "Search", "Ideas", ...],
"description": "Main app view when first opened"
},
"click:Settings": {
"trigger": "Settings",
"trigger_pos": [63, 523],
"visible": ["Chat_tab", "Account", "Billing", "Usage", "General", ...],
"disappeared": ["Ideas", "Customize", ...],
"description": "Settings page"
},
"click:Usage": {
"trigger": "Usage",
"visible": ["Chat_tab", "Account", "Billing", "Usage", "Developer", ...],
"description": "Settings > Usage tab"
}
}
}
How it works:
learn)click:ComponentName statevisible list โ highest match ratio winsChat_tab is visible in initial, click:Settings, click:Usage)Why this works: