by Fzkuji
๐ฆ Vision-based desktop automation skills for OpenClaw agents on macOS. See, learn, click โ any app.
# Add to your Claude Code skills
git clone https://github.com/Fzkuji/GUIClawYou ARE the agent loop. Every GUI task follows this flow, in order:
INTENT MATCH โ OBSERVE โ ENSURE APP READY โ ACT โ VERIFY โ SAVE WORKFLOW โ REPORT
Each step has detailed instructions in its own skill file:
| Step | Skill | When to read |
|------|-------|-------------|
| Observe | skills/gui-observe/SKILL.md | Before any action โ screenshot, OCR, identify state |
| Learn | skills/gui-learn/SKILL.md | App not in memory, or match rate < 80% |
| Act | skills/gui-act/SKILL.md | Clicking, typing, sending messages, waiting for UI |
| Memory | skills/gui-memory/SKILL.md | Visual memory โ profiles, components, pages, CRUD, cleanup |
| Workflow | skills/gui-workflow/SKILL.md | Intent matching, saving/replaying workflows, meta-workflows |
| Setup | skills/gui-setup/SKILL.md | First-time setup on a new machine |
Read the relevant sub-skill when you reach that step. You don't need to read all of them upfront.
All GUI operations go through scripts/agent.py. Do not call app_memory.py or gui_agent.py directly.
source ~/gui-agent-env/bin/activate
# Core operations
python3 scripts/agent.py open --app AppName
python3 scripts/agent.py learn --app AppName
python3 scripts/agent.py detect --app AppName
python3 scripts/agent.py click --app AppName --component ButtonName
python3 scripts/agent.py list --app AppName
python3 scripts/agent.py read_screen --app AppName
python3 scripts/agent.py wait_for --app AppName --component X
python3 scripts/agent.py cleanup --app AppName
python3 scripts/agent.py navigate --url "https://example.com"
python3 scripts/agent.py workflows --app AppName
python3 scripts/agent.py all_workflows
# Messaging
python3 scripts/agent.py send_message --app WeChat --contact "ๅฐๆ" --message "ๆๅคฉ่ง"
python3 scripts/agent.py read_messages --app WeChat --contact "ๅฐๆ"
wait_for command (template-match polling, no blind clicks); mandatory timing & token delta reporting; multi-window fix (selects largest window).No comments yet. Be the first to share your thoughts!
agent.py automatically handles:
โ Details: skills/gui-workflow/SKILL.md
Match user request to saved workflows before doing anything. If matched, use workflow steps as plan. If not, proceed and save after success.
โ Details: skills/gui-observe/SKILL.md
Screenshot, identify current state. Record session_status for token reporting.
โ Details: skills/gui-learn/SKILL.md
Check if app is in memory. If not โ learn. If match rate < 80% โ re-learn. This is YOUR responsibility โ do not wait for the user.
โ Details: skills/gui-learn/SKILL.md
Detect all components (YOLO + OCR), identify them, filter, save to memory. Privacy check: delete personal info.
โ Details: skills/gui-act/SKILL.md
Execute clicks, typing, sending. Pre-verify before every click. Pre-verify contact before every message send.
โ Details: skills/gui-act/SKILL.md
Screenshot after every action. Did the expected change happen? If not โ re-observe.
โ Details: skills/gui-workflow/SKILL.md
Save successful multi-step sequences for future replay.
Every GUI task ends with a report:
โฑ 45.2s | ๐ +10k tokens (85kโ95k) | ๐ง 3 screenshots, 2 clicks, 1 learn
Compare session_status from STEP 0 vs now.
open <url>, osascript tell app to set URL, CLI tools) to manipulate app state. Only allowed system calls: activate (bring window to front), screencapture (take screenshot), cliclick (click/type after visual detection provides coordinates).These exist because of real bugs:
โ Details: skills/gui-memory/SKILL.md
Visual memory stores app profiles, components, page fingerprints, workflows. See gui-memory for directory structure, profile schema, CRUD operations, and cleanup rules.
gui-agent/
โโโ SKILL.md # This file (main orchestrator)
โโโ skills/ # Sub-skills (read on demand)
โ โโโ gui-observe/SKILL.md
โ โโโ gui-learn/SKILL.md
โ โโโ gui-act/SKILL.md
โ โโโ gui-memory/SKILL.md
โ โโโ gui-workflow/SKILL.md
โ โโโ gui-setup/SKILL.md
โโโ scripts/ # Core scripts
โ โโโ agent.py, ui_detector.py, app_memory.py, gui_agent.py, template_match.py
โโโ memory/ # Visual memory (gitignored)
โ โโโ apps/<appname>/
โ โโโ meta_workflows/
โโโ actions/ # Atomic operations
โโโ docs/
โโโ README.md
You: "Send a message to John in WeChat saying see you tomorrow"
OBSERVE โ Screenshot, identify current state
โโโ Current app: Finder (not WeChat)
โโโ Action: need to switch to WeChat
STATE โ Check WeChat memory
โโโ Learned before? Yes (24 components)
โโโ OCR visible text: ["Chat", "Cowork", "Code", "Search", ...]
โโโ State identified: "initial" (89% match)
โโโ Components for this state: 18 โ use these for matching
NAVIGATE โ Find contact "John"
โโโ Template match search_bar โ found (conf=0.96) โ click
โโโ Paste "John" into search field (clipboard โ Cmd+V)
โโโ OCR search results โ found โ click
โโโ New state: "click:John" (chat opened)
VERIFY โ Confirm correct chat opened
โโโ OCR chat header โ "John" โ
โโโ Wrong contact? โ ABORT
ACT โ Send messag...