by Fzkuji
🦞 Vision-based desktop automation skills for OpenClaw agents on macOS. See, learn, click — any app.
# Add to your Claude Code skills
git clone https://github.com/Fzkuji/GUIClawLast scanned: 5/30/2026
{
"issues": [],
"status": "PASSED",
"scannedAt": "2026-05-30T17:05:47.257Z",
"npmAuditRan": true,
"pipAuditRan": false
}GUIClaw is an open-source ai agents skill for AI coding assistants such as Claude Code, Codex CLI, and ChatGPT, built by Fzkuji. 🦞 Vision-based desktop automation skills for OpenClaw agents on macOS. See, learn, click — any app. It has 6 GitHub stars.
Yes. GUIClaw passed SkillsLLM's automated security scan — a dependency vulnerability audit plus prompt-injection heuristics — with no high-severity issues. You can read the full report in the Security Report section on this page.
Clone the repository with "git clone https://github.com/Fzkuji/GUIClaw" and add it to your Claude Code skills directory (see the Installation section above). GUIClaw ships a SKILL.md manifest, so compatible agents can discover and load it automatically.
GUIClaw is primarily written in Python. It is open-source under Fzkuji on GitHub, so you can review or fork the full source.
Yes. SkillsLLM lists many other AI Agents skills you can browse and compare side by side. Open the AI Agents category from the badge at the top of this page, or use the Related Skills and comparison links further down to weigh GUIClaw against similar tools.
No comments yet. Be the first to share your thoughts!
You ARE the agent loop. Every GUI task follows this flow, in order:
INTENT MATCH → OBSERVE → ENSURE APP READY → ACT → VERIFY → SAVE WORKFLOW → REPORT
Each step has detailed instructions in its own skill file:
| Step | Skill | When to read |
|---|---|---|
| Observe | skills/gui-observe/SKILL.md |
Before any action — screenshot, OCR, identify state |
| Learn | skills/gui-learn/SKILL.md |
App not in memory, or match rate < 80% |
| Act | skills/gui-act/SKILL.md |
Clicking, typing, sending messages, waiting for UI |
| Memory | skills/gui-memory/SKILL.md |
Visual memory — profiles, components, pages, CRUD, cleanup |
| Workflow | skills/gui-workflow/SKILL.md |
Intent matching, saving/replaying workflows, meta-workflows |
| Setup | skills/gui-setup/SKILL.md |
First-time setup on a new machine |
Read the relevant sub-skill when you reach that step. You don't need to read all of them upfront.
All GUI operations go through scripts/agent.py. Do not call app_memory.py or gui_agent.py directly.
source ~/gui-agent-env/bin/activate
# Core operations
python3 scripts/agent.py open --app AppName
python3 scripts/agent.py learn --app AppName
python3 scripts/agent.py detect --app AppName
python3 scripts/agent.py click --app AppName --component ButtonName
python3 scripts/agent.py list --app AppName
python3 scripts/agent.py read_screen --app AppName
python3 scripts/agent.py wait_for --app AppName --component X
python3 scripts/agent.py cleanup --app AppName
python3 scripts/agent.py navigate --url "https://example.com"
python3 scripts/agent.py workflows --app AppName
python3 scripts/agent.py all_workflows
# Messaging
python3 scripts/agent.py send_message --app WeChat --contact "小明" --message "明天见"
python3 scripts/agent.py read_messages --app WeChat --contact "小明"
agent.py automatically handles:
→ Details: skills/gui-workflow/SKILL.md
Match user request to saved workflows before doing anything. If matched, use workflow steps as plan. If not, proceed and save after success.
→ Details: skills/gui-observe/SKILL.md
Screenshot, identify current state. Record session_status for token reporting.
→ Details: skills/gui-learn/SKILL.md
Check if app is in memory. If not → learn. If match rate < 80% → re-learn. This is YOUR responsibility — do not wait for the user.
→ Details: skills/gui-learn/SKILL.md
Detect all components (YOLO + OCR), identify them, filter, save to memory. Privacy check: delete personal info.
→ Details: skills/gui-act/SKILL.md
Execute clicks, typing, sending. Pre-verify before every click. Pre-verify contact before every message send.
→ Details: skills/gui-act/SKILL.md
Screenshot after every action. Did the expected change happen? If not → re-observe.
→ Details: skills/gui-workflow/SKILL.md
Save successful multi-step sequences for future replay.
Every GUI task ends with a report:
⏱ 45.2s | 📊 +10k tokens (85k→95k) | 🔧 3 screenshots, 2 clicks, 1 learn
Compare session_status from STEP 0 vs now.
open <url>, osascript tell app to set URL, CLI tools) to manipulate app state. Only allowed system calls: activate (bring window to front), screencapture (take screenshot), cliclick (click/type after visual detection provides coordinates).These exist because of real bugs:
→ Details: skills/gui-memory/SKILL.md
Visual memory stores app profiles, components, page fingerprints, workflows. See gui-memory for directory structure, profile schema, CRUD operations, and cleanup rules.
gui-agent/
├── SKILL.md # This file (main orchestrator)
├── skills/ # Sub-skills (read on demand)
│ ├── gui-observe/SKILL.md
│ ├── gui-learn/SKILL.md
│ ├── gui-act/SKILL.md
│ ├── gui-memory/SKILL.md
│ ├── gui-workflow/SKILL.md
│ └── gui-setup/SKILL.md
├── scripts/ # Core scripts
│ ├── agent.py, ui_detector.py, app_memory.py, gui_agent.py, template_match.py
├── memory/ # Visual memory (gitignored)
│ ├── apps/<appname>/
│ └── meta_workflows/
├── actions/ # Atomic operations
├── docs/
└── README.md
wait_for command (template-match polling, no blind clicks); mandatory timing & token delta reporting; multi-window fix (selects largest window).You: "Send a message to John in WeChat saying see you tomorrow"
OBSERVE → Screenshot, identify current state
├── Current app: Finder (not WeChat)
└── Action: need to switch to WeChat
STATE → Check WeChat memory
├── Learned before? Yes (24 components)
├── OCR visible text: ["Chat", "Cowork", "Code", "Search", ...]
├── State identified: "initial" (89% match)
└── Components for this state: 18 → use these for matching
NAVIGATE → Find contact "John"
├── Template match search_bar → found (conf=0.96) → click
├── Paste "John" into search field (clipboard → Cmd+V)
├── OCR search results → found → click
└── New state: "click:John" (chat opened)
VERIFY → Confirm correct chat opened
├── OCR chat header → "John" ✅
└── Wrong contact? → ABORT
ACT → Send message
├── Click input field (template match)
├── Paste "see you tomorrow" (clipboard → Cmd+V)
└── Press Enter
CONFIRM → Verify message sent
├── OCR chat area → "see you tomorrow" visible ✅
└── Done
OBSERVE → Screenshot → CleanMyMac X not in foreground → activate
├── Get main window bounds (largest window, skip status bar panels)
└── OCR window content → identify current state
STATE → Check memory for CleanMyMac X
├── OCR visible text: ["Smart Scan", "Malware Removal", "Privacy", ...]
├── State identified: "initial" (92% match)
└── Know which components to match: 21 components
NAVIGATE → Click "Malware Removal" in sidebar
├── Find element in window (exact match, filter by window bounds)
├── Click → new state: "click:Malware_Removal"
└── OCR confirms new state (87% match)
ACT → Click "Scan" button
├── Find "Scan" (exact match, bottom position — prevents matching "Deep Scan")
└── Click → scan starts
POLL → Wait for completion (event-driven, no fixed sleep)
├── Every 2s: screenshot → OCR check for "No threats"
└── Target found → proceed immediately
CONFIRM → "No threats found" ✅
OBSERVE → Screenshot → Chrome is open
└── Identify target: JupyterLab tab
NAVIGATE → Find JupyterLab tab in browser
├── OCR tab bar or use bookmarks
└── Click to switch
EXPLORE → Multiple terminal tabs visible
├── Screenshot terminal area
├── LLM vision analysis → identify which tab has nvitop
└── Click the correct tab
READ → Screenshot terminal content
├── LLM reads GPU utilization table
└── Report: "8 GPUs, 7 at 100% — experiment running" ✅
OBSERVE → Screenshot current state
└── Neither GlobalProtect nor Activity Monitor in foreground
ACT → Launch both apps
├── open -a "GlobalProtect"
└── open -a "Activity Monitor"
EXPLORE → Screenshot Activity Monitor window
├── LLM vision → "Network tab active, search field empty at top-right"
└── Decide: click search field first
ACT → Search for process
├── Click search field (identified by explore)
├── Paste "GlobalProtect" (clipboard → Cmd+V, never cliclick type)
└── Wait for filter results
VERIFY → Process found in list → select it
ACT → Kill process
├── Click stop button (X) in toolbar
└── Confirmation dialog appears
VERIFY → Click "Force Quit"
CONFIRM → Screenshot → process list empty → terminated ✅
1. Clone & install
git clone https://github.com/Fzkuji/GUIClaw.git
cd GUIClaw
bash scripts/setup.sh
2. Grant accessibility permissions
System Settings → Privacy & Security → Accessibility → Add Terminal / OpenClaw
3. Enable in OpenClaw (recommended)
Add to ~/.openclaw/openclaw.json:
{ "skills": { "entries": { "gui-agent": { "enabled": true } } } }
Then just chat with your agent — it reads SKILL.md and handles everything automatically.
First time — YOLO detects everything (~4 seconds):
🔍 YOLO: 43 icons 📝 OCR: 34 text elements 🔗 → 24 fixed UI components saved
Every time after — instant template match (~0.3 seconds):
✅ search_bar_icon (202,70) conf=1.0
✅ emoji_button (354,530) conf=1.0
✅ sidebar_contacts (85,214) conf=1.0
| Detector | Speed | Finds | Why |
|---|---|---|---|
| GPA-GUI-Detector | 0.3s | Icons, buttons | Finds gray-on-gray icons others miss |
| Apple Vision OCR | 1.6s | Text (CN + EN) | Best Chinese OCR, pixel-accurate |
| Template Match | 0.3s | Known components | 100% accuracy after first learn |
Each app gets its own visual memory with a click-graph state model.
memory/apps/
├── wechat/
│ ├── profile.json # Components + click-graph states
│ ├── components/ # Cropped UI element images
│ │ ├── search_bar.png
│ │ ├── emoji_button.png
│ │ └── ...
│ ├── workflows/ # Saved task sequences
│ │ └── send_message.json
│ └── pages/
│ └── main_annotated.jpg
├── cleanmymac_x/
│ ├── profile.json
│ ├── components/
│ ├── workflows/
│ │ └── smart_scan_cleanup.json
│ └── pages/
├── claude/
│ ├── profile.json
│ ├── components/
│ ├── workflows/
│ │ └── check_usage.json
│ └── pages/
└── google_chrome/
├── profile.json
├── components/
└── sites/ # Per-website memory
├── 12306_cn/
└── github_com/
The UI is modeled as a graph of states. Each state is defined by which components are visible on screen.
profile.json structure:
{
"app": "Claude",
"window_size": [1512, 828],
"components": {
"Search": { "type": "icon", "rel_x": 115, "rel_y": 143, "icon_file": "components/Search.png", ... },
"Settings": { ... }
},
"states": {
"initial": {
"visible": ["Chat_tab", "Cowork_tab", "Code_tab", "Search", "Ideas", ...],
"description": "Main app view when first opened"
},
"click:Settings": {
"trigger": "Settings",
"trigger_pos": [63, 523],
"visible": ["Chat_tab", "Account", "Billing", "Usage", "General", ...],
"disappeared": ["Ideas", "Customize", ...],
"description": "Settings page"
},
"click:Usage": {
"trigger": "Usage",
"visible": ["Chat_tab", "Account", "Billing", "Usage", "Developer", ...],
"description": "Settings > Usage tab"
}
}
}
How it works:
learn)click:ComponentName statevisible list → highest match ratio winsChat_tab is visible in initial, click:Settings, click:Usage)Why this works: