by Fzkuji
Vision-based desktop automation skills for LLM agents on macOS.
# Add to your Claude Code skills
git clone https://github.com/Fzkuji/GUI-Agent-SkillsYou ARE the agent loop: Observe → Decide → Act → Verify.
User Intent ("send WeChat message to 宋文涛")
│
▼
┌─────────────────────────────────────┐
│ 1. DETECT (ui_detector.py) │
│ GPA-GUI-Detector (YOLO, 40MB) │ → icons, buttons
│ Apple Vision OCR │ → text elements (Chinese)
│ Accessibility API │ → Dock, menubar, named controls
│ IoU merge + dedup │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 2. MATCH (app_memory.py) │
│ Template matching vs memory │ → known components (0.3s, conf=1.0)
│ If matched → use stored coords │
│ If unknown → LLM identifies │ → save to memory for next time
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 3. ACT (gui_agent.py + cliclick) │
│ Relative coords + window pos │ → screen coordinates
│ Pre-action verify │ → correct contact? correct field?
│ Execute click/type/key │
│ Post-action verify (OCR) │
└─────────────────────────────────────┘
| Detector | What it finds | Speed | Best for | |----------|--------------|-------|----------| | GPA-GUI-Detector | Icons, buttons, UI elements | 0.3s | WeChat icons, any app's buttons | | Apple Vision OCR | Text (Chinese + English) | 1.6s | Chat content, labels, menus | | Accessibility API | Named controls with positions | 0.1s | Dock, Discord, Chrome, menubar | | Template Match | Previously seen components | 0.3s | Known UI elements (conf=1.0) |
App has good AX? (Discord, Chrome, System Settings)
→ Use AX API directly (fastest, most accurate)
App has bad AX? (WeChat, QQ)
→ GPA-GUI-Detector + OCR → template match against memory
First time using an app?
→ Run `app_memory.py learn --app AppName`
→ Saves all components to memory/apps/appname/
Need to click a known element?
→ Run `app_memory.py click --app AppName --component name`
→ Template match → relative coords → click
Each app gets a memory directory with learned components:
memory/apps/
├── wechat/
│ ├── profile.json # Component registry (coords, labels, types)
│ ├── icons/ # Cropped component images (PNG)
│ │ ├── Q_Search.png
│ │ ├── 宋文涛.png
│ │ ├── icon_11_173_225.png (unlabeled → LLM identifies later)
│ │ └── ...
│ └── pages/
│ ├── main_annotated.jpg # Annotated screenshot
│ └── main.json # Page layout
├── discord/
│ └── ...
screencapture -l <windowID>, not fullscreenIcon filename = content description: chat_button.png, search_bar.png, NOT icon_0_170_103.png
unlabeled_<region>_<x>_<y>.png temporarilyapp_memory.py rename --old unlabeled_xxx --new actual_nameDedup: Never save duplicate icons. Before saving, compare against existing icons (similarity > 0.92 = duplicate). Keep ONE copy.
Cleanup: Run app_memory.py cleanup --app AppName to remove duplicates. Dynamic content (chat messages, timestamps, avatars in chat) should be periodically cleaned.
Per-app, per-page: Each app has its own memory directory. Different pages (main, chat, settings) get separate page layouts.
Important vs unimportant:
learn runs)1. Capture window screenshot
2. Run GPA-GUI-Detector + Apple Vision OCR
3. For each detected element:
a. Has OCR label? → use label as filename
b. No label? → name as "unlabeled_<region>_<x>_<y>"
c. Check visual dedup (similarity > 0.92) → skip if duplicate
d. Crop and save to icons/
4. After saving all elements:
a. Use vision model (Claude/GPT-4o) to identify ALL unlabeled icons
b. Rename identified icons: app_memory.py rename --old unlabeled_xxx --new actual_name
c. Remove dynamic content (timestamps, message previews, chat text, stickers)
d. Keep ONLY fixed UI elements (buttons, icons, tabs, navigation, input fields)
5. Result: clean profile with ~20-30 named, fixed UI components per page
The golden rule: Only save things that will look the same next time you open the app. If it changes every session, don't save it.
KEEP (fixed UI elements — same every time):
REMOVE (dynamic content — different every session):
HOW TO JUDGE: Ask yourself:
After every learn, verify:
unlabeled_ files remain (all identified or removed)cleanup if needed)| Scene | Location | Goal |
|-------|----------|------|
| Atomic Actions | actions/_actions.yaml | click, type, paste, AX scan... |
| WeChat | scenes/wechat/ | Send/read messages, scroll history |
| Discord | scenes/discord.yaml | Send/read messages |
| Telegram | scenes/telegram.yaml | Send/read messages |
| 1Password | scenes/1password.yaml | Retrieve credentials |
| VPN Reconnect | scenes/vpn-reconnect.yaml | Reconnect GlobalProtect VPN |
| App Exploration | scenes/app-explore.yaml | Map an unfamiliar app's UI |
What do you need to do?
│
├── GUI task on any app?
│ ├── App in memory? → app_memory.py detect/click
│ └── New app? → app_memory.py learn → then operate
│
├── WeChat? → read scenes/wechat/index.yaml
├── Discord? → read scenes/discord.yaml
├── Telegram? → read scenes/telegram.yaml
├── VPN/SSH down? → read scenes/vpn-reconnect.yaml
├── Need a password? → read scenes/1password.yaml
├── New app? → read scenes/app-explore.yaml
├── Atomic operation? → read actions/_actions.yaml
└── Principles? → read docs/core.md
These rules exist because of real bugs that caused messages sent to wrong people.
VERIFY BEFORE SENDING — Before typing ANY message, OCR the chat header to confirm the correct contact/group name is displayed. If wrong → ABORT immediately, do NOT send.
ALL OCR/clicks MUST be within target window bounds — Get window bounds first, filter all OCR results by window region. NEVER click coordinates outside the target app's window. Without this, you WILL click on other apps visible behind.
NEVER auto-learn from wrong-app context — If a click landed outside the target app window, do NOT save that location as a template. Validate window bounds before auto_learn.
Reject tiny templates — Templates smaller than 30×30 pixels produce false matches everywhere. Never save them.
Template match ≠ correct target — A template matching "宋文涛" text could be in a group chat name, a forwarded message, or another app. Always verify the CHAT HEADER after navigation, not just the sidebar click.
Every time you interact with a GUI app or website, check memory FIRST:
memory/apps/<appname>/? → Run learn automatically before operatingmemory/apps/<browser>/sites/<domain>/? → Run learn_site automaticallylearn --page <pagename> to add itDo NOT wait for the user to ask you to learn. This is YOUR responsibility.
screen_x = window_x + relative_x, screen_y = window_y + relative_yosascript -e 'tell application "System Events" to tell process "AppName" to return {position, size} of window 1'screencapture -x -l <windowID> output.pngosascript -e 'tell application "AppName" to activate'tell process "AppName" to set size of window 1 to {900, 650}/opt/homebrew/bin/cliclick c:<x>,<y> (logical screen coords, integers)cliclick t:"text" (ASCII only, special chars may break)pbcopy + Cmd+V (MUST set LANG=en_US.UTF-8)cliclick kp:return (valid keys: return, esc, tab, delete, space, arrow-*, f1-f16)osascript -e 'tell app "System Events" to keystroke "v" using command down'Browsers are a two-layer system:
memory/apps/
├── google_chrome/
│ ├── profile.json # Browser chrome UI (tabs, address bar, etc.)
│ ├── icons/ # Browser UI icons
│ └── sites/ # Per-website memory
│ ├── 12306.cn/
│ │ ├── profile.json # Site-specific UI elements
│ │ ├── icons/ # Site buttons, nav items
│ │ └── pages/
│ │ ├── search.json # Train search page layout
│ │ └── results.json # Results page layout
│ ├── google.com/
│ └── ...
Browser operation flow:
1. Learn browser chrome once: app_memory.py learn --app "Google Chrome"
→ Saves: address bar, tab controls, bookmarks bar, etc.
2. Navigate to a website:
a. Click address bar (template match: address_bar)
b. Type URL or search term (paste)
c. Press Enter
3. On a new website:
a. Wait for page load (1-2s)
b. Run detection (YOLO + OCR) on the page content area only
c. Save site-specific UI elements to sites/<domain>/
d. Dynamic content (search results, articles) = DON'T save
e. Fixed UI (nav bar, search box, buttons, filters) = SAVE
4. Operate within website:
a. Template match known site elements first
b. If not found → OCR find text → click
c. For form fields: click field → paste text → verify
What to save per website:
What NOT to save per website:
Autocomplete fields (like 12306 station selector): typing text alone is NOT enough. MUST click the dropdown suggestion item. The field only accepts values selected from the dropdown.
Chinese input in browsers: System IME interferes with website autocomplete.
cliclick t:bjn with English input → website dropdown shows 北京南 → click itCmd+V paste in web forms: May produce garbled text (encoding issues).
cliclick t:text for ASCII/pinyin, let website autocomplete handle ChineseDate pickers: Usually need to click the calendar UI, not just type a date string. Some accept direct input, some don't.
1. PREPARE
a. Activate the app: osascript tell "WeChat" to activate
b. Get window bounds: (win_x, win_y, win_w, win_h)
c. ALL subsequent OCR/clicks MUST be within these bounds
2. NAVIGATE TO CONTACT
a. Check if contact visible in sidebar (OCR within window bounds)
- If found → click it
- If not found → search:
i. Click search bar (template match or OCR within window)
ii. Paste contact name (pbcopy + Cmd+V)
iii. Wait 1s for results
iv. OCR find contact in results (within window bounds) → click
3. ⚠️ VERIFY CONTACT (MANDATORY — DO NOT SKIP)
a. OCR the chat HEADER area (top 120px of main content area)
b. Confirm expected contact name appears in the header
c. If WRONG contact or name NOT found:
→ LOG what chat IS open (for debugging)
→ ABORT immediately, do NOT type anything
→ Return error
d. Only proceed to step 4 if verification PASSES
4. TYPE MESSAGE (only after step 3 passes)
a. Click input field (template match or window_calc)
b. Paste message (pbcopy + Cmd+V, NOT cliclick type)
5. SEND
a. Press Enter (cliclick kp:return)
6. VERIFY SENT
a. OCR the chat area
b. Confirm first 10 chars of message visible
c. If not found → report warning (may still have sent)
WHY step 3 is critical: Template matching "宋文涛" could match:
Only the chat HEADER reliably shows who you're actually chatting with.
1. Activate the app, ensure window is reasonably sized (≥800x600)
2. Run: python3 app_memory.py learn --app AppName
3. System automatically:
a. Captures window screenshot
b. Runs GPA-GUI-Detector (YOLO) → finds all icons/buttons
c. Runs Apple Vision OCR → finds all text
d. Merges with IoU dedup
e. Crops each element → saves to memory/apps/appname/icons/
f. Auto-cleans dynamic content (timestamps, message previews)
g. Reports unlabeled icons
4. Agent identifies unlabeled icons (vision model looks at grid)
5. Rename: python3 app_memory.py rename --old unlabeled_xxx --new descriptive_name
6. Clean remaining dynamic content manually if needed
7. Final profile should have ~20-30 fixed UI components
1. Capture window screenshot
2. Template match against saved icon (OpenCV matchTemplate, threshold=0.8)
3. If matched (conf > 0.8):
a. Get relative coords from match
b. Convert to screen coords: screen = window_pos + relative
c. Verify: coords within window bounds? confidence > 0.7?
d. Click: cliclick c:<screen_x>,<screen_y>
4. If not matched:
a. Run full detection (YOLO + OCR)
b. Ask LLM to identify target element
c. Save new component to memory (auto-learn)
d. Click the identified element
1. Take screenshot
2. Run detection (YOLO + OCR)
3. Compare against known page layout
4. If new elements found:
a. Crop and save as unlabeled
b. Use LLM to identify
c. Update memory
5. If expected elements missing:
a. Maybe different page/state
b. Try learning as new page: learn --page settings
Run the setup script on a fresh Mac:
# Clone the repo
git clone https://github.com/Fzkuji/gui-agent-skills.git
cd gui-agent-skills
# Run setup (installs everything)
bash scripts/setup.sh
This will:
cliclick and Python 3.12 via Homebrew~/gui-agent-env/~/GPA-GUI-Detector/After setup, also grant Accessibility permissions: System Settings → Privacy & Security → Accessibility → Add Terminal / OpenClaw
source ~/gui-agent-env/bin/activate
cd scripts/
# Learn an app (captures window, detects all elements, saves to memory)
python3 app_memory.py learn --app WeChat
# After learning, identify unlabeled icons (agent does this automatically)
# Then operate:
python3 app_memory.py click --app WeChat --component search_bar_icon
python3 app_memory.py detect --app WeChat
| Script | Purpose |
|--------|---------|
| setup.sh | Run first on new machine — installs all dependencies |
| ui_detector.py | Unified detection engine (YOLO + OCR + AX) |
| app_memory.py | Per-app visual memory (learn / detect / click / verify) |
| gui_agent.py | Legacy task executor (send_message, read_messages, etc.) |
| template_match.py | Low-level template matching utilities |
| computer_use.py | Claude Computer Use API (experimental) |
All scripts use venv: source ~/gui-agent-env/bin/activate
| Model | Size | Auto-installed by setup.sh | Purpose |
|-------|------|---------------------------|---------|
| GPA-GUI-Detector | 40MB | ✅ ~/GPA-GUI-Detector/model.pt | UI element detection |
Optional (not auto-installed): | OmniParser V2 | 1.1GB | ❌ | Alt detection (weaker on desktop apps) | | GUI-Actor 2B | 4.5GB | ❌ | End-to-end grounding (experimental) |
~/gui-agent-env/ (created by setup.sh)~/GPA-GUI-Detector/model.pt (downloaded by setup.sh)<skill-dir>/memory/apps/<appname>/ (created on first learn)os.path.expanduser("~"), NOT hardcoded usernamesgui-agent/
├── SKILL.md # This file
├── actions/ # Atomic operations
│ └── _actions.yaml
├── scenes/ # Per-app operation workflows
│ ├── wechat/
│ ├── discord.yaml
│ └── ...
├── memory/ # Visual memory (gitignored)
│ └── apps/
│ ├── wechat/ # profile.json + icons/ + pages/
│ └── ...
├── scripts/ # Core scripts
│ ├── ui_detector.py # Detection engine
│ ├── app_memory.py # Memory management
│ ├── gui_agent.py # Task executor
│ └── ...
├── apps/ # App UI config (JSON)
├── docs/core.md # Core principles
└── README.md
Desktop GUI automation skill for OpenClaw and Claude Code.
<p align="center"> <img src="https://img.shields.io/badge/Platform-macOS_Apple_Silicon-black?logo=apple" /> <img src="https://img.shields.io/badge/Skill_for-OpenClaw-red?logo=lobster" /> <img src="https://img.shields.io/badge/Works_with-Claude_Code-blueviolet" /> <img src="https://img.shields.io/badge/License-MIT-yellow" /> </p>Teach your AI assistant to see, learn, and operate any macOS app — WeChat, Chrome, Discord, anything. It learns your UI once, remembers every button, and clicks precisely by name.
GUIClaw is an agent skill — a set of instructions and tools that teach AI assistants how to control your desktop. Instead of writing automation scripts, your AI:
# Clone into your OpenClaw skills directory
cd ~/.openclaw/workspace/skills
git clone https://github.com/Fzkuji/GUIClaw.git gui-agent
# Run setup (installs cliclick, Python env, detection model)
bash gui-agent/scripts/setup.sh
Then add to your OpenClaw config (~/.openclaw/openclaw.json):
{
"skills": {
"load": {
"extraDirs": ["~/.openclaw/workspace/skills"]
},
"entries": {
"gui-agent": { "enabled": true }
}
}
}
That's it. Your OpenClaw agent will now read SKILL.md automatically when you ask it to operate any desktop app.
# Clone anywhere
git clone https://github.com/Fzkuji/GUIClaw.git
cd GUIClaw
bash scripts/setup.sh
# Point Claude Code to the skill
# Add to your CLAUDE.md or project instructions:
# "For GUI automation tasks, read skills/gui-agent/SKILL.md first."
Claude Code reads SKILL.md, understands the architecture, and uses the scripts directly.
Once installed, just talk to your AI:
You: "Send a WeChat message to John saying hi"
AI: (reads SKILL.md → learns WeChat → finds contact → verifies → sends) "✅ Sent to John: hi"
You: "Check the GPU status on my server"
AI: (opens Chrome → finds JupyterLab bookmark → clicks nvitop tab → reads GPU info) "8×H20 GPUs all at 92% utilization, experiment is running."
You: "Open the settings in Discord"
AI: (activates Discord → template matches settings icon → clicks) "✅ Opened Discord settings."
First interaction — AI detects everything (~4 seconds):
🔍 YOLO: 43 icons 📝 OCR: 34 text elements 🔗 Merged → 24 fixed UI components
Every interaction after — instant recognition (~0.3 seconds):
✅ sidebar_contacts (85,214) conf=1.0
✅ emoji_button ...
No comments yet. Be the first to share your thoughts!