GUI-Agent-Skills

Name: GUI-Agent-Skills
Author: Fzkuji

by Fzkuji

Verified

Vision-based desktop automation skills for LLM agents on macOS.

1stars

0forks

Python

Added 3/17/2026

View on GitHub Download ZIP Scan for vulnerabilities

30 days in the Featured rail · terms & refunds

AI AgentsSKILL.mdai-agentapplescriptdesktop-automationgui-automationllm-toolsmacosocropenclaw

Installation

# Add to your Claude Code skills
git clone https://github.com/Fzkuji/GUI-Agent-Skills

Getting Started

Guides for using ai agents skills like GUI-Agent-Skills.

Caveman: Cut Claude Token Use by 65%
How agent-side prompt compression works, when to use it, and when not to.
What is an AI Skills Marketplace?
Definitions, how marketplaces work, and how to choose between them in 2026.
Getting Started with AI Skills
First-time install walkthrough for Claude Code, Codex CLI, and ChatGPT.

SKILL.md

name: gui-agent description: "Control desktop GUI applications on macOS using visual detection, template matching, and cliclick. Use when asked to operate, click, type, or interact with any desktop application. NOT for web-only tasks (use browser tool) or simple file operations."

GUI Agent Skill

You ARE the agent loop: Observe → Decide → Act → Verify.

Architecture Overview

User Intent ("send WeChat message to 宋文涛")
    │
    ▼
┌─────────────────────────────────────┐
│  1. DETECT (ui_detector.py)         │
│     GPA-GUI-Detector (YOLO, 40MB)   │  → icons, buttons
│     Apple Vision OCR                │  → text elements (Chinese)
│     Accessibility API               │  → Dock, menubar, named controls
│     IoU merge + dedup               │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  2. MATCH (app_memory.py)           │
│     Template matching vs memory     │  → known components (0.3s, conf=1.0)
│     If matched → use stored coords  │
│     If unknown → LLM identifies     │  → save to memory for next time
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  3. ACT (gui_agent.py + cliclick)   │
│     Relative coords + window pos    │  → screen coordinates
│     Pre-action verify               │  → correct contact? correct field?
│     Execute click/type/key          │
│     Post-action verify (OCR)        │
└─────────────────────────────────────┘

Detection Stack

Detector	What it finds	Speed	Best for
GPA-GUI-Detector	Icons, buttons, UI elements	0.3s	WeChat icons, any app's buttons
Apple Vision OCR	Text (Chinese + English)	1.6s	Chat content, labels, menus
Accessibility API	Named controls with positions	0.1s	Dock, Discord, Chrome, menubar
Template Match	Previously seen components	0.3s	Known UI elements (conf=1.0)

When to use what

App has good AX? (Discord, Chrome, System Settings)
  → Use AX API directly (fastest, most accurate)

App has bad AX? (WeChat, QQ)
  → GPA-GUI-Detector + OCR → template match against memory

First time using an app?
  → Run `app_memory.py learn --app AppName`
  → Saves all components to memory/apps/appname/

Need to click a known element?
  → Run `app_memory.py click --app AppName --component name`
  → Template match → relative coords → click

App Visual Memory

Each app gets a memory directory with learned components:

memory/apps/
├── wechat/
│   ├── profile.json          # Component registry (coords, labels, types)
│   ├── icons/                # Cropped component images (PNG)
│   │   ├── Q_Search.png
│   │   ├── 宋文涛.png
│   │   ├── icon_11_173_225.png  (unlabeled → LLM identifies later)
│   │   └── ...
│   └── pages/
│       ├── main_annotated.jpg   # Annotated screenshot
│       └── main.json            # Page layout
├── discord/
│   └── ...

Key Design Decisions

Relative coordinates: All positions relative to window top-left, not screen
Window-based capture: screencapture -l <windowID>, not fullscreen
Template matching first: Check memory before running expensive detection
Learn once, match forever: First detection saves templates; future ops use matching
Pre-action verification: Always verify target before sending messages

Memory Rules (MUST follow)

Icon filename = content description: chat_button.png, search_bar.png, NOT icon_0_170_103.png
- If label is known (from OCR): use label as filename
- If unknown: use unlabeled_<region>_<x>_<y>.png temporarily
- After LLM identifies: app_memory.py rename --old unlabeled_xxx --new actual_name
Dedup: Never save duplicate icons. Before saving, compare against existing icons (similarity > 0.92 = duplicate). Keep ONE copy.
Cleanup: Run app_memory.py cleanup --app AppName to remove duplicates. Dynamic content (chat messages, timestamps, avatars in chat) should be periodically cleaned.
Per-app, per-page: Each app has its own memory directory. Different pages (main, chat, settings) get separate page layouts.
Important vs unimportant:
- Keep: Fixed UI elements (buttons, icons, tabs, input fields, navigation)
- Clean: Dynamic content (message text, timestamps, avatars in chat area, stickers)

Learn Flow (MUST follow every time `learn` runs)

1. Capture window screenshot
2. Run GPA-GUI-Detector + Apple Vision OCR
3. For each detected element:
   a. Has OCR label? → use label as filename
   b. No label? → name as "unlabeled_<region>_<x>_<y>"
   c. Check visual dedup (similarity > 0.92) → skip if duplicate
   d. Crop and save to icons/
4. After saving all elements:
   a. Use vision model (Claude/GPT-4o) to identify ALL unlabeled icons
   b. Rename identified icons: app_memory.py rename --old unlabeled_xxx --new actual_name
   c. Remove dynamic content (timestamps, message previews, chat text, stickers)
   d. Keep ONLY fixed UI elements (buttons, icons, tabs, navigation, input fields)
5. Result: clean profile with ~20-30 named, fixed UI components per page

What to KEEP vs REMOVE after learning

The golden rule: Only save things that will look the same next time you open the app. If it changes every session, don't save it.

KEEP (fixed UI elements — same every time):

Sidebar navigation icons (chat, contacts, discover, favorites, etc.)
Toolbar buttons (search, add, settings, share screen, etc.)
Input area controls (emoji, file, voice, sticker buttons)
Window controls (close, minimize, fullscreen)
Tab/section headers
Fixed logos and app icons

REMOVE (dynamic content — different every session):

Chat message text and previews ("我好像到现在也就见到...")
Timestamps (17:14, Yesterday, 03/10, etc.)
User avatars in chat list (they move as chats reorder)
Sticker/emoji content in messages
Notification badges/counts
Contact names in chat list (OCR detects them fresh each time)
Web page content (articles, search results, feed items)
Any text longer than ~15 chars in the main content area (likely content, not UI label)
Profile pictures and photos in chat

HOW TO JUDGE: Ask yourself:

"Will this element be in the exact same place with the exact same appearance tomorrow?" → KEEP
"Is this a button/icon that I might need to click again?" → KEEP
"Is this just something someone typed/sent/posted?" → REMOVE
"Is this a webpage or feed item that will scroll away?" → REMOVE

Post-Learn Checklist

After every learn, verify:

No unlabeled_ files remain (all identified or removed)
No timestamps, message previews, or chat content saved
Each icon filename describes what it IS, not where it IS
No duplicate icons (run cleanup if needed)
Profile has ~20-30 components per page (not 60+)

Scene Index

Scene	Location	Goal
Atomic Actions	`actions/_actions.yaml`	click, type, paste, AX scan...
WeChat	`scenes/wechat/`	Send/read messages, scroll history
Discord	`scenes/discord.yaml`	Send/read messages
Telegram	`scenes/telegram.yaml`	Send/read messages
1Password	`scenes/1password.yaml`	Retrieve credentials
VPN Reconnect	`scenes/vpn-reconnect.yaml`	Reconnect GlobalProtect VPN
App Exploration	`scenes/app-explore.yaml`	Map an unfamiliar app's UI

Quick Decision Tree

What do you need to do?
│
├── GUI task on any app?
│   ├── App in memory? → app_memory.py detect/click
│   └── New app? → app_memory.py learn → then operate
│
├── WeChat? → read scenes/wechat/index.yaml
├── Discord? → read scenes/discord.yaml
├── Telegram? → read scenes/telegram.yaml
├── VPN/SSH down? → read scenes/vpn-reconnect.yaml
├── Need a password? → read scenes/1password.yaml
├── New app? → read scenes/app-explore.yaml
├── Atomic operation? → read actions/_actions.yaml
└── Principles? → read docs/core.md

⚠️ CRITICAL SAFETY RULES (READ FIRST)

These rules exist because of real bugs that caused messages sent to wrong people.

VERIFY BEFORE SENDING — Before typing ANY message, OCR the chat header to confirm the correct contact/group name is displayed. If wrong → ABORT immediately, do NOT send.
ALL OCR/clicks MUST be within target window bounds — Get window bounds first, filter all OCR results by window region. NEVER click coordinates outside the target app's window. Without this, you WILL click on other apps visible behind.
NEVER auto-learn from wrong-app context — If a click landed outside the target app window, do NOT save that location as a template. Validate window bounds before auto_learn.
Reject tiny templates — Templates smaller than 30×30 pixels produce false matches everywhere. Never save them.
Template match ≠ correct target — A template matching "宋文涛" text could be in a group chat name, a forwarded message, or another app. Always verify the CHAT HEADER after navigation, not just the sidebar click.

Auto-Learn Rule (MUST follow)

Every time you interact with a GUI app or website, check memory FIRST:

App not in memory/apps/<appname>/? → Run learn automatically before operating
Website not in memory/apps/<browser>/sites/<domain>/? → Run learn_site automatically
New page/state in a known app? → Run learn --page <pagename> to add it
After any screenshot/observation, if you see new unlabeled icons → identify them immediately

Do NOT wait for the user to ask you to learn. This is YOUR responsibility.

Key Principles

Memory first, detect second — check template match before running YOLO+OCR
Relative coordinates — never hardcode screen positions; all coords relative to window top-left
Verify before acting — especially before sending messages (verify contact, verify input field)
AX > Template > OCR > YOLO — use the cheapest method that works
Paste > Type for CJK text and special chars (set LANG=en_US.UTF-8)
Learn incrementally — save new components to memory after each interaction
Window-based, not screen-based — capture and operate within the target window only
Integer coordinates only — cliclick requires integers, always Math.round

Operating System Rules (macOS)

Coordinate System

Screen: top-left origin (0,0), logical pixels (Retina physical ÷ 2)
Window: relative to window's top-left corner
Retina: screenshots are 2x physical pixels; divide by 2 for logical
cliclick: uses screen logical pixels, integer only
Formula: screen_x = window_x + relative_x, screen_y = window_y + relative_y

Window Management

Get window bounds: osascript -e 'tell application "System Events" to tell process "AppName" to return {position, size} of window 1'
Get window ID: use Swift CGWindowListCopyWindowInfo (see ui_detector.py)
Capture window: screencapture -x -l <windowID> output.png
Activate app: osascript -e 'tell application "AppName" to activate'
Resize: tell process "AppName" to set size of window 1 to {900, 650}

Input Methods

Click: /opt/homebrew/bin/cliclick c:<x>,<y> (logical screen coords, integers)
Type ASCII: cliclick t:"text" (ASCII only, special chars may break)
Paste CJK/special: pbcopy + Cmd+V (MUST set LANG=en_US.UTF-8)
Key press: cliclick kp:return (valid keys: return, esc, tab, delete, space, arrow-*, f1-f16)
Shortcut: osascript -e 'tell app "System Events" to keystroke "v" using command down'

Accessibility (AX) API

Some apps have full AX support: Discord (1913 elements), Chrome (1415), System Settings
Some have zero: WeChat (4 elements), QQ, custom-rendered apps
Dock and menubar ALWAYS have AX — use it for those
AX gives: element name, position, size, role — most accurate source

App-Specific Quirks

WeChat: 4 AX elements only; left sidebar icons are gray-on-gray (YOLO needed); Cmd+F opens web search NOT contact search; input field placeholder invisible to OCR
Discord: Full AX; Cmd+K for quick switcher; sidebar icons are round server avatars
Electron apps (Discord, Cursor, Outlook): huge AX tree, may timeout on full scan; filter by region

Browser Automation

Browsers are a two-layer system:

Browser chrome (tabs, address bar, bookmarks) — fixed, learn once like any app
Web page content — different per site, need per-site memory

memory/apps/
├── google_chrome/
│   ├── profile.json          # Browser chrome UI (tabs, address bar, etc.)
│   ├── icons/                # Browser UI icons
│   └── sites/                # Per-website memory
│       ├── 12306.cn/
│       │   ├── profile.json  # Site-specific UI elements
│       │   ├── icons/        # Site buttons, nav items
│       │   └── pages/
│       │       ├── search.json     # Train search page layout
│       │       └── results.json    # Results page layout
│       ├── google.com/
│       └── ...

Browser operation flow:

1. Learn browser chrome once: app_memory.py learn --app "Google Chrome"
   → Saves: address bar, tab controls, bookmarks bar, etc.

2. Navigate to a website:
   a. Click address bar (template match: address_bar)
   b. Type URL or search term (paste)
   c. Press Enter

3. On a new website:
   a. Wait for page load (1-2s)
   b. Run detection (YOLO + OCR) on the page content area only
   c. Save site-specific UI elements to sites/<domain>/
   d. Dynamic content (search results, articles) = DON'T save
   e. Fixed UI (nav bar, search box, buttons, filters) = SAVE

4. Operate within website:
   a. Template match known site elements first
   b. If not found → OCR find text → click
   c. For form fields: click field → paste text → verify

What to save per website:

Navigation bars, menus, headers (fixed across pages)
Search boxes, filter buttons, sort controls
Login/signup buttons, submit buttons
Site logo, main action buttons

What NOT to save per website:

Search results, article content, feed items
Prices, availability (changes constantly)
Ads, pop-ups, banners
User-generated content

Browser Input Quirks (MUST know)

Autocomplete fields (like 12306 station selector): typing text alone is NOT enough. MUST click the dropdown suggestion item. The field only accepts values selected from the dropdown.
- Flow: click input → type pinyin/text → wait for dropdown → click the correct suggestion
- URL parameters may fill the text visually but don't trigger the selection event
Chinese input in browsers: System IME interferes with website autocomplete.
- Solution: switch to English input method first, type pinyin abbreviation (e.g., "bjn" for 北京南), let the WEBSITE's own autocomplete handle it (not system IME)
- cliclick t:bjn with English input → website dropdown shows 北京南 → click it
Cmd+V paste in web forms: May produce garbled text (encoding issues).
- Safer: use cliclick t:text for ASCII/pinyin, let website autocomplete handle Chinese
Date pickers: Usually need to click the calendar UI, not just type a date string. Some accept direct input, some don't.

Complete Operation Flow

Sending a Message (e.g., WeChat)

1. PREPARE
   a. Activate the app: osascript tell "WeChat" to activate
   b. Get window bounds: (win_x, win_y, win_w, win_h)
   c. ALL subsequent OCR/clicks MUST be within these bounds

2. NAVIGATE TO CONTACT
   a. Check if contact visible in sidebar (OCR within window bounds)
      - If found → click it
      - If not found → search:
        i.  Click search bar (template match or OCR within window)
        ii. Paste contact name (pbcopy + Cmd+V)
        iii. Wait 1s for results
        iv. OCR find contact in results (within window bounds) → click

3. ⚠️ VERIFY CONTACT (MANDATORY — DO NOT SKIP)
   a. OCR the chat HEADER area (top 120px of main content area)
   b. Confirm expected contact name appears in the header
   c. If WRONG contact or name NOT found:
      → LOG what chat IS open (for debugging)
      → ABORT immediately, do NOT type anything
      → Return error
   d. Only proceed to step 4 if verification PASSES

4. TYPE MESSAGE (only after step 3 passes)
   a. Click input field (template match or window_calc)
   b. Paste message (pbcopy + Cmd+V, NOT cliclick type)

5. SEND
   a. Press Enter (cliclick kp:return)

6. VERIFY SENT
   a. OCR the chat area
   b. Confirm first 10 chars of message visible
   c. If not found → report warning (may still have sent)

WHY step 3 is critical: Template matching "宋文涛" could match:

✅ 宋文涛's private chat (correct)
❌ A group chat containing someone named 宋文涛
❌ A forwarded message mentioning 宋文涛
❌ Text in another app's window (if OCR wasn't bounded)

Only the chat HEADER reliably shows who you're actually chatting with.

Learning a New App

1. Activate the app, ensure window is reasonably sized (≥800x600)
2. Run: python3 app_memory.py learn --app AppName
3. System automatically:
   a. Captures window screenshot
   b. Runs GPA-GUI-Detector (YOLO) → finds all icons/buttons
   c. Runs Apple Vision OCR → finds all text
   d. Merges with IoU dedup
   e. Crops each element → saves to memory/apps/appname/icons/
   f. Auto-cleans dynamic content (timestamps, message previews)
   g. Reports unlabeled icons
4. Agent identifies unlabeled icons (vision model looks at grid)
5. Rename: python3 app_memory.py rename --old unlabeled_xxx --new descriptive_name
6. Clean remaining dynamic content manually if needed
7. Final profile should have ~20-30 fixed UI components

Clicking a Known Component

1. Capture window screenshot
2. Template match against saved icon (OpenCV matchTemplate, threshold=0.8)
3. If matched (conf > 0.8):
   a. Get relative coords from match
   b. Convert to screen coords: screen = window_pos + relative
   c. Verify: coords within window bounds? confidence > 0.7?
   d. Click: cliclick c:<screen_x>,<screen_y>
4. If not matched:
   a. Run full detection (YOLO + OCR)
   b. Ask LLM to identify target element
   c. Save new component to memory (auto-learn)
   d. Click the identified element

Handling Unknown UI States

1. Take screenshot
2. Run detection (YOLO + OCR)
3. Compare against known page layout
4. If new elements found:
   a. Crop and save as unlabeled
   b. Use LLM to identify
   c. Update memory
5. If expected elements missing:
   a. Maybe different page/state
   b. Try learning as new page: learn --page settings

Setup (New Machine)

Run the setup script on a fresh Mac:

# Clone the repo
git clone https://github.com/Fzkuji/gui-agent-skills.git
cd gui-agent-skills

# Run setup (installs everything)
bash scripts/setup.sh

This will:

Install cliclick and Python 3.12 via Homebrew
Create venv at ~/gui-agent-env/
Install PyTorch, ultralytics, OpenCV, transformers
Download GPA-GUI-Detector model (40MB) to ~/GPA-GUI-Detector/
Verify everything works

After setup, also grant Accessibility permissions: System Settings → Privacy & Security → Accessibility → Add Terminal / OpenClaw

First Use

source ~/gui-agent-env/bin/activate
cd scripts/

# Learn an app (captures window, detects all elements, saves to memory)
python3 app_memory.py learn --app WeChat

# After learning, identify unlabeled icons (agent does this automatically)
# Then operate:
python3 app_memory.py click --app WeChat --component search_bar_icon
python3 app_memory.py detect --app WeChat

Scripts

Script	Purpose
`setup.sh`	Run first on new machine — installs all dependencies
`ui_detector.py`	Unified detection engine (YOLO + OCR + AX)
`app_memory.py`	Per-app visual memory (learn / detect / click / verify)
`gui_agent.py`	Legacy task executor (send_message, read_messages, etc.)
`template_match.py`	Low-level template matching utilities
`computer_use.py`	Claude Computer Use API (experimental)

All scripts use venv: source ~/gui-agent-env/bin/activate

Models & Dependencies

Model	Size	Auto-installed by setup.sh	Purpose
GPA-GUI-Detector	40MB	✅ `~/GPA-GUI-Detector/model.pt`	UI element detection

Optional (not auto-installed): | OmniParser V2 | 1.1GB | ❌ | Alt detection (weaker on desktop apps) | | GUI-Actor 2B | 4.5GB | ❌ | End-to-end grounding (experimental) |

Path Convention

Venv: ~/gui-agent-env/ (created by setup.sh)
Model: ~/GPA-GUI-Detector/model.pt (downloaded by setup.sh)
Memory: <skill-dir>/memory/apps/<appname>/ (created on first learn)
All paths use os.path.expanduser("~"), NOT hardcoded usernames

File Structure

gui-agent/
├── SKILL.md              # This file
├── actions/              # Atomic operations
│   └── _actions.yaml
├── scenes/               # Per-app operation workflows
│   ├── wechat/
│   ├── discord.yaml
│   └── ...
├── memory/               # Visual memory (gitignored)
│   └── apps/
│       ├── wechat/       # profile.json + icons/ + pages/
│       └── ...
├── scripts/              # Core scripts
│   ├── ui_detector.py    # Detection engine
│   ├── app_memory.py     # Memory management
│   ├── gui_agent.py      # Task executor
│   └── ...
├── apps/                 # App UI config (JSON)
├── docs/core.md          # Core principles
└── README.md

Security ReportVerified

Last scanned: 5/30/2026

{
  "issues": [],
  "status": "PASSED",
  "scannedAt": "2026-05-30T17:06:07.001Z",
  "npmAuditRan": true,
  "pipAuditRan": false
}

README.md

🖥️ GUI Agent Skills

Desktop GUI automation skill for OpenClaw and Claude Code.

Teach your AI assistant to see, learn, and operate any macOS app — WeChat, Chrome, Discord, anything. It learns your UI once, remembers every button, and clicks precisely by name.

What Is This?

GUIClaw is an agent skill — a set of instructions and tools that teach AI assistants how to control your desktop. Instead of writing automation scripts, your AI:

Sees your screen (YOLO icon detection + Apple Vision OCR)
Learns every button, icon, and UI element (saves them to memory)
Remembers what it learned (template matching, 100% accuracy, 0.3s)
Acts precisely (click, type, paste, verify)

🚀 Install as OpenClaw Skill

# Clone into your OpenClaw skills directory
cd ~/.openclaw/workspace/skills
git clone https://github.com/Fzkuji/GUIClaw.git gui-agent

# Run setup (installs cliclick, Python env, detection model)
bash gui-agent/scripts/setup.sh

Then add to your OpenClaw config (~/.openclaw/openclaw.json):

{
  "skills": {
    "load": {
      "extraDirs": ["~/.openclaw/workspace/skills"]
    },
    "entries": {
      "gui-agent": { "enabled": true }
    }
  }
}

That's it. Your OpenClaw agent will now read SKILL.md automatically when you ask it to operate any desktop app.

🤖 Use with Claude Code

# Clone anywhere
git clone https://github.com/Fzkuji/GUIClaw.git
cd GUIClaw
bash scripts/setup.sh

# Point Claude Code to the skill
# Add to your CLAUDE.md or project instructions:
# "For GUI automation tasks, read skills/gui-agent/SKILL.md first."

Claude Code reads SKILL.md, understands the architecture, and uses the scripts directly.

💬 Try It

Once installed, just talk to your AI:

You: "Send a WeChat message to John saying hi"

AI: (reads SKILL.md → learns WeChat → finds contact → verifies → sends) "✅ Sent to John: hi"

You: "Check the GPU status on my server"

AI: (opens Chrome → finds JupyterLab bookmark → clicks nvitop tab → reads GPU info) "8×H20 GPUs all at 92% utilization, experiment is running."

You: "Open the settings in Discord"

AI: (activates Discord → template matches settings icon → clicks) "✅ Opened Discord settings."

🧠 How It Works

Learn Once, Match Forever

First interaction — AI detects everything (~4 seconds):

🔍 YOLO: 43 icons    📝 OCR: 34 text elements    🔗 Merged → 24 fixed UI components

Every interaction after — instant recognition (~0.3 seconds):

✅ sidebar_contacts (85,214) conf=1.0
✅ emoji_button (354,530) conf=1.0
✅ search_bar_icon (202,70) conf=1.0

Detection Stack

Detector	Speed	What it finds	Why it matters
GPA-GUI-Detector	0.3s	Icons, buttons	Finds gray-on-gray icons others miss
Apple Vision OCR	1.6s	Text (CN + EN)	Best Chinese OCR available
Accessibility API	0.1s	Dock, menus	Perfect accuracy, zero cost
Template Match	0.3s	Anything seen before	100% accuracy after first learn

App Visual Memory

Each app gets its own memory directory:

memory/apps/
├── wechat/
│   ├── profile.json              # Named components with coordinates
│   ├── icons/
│   │   ├── sidebar_contacts.png
│   │   ├── emoji_button.png
│   │   └── search_bar_icon.png
│   └── pages/
│       └── main_annotated.jpg
├── google_chrome/
│   ├── icons/
│   └── sites/                    # Per-website memory
│       ├── 12306_cn/
│       └── github_com/

⚠️ Safety

Real bugs taught us these rules (they're enforced in code):

Always verify chat recipient before sending messages (OCR the header)
Window-bounded operations — never click outside target app window
No tiny templates — templates < 30×30 pixels produce false matches
Auto-learn validation — only save from correct app context

🗂️ Project Structure

GUIClaw/
├── SKILL.md              # 📖 Agent reads this first (complete instruction manual)
├── scripts/
│   ├── setup.sh          # One-command setup
│   ├── ui_detector.py    # Detection engine (YOLO + OCR + AX)
│   ├── app_memory.py     # Visual memory (learn/detect/click/verify)
│   ├── gui_agent.py      # Task executor (send_message, etc.)
│   └── template_match.py # Template matching utilities
├── memory/               # Per-app visual memory (gitignored, machine-specific)
├── actions/              # Atomic operations catalog
├── scenes/               # Per-app operation workflows (YAML)
├── apps/                 # App UI configs (JSON)
├── docs/core.md          # Hard-won lessons & principles
└── requirements.txt

The `SKILL.md` Contract

SKILL.md is the single source of truth for any AI agent using GUIClaw. It contains:

Complete architecture and detection flow
Safety rules (critical, read-first section)
Step-by-step operation flows (sending messages, learning apps, clicking)
macOS coordinate system, input methods, AX API coverage
App-specific quirks and lessons learned

Any OpenClaw agent, Claude Code instance, or LLM that reads SKILL.md can fully operate the system.

📦 Requirements

macOS with Apple Silicon (M1/M2/M3/M4)
Accessibility permissions: System Settings → Privacy → Accessibility
Everything else installed by setup.sh:
- Python 3.12, cliclick, PyTorch, ultralytics
- GPA-GUI-Detector model (40MB)

🤝 Ecosystem

Tool	Role
OpenClaw	AI assistant framework — loads GUIClaw as a skill
Claude Code	Agentic coding — reads SKILL.md for GUI tasks
GPA-GUI-Detector	Salesforce's YOLO model for UI element detection

📄 License

MIT

Frequently Asked Questions

What is GUI-Agent-Skills?

GUI-Agent-Skills is an open-source ai agents skill for AI coding assistants such as Claude Code, Codex CLI, and ChatGPT, built by Fzkuji. Vision-based desktop automation skills for LLM agents on macOS. It has 1 GitHub star.

Is GUI-Agent-Skills safe to use?

Yes. GUI-Agent-Skills passed SkillsLLM's automated security scan — a dependency vulnerability audit plus prompt-injection heuristics — with no high-severity issues. You can read the full report in the Security Report section on this page.

How do I install GUI-Agent-Skills?

Clone the repository with "git clone https://github.com/Fzkuji/GUI-Agent-Skills" and add it to your Claude Code skills directory (see the Installation section above). GUI-Agent-Skills ships a SKILL.md manifest, so compatible agents can discover and load it automatically.

What programming language is GUI-Agent-Skills written in?

GUI-Agent-Skills is primarily written in Python. It is open-source under Fzkuji on GitHub, so you can review or fork the full source.

Are there alternatives to GUI-Agent-Skills?

Yes. SkillsLLM lists many other AI Agents skills you can browse and compare side by side. Open the AI Agents category from the badge at the top of this page, or use the Related Skills and comparison links further down to weigh GUI-Agent-Skills against similar tools.

Agentic AI for Beginners

Build your first AI agent from scratch - tool use, ReAct pattern, memory, deployment

41 minBeginner

Comments (0)

to leave a comment.

No comments yet. Be the first to share your thoughts!

Related Skills

ECC

by affaan-m

The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.

215,726

33,155

JavaScript

AI Agentsai-agentsanthropic

The agent that grows with you

193,893

33,936

Python

AI Agentsaiai-agent

View details

Compare

everything-claude-code

by affaan-m

The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.

185,940

28,768

JavaScript

AI Agentsai-agentsanthropic

View details

Compare

claude-code

by anthropics

Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.

120,031

19,897

Shell

AI Agents

An open-source AI agent that brings the power of Gemini directly into your terminal.

105,291

14,068

TypeScript

AI Agentsaiai-agents

View details

Compare

llama.cpp

by ggml-org

LLM inference in C/C++

103,839

16,880

C++

AI Agentsggml

View details

Compare

Browse all AI Agents skills

structured-agentic-workflow MakeAIClips-YouTube-to-Viral-Clips-openclaw-skill

name: gui-agent description: "Control desktop GUI applications on macOS using visual detection, template matching, and cliclick. Use when asked to operate, click, type, or interact with any desktop application. NOT for web-only tasks (use browser tool) or simple file operations."

GUI Agent Skill

You ARE the agent loop: Observe → Decide → Act → Verify.

Architecture Overview

User Intent ("send WeChat message to 宋文涛")
    │
    ▼
┌─────────────────────────────────────┐
│  1. DETECT (ui_detector.py)         │
│     GPA-GUI-Detector (YOLO, 40MB)   │  → icons, buttons
│     Apple Vision OCR                │  → text elements (Chinese)
│     Accessibility API               │  → Dock, menubar, named controls
│     IoU merge + dedup               │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  2. MATCH (app_memory.py)           │
│     Template matching vs memory     │  → known components (0.3s, conf=1.0)
│     If matched → use stored coords  │
│     If unknown → LLM identifies     │  → save to memory for next time
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  3. ACT (gui_agent.py + cliclick)   │
│     Relative coords + window pos    │  → screen coordinates
│     Pre-action verify               │  → correct contact? correct field?
│     Execute click/type/key          │
│     Post-action verify (OCR)        │
└─────────────────────────────────────┘

Detection Stack

Detector	What it finds	Speed	Best for
GPA-GUI-Detector	Icons, buttons, UI elements	0.3s	WeChat icons, any app's buttons
Apple Vision OCR	Text (Chinese + English)	1.6s	Chat content, labels, menus
Accessibility API	Named controls with positions	0.1s	Dock, Discord, Chrome, menubar
Template Match	Previously seen components	0.3s	Known UI elements (conf=1.0)

When to use what

App has good AX? (Discord, Chrome, System Settings)
  → Use AX API directly (fastest, most accurate)

App has bad AX? (WeChat, QQ)
  → GPA-GUI-Detector + OCR → template match against memory

First time using an app?
  → Run `app_memory.py learn --app AppName`
  → Saves all components to memory/apps/appname/

Need to click a known element?
  → Run `app_memory.py click --app AppName --component name`
  → Template match → relative coords → click

App Visual Memory

Each app gets a memory directory with learned components:

memory/apps/
├── wechat/
│   ├── profile.json          # Component registry (coords, labels, types)
│   ├── icons/                # Cropped component images (PNG)
│   │   ├── Q_Search.png
│   │   ├── 宋文涛.png
│   │   ├── icon_11_173_225.png  (unlabeled → LLM identifies later)
│   │   └── ...
│   └── pages/
│       ├── main_annotated.jpg   # Annotated screenshot
│       └── main.json            # Page layout
├── discord/
│   └── ...

Key Design Decisions

Relative coordinates: All positions relative to window top-left, not screen
Window-based capture: screencapture -l <windowID>, not fullscreen
Template matching first: Check memory before running expensive detection
Learn once, match forever: First detection saves templates; future ops use matching
Pre-action verification: Always verify target before sending messages

Memory Rules (MUST follow)

Icon filename = content description: chat_button.png, search_bar.png, NOT icon_0_170_103.png
- If label is known (from OCR): use label as filename
- If unknown: use unlabeled_<region>_<x>_<y>.png temporarily
- After LLM identifies: app_memory.py rename --old unlabeled_xxx --new actual_name
Dedup: Never save duplicate icons. Before saving, compare against existing icons (similarity > 0.92 = duplicate). Keep ONE copy.
Cleanup: Run app_memory.py cleanup --app AppName to remove duplicates. Dynamic content (chat messages, timestamps, avatars in chat) should be periodically cleaned.
Per-app, per-page: Each app has its own memory directory. Different pages (main, chat, settings) get separate page layouts.
Important vs unimportant:
- Keep: Fixed UI elements (buttons, icons, tabs, input fields, navigation)
- Clean: Dynamic content (message text, timestamps, avatars in chat area, stickers)

Learn Flow (MUST follow every time `learn` runs)

1. Capture window screenshot
2. Run GPA-GUI-Detector + Apple Vision OCR
3. For each detected element:
   a. Has OCR label? → use label as filename
   b. No label? → name as "unlabeled_<region>_<x>_<y>"
   c. Check visual dedup (similarity > 0.92) → skip if duplicate
   d. Crop and save to icons/
4. After saving all elements:
   a. Use vision model (Claude/GPT-4o) to identify ALL unlabeled icons
   b. Rename identified icons: app_memory.py rename --old unlabeled_xxx --new actual_name
   c. Remove dynamic content (timestamps, message previews, chat text, stickers)
   d. Keep ONLY fixed UI elements (buttons, icons, tabs, navigation, input fields)
5. Result: clean profile with ~20-30 named, fixed UI components per page

What to KEEP vs REMOVE after learning

The golden rule: Only save things that will look the same next time you open the app. If it changes every session, don't save it.

KEEP (fixed UI elements — same every time):

Sidebar navigation icons (chat, contacts, discover, favorites, etc.)
Toolbar buttons (search, add, settings, share screen, etc.)
Input area controls (emoji, file, voice, sticker buttons)
Window controls (close, minimize, fullscreen)
Tab/section headers
Fixed logos and app icons

REMOVE (dynamic content — different every session):

Chat message text and previews ("我好像到现在也就见到...")
Timestamps (17:14, Yesterday, 03/10, etc.)
User avatars in chat list (they move as chats reorder)
Sticker/emoji content in messages
Notification badges/counts
Contact names in chat list (OCR detects them fresh each time)
Web page content (articles, search results, feed items)
Any text longer than ~15 chars in the main content area (likely content, not UI label)
Profile pictures and photos in chat

HOW TO JUDGE: Ask yourself:

"Will this element be in the exact same place with the exact same appearance tomorrow?" → KEEP
"Is this a button/icon that I might need to click again?" → KEEP
"Is this just something someone typed/sent/posted?" → REMOVE
"Is this a webpage or feed item that will scroll away?" → REMOVE

Post-Learn Checklist

After every learn, verify:

No unlabeled_ files remain (all identified or removed)
No timestamps, message previews, or chat content saved
Each icon filename describes what it IS, not where it IS
No duplicate icons (run cleanup if needed)
Profile has ~20-30 components per page (not 60+)

Scene Index

Scene	Location	Goal
Atomic Actions	`actions/_actions.yaml`	click, type, paste, AX scan...
WeChat	`scenes/wechat/`	Send/read messages, scroll history
Discord	`scenes/discord.yaml`	Send/read messages
Telegram	`scenes/telegram.yaml`	Send/read messages
1Password	`scenes/1password.yaml`	Retrieve credentials
VPN Reconnect	`scenes/vpn-reconnect.yaml`	Reconnect GlobalProtect VPN
App Exploration	`scenes/app-explore.yaml`	Map an unfamiliar app's UI

Quick Decision Tree

What do you need to do?
│
├── GUI task on any app?
│   ├── App in memory? → app_memory.py detect/click
│   └── New app? → app_memory.py learn → then operate
│
├── WeChat? → read scenes/wechat/index.yaml
├── Discord? → read scenes/discord.yaml
├── Telegram? → read scenes/telegram.yaml
├── VPN/SSH down? → read scenes/vpn-reconnect.yaml
├── Need a password? → read scenes/1password.yaml
├── New app? → read scenes/app-explore.yaml
├── Atomic operation? → read actions/_actions.yaml
└── Principles? → read docs/core.md

⚠️ CRITICAL SAFETY RULES (READ FIRST)

These rules exist because of real bugs that caused messages sent to wrong people.

VERIFY BEFORE SENDING — Before typing ANY message, OCR the chat header to confirm the correct contact/group name is displayed. If wrong → ABORT immediately, do NOT send.
ALL OCR/clicks MUST be within target window bounds — Get window bounds first, filter all OCR results by window region. NEVER click coordinates outside the target app's window. Without this, you WILL click on other apps visible behind.
NEVER auto-learn from wrong-app context — If a click landed outside the target app window, do NOT save that location as a template. Validate window bounds before auto_learn.
Reject tiny templates — Templates smaller than 30×30 pixels produce false matches everywhere. Never save them.
Template match ≠ correct target — A template matching "宋文涛" text could be in a group chat name, a forwarded message, or another app. Always verify the CHAT HEADER after navigation, not just the sidebar click.

Auto-Learn Rule (MUST follow)

Every time you interact with a GUI app or website, check memory FIRST:

App not in memory/apps/<appname>/? → Run learn automatically before operating
Website not in memory/apps/<browser>/sites/<domain>/? → Run learn_site automatically
New page/state in a known app? → Run learn --page <pagename> to add it
After any screenshot/observation, if you see new unlabeled icons → identify them immediately

Do NOT wait for the user to ask you to learn. This is YOUR responsibility.

Key Principles

Memory first, detect second — check template match before running YOLO+OCR
Relative coordinates — never hardcode screen positions; all coords relative to window top-left
Verify before acting — especially before sending messages (verify contact, verify input field)
AX > Template > OCR > YOLO — use the cheapest method that works
Paste > Type for CJK text and special chars (set LANG=en_US.UTF-8)
Learn incrementally — save new components to memory after each interaction
Window-based, not screen-based — capture and operate within the target window only
Integer coordinates only — cliclick requires integers, always Math.round

Operating System Rules (macOS)

Coordinate System

Screen: top-left origin (0,0), logical pixels (Retina physical ÷ 2)
Window: relative to window's top-left corner
Retina: screenshots are 2x physical pixels; divide by 2 for logical
cliclick: uses screen logical pixels, integer only
Formula: screen_x = window_x + relative_x, screen_y = window_y + relative_y

Window Management

Get window bounds: osascript -e 'tell application "System Events" to tell process "AppName" to return {position, size} of window 1'
Get window ID: use Swift CGWindowListCopyWindowInfo (see ui_detector.py)
Capture window: screencapture -x -l <windowID> output.png
Activate app: osascript -e 'tell application "AppName" to activate'
Resize: tell process "AppName" to set size of window 1 to {900, 650}

Input Methods

Click: /opt/homebrew/bin/cliclick c:<x>,<y> (logical screen coords, integers)
Type ASCII: cliclick t:"text" (ASCII only, special chars may break)
Paste CJK/special: pbcopy + Cmd+V (MUST set LANG=en_US.UTF-8)
Key press: cliclick kp:return (valid keys: return, esc, tab, delete, space, arrow-*, f1-f16)
Shortcut: osascript -e 'tell app "System Events" to keystroke "v" using command down'

Accessibility (AX) API

Some apps have full AX support: Discord (1913 elements), Chrome (1415), System Settings
Some have zero: WeChat (4 elements), QQ, custom-rendered apps
Dock and menubar ALWAYS have AX — use it for those
AX gives: element name, position, size, role — most accurate source

App-Specific Quirks

WeChat: 4 AX elements only; left sidebar icons are gray-on-gray (YOLO needed); Cmd+F opens web search NOT contact search; input field placeholder invisible to OCR
Discord: Full AX; Cmd+K for quick switcher; sidebar icons are round server avatars
Electron apps (Discord, Cursor, Outlook): huge AX tree, may timeout on full scan; filter by region

Browser Automation

Browsers are a two-layer system:

Browser chrome (tabs, address bar, bookmarks) — fixed, learn once like any app
Web page content — different per site, need per-site memory

memory/apps/
├── google_chrome/
│   ├── profile.json          # Browser chrome UI (tabs, address bar, etc.)
│   ├── icons/                # Browser UI icons
│   └── sites/                # Per-website memory
│       ├── 12306.cn/
│       │   ├── profile.json  # Site-specific UI elements
│       │   ├── icons/        # Site buttons, nav items
│       │   └── pages/
│       │       ├── search.json     # Train search page layout
│       │       └── results.json    # Results page layout
│       ├── google.com/
│       └── ...

Browser operation flow:

1. Learn browser chrome once: app_memory.py learn --app "Google Chrome"
   → Saves: address bar, tab controls, bookmarks bar, etc.

2. Navigate to a website:
   a. Click address bar (template match: address_bar)
   b. Type URL or search term (paste)
   c. Press Enter

3. On a new website:
   a. Wait for page load (1-2s)
   b. Run detection (YOLO + OCR) on the page content area only
   c. Save site-specific UI elements to sites/<domain>/
   d. Dynamic content (search results, articles) = DON'T save
   e. Fixed UI (nav bar, search box, buttons, filters) = SAVE

4. Operate within website:
   a. Template match known site elements first
   b. If not found → OCR find text → click
   c. For form fields: click field → paste text → verify

What to save per website:

Navigation bars, menus, headers (fixed across pages)
Search boxes, filter buttons, sort controls
Login/signup buttons, submit buttons
Site logo, main action buttons

What NOT to save per website:

Search results, article content, feed items
Prices, availability (changes constantly)
Ads, pop-ups, banners
User-generated content

Browser Input Quirks (MUST know)

Autocomplete fields (like 12306 station selector): typing text alone is NOT enough. MUST click the dropdown suggestion item. The field only accepts values selected from the dropdown.
- Flow: click input → type pinyin/text → wait for dropdown → click the correct suggestion
- URL parameters may fill the text visually but don't trigger the selection event
Chinese input in browsers: System IME interferes with website autocomplete.
- Solution: switch to English input method first, type pinyin abbreviation (e.g., "bjn" for 北京南), let the WEBSITE's own autocomplete handle it (not system IME)
- cliclick t:bjn with English input → website dropdown shows 北京南 → click it
Cmd+V paste in web forms: May produce garbled text (encoding issues).
- Safer: use cliclick t:text for ASCII/pinyin, let website autocomplete handle Chinese
Date pickers: Usually need to click the calendar UI, not just type a date string. Some accept direct input, some don't.

Complete Operation Flow

Sending a Message (e.g., WeChat)

1. PREPARE
   a. Activate the app: osascript tell "WeChat" to activate
   b. Get window bounds: (win_x, win_y, win_w, win_h)
   c. ALL subsequent OCR/clicks MUST be within these bounds

2. NAVIGATE TO CONTACT
   a. Check if contact visible in sidebar (OCR within window bounds)
      - If found → click it
      - If not found → search:
        i.  Click search bar (template match or OCR within window)
        ii. Paste contact name (pbcopy + Cmd+V)
        iii. Wait 1s for results
        iv. OCR find contact in results (within window bounds) → click

3. ⚠️ VERIFY CONTACT (MANDATORY — DO NOT SKIP)
   a. OCR the chat HEADER area (top 120px of main content area)
   b. Confirm expected contact name appears in the header
   c. If WRONG contact or name NOT found:
      → LOG what chat IS open (for debugging)
      → ABORT immediately, do NOT type anything
      → Return error
   d. Only proceed to step 4 if verification PASSES

4. TYPE MESSAGE (only after step 3 passes)
   a. Click input field (template match or window_calc)
   b. Paste message (pbcopy + Cmd+V, NOT cliclick type)

5. SEND
   a. Press Enter (cliclick kp:return)

6. VERIFY SENT
   a. OCR the chat area
   b. Confirm first 10 chars of message visible
   c. If not found → report warning (may still have sent)

WHY step 3 is critical: Template matching "宋文涛" could match:

✅ 宋文涛's private chat (correct)
❌ A group chat containing someone named 宋文涛
❌ A forwarded message mentioning 宋文涛
❌ Text in another app's window (if OCR wasn't bounded)

Only the chat HEADER reliably shows who you're actually chatting with.

Learning a New App

1. Activate the app, ensure window is reasonably sized (≥800x600)
2. Run: python3 app_memory.py learn --app AppName
3. System automatically:
   a. Captures window screenshot
   b. Runs GPA-GUI-Detector (YOLO) → finds all icons/buttons
   c. Runs Apple Vision OCR → finds all text
   d. Merges with IoU dedup
   e. Crops each element → saves to memory/apps/appname/icons/
   f. Auto-cleans dynamic content (timestamps, message previews)
   g. Reports unlabeled icons
4. Agent identifies unlabeled icons (vision model looks at grid)
5. Rename: python3 app_memory.py rename --old unlabeled_xxx --new descriptive_name
6. Clean remaining dynamic content manually if needed
7. Final profile should have ~20-30 fixed UI components

Clicking a Known Component

1. Capture window screenshot
2. Template match against saved icon (OpenCV matchTemplate, threshold=0.8)
3. If matched (conf > 0.8):
   a. Get relative coords from match
   b. Convert to screen coords: screen = window_pos + relative
   c. Verify: coords within window bounds? confidence > 0.7?
   d. Click: cliclick c:<screen_x>,<screen_y>
4. If not matched:
   a. Run full detection (YOLO + OCR)
   b. Ask LLM to identify target element
   c. Save new component to memory (auto-learn)
   d. Click the identified element

Handling Unknown UI States

1. Take screenshot
2. Run detection (YOLO + OCR)
3. Compare against known page layout
4. If new elements found:
   a. Crop and save as unlabeled
   b. Use LLM to identify
   c. Update memory
5. If expected elements missing:
   a. Maybe different page/state
   b. Try learning as new page: learn --page settings

Setup (New Machine)

Run the setup script on a fresh Mac:

# Clone the repo
git clone https://github.com/Fzkuji/gui-agent-skills.git
cd gui-agent-skills

# Run setup (installs everything)
bash scripts/setup.sh

This will:

Install cliclick and Python 3.12 via Homebrew
Create venv at ~/gui-agent-env/
Install PyTorch, ultralytics, OpenCV, transformers
Download GPA-GUI-Detector model (40MB) to ~/GPA-GUI-Detector/
Verify everything works

After setup, also grant Accessibility permissions: System Settings → Privacy & Security → Accessibility → Add Terminal / OpenClaw

First Use

source ~/gui-agent-env/bin/activate
cd scripts/

# Learn an app (captures window, detects all elements, saves to memory)
python3 app_memory.py learn --app WeChat

# After learning, identify unlabeled icons (agent does this automatically)
# Then operate:
python3 app_memory.py click --app WeChat --component search_bar_icon
python3 app_memory.py detect --app WeChat

Scripts

Script	Purpose
`setup.sh`	Run first on new machine — installs all dependencies
`ui_detector.py`	Unified detection engine (YOLO + OCR + AX)
`app_memory.py`	Per-app visual memory (learn / detect / click / verify)
`gui_agent.py`	Legacy task executor (send_message, read_messages, etc.)
`template_match.py`	Low-level template matching utilities
`computer_use.py`	Claude Computer Use API (experimental)

All scripts use venv: source ~/gui-agent-env/bin/activate

Models & Dependencies

Model	Size	Auto-installed by setup.sh	Purpose
GPA-GUI-Detector	40MB	✅ `~/GPA-GUI-Detector/model.pt`	UI element detection

Optional (not auto-installed): | OmniParser V2 | 1.1GB | ❌ | Alt detection (weaker on desktop apps) | | GUI-Actor 2B | 4.5GB | ❌ | End-to-end grounding (experimental) |

Path Convention

Venv: ~/gui-agent-env/ (created by setup.sh)
Model: ~/GPA-GUI-Detector/model.pt (downloaded by setup.sh)
Memory: <skill-dir>/memory/apps/<appname>/ (created on first learn)
All paths use os.path.expanduser("~"), NOT hardcoded usernames

File Structure

gui-agent/
├── SKILL.md              # This file
├── actions/              # Atomic operations
│   └── _actions.yaml
├── scenes/               # Per-app operation workflows
│   ├── wechat/
│   ├── discord.yaml
│   └── ...
├── memory/               # Visual memory (gitignored)
│   └── apps/
│       ├── wechat/       # profile.json + icons/ + pages/
│       └── ...
├── scripts/              # Core scripts
│   ├── ui_detector.py    # Detection engine
│   ├── app_memory.py     # Memory management
│   ├── gui_agent.py      # Task executor
│   └── ...
├── apps/                 # App UI config (JSON)
├── docs/core.md          # Core principles
└── README.md

GUI-Agent-Skills

name: gui-agent description: "Control desktop GUI applications on macOS using visual detection, template matching, and cliclick. Use when asked to operate, click, type, or interact with any desktop application. NOT for web-only tasks (use browser tool) or simple file operations."

GUI Agent Skill

Architecture Overview

Detection Stack

When to use what

App Visual Memory

Key Design Decisions

Memory Rules (MUST follow)

Learn Flow (MUST follow every time learn runs)

What to KEEP vs REMOVE after learning

Post-Learn Checklist

Scene Index

Quick Decision Tree

⚠️ CRITICAL SAFETY RULES (READ FIRST)

Auto-Learn Rule (MUST follow)

Key Principles

Operating System Rules (macOS)

Coordinate System

Window Management

Input Methods

Accessibility (AX) API

App-Specific Quirks

Browser Automation

Browser Input Quirks (MUST know)

Complete Operation Flow

Sending a Message (e.g., WeChat)

Learning a New App

Clicking a Known Component

Handling Unknown UI States

Setup (New Machine)

First Use

Scripts

Models & Dependencies

Path Convention

File Structure

🖥️ GUI Agent Skills

What Is This?

🚀 Install as OpenClaw Skill

🤖 Use with Claude Code

💬 Try It

🧠 How It Works

Learn Once, Match Forever

Detection Stack

App Visual Memory

⚠️ Safety

🗂️ Project Structure

The SKILL.md Contract

📦 Requirements

🤝 Ecosystem

📄 License

Frequently Asked Questions

What is GUI-Agent-Skills?

Is GUI-Agent-Skills safe to use?

How do I install GUI-Agent-Skills?

What programming language is GUI-Agent-Skills written in?

Are there alternatives to GUI-Agent-Skills?

Related Skills

GUI-Agent-Skills

name: gui-agent description: "Control desktop GUI applications on macOS using visual detection, template matching, and cliclick. Use when asked to operate, click, type, or interact with any desktop application. NOT for web-only tasks (use browser tool) or simple file operations."

GUI Agent Skill

Architecture Overview

Detection Stack

When to use what

App Visual Memory

Key Design Decisions

Memory Rules (MUST follow)

Learn Flow (MUST follow every time learn runs)

What to KEEP vs REMOVE after learning

Post-Learn Checklist

Scene Index

Quick Decision Tree

⚠️ CRITICAL SAFETY RULES (READ FIRST)

Auto-Learn Rule (MUST follow)

Key Principles

Operating System Rules (macOS)

Coordinate System

Window Management

Input Methods

Accessibility (AX) API

Learn Flow (MUST follow every time `learn` runs)

The `SKILL.md` Contract

Learn Flow (MUST follow every time `learn` runs)

The `SKILL.md` Contract