by 0xMassi
Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server.
# Add to your Claude Code skills
git clone https://github.com/0xMassi/webclawHigh-quality web extraction with automatic antibot bypass. Beats Firecrawl on extraction quality and handles Cloudflare, DataDome, and JS-rendered pages automatically.
web_fetch returns empty/blocked content (403, Cloudflare challenges)All requests go to https://api.webclaw.io/v1/.
Authentication: Authorization: Bearer $WEBCLAW_API_KEY
curl -X POST https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown"],
"only_main_content": true
}'
Request fields:
| Field | Type | Default | Description | |-------|------|---------|-------------| | | string | required | URL to scrape | | | string[] | | Output formats: , , , | | | string[] | | CSS selectors to keep (e.g. ) | | | string[] | | CSS selectors to remove (e.g. ) | | | bool | | Extract only the main article/content area | | | bool | | Skip cache, fetch fresh | | | int | server default | Max acceptable cache age in seconds |
Last scanned: 5/3/2026
{
"issues": [],
"status": "PASSED",
"scannedAt": "2026-05-03T06:26:10.723Z",
"semgrepRan": false,
"npmAuditRan": true,
"pipAuditRan": true
}Most web scraping tools give your agent one of two bad outputs:
webclaw.io is the hosted web extraction API for webclaw. This repo contains the open-source CLI, MCP server, extraction engine, and self-hostable server.
webclaw turns a URL into clean content your tools can actually use.
webclaw https://example.com --format markdown
# Example Domain
This domain is for use in illustrative examples in documents.
You may use this domain in literature without prior coordination or asking for permission.
Use it from the terminal, wire it into Claude/Cursor through MCP, call the hosted API from your app, or self-host the OSS server.
The fastest way to connect webclaw to Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode, Codex CLI, and other MCP-compatible tools:
npx create-webclaw
The installer detects supported clients and configures the MCP server for you.
brew tap 0xMassi/webclaw
brew install webclaw
Download macOS and Linux binaries from GitHub Releases.
docker run --rm ghcr.io/0xmassi/webclaw https://example.com
cargo install --git https://github.com/0xMassi/webclaw.git webclaw-cli
cargo install --git https://github.com/0xMassi/webclaw.git webclaw-mcp
No comments yet. Be the first to share your thoughts!
urlformats["markdown"]markdowntextllmjsoninclude_selectors[]["article", ".content"]exclude_selectors[]["nav", "footer", ".ads"]only_main_contentfalseno_cachefalsemax_cache_ageResponse:
{
"url": "https://example.com",
"metadata": {
"title": "Example",
"description": "...",
"language": "en",
"word_count": 1234
},
"markdown": "# Page Title\n\nContent here...",
"cache": { "status": "miss" }
}
Format options:
markdown — clean markdown, best for general usetext — plain text without formattingllm — optimized for LLM consumption: includes page title, URL, and cleaned content with link references. Best for feeding to AI models.json — full extraction result with all metadataWhen antibot bypass activates (automatic, no extra config):
{
"antibot": {
"bypass": true,
"elapsed_ms": 3200
}
}
Starts an async job. Poll for results.
Start crawl:
curl -X POST https://api.webclaw.io/v1/crawl \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"max_depth": 3,
"max_pages": 50,
"use_sitemap": true
}'
Response: { "job_id": "abc-123", "status": "running" }
Poll status:
curl https://api.webclaw.io/v1/crawl/abc-123 \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
Response when complete:
{
"job_id": "abc-123",
"status": "completed",
"total": 47,
"completed": 45,
"errors": 2,
"pages": [
{
"url": "https://docs.example.com/intro",
"markdown": "# Introduction\n...",
"metadata": { "title": "Intro", "word_count": 500 }
}
]
}
Request fields:
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| url | string | required | Starting URL |
| max_depth | int | 3 | How many links deep to follow |
| max_pages | int | 100 | Maximum pages to crawl |
| use_sitemap | bool | false | Seed URLs from sitemap.xml |
| formats | string[] | ["markdown"] | Output formats per page |
| include_selectors | string[] | [] | CSS selectors to keep |
| exclude_selectors | string[] | [] | CSS selectors to remove |
| only_main_content | bool | false | Main content only |
Fast URL discovery without full content extraction.
curl -X POST https://api.webclaw.io/v1/map \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
Response:
{
"url": "https://example.com",
"count": 142,
"urls": [
"https://example.com/about",
"https://example.com/pricing",
"https://example.com/docs/intro"
]
}
curl -X POST https://api.webclaw.io/v1/batch \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://a.com",
"https://b.com",
"https://c.com"
],
"formats": ["markdown"],
"concurrency": 5
}'
Response:
{
"total": 3,
"completed": 3,
"errors": 0,
"results": [
{ "url": "https://a.com", "markdown": "...", "metadata": {} },
{ "url": "https://b.com", "markdown": "...", "metadata": {} },
{ "url": "https://c.com", "error": "timeout" }
]
}
Pull structured data from any page using a JSON schema or plain-text prompt.
With JSON schema:
curl -X POST https://api.webclaw.io/v1/extract \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/pricing",
"schema": {
"type": "object",
"properties": {
"plans": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "string" },
"features": { "type": "array", "items": { "type": "string" } }
}
}
}
}
}
}'
With prompt:
curl -X POST https://api.webclaw.io/v1/extract \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/pricing",
"prompt": "Extract all pricing tiers with names, monthly prices, and key features"
}'
Response:
{
"url": "https://example.com/pricing",
"data": {
"plans": [
{ "name": "Starter", "price": "$49/mo", "features": ["10k pages", "Email support"] },
{ "name": "Pro", "price": "$99/mo", "features": ["100k pages", "Priority support", "API access"] }
]
}
}
curl -X POST https://api.webclaw.io/v1/summarize \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/long-article",
"max_sentences": 3
}'
Response:
{
"url": "https://example.com/long-article",
"summary": "The article discusses... Key findings include... The author concludes that..."
}
Compare current page content against a previous snapshot.
curl -X POST https://api.webclaw.io/v1/diff \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"previous": {
"markdown": "# Old content...",
"metadata": { "title": "Old Title" }
}
}'
Response:
{
"url": "https://example.com",
"status": "changed",
"diff": "--- previous\n+++ current\n@@ -1 +1 @@\n-# Old content\n+# New content",
"metadata_changes": [
{ "field": "title", "old": "Old Title", "new": "New Title" }
]
}
Analyze a website's visual identity: colors, fonts, logo.
curl -X POST https://api.webclaw.io/v1/brand \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
Response:
{
"url": "https://example.com",
"brand": {
"colors": [
{ "hex": "#FF6B35", "usage": "primary" },
{ "hex": "#1A1A2E", "usage": "background" }
],
"fonts": ["Inter", "JetBrains Mono"],
"logo_url": "https://example.com/logo.svg",
"favicon_url": "https://example.com/favicon.ico"
}
}
Search the web and optionally scrape each result page.
curl -X POST https://api.webclaw.io/v1/search \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "best rust web frameworks 2026",
"num_results": 5,
"scrape": true,
"formats": ["markdown"]
}'
Request fields:
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| query | string | required | Search query |
| num_results | int | 10 | Number of search results to return |
| scrape | bool | false | Also scrape each result page for full content |
| formats | string[] | ["markdown"] | Output formats when scrape is true |
| country | string | none | Country code for localized results (e.g. "us", "de") |
| lang | string | none | Language code for results (e.g. "en", "fr") |
Response:
{
"query": "best rust web frameworks 2026",
"results": [
{
"title": "Top Rust Web Frameworks in 2026",
"url": "https://blog.example.com/rust-frameworks",
"snippet": "A comprehensive comparison of Axum, Actix, and Rocket...",
"position": 1,
"markdown": "# Top Rust Web Frameworks\n\n..."
},
{
"title": "Choosing a Rust Backend Framework",
"url": "https://dev.to/rust-backends",
"snippet": "When starting a new Rust web project...",
"position": 2,
"markdown": "# Choosing a Rust Backend\n\n..."
}
]
}
The markdown field on each result is only present when scrape: true. Without it, you get titles, URLs, snippets, and positions only.
Starts an async research job that searches, scrapes, and synthesizes information across multiple sources. Poll for results.
Start research:
curl -X POST https://api.webclaw.io/v1/research \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "How does Cloudflare Turnstile work and what are its known bypass methods?",
"max_iterations": 5,
"max_sources": 10,
"topic": "security",
"deep": true
}'
Request fields:
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| query | string | required | Research question or topic |
| max_iterations | int | server default | Maximum research iterations (search-read-analyze cycles) |
| max_sources | int | server default | Maximum number of sources to consult |
| topic | string | none | Topic hint to guide search strategy (e.g. "security", "finance", "engineering") |
| deep | bool | false | Enable deep research mode for more thorough analysis (costs 10 credits instead of 1) |
Response: { "id": "res-abc-123", "status": "running" }
Poll results:
curl https://api.webclaw.io/v1/research/res-abc-123 \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
Response when complete:
{
"id": "res-abc-123",
"status": "completed",
"query": "How does Cloudflare Turnstile work and what are its known bypass methods?",
"report": "# Cloudflare Turnstile Analysis\n\n## Overview\nCloudflare Turnstile is a CAPTCHA replacement that...\n\n## How It Works\n...\n\n## Known Bypass Methods\n...",
"sources": [
{ "url": "https://developers.cloudflare.com/turnstile/", "title": "Turnstile Documentation" },
{ "url": "https://blog.cloudflare.com/turnstile-ga/", "title": "Turnstile GA Announcement" }
],
"findings": [
"Turnstile uses browser environment signals and proof-of-work challenges",
"Managed mode auto-selects challenge difficulty based on visitor risk score",
"Known bypass approaches include instrumented browser automation"
],
"iterations": 5,
"elapsed_ms": 34200
}
Status values: running, completed, failed
Use an AI agent to navigate and interact with a page to accomplish a specific goal. The agent can click, scroll, fill forms, and extract data across multiple steps.
curl -X POST https://api.webclaw.io/v1/agent-scrape \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/products",
"goal": "Find the cheapest laptop with at least 16GB RAM and extract its full specs",
"max_steps": 10
}'
Request fields:
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| url | string | required | Starting URL |
| goal | string | required | What the agent should accomplish |
| max_steps | int | server default | Maximum number of actions the agent can take |
Response:
{
"url": "https://example.com/products",
"result": "The cheapest laptop with 16GB+ RAM is the ThinkPad E14 Gen 6 at $649. Specs: AMD Ryzen 5 7535U, 16GB DDR4, 512GB SSD, 14\" FHD IPS display, 57Wh battery.",
"steps": [
{ "action": "navigate", "detail": "Loaded products page" },
{ "action": "click", "detail": "Clicked 'Laptops' category filter" },
{ "action": "click", "detail": "Applied '16GB+' RAM filter" },
{ "action": "click", "detail": "Sorted by price: low to high" },
{ "action": "extract", "detail": "Extracted specs from first matching product" }
]
}
Create persistent monitors that check a URL on a schedule and notify via webhook when content changes.
Create a monitor:
curl -X POST https://api.webclaw.io/v1/watch \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/pricing",
"interval": "0 */6 * * *",
"webhook_url": "https://hooks.example.com/pricing-changed",
"formats": ["markdown"]
}'
Request fields:
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| url | string | required | URL to monitor |
| interval | string | required | Check frequency as cron expression or seconds (e.g. "0 */6 * * *" or "3600") |
| webhook_url | string | none | URL to POST when changes are detected |
| formats | string[] | ["markdown"] | Output formats for snapshots |
Response:
{
"id": "watch-abc-123",
"url": "https://example.com/pricing",
"interval": "0 */6 * * *",
"webhook_url": "https://hooks.example.com/pricing-changed",
"formats": ["markdown"],
"created_at": "2026-03-20T10:00:00Z",
"last_check": null,
"status": "active"
}
List all monitors:
curl https://api.webclaw.io/v1/watch \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
Response:
{
"monitors": [
{
"id": "watch-abc-123",
"url": "https://example.com/pricing",
"interval": "0 */6 * * *",
"status": "active",
"last_check": "2026-03-20T16:00:00Z",
"checks": 4
}
]
}
Get a monitor with snapshots:
curl https://api.webclaw.io/v1/watch/watch-abc-123 \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
Response:
{
"id": "watch-abc-123",
"url": "https://example.com/pricing",
"interval": "0 */6 * * *",
"status": "active",
"snapshots": [
{
"checked_at": "2026-03-20T16:00:00Z",
"status": "changed",
"diff": "--- previous\n+++ current\n@@ -5 +5 @@\n-Pro: $99/mo\n+Pro: $119/mo"
},
{
"checked_at": "2026-03-20T10:00:00Z",
"status": "baseline"
}
]
}
Trigger an immediate check:
curl -X POST https://api.webclaw.io/v1/watch/watch-abc-123/check \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
Delete a monitor:
curl -X DELETE https://api.webclaw.io/v1/watch/watch-abc-123 \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
| Goal | Format | Why |
|------|--------|-----|
| Read and understand a page | markdown | Clean structure, headings, links preserved |
| Feed content to an AI model | llm | Optimized: includes title + URL header, clean link refs |
| Search or index content | text | Plain text, no formatting noise |
| Programmatic analysis | json | Full metadata, structured data, DOM statistics |
llm format when passing content to yourself or another AI — it's specifically optimized for LLM consumption with better context framing.only_main_content: true to skip navigation, sidebars, and footers. Reduces noise significantly.include_selectors/exclude_selectors for fine-grained control when only_main_content isn't enough.map before crawl to discover the site structure first, then crawl specific sections.extract with a JSON schema for reliable structured output (e.g., pricing tables, product specs, contact info).search with scrape: true to get full page content for each search result in one call instead of searching then scraping separately.research for complex questions that need multiple sources — it handles the search-read-synthesize loop automatically. Enable deep: true for thorough analysis.agent-scrape for interactive pages where data is behind filters, pagination, or form submissions that a simple scrape cannot reach.watch for ongoing monitoring — set up a cron schedule and a webhook to get notified when a page changes without polling manually.The webclaw MCP server uses a local-first approach:
This means:
WEBCLAW_API_KEY to enable cloud fallback| | webclaw | web_fetch | |---|---------|-----------| | Cloudflare bypass | Automatic (cloud fallback) | Fails (403) | | JS-rendered pages | Automatic fallback | Readability only | | Output quality | 20-step optimization pipeline | Basic HTML parsing | | Structured extraction | LLM-powered, schema-based | None | | Crawling | Full site crawl with sitemap | Single page only | | Caching | Built-in, configurable TTL | Per-session | | Rate limiting | Managed server-side | Client responsibility |
Use web_fetch for simple, fast lookups. Use webclaw when you need reliability, quality, or advanced features.
If building from source fails because native build tools are missing, install the platform prerequisites:
| OS | Command |
| --- | --- |
| Debian / Ubuntu | sudo apt install -y pkg-config libssl-dev cmake clang git build-essential |
| Fedora / RHEL | sudo dnf install -y pkg-config openssl-devel cmake clang git make gcc |
| Arch | sudo pacman -S pkg-config openssl cmake clang git base-devel |
| macOS | xcode-select --install |
webclaw https://stripe.com --format markdown
webclaw https://docs.anthropic.com --format llm
webclaw https://example.com/blog/post --only-main-content
webclaw https://example.com \
--include "article, main, .content" \
--exclude "nav, footer, .sidebar, .ad"
webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50
webclaw https://github.com --brand
webclaw https://example.com/pricing --format json > pricing-old.json
webclaw https://example.com/pricing --diff-with pricing-old.json
webclaw ships with an MCP server for AI agents.
npx create-webclaw
Manual config:
{
"mcpServers": {
"webclaw": {
"command": "~/.webclaw/webclaw-mcp"
}
}
}
Then ask your agent things like:
Scrape these competitor pricing pages and summarize the differences.
Crawl this documentation site and prepare clean context for a RAG index.
Extract the brand colors, fonts, and logos from this company website.
| Tool | What it does | Local |
| --- | --- | :-: |
| scrape | Extract one URL as markdown, text, JSON, LLM format, or HTML | Yes |
| crawl | Follow same-origin links and extract discovered pages | Yes |
| map | Discover URLs without extracting every page | Yes |
| batch | Scrape multiple URLs in parallel | Yes |
| extract | Convert page content into structured data | Yes, with local or configured LLM |
| summarize | Summarize a page | Yes, with local or configured LLM |
| diff | Compare page content snapshots | Yes |
| brand | Extract colors, fonts, logos, and metadata | Yes |
| search | Search the web and scrape results | Hosted API |
| research | Multi-source research workflow | Hosted API |
npm install @webclaw/sdk
pip install webclaw
go get github.com/0xMassi/webclaw-go
import { Webclaw } from "@webclaw/sdk";
const client = new Webclaw({ apiKey: process.env.WEBCLAW_API_KEY! });
const page = await client.scrape({
url: "https://example.com",
formats: ["markdown"],
only_main_content: true,
});
console.log(page.markdown);
from webclaw import Webclaw
client = Webclaw(api_key="wc_your_key")
page = client.scrape(
"https://example.com",
formats=["markdown"],
only_main_content=True,
)
print(page.markdown)
curl -X POST https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown"],
"only_main_content": true
}'
| Format | Use it when you need |
| --- | --- |
| markdown | Clean page content with structure preserved |
| llm | Compact context for agents and RAG pipelines |
| text | Plain text with minimal formatting |
| json | Structured metadata, links, images, and extracted fields |
| html | Cleaned HTML for custom processing |
The CLI and MCP server work locally without an account for the core extraction path.
Use the hosted API at webclaw.io when you need:
export WEBCLAW_API_KEY=wc_your_key
webclaw https://example.com --cloud
| Use case | Example | | --- | --- | | AI agent web access | Give Claude, Cursor, or another MCP client clean page context | | RAG ingestion | Crawl docs, help centers, blogs, and knowledge bases | | Competitor monitoring | Track pricing pages, changelogs, docs, and product pages | | Structured extraction | Turn messy pages into typed JSON for automations | | Research workflows | Search, scrape, summarize, and cite multiple sources | | Brand intelligence | Extract logos, colors, fonts, and social metadata |
webclaw/
crates/
webclaw-core HTML to markdown, text, JSON, and LLM-ready output
webclaw-fetch Fetching, crawling, batching, and mapping
webclaw-llm Local and hosted LLM provider support
webclaw-pdf PDF text extraction
webclaw-mcp MCP server for AI agents
webclaw-cli Command-line interface
webclaw-core is pure extraction logic: no network I/O, small surface area, and usable independently from the fetching layer.
| Variable | Description |
| --- | --- |
| WEBCLAW_API_KEY | Hosted API key |
| OLLAMA_HOST | Ollama URL for local LLM features |
| OPENAI_API_KEY | OpenAI-compatible LLM provider key |
| OPENAI_BASE_URL | OpenAI-compatible base URL |
| ANTHROPIC_API_KEY | Anthropic-compatible LLM provider key |
| ANTHROPIC_BASE_URL | Anthropic-compatible base URL |
| WEBCLAW_PROXY | Single proxy URL |
| WEBCLAW_PROXY_FILE | Proxy pool file |
The most useful contributions right now are practical and small:
Good first places to start:
If a page extracts badly, include:
URL:
Command or API request:
Expected output:
Actual output:
Format used: markdown / llm / text / json / html
CLI, MCP, SDK, or API:
Please remove secrets, cookies, private tokens, and custo