by discourselab
AI-powered web scraping CLI. Describe what you want, get a production-ready Scrapy spider. Write once, reuse forever.
# Add to your Claude Code skills
git clone https://github.com/discourselab/scrapai-cliGuides for using cli tools skills like scrapai-cli.
You: "Add https://bbc.co.uk to my news project"
Minutes later you have a tested, production-ready scraper stored in a database. No Python, no CSS selectors, no Scrapy knowledge. The AI agent analyzes the site, writes extraction rules, verifies quality, and saves a reusable config. Run it tomorrow or next year. Same command, no AI costs.
Built by DiscourseLab. Used in production across 500+ websites.
Good fit:
No comments yet. Be the first to share your thoughts!
Top skills in this category by stars
Not a good fit:
See COMPARISON.md for a detailed comparison with Scrapling and crawl4ai.
We needed data for our work. Hundreds of websites, scraped regularly, structured consistently. We got sick of building and maintaining fleets of scrapers.
There are great crawling frameworks out there. Scrapy, crawl4ai, and Scrapling are our favourites, and ScrapAI is built on top of Scrapy. But even with great frameworks, you hit a wall at scale. You still need to write code for every site, monitor for breakage, and fix things when layouts change. 10 scrapers is fine. 100 is a full-time job. 500 is a team.
We looked at three options:
Option 1: Web scraping services. They charge per page, per request, or per API call. Fine for small volumes, but at scale the bills get serious. Stop paying, lose access.
Option 2: AI-powered scraping with LLMs at runtime. Call an LLM on every page to extract data. Clever, but the cost scales linearly with volume. 10,000 pages means 10,000 inference calls. That's wasteful for what is ultimately a pattern-matching problem.
Option 3: AI once, deterministic forever. Use AI at build time to analyze the site and write extraction rules. Then run those rules with Scrapy: no AI in the loop, no per-page costs. The cost is per website, not per page. After that, you own the scraper and run it as many times as you want.
We chose option 3. That's ScrapAI.
Self-hosted, no vendor lock-in. You clone the repo, you own everything. No SaaS, no subscription, no per-page billing. Your scrapers are JSON configs in a database. Export them, share them, move them between projects.
ScrapAI is an orchestration layer on top of Scrapy. Instead of writing a Python spider file per website, an AI agent generates a JSON config and stores it in a database. A single generic spider (DatabaseSpider) loads any config at runtime.
You (plain English) → AI Agent → JSON config → Database → Scrapy crawl
(once) (forever)
Why JSON configs instead of AI-generated Python? An agent that writes and executes Python has the same power as an unsupervised developer. If it hallucinates, gets prompt-injected by a malicious page, or loses context, it can do real damage. An agent that writes JSON configs produces data, not code. That data goes through strict validation (Pydantic schemas, SSRF checks, reserved name blocking) before it reaches the database. The worst case is a bad config that extracts wrong fields, caught in the test crawl and trivially fixable. See Security for the full picture.
Here's what an AI-generated spider config looks like:
{
"name": "bbc_co_uk",
"allowed_domains": ["bbc.co.uk"],
"start_urls": ["https://www.bbc.co.uk/news"],
"rules": [
{
"allow": ["/news/articles/[^/]+$"],
"callback": "parse_article",
"follow": false
},
{
"allow": ["/news/?$"],
"follow": true
}
],
"settings": {
"EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
"DOWNLOAD_DELAY": 2
}
}
Adding a new website means adding a new row. See templates/ for complete working examples — news sites, e-commerce, forums, and Cloudflare-protected sites with full analysis and exported data.
ScrapAI is glue. These projects do the heavy lifting:
Our contribution is the orchestration: the CLI, the database-first spider management, the AI agent workflow, Cloudflare cookie caching, smart proxy escalation, and the glue that holds it together.
Advanced stealth with CloakBrowser. Source-level C++ patches (not JS injection or config flags) achieve 0.9 reCAPTCHA v3 scores and pass 30/30 detection tests including Cloudflare Turnstile (non-interactive auto-pass, managed single-click), FingerprintJS, BrowserScan, DataDome, and ShieldSquare. Fingerprints are compiled into the Chromium binary — detection sites see a real browser because it is a real browser with stealth baked in. Works in headless mode on Linux servers.
Cookie-cached Cloudflare bypass. CloakBrowser solves the challenge once, extracts session cookies, then shuts down. Subsequent requests use Scrapy's fast HTTP engine with cached cookies. Browser reopens every ~10 minutes to refresh. 20-100x faster than tools that keep the browser open for every request (~0.1-0.5s per page vs 5-10s). On a 1,000-page Cloudflare crawl: ~8 minutes vs 2+ hours.
Smart proxy escalation. Starts with direct connections. If a site blocks you (403/429), retries through a datacenter proxy and remembers that domain for next time. Residential proxies require explicit opt-in.
Checkpoint pause/resume. Press Ctrl+C to pause a long crawl, run the same command to resume. Built on Scrapy's native JOBDIR. No progress lost.
Incremental crawling. DeltaFetch skips already-scraped URLs, reducing bandwidth by 80-90% on routine re-crawls.
Targeted extraction. Articles get clean structured fields (title, content, author, date) via newspaper and trafilatura. Non-article content (products, jobs, listings) gets custom callbacks with field-level selectors and data processors. The output is structured data, not a page dump.
Database-first management. Spiders are rows in a database, not Python files on disk. Need to change DOWNLOAD_DELAY across your whole fleet? One SQL query instead of editing 100 files. Export a spider config as JSON, import it into another project. No code drift, no style inconsistencies.
Queue and batch processing. Bulk-add hundreds of URLs into a database-backed queue with priorities, status tracking, and retry on failure. The agent processes them in parallel batches of 5, each through the full build-test-deploy workflow.
AI-assisted health checks. ./scrapai health --project news tests all spiders with 5 sample items, detects extraction vs crawling failures, and generates a markdown report for the agent to fix. Run monthly via cron to catch breakage early. When a site redesigns, the agent re-analyzes, updates selectors, and verifies the fix in 5-10 minutes vs 45 minutes manual.
Requirements: Python 3.9+, Git
Supported platforms: Linux, macOS, Windows (WSL or Docker for Cloudflare bypass)
git clone https://github.com/discourselab/scrapai-cli.git
cd scrapai-cli
./scrapai setup
./scrapai verify
./scrapai setup creates the virtual environment, installs