Self-hosted URL- and file-to-Markdown service for humans and AI agents - web pages, documents, images, audio, YouTube. PWA + REST + MCP + Claude Code skill, Reddit-aware, refreshable share links.
# Add to your Claude Code skills
git clone https://github.com/AeternaLabsHQ/pullmdLast scanned: 6/11/2026
{
"issues": [
{
"type": "npm-audit",
"message": "express-rate-limit: Vulnerability found",
"severity": "medium"
},
{
"type": "npm-audit",
"message": "fast-uri: fast-uri vulnerable to path traversal via percent-encoded dot segments",
"severity": "high"
},
{
"type": "npm-audit",
"message": "hono: Hono has CSS Declaration Injection via Style Object Values in JSX SSR",
"severity": "medium"
},
{
"type": "npm-audit",
"message": "ip-address: ip-address has XSS in Address6 HTML-emitting methods",
"severity": "medium"
},
{
"type": "npm-audit",
"message": "qs: qs has a remotely triggerable DoS: qs.stringify crashes with TypeError on null/undefined entries in comma-format arrays when encodeValuesOnly is set",
"severity": "medium"
},
{
"file": "README.md",
"line": 181,
"type": "secret-exfiltration",
"message": "Instruction appears to send credentials/secrets to an external endpoint",
"severity": "medium"
},
{
"file": "README.md",
"line": 200,
"type": "secret-exfiltration",
"message": "Instruction appears to send credentials/secrets to an external endpoint",
"severity": "medium"
},
{
"file": "README.md",
"line": 269,
"type": "secret-exfiltration",
"message": "Instruction appears to send credentials/secrets to an external endpoint",
"severity": "medium"
},
{
"file": "README.md",
"type": "secret-exfiltration",
"message": "…and 1 more similar match in this file",
"severity": "low"
},
{
"file": "README.md",
"line": 360,
"type": "dangerous-command",
"message": "Dangerous command (writes to Claude config): \"> rm -rf ~/.claude/\"",
"severity": "medium"
}
],
"status": "WARNING",
"scannedAt": "2026-06-11T08:49:39.927Z",
"npmAuditRan": true,
"pipAuditRan": true,
"promptInjectionRan": true
}No comments yet. Be the first to share your thoughts!
Requires a passing catalog security scan. Resolve the flagged issues and resubmit to enable featuring.
Self-hosted URL-to-Markdown service for humans and AI agents.
PullMD takes any web URL and returns clean, readable Markdown — no navigation, no ads, no boilerplate. It auto-detects Reddit threads (with full comment trees), uses Cloudflare's native Markdown when available, runs Mozilla Readability + Trafilatura on static HTML, and as a last resort renders JavaScript-heavy pages via headless Chromium (Playwright sidecar) before extracting.
As of v3, PullMD goes beyond web pages: it also converts documents (PDF, Office, EPUB), images, audio, and YouTube videos to Markdown, and emits a leaner, token-efficient body by default. See What's new in v3 below.
It ships as:
GET /api?url=…POST /mcp (Streamable-HTTP transport, stateless)Every conversion gets an 8-hex share id that works as a stable
live-endpoint: GET /s/:id returns the cached markdown and
re-fetches from the source if older than one hour. Use the share id
as a fixed URL that always returns fresh content — useful for
subreddit feeds and similar.
PullMD v3 grows from a web-page reader into a general anything-to-Markdown service for agents, with a leaner default output. Everything beyond plain web extraction is opt-in and degrades gracefully - left unconfigured, v3 handles web pages exactly like v2, just with a cleaner body by default.
# Title + content. The source URL, fetch date, and all metadata moved into the YAML frontmatter, so nothing is duplicated and you spend fewer tokens. Reddit posts follow the same rule: subreddit, author, upvotes, and publish date live in the frontmatter (subreddit, author, upvotes, published), not the body. This is the one breaking change: set PULLMD_SOURCE_HEADER=true to restore the old inline header, and use PULLMD_FRONTMATTER_FIELDS to trim which fields are emitted. See MIGRATION.md.POST /api/file, drag-and-drop in the PWA).?pdf=ocr) for table-grade PDF conversion, with automatic fallback to the free path.Self-hosters upgrading from v2.x: the clean-body change is the only breaking one -
MIGRATION.mdhas the one-line opt-out. Everything else is additive.
Pre-built multi-arch images (linux/amd64, linux/arm64) live on Docker
Hub. Drop the compose file somewhere and run:
mkdir pullmd && cd pullmd
curl -O https://raw.githubusercontent.com/AeternaLabsHQ/pullmd/main/docker-compose.yml
docker compose up -d
# → http://localhost:3000
That's it. No .env needed: every variable has a sensible default
and PullMD listens on port 3000. Add a .env next to the compose
file to override anything (see Configuration).
docker-compose.yml (zero-config, abridged)services:
pullmd:
image: aeternalabshq/pullmd:latest
container_name: pullmd
restart: unless-stopped
ports:
- "${PORT:-3000}:3000"
environment:
- PUBLIC_URL=${PUBLIC_URL:-http://localhost:${PORT:-3000}}
- TRAFILATURA_URL=http://trafilatura:8001/extract
- PLAYWRIGHT_URL=http://playwright:8002/render
- MARKITDOWN_URL=http://markitdown:8003/convert
- CACHE_DB=/data/cache.db
volumes:
- ./data:/data
networks:
- pullmd-internal
depends_on:
- trafilatura
- playwright
- markitdown
trafilatura:
image: aeternalabshq/pullmd-trafilatura:latest
container_name: pullmd-trafilatura
restart: unless-stopped
networks:
- pullmd-internal
playwright:
image: aeternalabshq/pullmd-playwright:latest
container_name: pullmd-playwright
restart: unless-stopped
networks:
- pullmd-internal
markitdown:
image: aeternalabshq/pullmd-markitdown:latest
container_name: pullmd-markitdown
restart: unless-stopped
mem_limit: ${MARKITDOWN_MEM_LIMIT:-1g}
networks:
- pullmd-internal
networks:
pullmd-internal:
driver: bridge
Abridged for readability — the
docker-compose.ymlin the repo additionally passes every optional.envvariable through to the containers (Reddit credentials, auth, media/OCR keys, YouTube options, output shaping). Use thecurl -Ocommand above rather than copying this block, or.envoverrides beyond the basics won't reach the containers.
Note: the Playwright sidecar adds ~3.7 GB to your image cache (Chromium + Firefox + WebKit binaries from the official Playwright base image). It's optional — leave
PLAYWRIGHT_URLunset and theplaywrightservice block off, and PullMD silently degrades to static extraction with a fallback note in the metadata.
Note: the MarkItDown sidecar is optional. Leave
MARKITDOWN_URLunset and remove themarkitdownservice block to disable document conversion. Web-page URLs always work without it.
Mirror on GHCR:
ghcr.io/aeternalabshq/{pullmd,pullmd-trafilatura,pullmd-playwright,pullmd-markitdown}. Replace theimage:lines if you prefer GitHub's registry.
For deployments behind Traefik with TLS, use docker-compose.traefik.yml
instead. Same images, but with Traefik labels and the proxy external
network. Set HOST_DOMAIN in .env:
curl -O https://raw.githubusercontent.com/AeternaLabsHQ/pullmd/main/docker-compose.traefik.yml
echo "HOST_DOMAIN=pullmd.example.com" > .env
docker compose -f docker-compose.traefik.yml up -d
git clone https://github.com/AeternaLabsHQ/pullmd.git
cd pullmd
npm install
npm start # http://localhost:3000
npm test # node --test
All variables go in .env (copy from .env.example):
v3.0.0 output format change: the markdown body is clean by default - just
# Titlefollowed by content. The source URL, fetch date, and all extraction metadata remain in the YAML frontmatter unchanged - the body no longer duplicates them. SetPULLMD_SOURCE_HEADER=trueto restore the old inline header. UsePULLMD_FRONTMATTER_FIELDSto pick which frontmatter fields are emitted (handy for trimming tokens in agent pipelines).
| Variable | Required | Purpose |
|---|---|---|
HOST_DOMAIN |
Traefik variant only | Public hostname without scheme. Used by Traefik routing and as fallback for PUBLIC_URL. Unused by the default compose. |
PUBLIC_URL |
no | Full public origin embedded in /help and the skill zip. Defaults to https://${HOST_DOMAIN}. |
TRAFILATURA_URL |
no | URL of the Trafilatura sidecar's /extract endpoint. Unset → skip Trafilatura, Readability only. |
PLAYWRIGHT_URL |
no | URL of the Playwright sidecar's /render endpoint. Unset → skip Playwright fallback for JS pages. |
MARKITDOWN_URL |
no | URL of the MarkItDown sidecar's /convert endpoint. Unset → document-conversion path disabled; POST /api/file returns 502. |
PULLMD_VISION_API_KEY / …_BASE_URL / …_MODEL |
no | Image captioning via an OpenAI-compatible vision endpoint. Enabled when the key is set. _MODEL defaults to gpt-4o-mini. |
PULLMD_STT_API_KEY / …_BASE_URL / …_MODEL |
no | Audio transcription via an OpenAI-compatible /audio/transcriptions endpoint. Enabled when the key is set. _MODEL defaults to whisper-1. |
PULLMD_LLM_API_KEY / …_BASE_URL |
no | Shared fallback credentials for vision + STT when the per-modality vars are unset. |
PULLMD_PDF_OCR_API_KEY / …_BASE_URL / …_MODEL |
no | Opt-in high-quality PDF→Markdown via an OCR provider that preserves tables (reference: Mistral OCR mistral-ocr-latest). Triggered per request wit |