by Ar9av
An automated AI research-paper writer based off Google's PaperOrchestra paper's implementation through a skills - benchmark + autoraters using any coding agent (Claude Code, Cursor, Antigravity, Cline, Aider). No API keys, no LLM SDKs.
# Add to your Claude Code skills
git clone https://github.com/Ar9av/PaperOrchestraA pluggable skill pack that lets any coding agent in Claude Code, Cursor, Antigravity, Cline, Aider, OpenCode, etc. which can run the PaperOrchestra multi-agent pipeline for turning unstructured research materials into a submission-ready LaTeX paper.
Song, Y., Song, Y., Pfister, T., Yoon, J. PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing. arXiv:2604.05018, 2026. https://arxiv.org/pdf/2604.05018
The paper defines a five-agent pipeline
that substantially outperforms single-agent and tree-search baselines on the PaperWritingBench benchmark (50–68% absolute win margin on literature review quality; 14–38% on overall quality). The paper ships the exact prompts for every agent in Appendix F.
This repo turns those prompts, schemas, halt rules, and verification pipelines into a set of host-agent-executable skills. There are no API keys, no SDK dependencies, no embedded LLM calls. The skills are instruction documents plus deterministic helpers; your coding agent does all LLM reasoning and web search using its own tools.
Each skill is:
SKILL.md — a dense instruction document the host agent reads and follows.references/ — reference material: verbatim paper prompts (Appendix F), JSON
schemas, rubrics, halt rules, example outputs.scripts/ — local helpers: JSON schema validation,
Levenshtein fuzzy matching, BibTeX formatting, dedup, LaTeX sanity checks,
coverage gates. No network, no LLM, no API keys.No comments yet. Be the first to share your thoughts!
Everything else (LLM reasoning, web search, Semantic Scholar lookups, LaTeX compilation) is delegated to the host agent by instruction. See skills/paper-orchestra/references/host-integration.md for per-host invocation (Claude Code, Cursor, Antigravity, Cline, Aider).
| Skill | Paper step | # LLM calls | Role |
|---|---|---|---|
| paper-orchestra | orchestrator | — | Top-level driver. Coordinates the other six. |
| outline-agent | Step 1 | 1 | Idea + log + template + guidelines → structured outline JSON (plotting plan, lit review plan, section plan). |
| plotting-agent | Step 2 | ~20–30 | Execute plotting plan; render plots & conceptual diagrams; optional VLM-critique refinement loop; caption everything. |
| literature-review-agent | Step 3 | ~20–30 | Web-search candidates; Semantic Scholar verify (Levenshtein > 70, cutoff, dedup); draft Intro + Related Work with ≥90% citation integration. |
| section-writing-agent | Step 4 | 1 | One single multimodal call: draft remaining sections, build tables from experimental log, splice figures. |
| content-refinement-agent | Step 5 | ~5–7 | Simulated peer review; accept/revert per strict halt rules; safety constraints prevent gaming the evaluator. |
| paper-writing-bench | §3 | — | Reverse-engineer raw materials (Sparse/Dense idea, experimental log) from an existing paper to build benchmark cases. |
| paper-autoraters | App. F.3 | — | Run the paper's own autoraters: Citation F1 (P0/P1), LitReview quality (6-axis), SxS paper quality, SxS litreview quality. |
Steps 2 and 3 run in parallel (see skills/paper-orchestra/references/pipeline.md).
A pre-pipeline skill that bridges the gap between scattered AI coding-agent
history and the structured (idea.md, experimental_log.md) inputs that
PaperOrchestra expects. If you have been running experiments through Claude
Code, Cursor, Antigravity, or OpenClaw — but never wrote up a clean experiment
log — this skill does that extraction for you.
It is optional. If workspace/inputs/idea.md and
workspace/inputs/experimental_log.md already exist, the skill skips itself
and the pipeline proceeds directly. It only runs when the inputs are missing or
when you explicitly point an agent at a directory.
The simplest way to use it: just tell your agent the folder. If you have a directory (a project root, an agent cache, any folder with research notes), the aggregator figures out what's inside and structures it for PaperOrchestra. The first thing it does is aggregate — scanning, extracting, and synthesising — so even if the data is scattered across multiple files and formats, it produces clean, reviewable inputs before anything gets written.
Run it before paper-orchestra (or let paper-orchestra call it automatically
when inputs are missing).
[.claude/] [.cursor/] [.antigravity/] [.openclaw/]
│ │ │ │
└────────────┴──────────────┴───────────────┘
│
Phase 1: Discovery (deterministic)
│
Phase 2: Extraction (LLM — per batch)
│
Phase 3: Synthesis (LLM — one call)
│
Phase 4: Formatting (deterministic)
│
┌──────────┴──────────┐
workspace/inputs/ workspace/ara/
idea.md aggregation_report.md
experimental_log.md discovered_logs.json
raw_experiments.json
synthesis.json
The four phases are:
| Phase | Tool | What happens |
|---|---|---|
| 1 Discovery | discover_logs.py | Walks --search-roots to catalog every relevant log file across all agent caches. Prints a summary for user review before anything is read. |
| 2 Extraction | LLM (per ~50 KB batch) | Applies references/extraction-prompt.md to each batch; produces raw_experiments.json. PII is stripped; unverified numbers are flagged [UNVERIFIED]. |
| 3 Synthesis | LLM (one call) | Merges possibly-redundant experiment records into a single research narrative (synthesis.json). Detects multiple disconnected projects and pauses to ask the user. |
| 4 Formatting | format_po_inputs.py | Converts synthesis.json into idea.md (Sparse Idea format, §3.1) and experimental_log.md (App. D.3), ready for paper-orchestra. |
Install — no extra dependencies beyond the base requirements.txt.
Symlink the skill into your host's skill directory alongside the others:
ln -sf ~/paper-orchestra/skills/agent-research-aggregator \
~/.claude/skills/agent-research-aggregator
For Cursor / Antigravity / Cline / Aider, follow the same per-host
instructions in skills/paper-orchestra/references/host-integration.md.
Invoke by telling your coding agent:
"Aggregate my agent logs for paper writing" — or — "Prepare PaperOrchestra inputs from my cache" — or — "Turn my agent logs into a paper"
The trigger phrases are listed in the description field of
skills/agent-research-aggregator/SKILL.md.
| Flag | Default | Description |
|---|---|---|
| --search-roots | cwd, ~ | Directories to scan for agent caches |
| --agents | all | Subset: claude,cursor,antigravity,openclaw |
| --workspace | ./workspace | PaperOrchestra workspace root |
| --depth | 4 | Max scan depth (prevents runaway traversal) |
| --since | — | Only logs modified after this date (ISO 8601) |
From Claude Code memory + CLAUDE.md only:
python skills/agent-research-aggregator/scripts/discover_logs.py \
--search-roots . \
--agents claude \
--out workspace/ara/discovered_logs.json
# → finds .claude/projects/<hash>/memory/*.md and CLAUDE.md
From a Cursor project (chat history + rules):
python skills/agent-research-aggregator/scripts/discover_logs.py \
--search-roots ~/my-project \
--agents cursor \
--out workspace/ara/discovered_logs.json
# → finds .cursor/chat/chatHistory.json and .cursorrules
From Antigravity worker logs, restricted to the last 60 days:
python skills/agent-research-aggregator/scripts/discover_logs.py \
--search-roots ~/my-project \
--agents antigravity \
--since 2026-02-09 \
--out workspace/ara/discovered_logs.json
# → finds .antigravity/workers/<id>/log.jsonl and output.md
From OpenClaw sessions + run metrics:
python skills/agent-research-aggregator/scripts/discover_logs.py \
--search-roots ~/my-project \
--agents openclaw \
--out workspace/ara/discovered_logs.json
# → finds .openclaw/sessions/*/conversation.md and runs/*/metrics.json
Full run across all caches:
# Phase 1 — discovery
python skills/agent-research-aggregator/scripts/discover_logs.py \
--search-roots . ~ --out workspace/ara/discovered_logs.json
# Phase 2 — LLM extraction (your agent handles this; validate afterward)
python skills/agent-research-aggregator/scripts/extract_experiments.py \
--discovered workspace/ara/discovered_logs.json \
--out workspace/ara/raw_experiments.json --validate-only
# Phase 3 — LLM synthesis (your agent handles this)
# Phase 4 — format + audit report
python skills/agent-research-aggregator/scripts/format_po_inputs.py \
--synthesis workspace/ara/synthesis.json \
--out workspace/inputs/ \
--report workspace/ara/aggregation_report.md
After Phase 4, the workspace is ready for paper-orchestra. You still need
to supply workspace/inputs/template.tex (your conference LaTeX template) and
workspace/inputs/conference_guidelines.md (page limit, deadline, formatting
rules).