PaperOrchestra

Name: PaperOrchestra
Author: Ar9av

Verified

An automated AI research-paper writer based off Google's PaperOrchestra paper's implementation through a skills - benchmark + autoraters using any coding agent (Claude Code, Cursor, Antigravity, Cline, Aider). No API keys, no LLM SDKs.

615stars

85forks

Python

Added 4/15/2026

View on GitHub Download ZIP Scan for vulnerabilities

30 days in the Featured rail · terms & refunds

AI Agentsagentic-aiai-researchanthropicantigravityarxivautomated-paper-writingclaude-codeclaude-skillscoding-agentscursorlatexliterature-reviewllm-agentsmulti-agentpaper-orchestrapaperorchestraresearch-papersemantic-scholarskill-pack

Installation

# Add to your Claude Code skills
git clone https://github.com/Ar9av/PaperOrchestra

Getting Started

Guides for using ai agents skills like PaperOrchestra.

Security ReportVerified

Last scanned: 5/17/2026

{
  "issues": [],
  "status": "PASSED",
  "scannedAt": "2026-05-17T06:44:36.036Z",
  "semgrepRan": false,
  "npmAuditRan": true,
  "pipAuditRan": false
}

README.md

PaperOrchestra

A pluggable skill pack that lets any coding agent in Claude Code, Cursor, Antigravity, Cline, Aider, OpenCode, etc. which can run the PaperOrchestra multi-agent pipeline for turning unstructured research materials into a submission-ready LaTeX paper.

Song, Y., Song, Y., Pfister, T., Yoon, J. PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing. arXiv:2604.05018, 2026. https://arxiv.org/pdf/2604.05018

Why this exists

The paper defines a five-agent pipeline

Outline
Plotting
Literature Review
Section Writing
Content Refinement

that substantially outperforms single-agent and tree-search baselines on the PaperWritingBench benchmark (50–68% absolute win margin on literature review quality; 14–38% on overall quality). The paper ships the exact prompts for every agent in Appendix F.

This repo turns those prompts, schemas, halt rules, and verification pipelines into a set of host-agent-executable skills. There are no API keys, no SDK dependencies, no embedded LLM calls. The skills are instruction documents plus deterministic helpers; your coding agent does all LLM reasoning and web search using its own tools.

How skills work here

Each skill is:

SKILL.md — a dense instruction document the host agent reads and follows.
references/ — reference material: verbatim paper prompts (Appendix F), JSON schemas, rubrics, halt rules, example outputs.
scripts/ — purely deterministic local helpers: JSON schema validation, Levenshtein fuzzy matching, BibTeX formatting, dedup, LaTeX sanity checks, coverage gates. No network, no LLM, no API keys.

Everything else (LLM reasoning, web search, Semantic Scholar lookups, LaTeX compilation) is delegated to the host agent by instruction. See skills/paper-orchestra/references/host-integration.md for per-host invocation (Claude Code, Cursor, Antigravity, Cline, Aider).

The seven skills

Skill	Paper step	# LLM calls	Role
`paper-orchestra`	orchestrator	—	Top-level driver. Coordinates the other six.
`outline-agent`	Step 1	1	Idea + log + template + guidelines → structured outline JSON (plotting plan, lit review plan, section plan).
`plotting-agent`	Step 2	~20–30	Execute plotting plan; render plots & conceptual diagrams; optional VLM-critique refinement loop; caption everything.
`literature-review-agent`	Step 3	~20–30	Web-search candidates; Semantic Scholar verify (Levenshtein > 70, cutoff, dedup); draft Intro + Related Work with ≥90% citation integration.
`section-writing-agent`	Step 4	1	One single multimodal call: draft remaining sections, build tables from experimental log, splice figures.
`content-refinement-agent`	Step 5	~5–7	Simulated peer review; accept/revert per strict halt rules; safety constraints prevent gaming the evaluator.
`paper-writing-bench`	§3	—	Reverse-engineer raw materials (Sparse/Dense idea, experimental log) from an existing paper to build benchmark cases.
`paper-autoraters`	App. F.3	—	Run the paper's own autoraters: Citation F1 (P0/P1), LitReview quality (6-axis), SxS paper quality, SxS litreview quality.

Steps 2 and 3 run in parallel (see skills/paper-orchestra/references/pipeline.md).

agent-research-aggregator (optional)

A pre-pipeline skill that bridges the gap between scattered AI coding-agent history and the structured (idea.md, experimental_log.md) inputs that PaperOrchestra expects. If you have been running experiments through Claude Code, Cursor, Antigravity, or OpenClaw — but never wrote up a clean experiment log — this skill does that extraction for you.

It is optional. If workspace/inputs/idea.md and workspace/inputs/experimental_log.md already exist, the skill skips itself and the pipeline proceeds directly. It only runs when the inputs are missing or when you explicitly point an agent at a directory.

The simplest way to use it: just tell your agent the folder. If you have a directory (a project root, an agent cache, any folder with research notes), the aggregator figures out what's inside and structures it for PaperOrchestra. The first thing it does is aggregate — scanning, extracting, and synthesising — so even if the data is scattered across multiple files and formats, it produces clean, reviewable inputs before anything gets written.

Run it before paper-orchestra (or let paper-orchestra call it automatically when inputs are missing).

What it does

[.claude/]  [.cursor/]  [.antigravity/]  [.openclaw/]
      │            │              │               │
      └────────────┴──────────────┴───────────────┘
                        │
                Phase 1: Discovery  (deterministic)
                        │
                Phase 2: Extraction (LLM — per batch)
                        │
                Phase 3: Synthesis  (LLM — one call)
                        │
                Phase 4: Formatting (deterministic)
                        │
             ┌──────────┴──────────┐
      workspace/inputs/      workspace/ara/
        idea.md                aggregation_report.md
        experimental_log.md    discovered_logs.json
                               raw_experiments.json
                               synthesis.json

The four phases are:

Phase	Tool	What happens
1 Discovery	`discover_logs.py`	Walks `--search-roots` to catalog every relevant log file across all agent caches. Prints a summary for user review before anything is read.
2 Extraction	LLM (per ~50 KB batch)	Applies `references/extraction-prompt.md` to each batch; produces `raw_experiments.json`. PII is stripped; unverified numbers are flagged `[UNVERIFIED]`.
3 Synthesis	LLM (one call)	Merges possibly-redundant experiment records into a single research narrative (`synthesis.json`). Detects multiple disconnected projects and pauses to ask the user.
4 Formatting	`format_po_inputs.py`	Converts `synthesis.json` into `idea.md` (Sparse Idea format, §3.1) and `experimental_log.md` (App. D.3), ready for `paper-orchestra`.

Integration

Install — no extra dependencies beyond the base requirements.txt.

Symlink the skill into your host's skill directory alongside the others:

ln -sf ~/paper-orchestra/skills/agent-research-aggregator \
       ~/.claude/skills/agent-research-aggregator

For Cursor / Antigravity / Cline / Aider, follow the same per-host instructions in skills/paper-orchestra/references/host-integration.md.

Invoke by telling your coding agent:

"Aggregate my agent logs for paper writing" — or — "Prepare PaperOrchestra inputs from my cache" — or — "Turn my agent logs into a paper"

The trigger phrases are listed in the description field of skills/agent-research-aggregator/SKILL.md.

Parameters

Flag	Default	Description
`--search-roots`	cwd, `~`	Directories to scan for agent caches
`--agents`	all	Subset: `claude,cursor,antigravity,openclaw`
`--workspace`	`./workspace`	PaperOrchestra workspace root
`--depth`	4	Max scan depth (prevents runaway traversal)
`--since`	—	Only logs modified after this date (ISO 8601)

Example workflows

From Claude Code memory + CLAUDE.md only:

python skills/agent-research-aggregator/scripts/discover_logs.py \
    --search-roots . \
    --agents claude \
    --out workspace/ara/discovered_logs.json
# → finds .claude/projects/<hash>/memory/*.md and CLAUDE.md

From a Cursor project (chat history + rules):

python skills/agent-research-aggregator/scripts/discover_logs.py \
    --search-roots ~/my-project \
    --agents cursor \
    --out workspace/ara/discovered_logs.json
# → finds .cursor/chat/chatHistory.json and .cursorrules

From Antigravity worker logs, restricted to the last 60 days:

python skills/agent-research-aggregator/scripts/discover_logs.py \
    --search-roots ~/my-project \
    --agents antigravity \
    --since 2026-02-09 \
    --out workspace/ara/discovered_logs.json
# → finds .antigravity/workers/<id>/log.jsonl and output.md

From OpenClaw sessions + run metrics:

python skills/agent-research-aggregator/scripts/discover_logs.py \
    --search-roots ~/my-project \
    --agents openclaw \
    --out workspace/ara/discovered_logs.json
# → finds .openclaw/sessions/*/conversation.md and runs/*/metrics.json

Full run across all caches:

# Phase 1 — discovery
python skills/agent-research-aggregator/scripts/discover_logs.py \
    --search-roots . ~ --out workspace/ara/discovered_logs.json

# Phase 2 — LLM extraction (your agent handles this; validate afterward)
python skills/agent-research-aggregator/scripts/extract_experiments.py \
    --discovered workspace/ara/discovered_logs.json \
    --out workspace/ara/raw_experiments.json --validate-only

# Phase 3 — LLM synthesis (your agent handles this)

# Phase 4 — format + audit report
python skills/agent-research-aggregator/scripts/format_po_inputs.py \
    --synthesis workspace/ara/synthesis.json \
    --out workspace/inputs/ \
    --report workspace/ara/aggregation_report.md

After Phase 4, the workspace is ready for paper-orchestra. You still need to supply workspace/inputs/template.tex (your conference LaTeX template) and workspace/inputs/conference_guidelines.md (page limit, deadline, formatting rules).

Frequently Asked Questions

What is PaperOrchestra?

PaperOrchestra is an open-source ai agents skill for AI coding assistants such as Claude Code, Codex CLI, and ChatGPT, built by Ar9av. An automated AI research-paper writer based off Google's PaperOrchestra paper's implementation through a skills - benchmark + autoraters using any coding agent (Claude Code, Cursor, Antigravity, Cline, Aider). No API keys, no LLM SDKs. It has 615 GitHub stars.

Is PaperOrchestra safe to use?

Yes. PaperOrchestra passed SkillsLLM's automated security scan — a dependency vulnerability audit plus prompt-injection heuristics — with no high-severity issues. You can read the full report in the Security Report section on this page.

How do I install PaperOrchestra?

Clone the repository with "git clone https://github.com/Ar9av/PaperOrchestra" and add it to your Claude Code skills directory (see the Installation section above).

What programming language is PaperOrchestra written in?

PaperOrchestra is primarily written in Python. It is open-source under Ar9av on GitHub, so you can review or fork the full source.

Are there alternatives to PaperOrchestra?

Yes. SkillsLLM lists many other AI Agents skills you can browse and compare side by side. Open the AI Agents category from the badge at the top of this page, or use the Related Skills and comparison links further down to weigh PaperOrchestra against similar tools.

Agentic AI for Beginners

Build your first AI agent from scratch - tool use, ReAct pattern, memory, deployment

41 minBeginner

Comments (0)

to leave a comment.

No comments yet. Be the first to share your thoughts!

Related Skills

superpowers

by obra

An agentic skills framework & software development methodology that works.

234,966

20,863

Shell

AI Agentsaibrainstorming

View details

Compare

ECC

by affaan-m

The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.

230,715

35,207

JavaScript

AI Agentsai-agentsanthropic

The agent that grows with you

216,534

40,590

Python

AI Agentsaiai-agent

View details

Compare

everything-claude-code

by affaan-m

The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.

185,940

28,768

JavaScript

AI Agentsai-agentsanthropic

View details

Compare

claude-code

by anthropics

Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.

120,031

19,897

Shell

AI Agents

View details

Compare

cc-switch

by farion1231

A cross-platform desktop All-in-One assistant for Claude Code, Codex, OpenCode, OpenClaw, Gemini CLI & Hermes Agent. Only official website: ccswitch.io

118,401

7,931

Rust

AI Agentsai-toolsclaude-code

View details

Compare

Browse all AI Agents skills

swarmclaw openai-assistant-swarm

PaperOrchestra

Song, Y., Song, Y., Pfister, T., Yoon, J. PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing. arXiv:2604.05018, 2026. https://arxiv.org/pdf/2604.05018

Why this exists

The paper defines a five-agent pipeline

Outline
Plotting
Literature Review
Section Writing
Content Refinement

How skills work here

Each skill is:

SKILL.md — a dense instruction document the host agent reads and follows.
references/ — reference material: verbatim paper prompts (Appendix F), JSON schemas, rubrics, halt rules, example outputs.
scripts/ — purely deterministic local helpers: JSON schema validation, Levenshtein fuzzy matching, BibTeX formatting, dedup, LaTeX sanity checks, coverage gates. No network, no LLM, no API keys.

The seven skills

Skill	Paper step	# LLM calls	Role
`paper-orchestra`	orchestrator	—	Top-level driver. Coordinates the other six.
`outline-agent`	Step 1	1	Idea + log + template + guidelines → structured outline JSON (plotting plan, lit review plan, section plan).
`plotting-agent`	Step 2	~20–30	Execute plotting plan; render plots & conceptual diagrams; optional VLM-critique refinement loop; caption everything.
`literature-review-agent`	Step 3	~20–30	Web-search candidates; Semantic Scholar verify (Levenshtein > 70, cutoff, dedup); draft Intro + Related Work with ≥90% citation integration.
`section-writing-agent`	Step 4	1	One single multimodal call: draft remaining sections, build tables from experimental log, splice figures.
`content-refinement-agent`	Step 5	~5–7	Simulated peer review; accept/revert per strict halt rules; safety constraints prevent gaming the evaluator.
`paper-writing-bench`	§3	—	Reverse-engineer raw materials (Sparse/Dense idea, experimental log) from an existing paper to build benchmark cases.
`paper-autoraters`	App. F.3	—	Run the paper's own autoraters: Citation F1 (P0/P1), LitReview quality (6-axis), SxS paper quality, SxS litreview quality.

Steps 2 and 3 run in parallel (see skills/paper-orchestra/references/pipeline.md).

agent-research-aggregator (optional)

Run it before paper-orchestra (or let paper-orchestra call it automatically when inputs are missing).

What it does

[.claude/]  [.cursor/]  [.antigravity/]  [.openclaw/]
      │            │              │               │
      └────────────┴──────────────┴───────────────┘
                        │
                Phase 1: Discovery  (deterministic)
                        │
                Phase 2: Extraction (LLM — per batch)
                        │
                Phase 3: Synthesis  (LLM — one call)
                        │
                Phase 4: Formatting (deterministic)
                        │
             ┌──────────┴──────────┐
      workspace/inputs/      workspace/ara/
        idea.md                aggregation_report.md
        experimental_log.md    discovered_logs.json
                               raw_experiments.json
                               synthesis.json

The four phases are:

Phase	Tool	What happens
1 Discovery	`discover_logs.py`	Walks `--search-roots` to catalog every relevant log file across all agent caches. Prints a summary for user review before anything is read.
2 Extraction	LLM (per ~50 KB batch)	Applies `references/extraction-prompt.md` to each batch; produces `raw_experiments.json`. PII is stripped; unverified numbers are flagged `[UNVERIFIED]`.
3 Synthesis	LLM (one call)	Merges possibly-redundant experiment records into a single research narrative (`synthesis.json`). Detects multiple disconnected projects and pauses to ask the user.
4 Formatting	`format_po_inputs.py`	Converts `synthesis.json` into `idea.md` (Sparse Idea format, §3.1) and `experimental_log.md` (App. D.3), ready for `paper-orchestra`.

Integration

Install — no extra dependencies beyond the base requirements.txt.

Symlink the skill into your host's skill directory alongside the others:

ln -sf ~/paper-orchestra/skills/agent-research-aggregator \
       ~/.claude/skills/agent-research-aggregator

For Cursor / Antigravity / Cline / Aider, follow the same per-host instructions in skills/paper-orchestra/references/host-integration.md.

Invoke by telling your coding agent:

"Aggregate my agent logs for paper writing" — or — "Prepare PaperOrchestra inputs from my cache" — or — "Turn my agent logs into a paper"

The trigger phrases are listed in the description field of skills/agent-research-aggregator/SKILL.md.

Parameters

Flag	Default	Description
`--search-roots`	cwd, `~`	Directories to scan for agent caches
`--agents`	all	Subset: `claude,cursor,antigravity,openclaw`
`--workspace`	`./workspace`	PaperOrchestra workspace root
`--depth`	4	Max scan depth (prevents runaway traversal)
`--since`	—	Only logs modified after this date (ISO 8601)

Example workflows

From Claude Code memory + CLAUDE.md only:

python skills/agent-research-aggregator/scripts/discover_logs.py \
    --search-roots . \
    --agents claude \
    --out workspace/ara/discovered_logs.json
# → finds .claude/projects/<hash>/memory/*.md and CLAUDE.md

From a Cursor project (chat history + rules):

python skills/agent-research-aggregator/scripts/discover_logs.py \
    --search-roots ~/my-project \
    --agents cursor \
    --out workspace/ara/discovered_logs.json
# → finds .cursor/chat/chatHistory.json and .cursorrules

From Antigravity worker logs, restricted to the last 60 days:

python skills/agent-research-aggregator/scripts/discover_logs.py \
    --search-roots ~/my-project \
    --agents antigravity \
    --since 2026-02-09 \
    --out workspace/ara/discovered_logs.json
# → finds .antigravity/workers/<id>/log.jsonl and output.md

From OpenClaw sessions + run metrics:

python skills/agent-research-aggregator/scripts/discover_logs.py \
    --search-roots ~/my-project \
    --agents openclaw \
    --out workspace/ara/discovered_logs.json
# → finds .openclaw/sessions/*/conversation.md and runs/*/metrics.json

Full run across all caches:

# Phase 1 — discovery
python skills/agent-research-aggregator/scripts/discover_logs.py \
    --search-roots . ~ --out workspace/ara/discovered_logs.json

# Phase 2 — LLM extraction (your agent handles this; validate afterward)
python skills/agent-research-aggregator/scripts/extract_experiments.py \
    --discovered workspace/ara/discovered_logs.json \
    --out workspace/ara/raw_experiments.json --validate-only

# Phase 3 — LLM synthesis (your agent handles this)

# Phase 4 — format + audit report
python skills/agent-research-aggregator/scripts/format_po_inputs.py \
    --synthesis workspace/ara/synthesis.json \
    --out workspace/inputs/ \
    --report workspace/ara/aggregation_report.md