Agent skill to turn any arxiv paper into a working implementation
# Add to your Claude Code skills
git clone https://github.com/PrathamLearnsToCode/paper2codearxiv URL in → citation-anchored implementation out
┌─────────────────────────────┐ ┌──────────────────────────────────────┐
│ │ │ {paper_slug}/ │
│ /paper2code │ │ ├── README.md │
│ https://arxiv.org/abs/ │ ───▶ │ ├── REPRODUCTION_NOTES.md │
│ 1706.03762 │ │ ├── requirements.txt │
│ │ │ ├── src/ │
│ │ │ │ ├── model.py # §3.2 cited │
│ │ │ │ ├── loss.py # §3.4 cited │
│ │ │ │ ├── train.py # §4.1 cited │
│ │ │ │ ├── data.py │
│ │ │ │ ├── evaluate.py │
│ │ │ │ └── utils.py │
│ │ │ ├── configs/ │
│ │ │ │ └── base.yaml # all params │
│ │ │ └── notebooks/ │
│ │ │ └── walkthrough.ipynb │
└─────────────────────────────┘ └──────────────────────────────────────┘
[placeholder: animated GIF showing the full pipeline — paper fetch → parsing → ambiguity audit → code generation → walkthrough notebook]
The problem: ML papers are vague. Critical hyperparameters are buried in appendices or omitted entirely. Prose contradicts equations. "Standard settings" refers to nothing specific. When you implement a paper, you spend more time detective-working than coding.
What LLMs get wrong: Naive code generation fills in every gap silently and confidently. You get something that runs but doesn't match the paper. Worse, you can't tell which parts are from the paper and which were invented by the model.
No comments yet. Be the first to share your thoughts!
What paper2code does differently:
§3.2, Eq. 4)SPECIFIED, PARTIALLY_SPECIFIED, or UNSPECIFIED[UNSPECIFIED] comments at the exact line where the choice is made, with common alternatives listedThe result: code you can trust because you can verify every decision against the paper.
npx skills add PrathamLearnsToCode/paper2code/skills/paper2code
You'll be prompted to:
Once installed, open your agent and run the skill:
claude # or your preferred agent
/paper2code https://arxiv.org/abs/1706.03762
/paper2code https://arxiv.org/abs/2006.11239 --framework jax
/paper2code 2106.09685 --mode full
/paper2code https://arxiv.org/abs/2010.11929 --mode educational
/paper2code 1706.03762
attention_is_all_you_need/
├── README.md # Paper summary, contribution statement, quick-start
├── REPRODUCTION_NOTES.md # Ambiguity audit, unspecified choices, known deviations
├── requirements.txt # Pinned dependencies
├── src/
│ ├── model.py # Architecture — every layer cited to paper section
│ ├── loss.py # Loss functions with equation references
│ ├── data.py # Dataset class skeleton with preprocessing TODOs
│ ├── train.py # Training loop (if in scope)
│ ├── evaluate.py # Metric computation code
│ └── utils.py # Shared utilities (masking, positional encoding, etc.)
├── configs/
│ └── base.yaml # All hyperparams — each one cited or flagged [UNSPECIFIED]
└── notebooks/
└── walkthrough.ipynb # Pedagogical notebook linking paper sections → code → sanity checks
| File | Purpose |
|------|---------|
| model.py | Architecture only. Each class maps to a paper section. Variable names match paper notation. |
| REPRODUCTION_NOTES.md | The ambiguity audit. Lists every choice, whether the paper specified it, and what alternatives exist. |
| base.yaml | Single source of truth for all hyperparameters. |
| walkthrough.ipynb | Runnable on CPU with toy dimensions. Quotes paper passages, shows corresponding code, runs shape checks. |
[UNSPECIFIED]. It will never silently fill in gaps.data.py provides a Dataset class skeleton with clear instructions on where to get the data and how to preprocess it.Every non-trivial code decision is anchored to the paper:
# §3.2 — "We apply layer normalization before each sub-layer" (Pre-LN variant)
class TransformerBlock(nn.Module):
def forward(self, x):
# §3.2, Eq. 2 — attention_weights = softmax(QK^T / sqrt(d_k))
attn_out = self.attention(self.norm1(x)) # (batch, seq_len, d_model)
x = x + attn_out # §3.2 — residual connection
# [UNSPECIFIED] Paper does not state epsilon for LayerNorm — using 1e-6 (common default)
# Alternatives: 1e-5 (PyTorch default), 1e-8 (some implementations)
self.norm = nn.LayerNorm(d_model, eps=1e-6)
# [ASSUMPTION] Using pre-norm based on "we found pre-norm more stable" in §4.1
# The paper uses post-norm in Figure 1 but pre-norm in experiments — ambiguous
| Tag | Meaning |
|-----|---------|
| §X.Y | Directly specified in paper section X.Y |
| §X.Y, Eq. N | Implements equation N from section X.Y |
| [UNSPECIFIED] | Paper does not state this — our choice with alternatives listed |
| [PARTIALLY_SPECIFIED] | Paper mentions this but is ambiguous — quote included |
| [ASSUMPTION] | Reasonable inference from paper context — reasoning explained |
| [FROM_OFFICIAL_CODE] | Taken from the authors' official implementation |
Worked examples are the most trust-building part of this project. To add one:
/paper2code https://arxiv.org/abs/XXXX.XXXXXskills/paper2code/worked/{paper_slug}/review.md that honestly evaluates:
If you find a pattern where the skill hallucinates or makes a silent assumption, add it to the appropriate file in guardrails/.
If papers in your subfield consistently reference components that the skill doesn't know about (e.g., graph neural network primitives, RL components), add a knowledge file in knowledge/.
This repo includes fully worked examples to demonstrate output quality:
| Paper | Type | Command |
|-------|------|---------|
| Attention Is All You Need (1706.03762) | Architecture | /paper2code https://arxiv.org/abs/1706.03762 |
| DDPM (2006.11239) | Training method | /paper2code https://arxiv.org/abs/2006.11239 |
Each includes the complete generated output plus an honest review.md evaluating what the skill got right and wrong.