by datachain-ai
The Context Layer for unstructured data: typed, versioned datasets over S3, GCS, Azure
# Add to your Claude Code skills
git clone https://github.com/datachain-ai/datachainLast scanned: 4/22/2026
{
"issues": [],
"status": "PASSED",
"scannedAt": "2026-04-22T06:02:24.295Z",
"semgrepRan": false,
"npmAuditRan": true,
"pipAuditRan": true
}No comments yet. Be the first to share your thoughts!
A Python library that turns files in S3, GCS, and Azure into versioned, typed datasets, queryable at warehouse speed.
Optional, for agent workflows:
Bytes never leave your storage. Every run deposits a typed dataset the next pipeline (or agent) reads instead of recomputing.
pip install datachain
To add the agent skill (Knowledge Base + code generation):
datachain skill install --target claude # also: cursor, codex, copilot, pi
Works with S3, GCS, Azure, and local filesystems.
Task: find dogs in S3 similar to a reference image, filtered by breed, mask availability, and image dimensions.
Grab a reference image and run Claude Code (or other agent):
datachain cp --anon s3://dc-readme/fiona.jpg .
claude
Prompt:
Find dogs in s3://dc-readme/oxford-pets-micro/ similar to ./fiona.jpg:
- Pull breed metadata and mask files from annotations/
- Exclude images without mask
- Exclude Cocker Spaniels
- Only include images wider than 400px
Result:
┌──────┬───────────────────────────────────┬────────────────────────────┬──────────┐
│ Rank │ Image │ Breed │ Distance │
├──────┼───────────────────────────────────┼────────────────────────────┼──────────┤
│ 1 │ shiba_inu_52.jpg │ shiba_inu │ 0.244 │
├──────┼───────────────────────────────────┼────────────────────────────┼──────────┤
│ 2 │ shiba_inu_53.jpg │ shiba_inu │ 0.323 │
├──────┼───────────────────────────────────┼────────────────────────────┼──────────┤
│ 3 │ great_pyrenees_17.jpg │ great_pyrenees │ 0.325 │
└──────┴───────────────────────────────────┴────────────────────────────┴──────────┘
Fiona's closest matches are shiba inus (both top spots), which makes sense given her
tan coloring and pointed ears.
The agent decomposed the task into steps - embeddings, breed metadata, mask join, quality filter - and saved each as a named, versioned dataset. Next time you ask a related question, it starts from what's already built.
The datasets are registered in a Knowledge Base optimized for both agents and humans:
dc-knowledge
├── buckets
│ └── s3
│ └── dc_readme.md
├── datasets
│ ├── oxford_micro_dog_breeds.md
│ ├── oxford_micro_dog_embeddings.md
│ └── similar_to_fiona.md
└── index.md
Browse it as markdown files, navigate with wikilinks, or open in Obsidian:

Code harnesses (Claude Code, Cursor, Codex, GitHub Copilot, Pi) give agents repo context, dedicated tools, and memory across sessions. DataChain adds the same for data: typed datasets the agent reads, chain operations the agent calls (read_storage, map, save), a Dataset DB where its results persist.
A dataset is the unit of work - a named, versioned result of a pipeline step like pets_embeddings@1.0.0. Every .save() registers one.
For the data-flow architecture (Compute Engine, Dataset DB, Knowledge Base) and how the components connect, see Architecture.
A dataset is a versioned data reasoning step - what was computed, from what input, producing what schema. DataChain indexes your storage into one: no data copied, just typed metadata and file pointers. Re-runs only process new or changed files.
Create a dataset manually create_dataset.py:
from PIL import Image
import io
from pydantic import BaseModel
import datachain as dc
class ImageInfo(BaseModel):
width: int
height: int
def get_info(file: dc.File) -> ImageInfo:
img = Image.open(io.BytesIO(file.read()))
return ImageInfo(width=img.width, height=img.height)
ds = (
dc.read_storage(
"s3://dc-readme/oxford-pets-micro/images/**/*.jpg",
anon=True,
update=True,
delta=True, # re-runs skip unchanged files
)
.settings(prefetch=64)
.map(info=get_info)
.save("pets_images")
)
ds.show(5)
pets_images@1.0.0 is now the shared reference to this data - schema, version, lineage, and metadata.
Every .save() registers the dataset in the Dataset DB, DataChain's persistent store for schemas, versions, lineage, and processing state, kept locally in SQLite DB .datachain/db. Pipelines reference datasets by name, not paths. When the code or input data changes, the next run bumps dataset version.
This is what makes a dataset a management unit: owned, versioned, and queryable by everyone on the team.
DataChain uses Pydantic to define the shape of every column. The return type of your UDF becomes the dataset schema - each field a queryable column in the Dataset DB.
show() in the previous script renders nested fields as dotted columns:
file file info info
path size width height
0 oxford-pets-micro/images/Abyssinian_141.jpg 111270 461 500
1 oxford-pets-micro/images/Abyssinian_157.jpg 139948 500 375
2 oxford-pets-micro/images/Abyssinian_175.jpg 31265 600 234
3 oxford-pets-micro/images/Abyssinian_220.jpg 10687 300 225
4 oxford-pets-micro/images/Abyssinian_3.jpg 61533 600 869
[Limited by 5 rows]
.print_schema() renders it's schema:
file: File@v1
source: str
path: str
size: int
version: str
etag: str
is_latest: bool
last_modified: datetime
location: Union[dict, list[dict], NoneType]
info: ImageInfo
width: int
height: int
Models can be arbitrarily nested - a BBox inside an Annotation, a List[Citation] inside an LLM Response - every leaf field stays queryable the same way. The schema lives in the Dataset DB and is enforced at dataset creation time.
The Dataset DB handles datasets of any size - 100 millions of files, hundreds of metadata rows - without loading anything into memory. Pandas is limited by RAM; DataChain is not. Export to pandas when you need it, on a filtered subset:
import datachain as dc
df = dc.read_dataset("pets_images").filter(dc.C("info.width") > 500).to_pandas()
print(df)
Filters, aggregations, and joins run as vectorized operations directly against the Dataset DB - metadata never leaves your machine, no files downloaded.
import datachain as dc
cnt = (
dc.read_dataset("pets_images")
.filter(
(dc.C("info.width") > 400) &
~dc.C("file.path").ilike("%cocker_spaniel%") # case-insensitive
)
.count()
)
print(f"Large images with Cocker Spaniel: {cnt}")
Milliseconds, even at 100M-file scale.
Large images with Cocker Spaniel: 6
When computation is expensive, bugs and new data are both inevitable. DataChain tracks processing state in the Dataset DB - so crashes and new data are handled automatically, without changing how you write pipelines.
Save to embed.py:
import open_clip, torch, io
from PIL import Image
import datachain as dc
model, _, preprocess = open_clip.create_model_and_transforms("ViT-B-32", "laion2b_s34b_b79k")
model.eval()
counter = 0
def encode(file: dc.File, model, preprocess) -> list[float]:
global counter
counter += 1
if counter > 236: # ← bug: remove these two lines
raise Exception("some bug") # ←
img = Image.open(io.BytesIO(file.read())).convert("RGB")
with torch.no_grad():
return model.encode_image(preprocess(img).unsqueeze(0))[0].tolist()
(
dc.read_dataset("pets_images")
.settings(batch_size=100)
.setup(model=lambda: model, preprocess=lambda: preprocess)
.map(emb=encode)
.save("pets_embeddings")
)
It fails due to a bug in the code:
Exception: some bug
Remove the two marked lines and re-run - DataChain resumes from image 201 (two 100 size batches are completed), the start of the last uncommitted batch:
$ python embed.py
UDF 'encode': Continuing from checkpoint
The vectors live in the Dataset DB alongside all the metadata - list[float] type in pydentic schemas. Querying them is instant - no files re-read and can be combined with not vector filters like info.width:
Prepare data:
datachain cp s3://dc-readme/fiona.jpg .
similar.py: