Zero-config entity resolution that scales from a CSV to 100M+ rows on a Ray cluster (verified: 100M deduped in 213s, 0.30 GB driver). Fuzzy + exact + probabilistic dedupe, identity graph, PPRL, LLM boost. Python + full TypeScript port; SQL-native in PostgreSQL & DuckDB; MCP/REST servers, dbt + Airflow recipes.
# Add to your Claude Code skills
git clone https://github.com/benseverndev-oss/goldenmatchLast scanned: 6/11/2026
{
"issues": [],
"status": "PASSED",
"scannedAt": "2026-06-11T08:49:58.054Z",
"npmAuditRan": true,
"pipAuditRan": true,
"promptInjectionRan": true
}No comments yet. Be the first to share your thoughts!
30 days in the Featured rail ยท terms & refunds
A polyglot data-quality and entity-resolution toolkit. Polished, opinionated, AI-native.
GoldenCheck profiles โ GoldenFlow standardizes โ GoldenMatch deduplicates โ GoldenPipe orchestrates. With InferMap for schema mapping and a Rust extension layer for Postgres / DuckDB.
โก GoldenMatch scales from a CSV on your laptop to 100M+ rows on a Ray cluster โ verified: 100,000,000 records deduped in 213 s with a 0.30 GB driver footprint.
# Headline package: dedupe a CSV in 30 seconds
pip install goldenmatch && goldenmatch dedupe customers.csv
# TypeScript / Edge runtimes
npm install goldenmatch
๐ v1.26.0 โ 100M records, distributed, on a 4-worker Ray cluster โ verified. The distributed Phase-5 pipeline (
GOLDENMATCH_DISTRIBUTED_PIPELINE=2) now runs a full 100,000,000-row dedupe end to end in ~213 s with the driver process peaking at 0.30 GB RSS. The unlock was removing every driver-side collect from the pipeline (scoring -> per-partition local connected-components -> distributed join -> distributed golden build + write), so nothing funnels back to a single node.v1.25.0 โ Arrow-native groundwork + leaner large-N runs โ columnar pair-stream / two-frame-cluster entry points and optional Rust/Arrow-C kernels (
build_clusters,dedup_pairs,record_fingerprints, MST oversized-split) land behind thegoldenmatch._nativeextension, purely additive with the pure-Python + Polars pipeline unchanged as the default and byte-for-byte reference. Plus single-node memory wins (golden -2.6 GB, bucket -3.8 GB peak at 10M; standardize ~25-30s off the prep wall) and fixes for a silently-dropped GoldenCheck quality scan and a prep-cacheid()-recycle flake. PRs #588-#650.v1.16.0 โ 5M records in 9.94 min, 6.4 GB peak RSS, on one 16-core node โ the new
backend="bucket"path is now the recommended 5M-on-one-node config. 5x wall reduction and 2x peak RSS reduction vs the v1.15 chunked baseline (~50 min, 11.9 GB), with rock-solid reliability on Linux runners where the chunked path was hanging at 63 GB plateau on the same fixture. PRs #310-#326.
Each tool stands alone, but they compose into a single pipeline:
flowchart LR
raw([raw rows])
golden([golden records])
subgraph orchestration ["GoldenPipe orchestrates"]
direction LR
infermap[InferMap]
goldencheck[GoldenCheck]
goldenflow[GoldenFlow]
goldenmatch[GoldenMatch]
infermap --> goldencheck --> goldenflow --> goldenmatch
end
raw --> infermap
goldenmatch --> golden
| Step | Role |
|---|---|
| InferMap | schema mapping โ auto-aligns columns across heterogeneous sources |
| GoldenCheck | profile + validate โ encoding, format, anomaly detection |
| GoldenFlow | standardize + transform โ phone, date, address, categorical normalization |
| GoldenMatch | dedupe + cluster + survivorship โ fuzzy / exact / probabilistic / LLM |
| GoldenPipe | orchestrator โ declarative YAML pipeline wiring the four steps |
auto_configure + controller_telemetry for v1.7-v1.12 introspection.ControllerPanel, TUI Ctrl+A, CLI goldenmatch autoconfig, REST /autoconfig + /controller/telemetry, Postgres goldenmatch_autoconfig + gm_telemetry, DuckDB UDFs, MCP/A2A telemetry tools. One JSON shape across every interface.evaluate, Fellegi-Sunter probabilistic scoring, and GoldenFlow transforms.| Package | Lang | What it does | Install |
|---|---|---|---|
| GoldenMatch ๐ก | Python ยท TS | Zero-config entity resolution. Fuzzy + exact + probabilistic + LLM. Headline package. | pip install goldenmatch ยท npm i goldenmatch |
| GoldenCheck | Python ยท TS | Data-quality scanning: encoding, Unicode, format validation, anomaly detection. | pip install goldencheck ยท npm i goldencheck |
| GoldenFlow | Python ยท TS | Transforms & standardizers: phone, date, address, categorical normalization. | pip install goldenflow ยท npm i goldenflow |
| GoldenPipe | Python ยท TS | Orchestrator that wires Check โ Flow โ Match into one declarative pipeline. | pip install goldenpipe ยท npm i goldenpipe |
| InferMap | Python ยท TS | Schema mapping engine โ auto-aligns columns across heterogeneous sources. | pip install infermap ยท npm i infermap |
| goldenmatch-extensions | Rust | Postgres extension (pgrx) + DuckDB UDFs. SQL-native fuzzy matching. | source build |
| **[dbt-goldensuite](packages/python/goldenmatch/dbt-goldensuite/README.m |