Overview
Most MERGE INTO statements touch a small slice of the table. You’re upserting a day of late-arriving events, applying a CDC batch, reconciling a few corrections. The changed set is often 1–5% of the rows.
The work should scale with that slice. On OSS Apache Spark + Apache Iceberg, it scales with the whole table instead. This post shows why that happens, what it costs at 1 TB, and the low-shuffle path Quanton uses to fix it — same SQL, same table, same cluster.
The problem
To join the target table against the incoming source rows, the Spark + Iceberg MERGE plan shuffles the whole target table — even when only ~5% of rows actually change. On a 1 TB table that means thousands of tasks moving the entire dataset across the network before a single row is written.
You can see it directly in the Spark History Server. This is a 1 TB MERGE INTO updating ~5% of rows — look at the shuffle columns on the SQL stages:
The shuffle is the dominant cost, and it’s fixed: it’s a function of how big your table is, not how much of it you’re changing. Double the table and the MERGE gets slower even if the change set stays the same size. That’s the part that catches teams off guard — the cost is decoupled from the work.
How Quanton fixes it
Quanton replaces this with a vectorized, low-shuffle columnar MERGE path for both Apache Iceberg and Apache Hudi. The work scales with what changed, not with the size of the table.
No plan rewrite, no new syntax, no data migration. You run the exact same MERGE INTO you run today:
MERGE INTO default.customers t
USING source_updates s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET
t.amount = s.amount,
t.status = s.status
WHEN NOT MATCHED THEN INSERT (id, name, region, amount, status)
VALUES (s.id, s.name, s.region, s.amount, s.status)
Comparison at 1 TB scale
1 TB MERGE INTO, ~5% of rows updated, identical resources across both runs:
Roughly 6× faster — same SQL, same result, and about 1 TB less shuffle across the network.
See it on a real table
The merge-into-demo in the quanton-operator repo is fully self-contained — it seeds a customers table with 10 rows, runs a MERGE that updates 3 and inserts 3, and validates the result. It runs on a local minikube (4 CPUs / 8 GB RAM) with the Spark and Quanton operators installed.
Iceberg:
kubectl apply -f quanton-iceberg-merge-into-demo.yaml
kubectl logs -f quanton-iceberg-merge-into-demo-driver -n default
Hudi:
kubectl apply -f quanton-hudi-merge-into-demo.yaml
kubectl logs -f quanton-hudi-merge-into-demo-driver -n default
Both end with the same check:
[iceberg-merge] After MERGE: 13 rows, 3 'vip'
[iceberg-merge] PASS — 10 -> 13 rows, 3 updated to 'vip'
Run it with an agent
The repo ships a Claude Code skill — run-merge-into — that runs the whole demo for you. Clone the repo, launch Claude, and invoke the skill:
git clone https://github.com/onehouseinc/quanton-operator
cd quanton-operator
claude
> /run-merge-into
It walks you through picking Hudi or Iceberg, validates the cluster, applies the manifest, polls progress, and verifies the result (10 → 13 rows, 3 vip) — no YAML spelunking required.
Takeaway
MERGE INTO cost should track the size of your change, not the size of your table. On OSS Spark + Iceberg it tracks the table. Quanton’s low-shuffle columnar path puts the cost back where it belongs.
Try it on your own tables now!