What does spark.sql.shuffle.partitions do?

It sets how many partitions Spark creates on the reduce side of a shuffle — for joins, aggregations, and MERGE — and defaults to 200. Too few gives you huge tasks that spill to disk; too many gives you thousands of tiny tasks whose scheduling overhead dominates.

What is the right value for spark.sql.shuffle.partitions?

Aim for shuffle partitions of roughly 100–200 MB: total shuffle bytes divided by that target size is a good starting count. On Spark 3.x, enable AQE (spark.sql.adaptive.enabled with coalescePartitions) and it will coalesce small shuffle partitions to a sensible size at runtime, which makes the static number far less critical.

Does tuning spark.sql.shuffle.partitions make MERGE INTO faster?

No — the two are unrelated. The setting changes how shuffled data is sliced into tasks, not how much data is shuffled; the full-table shuffle is inherent to how the OSS Spark + Iceberg copy-on-write MERGE plan is designed. A 1 TB MERGE moves ~1 TB regardless of the partition count — cutting the bytes shuffled requires a different MERGE execution path.

Why is MERGE INTO so slow on Spark and Iceberg?

Because the copy-on-write MERGE plan shuffles the entire target table to join it against the source, even when only a few percent of rows change. The shuffle cost scales with table size, not change size — a 1 TB table moves ~1 TB across the network to update 5% of its rows.

How does Quanton make MERGE INTO faster?

Quanton runs MERGE through a vectorized, low-shuffle columnar path for Iceberg and Hudi, so shuffle scales with the change set instead of the table. On a 1 TB MERGE updating ~5% of rows, that's 21 minutes down to 3.6 minutes — about 6× — with the same SQL.

MERGE INTO Updates a Slice of Your Table — So Why Shuffle All of It?

Overview

Most MERGE INTO statements touch a small slice of the table. You’re upserting a day of late-arriving events, applying a CDC batch, reconciling a few corrections. The changed set is often 1–5% of the rows.

The work should scale with that slice. On OSS Apache Spark + Apache Iceberg, it scales with the whole table instead. This post shows why that happens, what it costs at 1 TB, and the low-shuffle path Quanton uses to fix it — same SQL, same table, same cluster.

The problem

To join the target table against the incoming source rows, the Spark + Iceberg MERGE plan shuffles the whole target table — even when only ~5% of rows actually change. On a 1 TB table that means thousands of tasks moving the entire dataset across the network before a single row is written.

You can see it directly in the Spark History Server. This is a 1 TB MERGE INTO updating ~5% of rows — look at the shuffle columns on the SQL stages:

Spark History Server completed stages for a 1 TB MERGE, with stage 10 writing 1,459.7 GiB of shuffle and stage 14 reading 1,466.8 GiB, both highlighted in red. — Stage 10 writes 1,459.7 GiB of shuffle; stage 14 reads 1,466.8 GiB — roughly the entire 1 TB table moved across the network to update ~5% of its rows.

The shuffle is the dominant cost, and it’s fixed: it’s a function of how big your table is, not how much of it you’re changing. Double the table and the MERGE gets slower even if the change set stays the same size. That’s the part that catches teams off guard — the cost is decoupled from the work.

Can spark.sql.shuffle.partitions fix this?

No — these are two unrelated problems, and no amount of shuffle tuning fixes the one above. The full-table shuffle is a design flaw in how the OSS Spark + Iceberg copy-on-write MERGE writer builds its plan: the entire target table is shuffled to join against the source, by construction. spark.sql.shuffle.partitions controls how that shuffled data is sliced into tasks, not how much of it moves — so the 1,459 GiB crosses the network at any partition count. The fix has to come from a different plan, not a different knob.

That said, the setting is worth tuning for what it does control — task granularity on every shuffle in your workload (default 200 partitions):

Too few partitions → giant tasks that blow past executor memory and spill to disk.
Too many partitions → thousands of milliseconds-long tasks where scheduling overhead dominates.
A reasonable target is 100–200 MB per shuffle partition (total shuffle bytes ÷ target size ≈ partition count). On Spark 3.x, AQE (spark.sql.adaptive.enabled + spark.sql.adaptive.coalescePartitions.enabled) coalesces undersized shuffle partitions at runtime and makes the static setting far less critical.

Best case, that buys you evenly-sized tasks. When the problem is bytes shuffled rather than task granularity, you need a plan that shuffles less.

How Quanton fixes it

Quanton replaces this with a vectorized, low-shuffle columnar MERGE path for both Apache Iceberg and Apache Hudi. The work scales with what changed, not with the size of the table.

No plan rewrite, no new syntax, no data migration. You run the exact same MERGE INTO you run today:

MERGE INTO default.customers t
USING source_updates s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET
  t.amount = s.amount,
  t.status = s.status
WHEN NOT MATCHED THEN INSERT (id, name, region, amount, status)
  VALUES (s.id, s.name, s.region, s.amount, s.status)

Comparison at 1 TB scale

1 TB MERGE INTO, ~5% of rows updated, identical resources across both runs:

OSS Spark + Iceberg 21min ~1 TB shuffled — the entire table moves across the network

Quanton 3.6min shuffle scales with the change set, not the table

Roughly 6× faster — same SQL, same result, and about 1 TB less shuffle across the network.

See it on a real table

The merge-into-demo in the quanton-operator repo is fully self-contained — it seeds a customers table with 10 rows, runs a MERGE that updates 3 and inserts 3, and validates the result. It runs on a local minikube (4 CPUs / 8 GB RAM) with the Spark and Quanton operators installed.

Iceberg:

kubectl apply -f quanton-iceberg-merge-into-demo.yaml
kubectl logs -f quanton-iceberg-merge-into-demo-driver -n default

Hudi:

kubectl apply -f quanton-hudi-merge-into-demo.yaml
kubectl logs -f quanton-hudi-merge-into-demo-driver -n default

Both end with the same check:

[iceberg-merge] After MERGE: 13 rows, 3 'vip'
[iceberg-merge] PASS — 10 -> 13 rows, 3 updated to 'vip'

Run it with an agent

The repo ships a Claude Code skill — run-merge-into — that runs the whole demo for you. Clone the repo, launch Claude, and invoke the skill:

git clone https://github.com/onehouseinc/quanton-operator
cd quanton-operator
claude

> /run-merge-into

It walks you through picking Hudi or Iceberg, validates the cluster, applies the manifest, polls progress, and verifies the result (10 → 13 rows, 3 vip) — no YAML spelunking required.

Takeaway

MERGE INTO cost should track the size of your change, not the size of your table. On OSS Spark + Iceberg it tracks the table. Quanton’s low-shuffle columnar path puts the cost back where it belongs.

Try it on your own tables now!