Shuffle Hash Join vs. Sort-Merge Join: How Modern Engines Execute Equi-Joins

Overview

Joins are among the most expensive operations in distributed analytical query processing. For inner equi-joins (R.id = S.id), two execution strategies dominate modern query engines: Shuffle Hash Join (SHJ) and Sort-Merge Join (SMJ).

Both produce identical results. What differs is where they spend their cycles: SHJ trades memory for near-constant-time lookups; SMJ trades an upfront sort for a cheap sequential merge and ordered output. The right choice depends on data layout, memory headroom, and what the rest of the plan needs.

Side-by-side data-flow diagram: in Shuffle Hash Join, tables R and S are shuffled by hash(key) % 3 to three executors, each of which builds an in-memory hash table from its R partition and streams its S partition through it, producing unordered output. In Sort-Merge Join, the same shuffle is followed by sorting both partitions on each executor (O(n log n)), then a lock-step cursor merge, producing output sorted by the join key. — Both strategies shuffle by `hash(key) % N` so matching keys co-locate. After that they diverge: SHJ builds and probes a hash table per executor; SMJ sorts both partitions, then merges cursors in lock-step.

Quanton selects SHJ as the default strategy for equi-joins. The operator is fully vectorized: hash computation, key comparison, and table probing all run over batches of rows using SIMD instructions rather than one row at a time. This post walks through both algorithms, the implementation details that actually determine performance, and the tradeoffs an optimizer weighs between them.

Shuffle Hash Join (SHJ)

A Shuffle Hash Join evaluates a join by redistributing data across workers, then performing local hash joins within each partition. Three phases:

Phase 01

Shuffle

hash(join_key) % N

Both relations are partitioned by a hash of the join key and redistributed across the cluster.

Rows with the same join key are guaranteed to land on the same worker, so each worker can complete its slice of the join locally with no further coordination.

Phase 02

Build

Each worker scans its partition of one side — typically the smaller one — and builds an in-memory hash table keyed by the join column.

In a pipelined engine, build and probe run as separate operator pipelines: the build pipeline consumes its entire input and constructs the table, while the probe pipeline blocks until the table is published.

Phase 03

Probe

The corresponding partition of the other relation is scanned, and each join key is hashed and used to probe the table for matches.

Matching rows are stitched into output batches and streamed downstream — no blocking, no sort.

How a fast SHJ is actually built

The three-phase description above is the easy part. The performance gap between a textbook hash join and a state-of-the-art one comes from the hash table itself. Quanton’s implementation uses several techniques that have become standard in high-performance vectorized engines:

Columnar hashing, not row-at-a-time. Hashes are computed over an entire batch of keys per column before any probing begins. A multi-column key hashes column 1 for all rows, then combines in column 2 for all rows, and so on — keeping each loop tight, branch-free, and trivially vectorizable. The computed hash is also carried alongside the row, so it’s never recomputed during table resizes or repartitioning.

Adaptive table layouts. The operator inspects the actual key values as the build side streams in and picks the cheapest table representation that fits:

Array layout — if the keys are integers in a reasonably dense range (or low-cardinality values that can be dictionary-encoded into one), the “hash table” degenerates into a direct array index. No hash function, no collisions, one load per probe.
Normalized keys — if a multi-column key’s values can be packed into a single 64-bit word (each column contributes a fixed number of bits based on its observed range), the join compares one machine word instead of N columns. Hashing and equality both collapse to single-word operations.
Generic layout — arbitrary keys fall back to a full hash table with explicit key comparison.

The decision is made at runtime from observed data, not from static type information — and the operator can fall back from a cheaper mode to a more general one mid-build if a value arrives that breaks the assumption.

SIMD tag probing. In generic mode, the table groups slots into blocks of 16, each with a one-byte tag derived from the key’s hash. A probe loads all 16 tags with a single vector instruction and compares them against the probe key’s tag in one cycle. Only slots whose tags match proceed to a full key comparison — which for the overwhelming majority of probes means zero or one full comparisons per lookup, and most non-matches are rejected without ever touching the key data.

Two-level tables for parallel builds. When multiple threads build the table concurrently, each thread fills its own sub-table sharded by high hash bits; the shards are then stitched together without locks on the hot path. The same sharding makes spilling surgical — individual sub-tables can be spilled and recursively re-joined rather than restarting the whole build.

Dynamic filter pushdown. As a bonus that pure SMJ can’t replicate: once the build side is complete, the operator knows the exact set (or min/max range) of join keys that can possibly match. That predicate is pushed down into the probe side’s scan, so non-matching rows are filtered out at the file reader before they’re ever decoded, shuffled, or probed. On selective joins this routinely eliminates the majority of probe-side I/O.

Why vectorization matters

A row-at-a-time join interpreter does this per row:

read key → compute hash → probe table → compare key → emit → repeat

Each step is a data-dependent branch, and the CPU stalls on every cache miss in sequence. A vectorized SHJ restructures the same work into per-batch loops:

hash 1,000 keys, then
tag-check 1,000 probes (16 slots per SIMD compare), then
run full comparisons only for tag hits, then
gather matched payloads into output batches

Each loop is branch-predictable, keeps the relevant data hot in cache, and exposes independent memory accesses the CPU can overlap instead of serializing. For joins over millions or billions of rows, this is the difference between a join that is memory-bound and one that is genuinely compute-efficient.

SHJ tradeoffs

Advantages

Excellent performance on large, unsorted datasets — no sort, near-O(1) probes
Build cost is linear in the build side; probe cost is linear in the probe side
Enables dynamic filter pushdown into probe-side scans
Typically the fastest strategy when data has no useful ordering

Limitations

Requires memory proportional to the build side; oversized builds force spilling and recursive partitioning
Severe key skew can overload individual partitions
Output ordering is not preserved

Sort-Merge Join (SMJ)

A Sort-Merge Join takes the opposite bet: instead of building a hash table, both relations are ordered by the join key and merged sequentially.

Phase 01

Shuffle & Sort

sort(partition) — O(n log n)

Like SHJ, both relations are redistributed across workers by join key — but then each partition is sorted on the join column before any matching begins.

Partitions that exceed memory fall back to external merge sort: spill sorted runs to disk, merge them back in a k-way pass.

Phase 02

Merge

▸ ▸ lock-step cursors

Workers scan the two sorted partitions with a cursor on each side: keys equal → emit joined rows; otherwise advance the cursor with the smaller key.

Purely sequential access on both inputs, trivially streamable, and very low memory — a few buffered batches per side.

The sort phase is where the asymptotics bite. SHJ does O(n) work per side; SMJ does O(n log n) — and the constant factor on comparison-based sorting of wide rows is substantial. An external sort also pays for its input twice (write runs, read runs) before the join proper even begins. Engines mitigate this with the same normalized-key trick used in hash tables (pack the sort key into a single word so comparisons are one instruction) and with cache-conscious merge trees, but the work doesn’t disappear.

The merge phase has its own wrinkle: when duplicate keys exist on both sides, the merge must produce the full cross product of the matching groups — the inner cursor rewinds to the start of the duplicate run for each outer row. Duplicate-heavy keys (the classic skewed fact-to-fact join) turn the linear merge into repeated backtracking over the same region, and that region had better still be in cache.

SMJ tradeoffs

Advantages

Low, predictable memory footprint — no build-side table to hold
Output is sorted by the join key
If the inputs are already sorted or clustered on the key, the sort phase is skipped entirely and the join reduces to a cheap streaming merge
Can eliminate a downstream sort when the query has an ORDER BY on the join key — the optimizer gets two operators for the price of one

Limitations

Sorting dominates cost on unsorted inputs: O(n log n) plus potential spill I/O
Higher CPU overhead per row than hash probing
Duplicate-heavy keys introduce merge backtracking

Forcing SMJ in Quanton

Quanton defaults to SHJ: even when Spark’s planner selects a sort-merge join, Quanton rewrites it into a shuffle hash join at execution time. If your workload genuinely favors SMJ — pre-sorted inputs, an ORDER BY on the join key, or tight memory — you can disable that rewrite:

spark.quanton.sql.columnar.forceShuffledHashJoin=false

With the rewrite off, join selection falls back to Spark’s planner, which prefers SMJ by default (spark.sql.join.preferSortMergeJoin is true out of the box). Note that setting preferSortMergeJoin alone is not enough — Quanton’s SHJ rewrite takes precedence unless it is explicitly disabled.

Architectural comparison

Property	Shuffle Hash Join (SHJ)	Sort-Merge Join (SMJ)
Data redistribution	Required	Required
Sorting required	No	Yes (unless already sorted)
Join-phase complexity	O(n) build + O(m) probe	O(n log n) sort + O(n + m) merge
Memory dependency	Higher (build-side table)	Lower (streaming merge)
CPU overhead	Lower	Higher
Output order	Unordered	Sorted by join key
`ORDER BY` synergy	None	Can eliminate downstream sorting
Scan-level filter pushdown	Yes (dynamic filters from build side)	No
Best data layout	Unsorted data	Pre-sorted or clustered data
Worst case	Key skew, oversized build side, spilling	Duplicate-heavy backtracking, double-pass external sort

Key takeaways

Hash-based joins generally outperform sort-merge joins on large, unsorted analytical datasets: they replace an O(n log n) sort with linear-time table construction and near-constant-time lookups, and a well-engineered implementation — adaptive table layouts, normalized keys, SIMD tag probing, dynamic filter pushdown — keeps both the constant factors and the memory traffic low.

The exception is layout. When data is already sorted or clustered on the join key, or when a downstream operator wants ordered output, SMJ skips its expensive phase entirely and becomes the cheaper plan.

A modern optimizer weighs exactly these factors at runtime: data distribution, memory availability, existing ordering, and downstream requirements. For most analytical workloads, though, SHJ wins — which is why Quanton defaults to a SIMD-vectorized Shuffle Hash Join for equi-joins.