← Back to blog
// Query Execution · Joins

Shuffle Hash Join vs. Sort-Merge Join: How Modern Engines Execute Equi-Joins

SHJ and SMJ produce identical results but make opposite bets on CPU, memory, and data layout. A deep dive into how vectorized engines implement both — adaptive hash-table layouts, SIMD tag probing, normalized keys — and why Quanton defaults to a vectorized SHJ.

June 4, 2026 / Written by Rajesh Mahindra

Overview

Joins are among the most expensive operations in distributed analytical query processing. For inner equi-joins (R.id = S.id), two execution strategies dominate modern query engines: Shuffle Hash Join (SHJ) and Sort-Merge Join (SMJ).

Both produce identical results. What differs is where they spend their cycles: SHJ trades memory for near-constant-time lookups; SMJ trades an upfront sort for a cheap sequential merge and ordered output. The right choice depends on data layout, memory headroom, and what the rest of the plan needs.

Side-by-side data-flow diagram: in Shuffle Hash Join, tables R and S are shuffled by hash(key) % 3 to three executors, each of which builds an in-memory hash table from its R partition and streams its S partition through it, producing unordered output. In Sort-Merge Join, the same shuffle is followed by sorting both partitions on each executor (O(n log n)), then a lock-step cursor merge, producing output sorted by the join key.
Both strategies shuffle by hash(key) % N so matching keys co-locate. After that they diverge: SHJ builds and probes a hash table per executor; SMJ sorts both partitions, then merges cursors in lock-step.

Quanton selects SHJ as the default strategy for equi-joins. The operator is fully vectorized: hash computation, key comparison, and table probing all run over batches of rows using SIMD instructions rather than one row at a time. This post walks through both algorithms, the implementation details that actually determine performance, and the tradeoffs an optimizer weighs between them.

Shuffle Hash Join (SHJ)

A Shuffle Hash Join evaluates a join by redistributing data across workers, then performing local hash joins within each partition. Three phases:

Phase 01

Shuffle

hash(join_key) % N

Both relations are partitioned by a hash of the join key and redistributed across the cluster.

Rows with the same join key are guaranteed to land on the same worker, so each worker can complete its slice of the join locally with no further coordination.

Phase 02

Build

Each worker scans its partition of one side — typically the smaller one — and builds an in-memory hash table keyed by the join column.

In a pipelined engine, build and probe run as separate operator pipelines: the build pipeline consumes its entire input and constructs the table, while the probe pipeline blocks until the table is published.

Phase 03

Probe

The corresponding partition of the other relation is scanned, and each join key is hashed and used to probe the table for matches.

Matching rows are stitched into output batches and streamed downstream — no blocking, no sort.

How a fast SHJ is actually built

The three-phase description above is the easy part. The performance gap between a textbook hash join and a state-of-the-art one comes from the hash table itself. Quanton’s implementation uses several techniques that have become standard in high-performance vectorized engines:

Columnar hashing, not row-at-a-time. Hashes are computed over an entire batch of keys per column before any probing begins. A multi-column key hashes column 1 for all rows, then combines in column 2 for all rows, and so on — keeping each loop tight, branch-free, and trivially vectorizable. The computed hash is also carried alongside the row, so it’s never recomputed during table resizes or repartitioning.

Adaptive table layouts. The operator inspects the actual key values as the build side streams in and picks the cheapest table representation that fits:

  • Array layout — if the keys are integers in a reasonably dense range (or low-cardinality values that can be dictionary-encoded into one), the “hash table” degenerates into a direct array index. No hash function, no collisions, one load per probe.
  • Normalized keys — if a multi-column key’s values can be packed into a single 64-bit word (each column contributes a fixed number of bits based on its observed range), the join compares one machine word instead of N columns. Hashing and equality both collapse to single-word operations.
  • Generic layout — arbitrary keys fall back to a full hash table with explicit key comparison.

The decision is made at runtime from observed data, not from static type information — and the operator can fall back from a cheaper mode to a more general one mid-build if a value arrives that breaks the assumption.

SIMD tag probing. In generic mode, the table groups slots into blocks of 16, each with a one-byte tag derived from the key’s hash. A probe loads all 16 tags with a single vector instruction and compares them against the probe key’s tag in one cycle. Only slots whose tags match proceed to a full key comparison — which for the overwhelming majority of probes means zero or one full comparisons per lookup, and most non-matches are rejected without ever touching the key data.

Two-level tables for parallel builds. When multiple threads build the table concurrently, each thread fills its own sub-table sharded by high hash bits; the shards are then stitched together without locks on the hot path. The same sharding makes spilling surgical — individual sub-tables can be spilled and recursively re-joined rather than restarting the whole build.

Dynamic filter pushdown. As a bonus that pure SMJ can’t replicate: once the build side is complete, the operator knows the exact set (or min/max range) of join keys that can possibly match. That predicate is pushed down into the probe side’s scan, so non-matching rows are filtered out at the file reader before they’re ever decoded, shuffled, or probed. On selective joins this routinely eliminates the majority of probe-side I/O.

Why vectorization matters

A row-at-a-time join interpreter does this per row:

read key → compute hash → probe table → compare key → emit → repeat

Each step is a data-dependent branch, and the CPU stalls on every cache miss in sequence. A vectorized SHJ restructures the same work into per-batch loops:

  • hash 1,000 keys, then
  • tag-check 1,000 probes (16 slots per SIMD compare), then
  • run full comparisons only for tag hits, then
  • gather matched payloads into output batches

Each loop is branch-predictable, keeps the relevant data hot in cache, and exposes independent memory accesses the CPU can overlap instead of serializing. For joins over millions or billions of rows, this is the difference between a join that is memory-bound and one that is genuinely compute-efficient.

SHJ tradeoffs

Advantages

  • Excellent performance on large, unsorted datasets — no sort, near-O(1) probes
  • Build cost is linear in the build side; probe cost is linear in the probe side
  • Enables dynamic filter pushdown into probe-side scans
  • Typically the fastest strategy when data has no useful ordering

Limitations

  • Requires memory proportional to the build side; oversized builds force spilling and recursive partitioning
  • Severe key skew can overload individual partitions
  • Output ordering is not preserved

Sort-Merge Join (SMJ)

A Sort-Merge Join takes the opposite bet: instead of building a hash table, both relations are ordered by the join key and merged sequentially.

Phase 01

Shuffle & Sort

sort(partition) — O(n log n)

Like SHJ, both relations are redistributed across workers by join key — but then each partition is sorted on the join column before any matching begins.

Partitions that exceed memory fall back to external merge sort: spill sorted runs to disk, merge them back in a k-way pass.

Phase 02

Merge

▸ ▸ lock-step cursors

Workers scan the two sorted partitions with a cursor on each side: keys equal → emit joined rows; otherwise advance the cursor with the smaller key.

Purely sequential access on both inputs, trivially streamable, and very low memory — a few buffered batches per side.

The sort phase is where the asymptotics bite. SHJ does O(n) work per side; SMJ does O(n log n) — and the constant factor on comparison-based sorting of wide rows is substantial. An external sort also pays for its input twice (write runs, read runs) before the join proper even begins. Engines mitigate this with the same normalized-key trick used in hash tables (pack the sort key into a single word so comparisons are one instruction) and with cache-conscious merge trees, but the work doesn’t disappear.

The merge phase has its own wrinkle: when duplicate keys exist on both sides, the merge must produce the full cross product of the matching groups — the inner cursor rewinds to the start of the duplicate run for each outer row. Duplicate-heavy keys (the classic skewed fact-to-fact join) turn the linear merge into repeated backtracking over the same region, and that region had better still be in cache.

SMJ tradeoffs

Advantages

  • Low, predictable memory footprint — no build-side table to hold
  • Output is sorted by the join key
  • If the inputs are already sorted or clustered on the key, the sort phase is skipped entirely and the join reduces to a cheap streaming merge
  • Can eliminate a downstream sort when the query has an ORDER BY on the join key — the optimizer gets two operators for the price of one

Limitations

  • Sorting dominates cost on unsorted inputs: O(n log n) plus potential spill I/O
  • Higher CPU overhead per row than hash probing
  • Duplicate-heavy keys introduce merge backtracking

Forcing SMJ in Quanton

Quanton defaults to SHJ: even when Spark’s planner selects a sort-merge join, Quanton rewrites it into a shuffle hash join at execution time. If your workload genuinely favors SMJ — pre-sorted inputs, an ORDER BY on the join key, or tight memory — you can disable that rewrite:

spark.quanton.sql.columnar.forceShuffledHashJoin=false

With the rewrite off, join selection falls back to Spark’s planner, which prefers SMJ by default (spark.sql.join.preferSortMergeJoin is true out of the box). Note that setting preferSortMergeJoin alone is not enough — Quanton’s SHJ rewrite takes precedence unless it is explicitly disabled.

Architectural comparison

PropertyShuffle Hash Join (SHJ)Sort-Merge Join (SMJ)
Data redistributionRequiredRequired
Sorting requiredNoYes (unless already sorted)
Join-phase complexityO(n) build + O(m) probeO(n log n) sort + O(n + m) merge
Memory dependencyHigher (build-side table)Lower (streaming merge)
CPU overheadLowerHigher
Output orderUnorderedSorted by join key
ORDER BY synergyNoneCan eliminate downstream sorting
Scan-level filter pushdownYes (dynamic filters from build side)No
Best data layoutUnsorted dataPre-sorted or clustered data
Worst caseKey skew, oversized build side, spillingDuplicate-heavy backtracking, double-pass external sort

Key takeaways

Hash-based joins generally outperform sort-merge joins on large, unsorted analytical datasets: they replace an O(n log n) sort with linear-time table construction and near-constant-time lookups, and a well-engineered implementation — adaptive table layouts, normalized keys, SIMD tag probing, dynamic filter pushdown — keeps both the constant factors and the memory traffic low.

The exception is layout. When data is already sorted or clustered on the join key, or when a downstream operator wants ordered output, SMJ skips its expensive phase entirely and becomes the cheaper plan.

A modern optimizer weighs exactly these factors at runtime: data distribution, memory availability, existing ordering, and downstream requirements. For most analytical workloads, though, SHJ wins — which is why Quanton defaults to a SIMD-vectorized Shuffle Hash Join for equi-joins.

RM
Data Infrastructure at Onehouse · Building Quanton
← Back to blog