← Back to blog
// Query Execution · Runtime Filters

Bloom Filters Before the Join: How Spark Prunes Probe Rows — and How Quanton Makes It Native

Spark applies Bloom filters to large fact tables before joining them with dimension tables, reducing the amount of data that must be shuffled and joined. Quanton preserves Spark's Bloom filter injection strategy while accelerating filter evaluation through vectorized execution.

June 4, 2026 / Written by Rajesh Mahindra

Overview

Consider a common analytical query:

SELECT ss.ss_item_sk, SUM(ss.ss_net_paid)
FROM store_sales ss
JOIN date_dim d ON ss.ss_sold_date_sk = d.d_date_sk
WHERE d.d_year = 2024
GROUP BY ss.ss_item_sk

The filter is applied to date_dim, reducing it to a few thousand rows. The dominant cost, however, lies in store_sales, where billions of rows must still be scanned, shuffled, and probed during the join, even though only records associated with 2024 dates contribute to the final result. The predicate that would eliminate most fact-table rows is implied by the join condition, but because it is not explicitly defined on the fact table, a conventional execution plan cannot apply it during the scan.

Runtime filtering addresses this limitation by computing a compact summary of the join keys on the build side and using it to eliminate probe-side rows before they reach the most expensive stages of execution. At fact-table scale, the data structure that makes this approach practical is the Bloom filter.

This post examines three layers of the optimization stack: the fundamentals of Bloom filters, how Apache Spark injects Bloom filters as runtime filters ahead of a join, and how Quanton executes the same plan in native code, where the filter implementation itself is redesigned around modern CPU cache behavior.

Bloom filter fundamentals

A Bloom filter answers a single question — might this key belong to the set? — using a bit array and a small number of hash functions, without storing the keys themselves.

  • Insert: hash the key k times and set the corresponding bit positions to 1.
  • Lookup: hash the key using the same k hash functions. If any queried bit is 0, the key was definitely not inserted. If all queried bits are 1, the key may be present.
A bit array of m bits. Two inserted keys each set three bit positions via three hash functions. A lookup for an absent key finds one of its three positions still 0, proving absence. A lookup for another key that was never inserted finds all three of its positions set to 1 by other keys — a false positive reported as 'maybe present'.
One-sided error: a 0 at any probed position is proof of absence, while all 1s is only probability — bits set by other keys can collide into a false positive. There are no deletes and no stored keys, just bits.

Bloom filters guarantee zero false negatives: any key reported as absent is truly absent. False positives are possible, meaning some keys that were never inserted may still be reported as present. This property makes Bloom filters well suited for runtime filtering. Rows that are definitively absent can be discarded immediately, while rows that pass the filter continue to the join operator, where membership is verified exactly.

Two parameters determine the accuracy–space trade-off: the number of bits per key (m/n) and the number of hash functions (k). The false-positive rate is approximately (1 − e^(−kn/m))^k, and is minimized when k = (m/n) · ln 2. In practice, allocating roughly 16 bits (2 bytes) per key yields false-positive rates in the low single-digit percentages. A build side containing tens of thousands of keys can therefore be summarized in a filter ranging from a few kilobytes to around a megabyte, making it inexpensive to distribute across a cluster — the mechanism Spark leverages for runtime filtering.

How Spark injects a Bloom filter before a join

Introduced in Spark 3.3 (SPARK-32268) and enabled by default since Spark 3.4 — including all 4.x releases — the optimizer rule InjectRuntimeFilter examines every equi-join for a specific asymmetry: one side small and selectively filtered, the other a large scan with no selective predicate of its own. When these conditions hold, it rewrites the plan:

Phase 01

Build the filter

bloom_filter_agg(xxhash64(key))

Spark wraps the filtered build side in an aggregate that hashes each join key with XxHash64 and inserts it into a Bloom filter. Each task builds a partial filter; partials are merged with a bitwise OR of their bit arrays.

Phase 02

Distribute the filter

scalar subquery → bitset

The merged filter is the result of a scalar subquery: one value, a serialized bitset. Megabytes at most — cheap to distribute to every probe-side task, unlike the build rows themselves.

Phase 03

Filter the probe side

might_contain(bf, xxhash64(key))

A new predicate appears on the fact-table side, upstream of the join — and critically, upstream of the shuffle. Rows whose keys the filter rejects are dropped before they are ever partitioned, written, or sorted.

Plan diagram: on the build side, a scan of date_dim filtered to d_year = 2024 feeds bloom_filter_agg(xxhash64(d_date_sk)), producing a ~1 MB Bloom filter bitset. A dashed arrow shows the optimizer injecting it into the probe side, where a scan of 10 billion store_sales rows passes through FILTER might_contain(bf, xxhash64(ss_sold_date_sk)). Only ~6% of rows survive to the shuffle and join; dropped rows skip shuffle write/read, sort or hash probe, and downstream operators.
The dimension's selectivity travels sideways into the fact-table scan as a Bloom filter. Rows rejected by might_contain never reach the shuffle — typically the dominant cost of the stage.

Why filtering before the shuffle matters

In a distributed join, probe-side cost accumulates across multiple operators: scanning and decoding data from storage, repartitioning rows, materializing shuffle files, transferring data across the network, and finally performing hash or sort-based join processing. Each row eliminated by the Bloom filter avoids this entire execution path. For selective joins, the filter often removes the majority of probe-side rows before repartitioning, reducing shuffle volume proportionally. Since data exchange is frequently the dominant cost of a distributed join, even modest reductions in probe-side cardinality can translate directly into lower execution time.

Bloom-based runtime filtering complements Spark’s dynamic partition pruning (DPP). DPP propagates exact join-key values from the build side to eliminate entire partitions before they are scanned, but it is applicable only when the join key is also a partitioning column. Bloom filters provide a more general mechanism, operating at row granularity and supporting arbitrary join keys at the cost of a controlled false-positive rate. The optimizations are complementary, and Spark avoids generating redundant Bloom filters when equivalent pruning is already provided by DPP.

Injection thresholds

The rewrite is guarded by thresholds — a runtime filter on a small scan adds overhead without benefit, and a filter built from a large build side is too large to be effective:

ConfigDefaultMeaning
spark.sql.optimizer.runtime.bloomFilter.enabledtrue (since 3.4)Master switch for injection
...bloomFilter.creationSideThreshold10 MBBuild side must estimate below this
...bloomFilter.applicationSideScanSizeThreshold10 GBProbe-side scan must estimate above this
...bloomFilter.expectedNumItems / maxNumItems1 M / 4 MSizing the filter for the build-side key count
...bloomFilter.numBits / maxNumBits8 M / 64 MBit-array budget (1 MB / 8 MB)

One additional condition matters in practice: the build side needs a selective predicate of its own (d_year = 2024). A Bloom filter of an unfiltered dimension contains every key the fact table could ever reference — it would pass everything and filter nothing.

Note that spark.sql.optimizer.runtimeFilter.semiJoinReduction.enabled is a separate, related config: instead of a Bloom filter, it injects an exact semi-join reduction on the probe side. It remains false by default and is easy to conflate with the Bloom filter switch above.

How Quanton executes runtime filters natively

Spark’s runtime filter implementation uses a conventional Bloom filter: a flat bit array combined with k hash-derived probe locations per key. At typical filter sizes, the bit array often exceeds L2 cache capacity, while the probe locations are intentionally distributed across the array. As a result, a lookup may require multiple memory accesses with limited spatial locality, causing execution to become increasingly constrained by memory latency rather than computation.

Quanton executes Spark plans on a native, SIMD-vectorized columnar engine. When Spark injects a runtime Bloom filter, Quanton replaces both bloom_filter_agg and might_contain with native implementations, allowing filter construction, aggregation, serialization, and probing to execute entirely outside the JVM. Because the aggregation and probe operators share a common binary representation, they are offloaded together as a unit; execution never combines a JVM-generated filter with a native probe path, or vice versa.

Runtime filter evaluation occupies a particularly performance-sensitive position in the execution plan. The might_contain predicate is evaluated once for every row on the probe side, making it one of the highest-frequency operations in the query. At probe-side cardinalities measured in billions of rows, the memory-access characteristics of the filter become a primary determinant of execution cost.

A blocked Bloom filter design

Quanton’s native implementation reorganizes the filter layout so that all bits associated with a key reside within a single 64-bit block.

  • A single 64-bit hash value is computed for the key.
  • The upper bits select a block within the filter array.
  • Four 6-bit segments of the remaining hash value select bit positions within that block, producing a four-bit mask.

Insertion becomes word |= mask, while lookup becomes (word & mask) == mask. The resulting probe path consists of a single word load, a bitwise AND, and a comparison, all without branches.

Conventional Bloom filter kmemory accesses / probe probe locations distributed across a multi-MB array — multiple accesses to unrelated memory locations
Quanton blocked filter 1block access / probe all 4 bits in one 64-bit block — one load, one AND, one compare, no branches

Compared with a conventional Bloom filter, which may require multiple accesses to unrelated memory locations, the blocked layout limits each lookup to a single block access and substantially improves cache locality.

The filter size is maintained as a power of two, allowing block selection to be implemented with a bit mask rather than integer division. Although division is inexpensive in absolute terms, it remains sufficiently costly to appear in performance profiles when executed billions of times. Other systems address the same issue through alternative techniques; for example, ClickHouse uses branch-free reciprocal-based arithmetic rather than hardware division for filter indexing.

This layout introduces a deliberate trade-off. Constraining all probe bits to a single block increases the false-positive rate relative to a conventional Bloom filter at the same memory budget. At approximately 16 bits per key, a blocked design typically produces false-positive rates near 2%, compared with substantially lower rates for a classic implementation. For runtime join filtering, however, the reduction in memory accesses generally outweighs the increase in false positives. A false positive merely allows an additional row to proceed to the join operator, where it is eliminated exactly, whereas excess memory latency is incurred for every probe-side row.

Vectorized and zero-copy evaluation

The blocked layout improves memory behavior; the execution engine addresses instruction throughput.

Quanton evaluates might_contain over columnar batches rather than individual rows:

  • Keys are hashed in tight vectorized loops over column vectors instead of per-row dispatch paths.
  • Filter evaluation operates over batches and produces selection vectors that identify qualifying rows.
  • Multiple independent probes can be processed concurrently, improving utilization of the processor’s memory subsystem.
  • The probe operator evaluates the serialized filter directly, eliminating deserialization and per-task filter copies.
  • Rows that satisfy the filter flow directly into downstream operators, while rejected rows are discarded before materialization.

Filter construction follows the same native execution model. Individual tasks build local filters, partial results are merged through bitwise OR operations, and the final filter is serialized once for distribution. Filter sizing continues to honor Spark’s existing configuration parameters, including expectedNumItems and numBits, allowing existing tuning practices to carry over unchanged. The native path is enabled by default and can be toggled:

spark.quanton.sql.native.bloomFilter=true

Runtime filtering and sideways information passing

Runtime Bloom filters represent one example of sideways information passing, a broader optimization technique in which information derived by one operator is propagated to another operator to reduce work. Quanton applies this principle in multiple execution paths.

As discussed in our Shuffle Hash Join (SHJ) deep dive, once a hash join has completed construction of its build-side hash table, the operator possesses exact knowledge of the build-side key set. This information can be propagated into the probe-side scan as either an exact value set or a min/max range filter. Runtime Bloom filters and operator-level exact filters serve complementary roles: Bloom filters propagate compact approximations across shuffle boundaries, while exact filters exploit executor-local knowledge where precise filtering can be applied with minimal overhead.

Key takeaways

A Bloom filter is a compact probabilistic data structure, yet when positioned correctly within a query plan it can substantially reduce execution cost by applying join selectivity before the join itself. Spark’s runtime filter framework constructs a filter from the filtered build side, distributes a compact representation across the cluster, and removes non-matching probe-side rows before they participate in shuffle and join processing.

Quanton preserves Spark’s execution strategy while replacing the underlying implementation. Runtime filters are represented using a blocked layout optimized for cache locality, evaluated through branch-free vectorized loops, and probed directly from their serialized representation within the native scan pipeline. Although the blocked design increases the false-positive rate modestly, it significantly reduces memory-access overhead. At probe-side cardinalities measured in billions of rows, the reduction in per-row execution cost dominates the additional false-positive work.

RM
Data Infrastructure at Onehouse · Building Quanton
← Back to blog