GROUP BY ROLLUP Without the Row Explosion

What ROLLUP actually computes

GROUP BY ROLLUP is the SQL standard’s way of asking for a hierarchy of subtotals in a single query. Given a set of grouping columns, it produces aggregates at every prefix of that list — from the finest grain all the way up to the grand total.

SELECT i_category, i_class, SUM(ss_sales_price) AS revenue
FROM store_sales JOIN item ON ss_item_sk = i_item_sk
GROUP BY ROLLUP(i_category, i_class);

For two grouping keys, ROLLUP(a, b) returns three levels of aggregation:

(a, b) — revenue per class within each category
(a) — revenue per category (subtotal across all its classes)
() — total revenue (grand total across everything)

In general, ROLLUP over N keys produces N+1 levels: drop the rightmost key, aggregate again; drop the next, aggregate again; until nothing is left to group by. Each output row carries a grouping_id — a bitmask that tells you which level it belongs to, so a subtotal row (i_class is NULL) is distinguishable from a genuine NULL class.

It’s an enormously useful operator for reporting and OLAP — and 11 of the 99 TPC-DS queries use it (q5, q14a, q18, q22, q27, q36, q67, q70, q77, q80, q86). The interesting question is how an engine executes it.

How Spark executes ROLLUP: the Expand operator

Spark doesn’t have a rollup operator. It lowers ROLLUP into operators it already has — a plan that looks like this:

HashAggregate (final)
  └─ Exchange (shuffle on grouping keys + grouping_id)
       └─ HashAggregate (partial)
            └─ Expand            ◀── the problem
                 └─ <join / scan output>

The mechanism — and the problem — is the Expand operator. To produce N+1 levels with a single downstream aggregation, Expand takes each input row and emits N+1 copies of it — one per rollup level. Each copy has the rolled-up keys for that level NULL’d out and is tagged with the appropriate grouping_id. The partial HashAggregate then groups by {all keys} ∪ {grouping_id}, which collapses each level back down.

It’s correct, and it reuses machinery the engine already has. But look at what happens to the row count before aggregation: every row entering Expand leaves as N+1 rows. The deeper the rollup, the worse the amplification.

Rows in 1× the join/scan output feeding the aggregation

Rows out of Expand N+1× one copy per rollup level — all of it hits the partial hash table

On TPC-DS Query 67 — the deepest rollup in the benchmark, with 8 grouping keys and 9 levels — Expand fans every input row into 9 copies. At 10 TB, that means the partial aggregation has to hash and probe roughly 9× as many rows, with each of the 9 levels re-shuffled across the cluster. That amplification is pure overhead — CPU spent on redundant hashing, bandwidth spent on redundant shuffle — and it scales with the depth of the rollup.

Vectorizing it isn’t enough

Apache Gluten pushes Spark’s operators down onto a native, vectorized C++ backend. For most operators that’s a large win — SIMD hashing, columnar batches, no JVM object churn. And it does help here: Gluten runs Expand and HashAggregate as vectorized ExpandExecTransformer / HashAggregateExecTransformer nodes, so each row is processed far more efficiently than in JVM Spark.

But vectorization speeds up the work; it doesn’t remove it. Gluten still inherits Spark’s plan shape, which means it still runs Expand, which means it still materializes the same N+1× row explosion — just faster. You’re paying for a 9× amplification with a faster CPU. The algorithm is the bottleneck, not the implementation.

How Quanton executes ROLLUP: a rollup operator

Quanton takes the input-side explosion out of the plan. Instead of fanning every input row out into N+1 copies and letting a downstream hash aggregate collapse them, Quanton introduces a rollup operator that derives all N+1 levels by re-aggregating already-collapsed state. The coarse levels still shuffle — the grand-total level collapses to a single group either way — but the N+1× amplification on the input side is gone.

The plan loses the Expand node completely:

HashAggregate (final)
  └─ Exchange
       └─ Rollup                         ◀── rollup operator, N+1 levels
            └─ HashAggregate (partial)    ◀── groups by K keys, not K ∪ {grouping_id}
                 └─ <join / scan output>

Quanton recognizes the partial HashAggregate ← Expand shape in the Spark plan and rewrites it to the rollup operator automatically — no query changes required. It’s on by default, and falls back cleanly to the Expand path wherever the re-merge doesn’t apply. Aggregating the aggregates works for decomposable aggregates — SUM, COUNT, MIN, MAX, AVG — where a coarser level can be merged from a finer one. Holistic aggregates like COUNT(DISTINCT …), MEDIAN, or PERCENTILE can’t be merged that way and take the Expand path.

TPC-DS 10TB Q67

Here’s the same TPC-DS Query 67 — 8 rollup keys — run end to end at 10 TB scale, on each engine. Same SQL, same result, same hardware.

Horizontal bar chart of TPC-DS 10TB Query 67 end-to-end runtime: OSS Spark 4.4 minutes, Gluten 2.7 minutes, Quanton 55 seconds. Lower is faster.

Across the full TPC-DS suite

Query 67 is the dramatic case because it’s the deepest rollup, but the rollup operator touches only the 11 rollup queries — and only their aggregation phase. The suite-wide numbers below come from Quanton broadly, not this one operator: Quanton is an engine that’s aware of the data it reads and writes and leverages metadata and auxiliary data structures to the fullest — on top of vectorized execution and query-plan optimizations. Across joins, scans, shuffle, aggregation, and other Spark SQL operators, the picture over the entire 99-query TPC-DS suite is consistent:

Quanton vs. OSS Spark 5× faster across the full TPC-DS suite

Quanton vs. Gluten 3.5× faster across the full TPC-DS suite

Key takeaways

GROUP BY ROLLUP computes N+1 levels of subtotals — every prefix of the grouping keys, from finest grain up to the grand total.
Spark executes it with an Expand operator that copies every input row N+1 times. The amplification grows with the depth of the rollup, burning CPU on redundant hashing and bandwidth on redundant shuffle.
Gluten vectorizes the same plan — faster per row, but it still runs Expand, so it still pays the full amplification.
Quanton uses a rollup operator that re-aggregates already-collapsed state for each coarser level instead of exploding the input. The work scales with the aggregate, not with N+1 copies of the input — for decomposable aggregates; holistic ones (COUNT(DISTINCT), MEDIAN, …) fall back to Expand.
On TPC-DS Q67 at 10 TB the difference is 4.4 min → 2.7 min → 55 s across OSS Spark, Gluten, and Quanton. Across the full suite, Quanton runs 5× faster than OSS Spark and 3.5× faster than Gluten — same SQL, same results.