Overview
Native execution engines have become the new way to make Apache Spark run faster. Projects like Apache Gluten (with Velox) and Apache DataFusion Comet push the heavy lifting of a query — scans, joins, aggregations — out of the JVM and into vectorized C++ or Rust. The performance case is real, and both are healthy open-source projects that have moved the whole ecosystem forward.
But running them at scale surfaces a reliability wall that this post is about: executors that die with exit code 137 (OOMKilled), the signal a container receives when the orchestrator decides the process has used more memory than it was allowed.
On Kubernetes it shows up like this:
The API gave the following container statuses:
container name: spark-kubernetes-executor
container state: terminated
exit code: 137
termination reason: OOMKilled
What makes these kills frustrating is that they often happen even when the engine is accounting for and freeing its memory correctly. The query logic is fine. The spill code runs. And the container still dies. This post looks at why that happens — the structural reasons native accelerators are prone to off-heap OOM kills — and how Quanton is built to make progress reliably under the same memory pressure without giving up performance on the hardware you already have.
Two worlds of memory
A vanilla Spark executor lives almost entirely inside the JVM heap. The garbage collector tracks every object, and when memory gets tight, Spark spills to disk and the JVM reclaims space. It can be slow, but it is bounded and observable — the heap has a ceiling, and the runtime knows where every byte is.
A native accelerator splits the executor into two worlds:
The JVM heap
Spark's driver-side planning, task scheduling, and the thin layer that hands work to the native engine still live here, under the garbage collector — the part of the executor Spark has always understood.
Native off-heap memory
The vectors, hash tables, sort buffers, and shuffle buffers that do the real work now live outside the JVM, allocated through C/C++ malloc. No garbage collector watches this region — and crucially, the container's memory limit is enforced against it all the same.
The container limit does not care which world a byte lives in. It sees one number: resident set size (RSS) — the total physical memory the process holds. The moment RSS crosses the limit, the Kubernetes container is OOMKilled (or, on YARN, the container is killed for exceeding its physical memory limit) — regardless of how carefully the engine thinks it is managing its share. Native accelerators move most of a query’s working set into the second world, which is precisely the world the container limit is hardest to reason about.
The problem: where the kills come from
Off-heap OOM kills in native accelerators come from two compounding layers. Neither is a bug in the usual sense — they are consequences of the architecture — but together they make memory pressure far more likely to end in a kill than in a graceful slowdown.
Layer one: the engine can’t always see or reclaim its own memory
A native engine reports its memory usage back to Spark so Spark can decide when to spill. That reporting only works if every allocation is tracked and every operator can give memory back when asked. In practice there are gaps:
- Memory that escapes accounting. Some structures get allocated outside the tracked off-heap pool, so the engine’s own bookkeeping understates true usage — and Spark never triggers the spill it otherwise would.
- Spill requests that don’t actually free memory. When Spark asks an operator to spill, the released bytes are supposed to come back. If the spill path reports zero, or releases the wrong amount, Spark believes no memory was reclaimed and has no lever left to relieve the pressure.
- Operators that can’t spill at all. Some operators have no spill path, so when their working set exceeds the budget the only outcomes are “allocate” or “fail” — there is no graceful fallback to disk.
When any of these gaps is hit, the engine sails past the point where Spark would normally have intervened — and walks straight into the container limit.
Layer two: the allocator keeps memory the engine already freed
This is the subtler and, for many teams, the more surprising cause. Suppose the engine does free everything correctly. RSS can still climb and stay high — because freeing memory inside the engine does not mean the memory is returned to the operating system.
That decision belongs to the C library’s memory allocator sitting underneath the engine, and the default general-purpose allocator on most Linux systems behaves in a way that is unkind to this workload:
- Per-thread arenas multiply on many-core hosts. To reduce lock contention, the default allocator creates a separate memory arena per thread, scaling with core count. A native Spark executor on a modern many-core node can end up with dozens of arenas.
- Freed memory is retained, not returned. Each arena holds onto freed blocks to satisfy future allocations quickly. Across many arenas, this means large amounts of freed memory stay resident in the process — counted against the container limit — long after the engine is done with it.
- Fragmentation inflates the high-water mark. Native query execution allocates buffers of wildly varying sizes in bursty patterns (a big hash table here, thousands of small vectors there). General-purpose allocators fragment under this pattern: free space exists but is scattered across arenas in pieces too small or too misplaced to reuse, so the allocator keeps requesting new pages from the OS instead. RSS tracks the worst moment the process ever saw, not what it currently needs.
The net effect: RSS reflects the allocator’s retained high-water mark, not the engine’s live memory. A query that genuinely needs 4 GB at any instant can hold 9 GB resident because freed memory is parked across arenas and never handed back. The engine’s accounting says it is well under budget; the container sees a process over the limit and OOMKills it.
This is a well-understood property of general-purpose allocators under this kind of workload, not a defect specific to any one engine. The two layers compound: weak reclamation pushes the engine closer to the edge, and allocator retention raises the floor it is operating above. Either alone is survivable; together they turn ordinary memory pressure into a kill.
What this looks like on TPC-DS at 10TB
This is not theoretical. Running the full 99-query TPC-DS benchmark at 10 TB (Parquet) on a fixed cluster, both OSS Apache Gluten and Apache DataFusion Comet are OOMKilled mid-query — on the same hardware and memory budget where Quanton completes all 99.
How Quanton runs reliably under pressure
Quanton runs the same kind of native, vectorized engine — so it inherits the same physics. The difference is that it is engineered, top to bottom, to keep its memory footprint close to live usage and to handle memory pressure more reliably instead of fatally. Three things work together:
A memory allocator built for lakehouse workloads
Quanton ships with a modern memory allocator purpose-built for exactly what a native engine does — many threads, bursty mixed-size allocations, and long-running processes. It is the default for every executor, with nothing to hand-tune.
Native memory bounded under the executor limit
Quanton's engine memory is tracked against a single budget that lives under the container limit, so the engine knows its true footprint rather than an understated one. No broadcast or build-side structure quietly escapes accounting and surprises the orchestrator.
Operators spill to disk under pressure
When pressure builds, operators spill to disk rather than failing an allocation. Spill releases real memory and reports it accurately, so the runtime always has a lever to pull. The worst case under memory pressure is a slower query, not a dead executor.
None of this trades away speed. A memory allocator designed for highly concurrent workloads is also faster at serving allocations across many threads — so running reliably under memory pressure does not mean running slower. Quanton stays within its memory budget and holds the same throughput on the hardware you already have.
Key takeaways
Native accelerators get OOMKilled for reasons that have little to do with the query and everything to do with how off-heap memory is allocated and tracked — which is what turns memory pressure into a kill rather than a slowdown.
Quanton treats this as a first-class design constraint: a memory allocator built for lakehouse workloads, accounting that knows the engine’s true footprint, and spill to disk under pressure. The result is an engine that keeps making progress reliably under memory pressure, on the hardware you already have — without trading away the performance that made native acceleration worth adopting.