Agentic AI
Quanton embeds an AI agent directly into the Apache Spark Web UI. While your job is running, it automatically analyzes stages, tasks, executor metrics, and SQL plans — then surfaces recommendations, alerts, and answers in plain English.
No changes to your job code. No external tooling. Just open the Spark UI.

Setup
Starting with Quanton Operator 2.0, the agent ships with the operator. Two ways to enable it.
Option 1: Operator-wide (recommended)
When installing the operator, set onehouseConfig.enableAIAgent=true. Every QuantonSparkApplication submitted afterwards picks up the agent automatically — no per-job config.
helm upgrade --install quanton-operator oci://registry-1.docker.io/onehouseai/quanton-operator \
--namespace quanton-operator \
--create-namespace \
--set "quantonOperator.jobNamespaces={default}" \
--set onehouseConfig.enableAIAgent=true \
-f onehouse-values.yaml
Option 2: Per-job
If you want the agent on for some jobs but not others, leave the operator default and add two sparkConf keys to the QuantonSparkApplication:
spec:
sparkApplicationSpec:
sparkConf:
spark.plugins: "ai.quanton.spark.agent.SparkAgentPlugin"
spark.quanton.agent.enabled: "true"
Try it on a real workload
The operator repo ships a self-contained TPC-DS example with the agent enabled — a meaningful workload for the agent to reason about (1000+ stages, 99 SQL queries, real shuffle and join activity).
git clone https://github.com/onehouseinc/quanton-operator
cd quanton-operator/examples/tpcds-agent
./run.sh # default — SF=10 (~10-15 min datagen, ~8-30 min queries)
# or smaller for a quick demo:
SCALE_FACTOR=1 ./run.sh # ~3-5 min datagen, ~2-3 min queries
See examples/tpcds-agent/README.md for prereqs and details.
Open the agent UI
Once the driver pod is Running:
kubectl port-forward <driver-pod-name> 4040:4040 -n default
Open http://localhost:4040. The AI Agent button appears in the top-right corner.
Add your LLM key
The agent is provider-agnostic. Click the AI Agent button → Settings → paste your API key. The provider (Anthropic, OpenAI, Google) is auto-detected from the key prefix.

| Provider | Models |
|---|---|
| Anthropic | claude-opus-4-6, claude-sonnet-4-6 |
| Google Gemini | gemini-2.0-flash, gemini-2.5-pro |
| OpenAI | gpt-4o, gpt-4.1 |
Your API key is stored in browser localStorage and never sent to Onehouse servers.
What it does
Chat
Ask questions about the running job in natural language. The agent has full context — stages, tasks, SQL plans, executor logs — and answers with specific observations about your job, not generic Spark advice.
"How many stages have completed?" "Why is stage 4 taking so long?" "Which executor is causing the skew?"

Diagnostics
Real-time health alerts for the conditions that slow jobs down most, with the underlying analysis shown alongside each finding:
- Data skew
- Shuffle spill to disk
- GC pressure
- OOM risk
- Straggler tasks

Monitor
Live executor and SQL metrics in a single view alongside your job:
- GC Pressure — fraction of executor time spent in garbage collection
- Skew Ratio — how unbalanced task durations are within a stage
- Spill (GB) — total memory + disk spill across active stages
- Task Duration (s) — per-task runtime distribution, with stragglers highlighted
- Shuffle info — bytes read/written, shuffle partitions, fetch wait time
- Executor info — CPU, heap, active/failed task counts per executor
- SQL / Cache info — query plans, accelerated stages, cached datasets

Notes
- The agent UI is reachable only while the driver pod is running. When the SparkContext shuts down at job end, the Spark UI (and the agent) goes with it. For longer interactive sessions, run a bigger workload (
SCALE_FACTOR=10or higher) so queries stay in flight. - For very large runs (thousands of stages or hundreds of SQL queries), you can bump the agent's in-memory context limits via
spark.quanton.agent.context.max.completed.stages(default is lower) — the example above keeps defaults since SF=10 fits comfortably.