Skip to main content

Agentic AI

Quanton embeds an AI agent directly into the Apache Spark Web UI. While your job is running, it automatically analyzes stages, tasks, executor metrics, and SQL plans — then surfaces recommendations, alerts, and answers in plain English.

No changes to your job code. No external tooling. Just open the Spark UI.

Spark UI with the AI Agent button

Setup

Starting with Quanton Operator 2.0, the agent ships with the operator. Two ways to enable it.

When installing the operator, set onehouseConfig.enableAIAgent=true. Every QuantonSparkApplication submitted afterwards picks up the agent automatically — no per-job config.

helm upgrade --install quanton-operator oci://registry-1.docker.io/onehouseai/quanton-operator \
--namespace quanton-operator \
--create-namespace \
--set "quantonOperator.jobNamespaces={default}" \
--set onehouseConfig.enableAIAgent=true \
-f onehouse-values.yaml

Option 2: Per-job

If you want the agent on for some jobs but not others, leave the operator default and add two sparkConf keys to the QuantonSparkApplication:

spec:
sparkApplicationSpec:
sparkConf:
spark.plugins: "ai.quanton.spark.agent.SparkAgentPlugin"
spark.quanton.agent.enabled: "true"

Try it on a real workload

The operator repo ships a self-contained TPC-DS example with the agent enabled — a meaningful workload for the agent to reason about (1000+ stages, 99 SQL queries, real shuffle and join activity).

git clone https://github.com/onehouseinc/quanton-operator
cd quanton-operator/examples/tpcds-agent
./run.sh # default — SF=10 (~10-15 min datagen, ~8-30 min queries)
# or smaller for a quick demo:
SCALE_FACTOR=1 ./run.sh # ~3-5 min datagen, ~2-3 min queries

See examples/tpcds-agent/README.md for prereqs and details.

Open the agent UI

Once the driver pod is Running:

kubectl port-forward <driver-pod-name> 4040:4040 -n default

Open http://localhost:4040. The AI Agent button appears in the top-right corner.

Add your LLM key

The agent is provider-agnostic. Click the AI Agent button → Settings → paste your API key. The provider (Anthropic, OpenAI, Google) is auto-detected from the key prefix.

AI Agent Settings — paste your API key

ProviderModels
Anthropicclaude-opus-4-6, claude-sonnet-4-6
Google Geminigemini-2.0-flash, gemini-2.5-pro
OpenAIgpt-4o, gpt-4.1

Your API key is stored in browser localStorage and never sent to Onehouse servers.

What it does

Chat

Ask questions about the running job in natural language. The agent has full context — stages, tasks, SQL plans, executor logs — and answers with specific observations about your job, not generic Spark advice.

"How many stages have completed?" "Why is stage 4 taking so long?" "Which executor is causing the skew?"

AI Agent Chat tab

Diagnostics

Real-time health alerts for the conditions that slow jobs down most, with the underlying analysis shown alongside each finding:

  • Data skew
  • Shuffle spill to disk
  • GC pressure
  • OOM risk
  • Straggler tasks

AI Agent Diagnostics view with stage analysis

Monitor

Live executor and SQL metrics in a single view alongside your job:

  • GC Pressure — fraction of executor time spent in garbage collection
  • Skew Ratio — how unbalanced task durations are within a stage
  • Spill (GB) — total memory + disk spill across active stages
  • Task Duration (s) — per-task runtime distribution, with stragglers highlighted
  • Shuffle info — bytes read/written, shuffle partitions, fetch wait time
  • Executor info — CPU, heap, active/failed task counts per executor
  • SQL / Cache info — query plans, accelerated stages, cached datasets

AI Agent Monitor tab

Notes

  • The agent UI is reachable only while the driver pod is running. When the SparkContext shuts down at job end, the Spark UI (and the agent) goes with it. For longer interactive sessions, run a bigger workload (SCALE_FACTOR=10 or higher) so queries stay in flight.
  • For very large runs (thousands of stages or hundreds of SQL queries), you can bump the agent's in-memory context limits via spark.quanton.agent.context.max.completed.stages (default is lower) — the example above keeps defaults since SF=10 fits comfortably.