Skip to main content

Monitoring Jobs

Quanton exposes job observability through the Apache Spark Web UI (event logs) and driver logs. Operator-level metrics are available via Prometheus.

Spark UI (event logs)

The Spark UI shows stages, tasks, SQL plans, executor metrics, and shuffle I/O for a running or completed job.

Access the Spark UI locally

While the driver pod is running, port-forward it:

# Find the driver pod name
kubectl get pods -n <namespace> | grep driver

# Port-forward to local port 4040
kubectl port-forward <driver-pod-name> 4040:4040 -n <namespace>

Open http://localhost:4040.

note

The Spark UI is only available while the driver pod is alive. Once the job completes, the driver pod is terminated and the UI goes away. Consider enabling history server if you need post-completion access.

Event log retention

Event logs are retained for 7 days after a job completes or is terminated.

Driver logs

Stream driver logs directly with kubectl:

kubectl logs -f <driver-pod-name> -n <namespace>

To tail just errors:

kubectl logs <driver-pod-name> -n <namespace> | grep -i "error\|exception\|warn"

Custom log output (Python)

Use Log4j to emit structured logs from Python jobs:

from pyspark.sql import SparkSession
from typing import Optional

class LoggerProvider:
def get_logger(self, spark: SparkSession, custom_prefix: Optional[str] = ""):
log4j_logger = spark._jvm.org.apache.log4j
return log4j_logger.LogManager.getLogger(custom_prefix + self.__class__.__name__)

class MyJob(LoggerProvider):
def __init__(self):
self.spark = SparkSession.builder.appName("MyJob").getOrCreate()
self.logger = self.get_logger(self.spark)

def run(self):
self.logger.info("Starting job")
# ... your job logic
self.logger.info("Job complete")

Custom log output (Java/Scala)

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

public class MyJob {
private static final Logger logger = LogManager.getLogger(MyJob.class);

public static void main(String[] args) {
logger.info("Starting job");
// ... job logic
}
}

Operator metrics (Prometheus)

The Quanton Operator exposes Prometheus metrics on port 8080 at /metrics. Scrape this endpoint from your monitoring stack.

Key metrics to watch

Use caseMetric
Reconciliation errorsquanton_reconciliations_total{status!="success"}
Slow reconciliationquanton_reconciliation_duration_seconds
Submission failuresquanton_spark_application_submissions_total{status="failure"}
Jobs by phasequanton_spark_applications_by_phase
Control plane healthquanton_gateway_calls_total, quanton_gateway_call_duration_seconds
Work queue backupworkqueue_depth, workqueue_unfinished_work_seconds

Job phase tracking

# Count of jobs currently running
quanton_spark_applications_by_phase{phase="RUNNING"}

# Submission failure rate (5m window)
rate(quanton_spark_application_submissions_total{status="failure"}[5m])

OpenTelemetry

The operator ships a kube-state-metrics sidecar that collects pod-level metrics (CPU, memory requests/limits, lifecycle timestamps) from Spark driver and executor pods. These are forwarded to the Onehouse control plane via OpenTelemetry.

AI-powered diagnostics

The Quanton Agent AI sidebar injects into every Spark Web UI page, providing:

  • Real-time chat about the running job (stages, SQL plans, executor metrics)
  • Auto-generated recommendations based on observed behavior
  • Health alerts for spill, GC pressure, skew, OOM, and straggler tasks
  • Live executor metrics — CPU, heap, shuffle IO, disk IO

See Agent AI for setup instructions.