Monitoring Jobs

Quanton exposes job observability through the Apache Spark Web UI (event logs) and driver logs. Operator-level metrics are available via Prometheus.

Spark UI (event logs)

The Spark UI shows stages, tasks, SQL plans, executor metrics, and shuffle I/O for a running or completed job.

Access the Spark UI locally

While the driver pod is running, port-forward it:

# Find the driver pod name
kubectl get pods -n <namespace> | grep driver

# Port-forward to local port 4040
kubectl port-forward <driver-pod-name> 4040:4040 -n <namespace>

Open http://localhost:4040.

note

The Spark UI is only available while the driver pod is alive. Once the job completes, the driver pod is terminated and the UI goes away. Consider enabling history server if you need post-completion access.

Event log retention

Event logs are retained for 7 days after a job completes or is terminated.

Driver logs

Stream driver logs directly with kubectl:

kubectl logs -f <driver-pod-name> -n <namespace>

To tail just errors:

kubectl logs <driver-pod-name> -n <namespace> | grep -i "error\|exception\|warn"

Custom log output (Python)

Use Log4j to emit structured logs from Python jobs:

from pyspark.sql import SparkSession
from typing import Optional

class LoggerProvider:
    def get_logger(self, spark: SparkSession, custom_prefix: Optional[str] = ""):
        log4j_logger = spark._jvm.org.apache.log4j
        return log4j_logger.LogManager.getLogger(custom_prefix + self.__class__.__name__)

class MyJob(LoggerProvider):
    def __init__(self):
        self.spark = SparkSession.builder.appName("MyJob").getOrCreate()
        self.logger = self.get_logger(self.spark)

    def run(self):
        self.logger.info("Starting job")
        # ... your job logic
        self.logger.info("Job complete")

Custom log output (Java/Scala)

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

public class MyJob {
    private static final Logger logger = LogManager.getLogger(MyJob.class);

    public static void main(String[] args) {
        logger.info("Starting job");
        // ... job logic
    }
}

Operator metrics (Prometheus)

The Quanton Operator exposes Prometheus metrics on port 8080 at /metrics. Scrape this endpoint from your monitoring stack.

Key metrics to watch

Use case	Metric
Reconciliation errors	`quanton_reconciliations_total{status!="success"}`
Slow reconciliation	`quanton_reconciliation_duration_seconds`
Submission failures	`quanton_spark_application_submissions_total{status="failure"}`
Jobs by phase	`quanton_spark_applications_by_phase`
Control plane health	`quanton_gateway_calls_total`, `quanton_gateway_call_duration_seconds`
Work queue backup	`workqueue_depth`, `workqueue_unfinished_work_seconds`

Job phase tracking

# Count of jobs currently running
quanton_spark_applications_by_phase{phase="RUNNING"}

# Submission failure rate (5m window)
rate(quanton_spark_application_submissions_total{status="failure"}[5m])

OpenTelemetry

The operator ships a kube-state-metrics sidecar that collects pod-level metrics (CPU, memory requests/limits, lifecycle timestamps) from Spark driver and executor pods. These are forwarded to the Onehouse control plane via OpenTelemetry.

AI-powered diagnostics

The Quanton Agent AI sidebar injects into every Spark Web UI page, providing:

Real-time chat about the running job (stages, SQL plans, executor metrics)
Auto-generated recommendations based on observed behavior
Health alerts for spill, GC pressure, skew, OOM, and straggler tasks
Live executor metrics — CPU, heap, shuffle IO, disk IO

See Agent AI for setup instructions.

Spark UI (event logs)​

Access the Spark UI locally​

Event log retention​

Driver logs​

Custom log output (Python)​

Custom log output (Java/Scala)​

Operator metrics (Prometheus)​

Key metrics to watch​

Job phase tracking​

OpenTelemetry​

AI-powered diagnostics​