Monitoring Jobs
Quanton exposes job observability through the Apache Spark Web UI (event logs) and driver logs. Operator-level metrics are available via Prometheus.
Spark UI (event logs)
The Spark UI shows stages, tasks, SQL plans, executor metrics, and shuffle I/O for a running or completed job.
Access the Spark UI locally
While the driver pod is running, port-forward it:
# Find the driver pod name
kubectl get pods -n <namespace> | grep driver
# Port-forward to local port 4040
kubectl port-forward <driver-pod-name> 4040:4040 -n <namespace>
Open http://localhost:4040.
The Spark UI is only available while the driver pod is alive. Once the job completes, the driver pod is terminated and the UI goes away. Consider enabling history server if you need post-completion access.
Event log retention
Event logs are retained for 7 days after a job completes or is terminated.
Driver logs
Stream driver logs directly with kubectl:
kubectl logs -f <driver-pod-name> -n <namespace>
To tail just errors:
kubectl logs <driver-pod-name> -n <namespace> | grep -i "error\|exception\|warn"
Custom log output (Python)
Use Log4j to emit structured logs from Python jobs:
from pyspark.sql import SparkSession
from typing import Optional
class LoggerProvider:
def get_logger(self, spark: SparkSession, custom_prefix: Optional[str] = ""):
log4j_logger = spark._jvm.org.apache.log4j
return log4j_logger.LogManager.getLogger(custom_prefix + self.__class__.__name__)
class MyJob(LoggerProvider):
def __init__(self):
self.spark = SparkSession.builder.appName("MyJob").getOrCreate()
self.logger = self.get_logger(self.spark)
def run(self):
self.logger.info("Starting job")
# ... your job logic
self.logger.info("Job complete")
Custom log output (Java/Scala)
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
public class MyJob {
private static final Logger logger = LogManager.getLogger(MyJob.class);
public static void main(String[] args) {
logger.info("Starting job");
// ... job logic
}
}
Operator metrics (Prometheus)
The Quanton Operator exposes Prometheus metrics on port 8080 at /metrics. Scrape this endpoint from your monitoring stack.
Key metrics to watch
| Use case | Metric |
|---|---|
| Reconciliation errors | quanton_reconciliations_total{status!="success"} |
| Slow reconciliation | quanton_reconciliation_duration_seconds |
| Submission failures | quanton_spark_application_submissions_total{status="failure"} |
| Jobs by phase | quanton_spark_applications_by_phase |
| Control plane health | quanton_gateway_calls_total, quanton_gateway_call_duration_seconds |
| Work queue backup | workqueue_depth, workqueue_unfinished_work_seconds |
Job phase tracking
# Count of jobs currently running
quanton_spark_applications_by_phase{phase="RUNNING"}
# Submission failure rate (5m window)
rate(quanton_spark_application_submissions_total{status="failure"}[5m])
OpenTelemetry
The operator ships a kube-state-metrics sidecar that collects pod-level metrics (CPU, memory requests/limits, lifecycle timestamps) from Spark driver and executor pods. These are forwarded to the Onehouse control plane via OpenTelemetry.
AI-powered diagnostics
The Quanton Agent AI sidebar injects into every Spark Web UI page, providing:
- Real-time chat about the running job (stages, SQL plans, executor metrics)
- Auto-generated recommendations based on observed behavior
- Health alerts for spill, GC pressure, skew, OOM, and straggler tasks
- Live executor metrics — CPU, heap, shuffle IO, disk IO
See Agent AI for setup instructions.