Skip to main content

Debugging Jobs

Common issues and how to resolve them when running Quanton Spark jobs on Kubernetes.

Job won't start

Operator pod is not running

kubectl get pods -n quanton-operator
kubectl describe pod <operator-pod> -n quanton-operator

Common causes:

  • onehouse-values.yaml is missing or malformed
  • Control plane endpoint unreachable — check network configuration
  • Image pull failure — verify imagePullSecrets.accessToken in your values file

Driver pod stuck in Pending

kubectl describe pod <driver-pod> -n <namespace>

Look for events at the bottom. Common causes:

  • Insufficient resources: The pod can't be scheduled. Reduce driver.memory or driver.cores, or add more nodes.
  • Missing service account: Ensure spark-operator-spark exists in the job namespace:
    kubectl get serviceaccount spark-operator-spark -n <namespace>
    If missing, the Spark Operator creates it automatically when jobNamespaces is configured — verify the Spark Operator is watching the correct namespace.
  • Image pull failure: The Quanton image isn't accessible. Verify network egress to dist.onehouse.ai and that the pull secret was created in the job namespace.

Job stuck in SUBMITTED phase

kubectl get quantonsparkapplication <name> -n <namespace> -o yaml

Check the status field for error messages. This often means the operator hasn't reconciled the resource yet. Check operator logs:

kubectl logs -n quanton-operator deployment/quanton-operator

Job fails immediately

Check driver logs first

kubectl logs <driver-pod-name> -n <namespace>

Pipe through grep to filter noise:

kubectl logs <driver-pod-name> -n <namespace> | grep -i "error\|exception\|caused by"

OOM (Out of Memory) errors

Symptoms: java.lang.OutOfMemoryError or executor pod killed with exit code 137.

Fix: Increase executor memory or reduce partition size:

executor:
memory: "16384m"
sparkConf:
"spark.executor.memoryOverhead": "2g"
"spark.sql.shuffle.partitions": "400"

Data skew

Symptoms: Most executors finish quickly but a few tasks run for much longer.

sparkConf:
"spark.sql.adaptive.enabled": "true"
"spark.sql.adaptive.skewJoin.enabled": "true"

S3 / GCS access errors

com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied

Check that the driver service account has IAM permissions to read/write the bucket. For EKS with IRSA:

kubectl describe serviceaccount spark-operator-spark -n <namespace>
# Look for: eks.amazonaws.com/role-arn annotation

Executor connection timeout

Executors can't reach the driver. This usually means the driver pod IP isn't routable from executor pods (common with multi-namespace setups or restricted network policies).

Ensure your network policy allows pod-to-pod traffic within the job namespace:

# Allow all pods in namespace to communicate
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-spark-pods
namespace: data-jobs
spec:
podSelector: {}
ingress:
- from:
- podSelector: {}

Migration issues

Transformed YAML fails to apply

If a job converted with the migration tool fails validation:

  1. Verify apiVersion is onehouse.ai/v1beta2
  2. Verify kind is QuantonSparkApplication
  3. Verify the original spec is nested under spec.sparkApplicationSpec
  4. Check that spec.sparkApplicationSpec.type is one of: Java, Scala, Python, R
  5. Check that spec.sparkApplicationSpec.mode is cluster or client

Existing SparkApplication still running both operators

If you applied a QuantonSparkApplication but the job is still using OSS Spark, check that:

  • The Quanton Operator is running and watching the correct namespace
  • The resource has kind: QuantonSparkApplication (not SparkApplication)
  • The namespace is listed in quantonOperator.jobNamespaces

Operator logs

The operator logs all reconciliation events. Filter for errors:

kubectl logs -n quanton-operator deployment/quanton-operator | grep -i "error\|fail\|warn"

For a specific job:

kubectl logs -n quanton-operator deployment/quanton-operator | grep <job-name>

Get help

Join the Onehouse Community Slack to connect with engineers building Quanton. Share your driver logs and operator logs when asking for help.