Debugging Jobs

Common issues and how to resolve them when running Quanton Spark jobs on Kubernetes.

Job won't start

Operator pod is not running

kubectl get pods -n quanton-operator
kubectl describe pod <operator-pod> -n quanton-operator

Common causes:

onehouse-values.yaml is missing or malformed
Control plane endpoint unreachable — check network configuration
Image pull failure — verify imagePullSecrets.accessToken in your values file

Driver pod stuck in `Pending`

kubectl describe pod <driver-pod> -n <namespace>

Look for events at the bottom. Common causes:

Insufficient resources: The pod can't be scheduled. Reduce driver.memory or driver.cores, or add more nodes.
Missing service account: Ensure spark-operator-spark exists in the job namespace:
```
kubectl get serviceaccount spark-operator-spark -n <namespace>
```
If missing, the Spark Operator creates it automatically when jobNamespaces is configured — verify the Spark Operator is watching the correct namespace.
Image pull failure: The Quanton image isn't accessible. Verify network egress to dist.onehouse.ai and that the pull secret was created in the job namespace.

Job stuck in `SUBMITTED` phase

kubectl get quantonsparkapplication <name> -n <namespace> -o yaml

Check the status field for error messages. This often means the operator hasn't reconciled the resource yet. Check operator logs:

kubectl logs -n quanton-operator deployment/quanton-operator

Job fails immediately

Check driver logs first

kubectl logs <driver-pod-name> -n <namespace>

Pipe through grep to filter noise:

kubectl logs <driver-pod-name> -n <namespace> | grep -i "error\|exception\|caused by"

OOM (Out of Memory) errors

Symptoms: java.lang.OutOfMemoryError or executor pod killed with exit code 137.

Fix: Increase executor memory or reduce partition size:

executor:
  memory: "16384m"
sparkConf:
  "spark.executor.memoryOverhead": "2g"
  "spark.sql.shuffle.partitions": "400"

Data skew

Symptoms: Most executors finish quickly but a few tasks run for much longer.

sparkConf:
  "spark.sql.adaptive.enabled": "true"
  "spark.sql.adaptive.skewJoin.enabled": "true"

S3 / GCS access errors

com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied

Check that the driver service account has IAM permissions to read/write the bucket. For EKS with IRSA:

kubectl describe serviceaccount spark-operator-spark -n <namespace>
# Look for: eks.amazonaws.com/role-arn annotation

Executor connection timeout

Executors can't reach the driver. This usually means the driver pod IP isn't routable from executor pods (common with multi-namespace setups or restricted network policies).

Ensure your network policy allows pod-to-pod traffic within the job namespace:

# Allow all pods in namespace to communicate
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-spark-pods
  namespace: data-jobs
spec:
  podSelector: {}
  ingress:
    - from:
        - podSelector: {}

Migration issues

Transformed YAML fails to apply

If a job converted with the migration tool fails validation:

Verify apiVersion is onehouse.ai/v1beta2
Verify kind is QuantonSparkApplication
Verify the original spec is nested under spec.sparkApplicationSpec
Check that spec.sparkApplicationSpec.type is one of: Java, Scala, Python, R
Check that spec.sparkApplicationSpec.mode is cluster or client

Existing SparkApplication still running both operators

If you applied a QuantonSparkApplication but the job is still using OSS Spark, check that:

The Quanton Operator is running and watching the correct namespace
The resource has kind: QuantonSparkApplication (not SparkApplication)
The namespace is listed in quantonOperator.jobNamespaces

Operator logs

The operator logs all reconciliation events. Filter for errors:

kubectl logs -n quanton-operator deployment/quanton-operator | grep -i "error\|fail\|warn"

For a specific job:

kubectl logs -n quanton-operator deployment/quanton-operator | grep <job-name>

Get help

Join the Onehouse Community Slack to connect with engineers building Quanton. Share your driver logs and operator logs when asking for help.

Job won't start​

Operator pod is not running​

Driver pod stuck in Pending​

Job stuck in SUBMITTED phase​

Job fails immediately​

Check driver logs first​

OOM (Out of Memory) errors​

Data skew​

S3 / GCS access errors​

Executor connection timeout​

Migration issues​

Transformed YAML fails to apply​

Existing SparkApplication still running both operators​

Operator logs​

Get help​