Debugging Jobs
Common issues and how to resolve them when running Quanton Spark jobs on Kubernetes.
Job won't start
Operator pod is not running
kubectl get pods -n quanton-operator
kubectl describe pod <operator-pod> -n quanton-operator
Common causes:
onehouse-values.yamlis missing or malformed- Control plane endpoint unreachable — check network configuration
- Image pull failure — verify
imagePullSecrets.accessTokenin your values file
Driver pod stuck in Pending
kubectl describe pod <driver-pod> -n <namespace>
Look for events at the bottom. Common causes:
- Insufficient resources: The pod can't be scheduled. Reduce
driver.memoryordriver.cores, or add more nodes. - Missing service account: Ensure
spark-operator-sparkexists in the job namespace:If missing, the Spark Operator creates it automatically whenkubectl get serviceaccount spark-operator-spark -n <namespace>jobNamespacesis configured — verify the Spark Operator is watching the correct namespace. - Image pull failure: The Quanton image isn't accessible. Verify network egress to
dist.onehouse.aiand that the pull secret was created in the job namespace.
Job stuck in SUBMITTED phase
kubectl get quantonsparkapplication <name> -n <namespace> -o yaml
Check the status field for error messages. This often means the operator hasn't reconciled the resource yet. Check operator logs:
kubectl logs -n quanton-operator deployment/quanton-operator
Job fails immediately
Check driver logs first
kubectl logs <driver-pod-name> -n <namespace>
Pipe through grep to filter noise:
kubectl logs <driver-pod-name> -n <namespace> | grep -i "error\|exception\|caused by"
OOM (Out of Memory) errors
Symptoms: java.lang.OutOfMemoryError or executor pod killed with exit code 137.
Fix: Increase executor memory or reduce partition size:
executor:
memory: "16384m"
sparkConf:
"spark.executor.memoryOverhead": "2g"
"spark.sql.shuffle.partitions": "400"
Data skew
Symptoms: Most executors finish quickly but a few tasks run for much longer.
sparkConf:
"spark.sql.adaptive.enabled": "true"
"spark.sql.adaptive.skewJoin.enabled": "true"
S3 / GCS access errors
com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied
Check that the driver service account has IAM permissions to read/write the bucket. For EKS with IRSA:
kubectl describe serviceaccount spark-operator-spark -n <namespace>
# Look for: eks.amazonaws.com/role-arn annotation
Executor connection timeout
Executors can't reach the driver. This usually means the driver pod IP isn't routable from executor pods (common with multi-namespace setups or restricted network policies).
Ensure your network policy allows pod-to-pod traffic within the job namespace:
# Allow all pods in namespace to communicate
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-spark-pods
namespace: data-jobs
spec:
podSelector: {}
ingress:
- from:
- podSelector: {}
Migration issues
Transformed YAML fails to apply
If a job converted with the migration tool fails validation:
- Verify
apiVersionisonehouse.ai/v1beta2 - Verify
kindisQuantonSparkApplication - Verify the original spec is nested under
spec.sparkApplicationSpec - Check that
spec.sparkApplicationSpec.typeis one of:Java,Scala,Python,R - Check that
spec.sparkApplicationSpec.modeisclusterorclient
Existing SparkApplication still running both operators
If you applied a QuantonSparkApplication but the job is still using OSS Spark, check that:
- The Quanton Operator is running and watching the correct namespace
- The resource has
kind: QuantonSparkApplication(notSparkApplication) - The namespace is listed in
quantonOperator.jobNamespaces
Operator logs
The operator logs all reconciliation events. Filter for errors:
kubectl logs -n quanton-operator deployment/quanton-operator | grep -i "error\|fail\|warn"
For a specific job:
kubectl logs -n quanton-operator deployment/quanton-operator | grep <job-name>
Get help
Join the Onehouse Community Slack to connect with engineers building Quanton. Share your driver logs and operator logs when asking for help.