Debugging Jobs
Common issues and how to resolve them when running Quanton Spark jobs on Kubernetes.
Job won't start
Operator pod is not running
kubectl get pods -n quanton-operator
kubectl describe pod <operator-pod> -n quanton-operator
Common causes:
onehouse-values.yamlis missing or malformed- Control plane endpoint unreachable — check network configuration
- Image pull failure — verify
imagePullSecrets.accessTokenin your values file
Driver pod stuck in Pending
kubectl describe pod <driver-pod> -n <namespace>
Look for events at the bottom. Common causes:
- Insufficient resources: The pod can't be scheduled. Reduce
driver.memoryordriver.cores, or add more nodes. - Missing service account: Ensure
spark-operator-sparkexists in the job namespace:If missing, the Spark Operator creates it automatically whenkubectl get serviceaccount spark-operator-spark -n <namespace>jobNamespacesis configured — verify the Spark Operator is watching the correct namespace. - Image pull failure: The Quanton image isn't accessible. Verify network egress to
dist.onehouse.aiand that the pull secret was created in the job namespace.
Job stuck in SUBMITTED phase
kubectl get quantonsparkapplication <name> -n <namespace> -o yaml
Check the status field for error messages. This often means the operator hasn't reconciled the resource yet. Check operator logs:
kubectl logs -n quanton-operator deployment/quanton-operator
Operator can't fetch a per-job auth token
Symptoms: the operator logs report a failed GetQuantonJobAuthToken RPC referencing a missing secret:
GetQuantonJobAuthToken RPC failed:
Secrets Manager can't find the specified secret.
(AWS SecretsManager, Status 400)
This happens when you uninstall the Quanton Operator Helm chart for one org and re-install with values for a different org in the same Kubernetes cluster.
The Quanton Operator Kubernetes secrets are not managed by Helm. So when you run a Helm uninstall without also cleaning up the Kubernetes secrets, the leftover secret remains in the cluster. The new controller sees that the Kubernetes secret already exists and therefore never invokes the gateway (GWC) to create and fetch the secrets for the new org — leaving it pointing at a secret that doesn't exist for that org.
Fix: clean up the leftover Quanton Operator Kubernetes secrets after the Helm uninstall, before re-installing for the new org:
# Inspect the leftover operator secrets
kubectl get secrets -n quanton-operator
# Delete the stale secrets so the new controller re-fetches them via GWC
kubectl delete secret <operator-secret> -n quanton-operator
After the secrets are removed, re-install (or restart the operator) so the new controller invokes GWC to create and fetch the correct secrets for the new org.
Job fails immediately
Check driver logs first
kubectl logs <driver-pod-name> -n <namespace>
Pipe through grep to filter noise:
kubectl logs <driver-pod-name> -n <namespace> | grep -i "error\|exception\|caused by"
OOM (Out of Memory) errors
Symptoms: java.lang.OutOfMemoryError or executor pod killed with exit code 137.
Fix: Increase executor memory or reduce partition size:
executor:
memory: "16384m"
sparkConf:
"spark.executor.memoryOverhead": "2g"
"spark.sql.shuffle.partitions": "400"
Data skew
Symptoms: Most executors finish quickly but a few tasks run for much longer.
sparkConf:
"spark.sql.adaptive.enabled": "true"
"spark.sql.adaptive.skewJoin.enabled": "true"
S3 / GCS access errors
com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied
Check that the driver service account has IAM permissions to read/write the bucket. For EKS with IRSA:
kubectl describe serviceaccount spark-operator-spark -n <namespace>
# Look for: eks.amazonaws.com/role-arn annotation
Executor connection timeout
Executors can't reach the driver. This usually means the driver pod IP isn't routable from executor pods (common with multi-namespace setups or restricted network policies).
Ensure your network policy allows pod-to-pod traffic within the job namespace:
# Allow all pods in namespace to communicate
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-spark-pods
namespace: data-jobs
spec:
podSelector: {}
ingress:
- from:
- podSelector: {}
Migration issues
Transformed YAML fails to apply
If a job converted with the migration tool fails validation:
- Verify
apiVersionisonehouse.ai/v1beta2 - Verify
kindisQuantonSparkApplication - Verify the original spec is nested under
spec.sparkApplicationSpec - Check that
spec.sparkApplicationSpec.typeis one of:Java,Scala,Python,R - Check that
spec.sparkApplicationSpec.modeisclusterorclient
Existing SparkApplication still running both operators
If you applied a QuantonSparkApplication but the job is still using OSS Spark, check that:
- The Quanton Operator is running and watching the correct namespace
- The resource has
kind: QuantonSparkApplication(notSparkApplication) - The namespace is listed in
quantonOperator.jobNamespaces
Operator logs
The operator logs all reconciliation events. Filter for errors:
kubectl logs -n quanton-operator deployment/quanton-operator | grep -i "error\|fail\|warn"
For a specific job:
kubectl logs -n quanton-operator deployment/quanton-operator | grep <job-name>
Get help
Join the Onehouse Community Slack to connect with engineers building Quanton. Share your driver logs and operator logs when asking for help.