Troubleshooting Guide¶
This guide provides comprehensive troubleshooting information for the Neo4j Kubernetes Operator, covering both Neo4jEnterpriseCluster and Neo4jEnterpriseStandalone deployments.
Quick Reference¶
Diagnostic Commands¶
# Check deployment status
kubectl get neo4jenterprisecluster
kubectl get neo4jenterprisestandalone
kubectl get neo4jdatabase
# View detailed information
kubectl describe neo4jenterprisecluster <cluster-name>
kubectl describe neo4jenterprisestandalone <standalone-name>
kubectl describe neo4jdatabase <database-name>
# Check pod status
# Clusters
kubectl get pods -l neo4j.com/cluster=<cluster-name>
kubectl logs -l neo4j.com/cluster=<cluster-name>
# Standalone
kubectl get pods -l app=<standalone-name>
kubectl logs -l app=<standalone-name>
# Check events
kubectl get events --sort-by=.metadata.creationTimestamp
# Check operator logs
kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager
Common Port Forwarding Commands¶
# For clusters
kubectl port-forward svc/<cluster-name>-client 7474:7474 7687:7687
# For standalone deployments
kubectl port-forward svc/<standalone-name>-service 7474:7474 7687:7687
Common Issues and Solutions¶
1. Split-Brain Scenarios¶
Problem: Cluster nodes form multiple independent clusters¶
This is most common with TLS-enabled clusters where nodes fail to join during initial formation.
Quick Check:
# Check each node's view of the cluster
for i in 0 1 2; do
kubectl exec <cluster>-server-$i -- cypher-shell -u neo4j -p <password> "SHOW SERVERS" | wc -l
done
Solution: See the comprehensive Split-Brain Recovery Guide or use the Quick Reference.
Quick Fix:
# Restart minority cluster nodes (orphaned pods)
kubectl delete pod <cluster>-server-1 <cluster>-server-2
2. Test Environment Issues¶
Problem: Integration tests failing with namespace termination issues¶
Test namespaces get stuck in "Terminating" state due to resources with finalizers.
Solution: Ensure proper cleanup in test code:
// Always remove finalizers before deletion
if len(resource.GetFinalizers()) > 0 {
resource.SetFinalizers([]string{})
_ = k8sClient.Update(ctx, resource)
}
_ = k8sClient.Delete(ctx, resource)
Problem: Backup sidecar test timeout¶
Test waits for wrong readiness field on standalone deployments.
Solution: Check the correct status field:
// For standalone deployments
return standalone.Status.Ready // NOT Status.Conditions
// Correct pod label selector
client.MatchingLabels{"app": standalone.Name}
Problem: Operator not deployed in test cluster¶
Integration tests fail because operator is not running.
Solution: Deploy operator before running tests:
kubectl config use-context kind-neo4j-operator-test
make operator-setup # Deploy operator to cluster
make test-integration
Problem: CI Failures Due to Resource Constraints (Added 2025-08-22)¶
GitHub Actions CI often fails with "Unschedulable - 0/1 nodes are available: 1 Insufficient memory" when running integration tests.
Root Cause: CI environments have limited memory (~7GB total), but tests request 1Gi+ per Neo4j pod.
Solution - Use CI Workflow Emulation:
What CI Emulation Provides:
- Identical Environment: Sets CI=true GITHUB_ACTIONS=true variables
- Memory Constraints: Uses 512Mi memory limits (same as CI)
- Debug Logging: Comprehensive logs saved to logs/ci-local-*.log
- Complete Workflow: Unit tests → Integration tests → Cleanup
- Troubleshooting: Auto-provided diagnostic commands on failure
Generated Debug Files:
- logs/ci-local-unit.log - Unit test output with environment info
- logs/ci-local-integration.log - Integration test output with cluster setup
- logs/ci-local-cleanup.log - Environment cleanup output
Manual Resource Debugging:
# Check memory allocation in CI logs
cat logs/ci-local-integration.log | grep -E "(memory|Memory|512Mi)"
# Check pod resource requests
kubectl describe pod <pod-name> | grep -A10 "Requests"
# Monitor real-time memory usage
kubectl top pod <pod-name> --containers
# Check for OOMKilled pods
kubectl get events | grep OOMKilled
Key Resource Requirements:
- CI Environment: 512Mi memory limits per pod
- Local Development: 1.5Gi memory limits per pod (Neo4j Enterprise minimum)
- Automatic Detection: Tests use getCIAppropriateResourceRequirements() function
Prevention:
# Always test with CI constraints before pushing
make test-ci-local
# If CI emulation passes, CI should pass too
echo "✅ Ready for CI deployment"
3. Deployment Validation Errors¶
Problem: Single-Node Cluster Not Allowed¶
Error: Neo4jEnterpriseCluster requires minimum 2 servers for clustering. For single-node deployments, use Neo4jEnterpriseStandalone instead
Solution: Use the correct CRD for your deployment type:
For development/testing (single-node):
apiVersion: neo4j.neo4j.com/v1beta1
kind: Neo4jEnterpriseStandalone
metadata:
name: dev-neo4j
spec:
image:
repo: neo4j
tag: "5.26-enterprise"
storage:
className: standard
size: "10Gi"
For production (minimum cluster):
apiVersion: neo4j.neo4j.com/v1beta1
kind: Neo4jEnterpriseCluster
metadata:
name: prod-cluster
spec:
topology:
servers: 2 # Minimum required for clustering
image:
repo: neo4j
tag: "5.26-enterprise"
storage:
className: standard
size: "10Gi"
Problem: Invalid Neo4j Version¶
Solution: Update to a supported version:
Supported versions: - Semver: 5.26.0, 5.26.1 (5.26.x is the last semver LTS — no 5.27+ exists) - Calver: 2025.01.0, 2025.06.1, 2026.01.0+
2. Pod Startup Issues¶
Problem: Pods Stuck in Pending State¶
# Check pod events
kubectl describe pod <pod-name>
# Common causes:
# - Insufficient resources
# - Storage issues
# - Image pull issues
Solutions:
-
Check Resource Availability:
-
Verify Storage Class:
-
Check Image Pull:
Problem: Pods Crashing (CrashLoopBackOff)¶
Common causes and solutions:
-
Memory Issues:
-
Configuration Issues:
-
License Issues:
3. Connectivity Issues¶
Problem: Cannot Connect to Neo4j¶
# Test connectivity
kubectl port-forward svc/<service-name> 7474:7474 7687:7687
curl http://localhost:7474
# Check service
kubectl get svc -l app.kubernetes.io/name=neo4j
kubectl describe svc <service-name>
Solutions:
-
Check Service Configuration:
-
Verify Network Policies:
-
Check TLS Configuration:
4. Cluster-Specific Issues¶
Problem: Cluster Formation Fails¶
# Check cluster status
kubectl get neo4jenterprisecluster <cluster-name> -o yaml
# Check individual pod logs
kubectl logs <cluster-name>-server-0
kubectl logs <cluster-name>-server-1
Solutions:
- 🔧 Verify LIST Discovery Configuration
The operator uses LIST discovery with static pod FQDNs (port 6000). Check the startup script in the cluster ConfigMap:
kubectl get configmap <cluster-name>-config -o yaml | grep -A 3 "resolver_type"
# Neo4j 5.26.x should show:
# dbms.cluster.discovery.resolver_type=LIST
# dbms.cluster.discovery.version=V2_ONLY
# dbms.cluster.discovery.v2.endpoints=<cluster>-server-0.<cluster>-headless.<ns>.svc.cluster.local:6000,...
# Neo4j 2025.x+ should show:
# dbms.cluster.discovery.resolver_type=LIST
# dbms.cluster.endpoints=<cluster>-server-0.<cluster>-headless.<ns>.svc.cluster.local:6000,...
If K8S or wrong ports appear: upgrade to the latest operator version — this was fixed in favour of LIST discovery.
-
Verify Cluster Topology:
-
Check Inter-Pod Communication:
-
Verify Discovery Labels:
bash # Check that only the discovery service has clustering label kubectl get svc -l neo4j.com/cluster=<cluster-name> -o yaml | grep -A 3 -B 3 "neo4j.com/clustering"
Problem: Scaling Issues¶
Solutions:
-
Verify Minimum Topology:
-
Check Resource Limits:
5. Standalone-Specific Issues¶
Problem: Standalone Pod Won't Start¶
# Check standalone status
kubectl get neo4jenterprisestandalone <standalone-name> -o yaml
# Check pod events
kubectl describe pod <standalone-name>-0
Solutions:
-
Check Standalone Configuration:
-
Verify Storage Configuration:
Problem: Migration from Cluster to Standalone¶
# Create backup first
kubectl apply -f backup.yaml
# Deploy standalone
kubectl apply -f standalone.yaml
# Restore data
kubectl apply -f restore.yaml
6. Performance Issues¶
Problem: Slow Query Performance¶
# Check resource usage
kubectl top pods
kubectl top nodes
# Check Neo4j metrics
kubectl port-forward svc/<service-name> 7474:7474
# Access http://localhost:7474/metrics
Solutions:
-
Adjust Memory Settings:
-
Enable Query Logging:
-
Check Storage Performance:
7. Storage Issues¶
Problem: PVC Issues¶
# Check PVC status
kubectl get pvc
kubectl describe pvc <pvc-name>
# Check storage class
kubectl get storageclass
Solutions:
-
Verify Storage Class:
-
Check Node Storage:
Problem: Data Corruption¶
Solutions:
-
Run Consistency Check:
-
Restore from Backup:
8. Backup and Restore Issues¶
Problem: Backup failing with permission denied¶
Backup jobs fail with "permission denied" or "cannot exec into pod" errors.
Solution: The operator now automatically creates RBAC resources. If you're upgrading:
# Ensure operator has latest permissions
make install # After cloning the repository
# Check operator has pods/exec and pods/log permissions
kubectl describe clusterrole neo4j-operator-manager-role | grep -E "pods/exec|pods/log"
Note: Starting with the latest version, the operator automatically creates:
- Service accounts for backup jobs
- Roles with pods/exec and pods/log permissions
- Role bindings for secure backup execution
Problem: Backup path not found¶
Neo4j 5.26+ requires backup destination path to exist.
Solution: The operator's backup pod (clusters) or backup sidecar (standalone) automatically creates paths. Check the backup container is running:
# Cluster backup pod
kubectl get pod <cluster>-backup-0 -o yaml | grep backup
kubectl logs <cluster>-backup-0 -c backup
# Standalone backup sidecar
kubectl logs <neo4j-pod> -c backup-sidecar
9. Security Issues¶
Problem: Authentication Failures¶
# Check auth secret
kubectl get secret <auth-secret> -o yaml
# Check Neo4j auth logs
kubectl logs <pod-name> | grep -i auth
Solutions:
-
Verify Admin Secret:
-
Check Password Policy:
spec.auth.passwordPolicyis schema-only and currently ignored — set the Neo4j keys directly inspec.configinstead:
Problem: TLS Certificate Issues¶
# Check certificate status
kubectl get certificates
kubectl describe certificate <cert-name>
# Check cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager
Solutions:
-
Verify Issuer:
-
Check Certificate Details:
-
TLS Cluster Formation Issues:
TLS-enabled clusters are prone to split-brain during initial formation. If you see partial cluster formation:
# Check for split clusters
kubectl exec <cluster>-server-0 -- cypher-shell -u neo4j -p <password> "SHOW SERVERS"
kubectl exec <cluster>-server-1 -- cypher-shell -u neo4j -p <password> "SHOW SERVERS"
Prevention:
spec:
config:
# Increase discovery timeouts for TLS clusters
dbms.cluster.discovery.v2.initial_timeout: "10s"
dbms.cluster.discovery.v2.retry_timeout: "20s"
# Note: Do NOT override dbms.cluster.raft.membership.join_timeout
# The operator sets it to 10m which is optimal
See Split-Brain Recovery Guide for detailed recovery procedures.
10. Database Creation Issues¶
Problem: Neo4jDatabase Creation Fails¶
# Check database status
kubectl get neo4jdatabase <database-name> -o yaml
kubectl describe neo4jdatabase <database-name>
# Check events specific to the database
kubectl get events --field-selector involvedObject.name=<database-name>
Common causes and solutions:
-
Cluster Not Ready:
-
Topology Exceeds Cluster Capacity:
-
Invalid Configuration Conflicts:
Problem: Seed URI Database Creation Fails¶
# Check validation errors
kubectl describe neo4jdatabase <database-name>
# Check operator logs for seed URI specific errors
kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager | grep -i seed
Common seed URI issues:
-
Authentication Failures:
-
URI Access Issues:
-
Invalid URI Format:
-
Point-in-Time Recovery Issues:
-
Performance Issues with Seed URI:
# Warning: Using dump file format. For better performance with large databases, consider using Neo4j backup format (.backup) instead. # Solution: Use .backup format for large datasets seedURI: "s3://my-backups/database.backup" # Instead of .dump # Optimize seed configuration for better performance seedConfig: config: compression: "lz4" # Faster than gzip bufferSize: "256MB" # Larger buffer for big files validation: "lenient" # Skip intensive validation
Problem: Database Stuck in Creating State¶
# Check database status conditions
kubectl get neo4jdatabase <database-name> -o jsonpath='{.status.conditions[*].message}'
# Monitor database creation progress
kubectl get events -w --field-selector involvedObject.name=<database-name>
Solutions:
-
Check Cluster Connectivity:
-
Large Backup Restoration:
-
Network Connectivity Issues:
Problem: Database Ready But No Data¶
# Connect to database and check
kubectl exec -it <cluster-pod> -- cypher-shell -u neo4j -p <password> -d <database-name> "MATCH (n) RETURN count(n)"
Solutions:
-
Initial Data Not Applied:
-
Seed URI Data Not Restored:
Advanced Troubleshooting¶
Debug Mode¶
Enable debug logging in the operator:
kubectl patch deployment neo4j-operator-controller-manager \
-n neo4j-operator \
-p '{"spec":{"template":{"spec":{"containers":[{"name":"manager","args":["--zap-log-level=debug"]}]}}}}'
Resource Monitoring¶
Monitor resource usage:
# Watch resource usage
watch kubectl top pods
watch kubectl top nodes
# Check resource limits
kubectl describe limitrange
kubectl describe resourcequota
Network Debugging¶
Test network connectivity:
# DNS resolution
kubectl exec -it <pod-name> -- nslookup <service-name>
# Port connectivity
kubectl exec -it <pod-name> -- telnet <service-name> 7687
# Network policies
kubectl get networkpolicies --all-namespaces
Collecting Diagnostic Information¶
Use this script to collect comprehensive diagnostic information:
#!/bin/bash
# neo4j-debug.sh - Collect diagnostic information
echo "=== Neo4j Kubernetes Operator Diagnostic Report ==="
echo "Generated: $(date)"
echo
echo "=== Cluster Resources ==="
kubectl get neo4jenterprisecluster
echo
echo "=== Standalone Resources ==="
kubectl get neo4jenterprisestandalone
echo
echo "=== Cluster Pods ==="
kubectl get pods -l neo4j.com/cluster=<cluster-name>
echo
echo "=== Standalone Pods ==="
kubectl get pods -l app=<standalone-name>
echo
echo "=== Services ==="
kubectl get svc -l app.kubernetes.io/name=neo4j
echo
echo "=== PVCs ==="
kubectl get pvc
echo
echo "=== ConfigMaps ==="
kubectl get configmap
echo
echo "=== Secrets ==="
kubectl get secret
echo
echo "=== Recent Events ==="
kubectl get events --sort-by=.metadata.creationTimestamp | tail -20
echo
echo "=== Operator Logs (last 100 lines) ==="
kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager --tail=100
echo
echo "=== Storage Classes ==="
kubectl get storageclass
echo
echo "=== Node Resources ==="
kubectl describe nodes | grep -A 5 "Allocated resources:"
Getting Help¶
Support Resources¶
- Documentation: User Guide
- API Reference: Neo4jEnterpriseCluster, Neo4jEnterpriseStandalone
- Migration Guide: Migration Guide
- Community: Neo4j Community Forum
- Issues: GitHub Issues
When to Contact Support¶
Contact support when: - Data corruption is suspected - Cluster formation consistently fails - Performance is significantly degraded - Security incidents occur - Migration issues cannot be resolved
Always provide the diagnostic report and specific error messages when contacting support.