Split-Brain Recovery Quick Reference¶
Fast reference guide for detecting and recovering from Neo4j cluster split-brain scenarios.
Quick Detection¶
# Check cluster consistency across all servers
for i in 0 1 2; do
echo "=== Server $i ==="
kubectl exec cluster-server-$i -- cypher-shell -u neo4j -p password \
"SHOW SERVERS YIELD name, state ORDER BY name"
done
✅ Healthy: All servers show same server list ❌ Split-Brain: Different servers show different server lists
Automatic Recovery¶
The Neo4j Kubernetes Operator automatically detects and repairs split-brain scenarios:
Monitor Auto-Recovery¶
# Watch operator logs for split-brain detection
kubectl logs -f -n neo4j-operator-system deployment/neo4j-operator-controller-manager | grep -i "split.*brain"
# Check for split-brain events
kubectl get events --field-selector reason=SplitBrainDetected
kubectl get events --field-selector reason=SplitBrainRepaired
Expected Auto-Recovery Logs¶
Starting split-brain detection for cluster production-cluster, expectedServers: 3
Split-brain analysis results: isSplitBrain: true, orphanedPods: 1
Split-brain automatically repaired by restarting orphaned pods: [cluster-server-2]
Manual Recovery (If Auto-Recovery Fails)¶
1. Identify Main Cluster¶
# Count servers visible to each pod
kubectl exec cluster-server-0 -- cypher-shell -u neo4j -p password "SHOW SERVERS" | wc -l
kubectl exec cluster-server-1 -- cypher-shell -u neo4j -p password "SHOW SERVERS" | wc -l
kubectl exec cluster-server-2 -- cypher-shell -u neo4j -p password "SHOW SERVERS" | wc -l
2. Restart Orphaned Servers¶
# Restart the server(s) with inconsistent views
kubectl delete pod cluster-server-X
# Wait for rejoin
kubectl wait --for=condition=Ready pod/cluster-server-X --timeout=300s
3. Verify Recovery¶
# All servers should now show consistent cluster membership
kubectl exec cluster-server-0 -- cypher-shell -u neo4j -p password \
"SHOW SERVERS YIELD name, state ORDER BY name"
Emergency Procedures¶
Force Full Cluster Restart¶
⚠️ Use only if individual pod restart fails
# Delete all server pods (data preserved in PVCs)
kubectl delete pods -l app.kubernetes.io/name=neo4j,neo4j.com/cluster=CLUSTER_NAME
# Monitor reformation
kubectl get pods -l app.kubernetes.io/name=neo4j -w
Trigger Operator Reconciliation¶
# Force operator to re-examine cluster with a no-op annotation change
kubectl annotate neo4jenterprisecluster CLUSTER_NAME \
"troubleshooting.neo4j.com/reconcile=$(date +%s)" --overwrite
Common Symptoms¶
| Symptom | Indicates Split-Brain |
|---|---|
| Different server counts per pod | ✅ |
| "Insufficient servers" database errors | ✅ |
| Some databases unreachable | ✅ |
Inconsistent SHOW DATABASES output |
✅ |
| Application connection failures | ⚠️ Possible |
Prevention Quick Tips¶
Resource Allocation¶
Multi-Zone Deployment¶
spec:
topology:
servers: 3
placement:
topologySpread:
enabled: true
topologyKey: topology.kubernetes.io/zone
maxSkew: 1
Network Resilience¶
spec:
config:
# RAFT tuning (LIST discovery — no K8S API polling refresh needed)
dbms.cluster.raft.election_timeout: "7s" # Neo4j 5.26+
Monitoring Commands¶
# Health check script
#!/bin/bash
CLUSTER="production-cluster"
EXPECTED=3
for i in $(seq 0 $((EXPECTED-1))); do
COUNT=$(kubectl exec ${CLUSTER}-server-$i -- cypher-shell -u neo4j -p password \
"SHOW SERVERS" 2>/dev/null | wc -l)
echo "Server $i sees $COUNT servers"
[ "$COUNT" -ne "$EXPECTED" ] && echo "⚠️ Split-brain detected!"
done
Quick Troubleshooting¶
| Issue | Command | Solution |
|---|---|---|
| Can't connect to Neo4j | kubectl exec cluster-server-0 -- cypher-shell -u neo4j -p password "RETURN 1" |
Check credentials/network |
| Pod not ready | kubectl describe pod cluster-server-0 |
Check resources/storage |
| Operator not responding | kubectl logs -n neo4j-operator-system deployment/neo4j-operator-controller-manager |
Check operator health |
| RBAC issues | kubectl auth can-i exec pods --as=system:serviceaccount:neo4j-operator-system:operator |
Fix permissions |
Emergency Contacts¶
When automatic recovery fails: 1. Check operator logs first 2. Try manual pod restart 3. Full cluster restart if necessary 4. Restore from backup as last resort
⚠️ Remember: The operator handles 99% of split-brain scenarios automatically. Manual intervention should be rare.
For detailed procedures, see: Split-Brain Recovery Guide