Split-Brain Recovery Guide¶
This guide provides comprehensive troubleshooting and recovery procedures for Neo4j cluster split-brain scenarios when using the Neo4j Kubernetes Operator with server-based architecture.
Overview¶
Split-brain occurs when Neo4j cluster servers lose communication and form separate, independent clusters instead of one unified cluster. This can lead to data inconsistencies and cluster instability if not properly detected and resolved.
The Neo4j Kubernetes Operator includes automatic split-brain detection and repair to prevent and resolve these issues proactively.
Understanding Split-Brain Scenarios¶
What is Split-Brain?¶
Split-brain happens when: 1. Network partitions separate cluster servers 2. Servers cannot communicate with each other 3. Multiple independent "clusters" form within the same deployment 4. Each partition believes it is the authoritative cluster
Common Causes¶
- Network partitions between Kubernetes nodes
- Resource constraints causing pod communication failures
- DNS resolution issues preventing server discovery
- Storage problems affecting cluster state persistence
- Configuration errors in discovery or networking
Automatic Split-Brain Detection¶
The operator includes comprehensive split-brain detection that runs automatically during cluster health checks.
Detection Process¶
- Multi-Pod Analysis: Connects to each server pod individually
- Cluster View Comparison: Compares each server's view of cluster membership
- Inconsistency Detection: Identifies servers with conflicting cluster views
- Automatic Repair: Restarts orphaned pods to rejoin the main cluster
Detection Logs¶
Monitor operator logs for split-brain detection:
# Check for split-brain detection logs
kubectl logs -n neo4j-operator-system deployment/neo4j-operator-controller-manager | grep -i "split.*brain"
# Expected detection logs:
# Starting split-brain detection for cluster production-cluster, expectedServers: 3
# Split-brain analysis results: isSplitBrain: true, orphanedPods: 1, repairAction: RestartPods
# Split-brain automatically repaired by restarting orphaned pods: [production-cluster-server-2]
Kubernetes Events¶
The operator generates events for split-brain scenarios:
# Check for split-brain events
kubectl get events --field-selector reason=SplitBrainDetected
kubectl get events --field-selector reason=SplitBrainRepaired
# Example events:
# Warning SplitBrainDetected Neo4jEnterpriseCluster/production-cluster Split-brain detected: 1 orphaned servers
# Normal SplitBrainRepaired Neo4jEnterpriseCluster/production-cluster Split-brain repaired: restarted orphaned pods
Manual Split-Brain Detection¶
Verify Cluster Health¶
-
Check Server Status:
-
Compare Cluster Views: Look for inconsistencies in server lists between different pods. In a healthy cluster, all servers should see the same cluster membership.
-
Check Database Allocation:
Identify Split-Brain Symptoms¶
Indicators of Split-Brain: - Different server counts reported by different pods - Inconsistent database allocations across servers - Some servers showing as "offline" from others' perspectives - Database creation failures with "insufficient servers" errors - Application connection failures to some databases
Repair Strategies¶
Automatic Repair (Recommended)¶
The operator automatically repairs split-brain scenarios by:
- Detection: Identifying orphaned servers with inconsistent cluster views
- Analysis: Determining the main cluster and orphaned servers
- Restart: Gracefully restarting orphaned pods to rejoin the main cluster
- Verification: Confirming successful cluster reformation
No manual intervention required - the operator handles this automatically.
Manual Repair Procedures¶
If automatic repair fails or you need to intervene manually:
1. Identify the Main Cluster¶
# Check which partition has the majority of servers
kubectl exec production-cluster-server-0 -- cypher-shell -u neo4j -p password \
"SHOW SERVERS YIELD name, state ORDER BY name"
# Count active servers in each partition
kubectl exec production-cluster-server-1 -- cypher-shell -u neo4j -p password \
"SHOW SERVERS YIELD name, state ORDER BY name"
2. Restart Orphaned Servers¶
# Restart the server(s) that show inconsistent cluster views
kubectl delete pod production-cluster-server-2
# Wait for pod to restart and rejoin
kubectl wait --for=condition=Ready pod/production-cluster-server-2 --timeout=300s
3. Verify Cluster Recovery¶
# Confirm all servers show consistent cluster membership
kubectl exec production-cluster-server-0 -- cypher-shell -u neo4j -p password \
"SHOW SERVERS YIELD name, state, health ORDER BY name"
# Check database status
kubectl exec production-cluster-server-0 -- cypher-shell -u neo4j -p password \
"SHOW DATABASES"
4. Force Cluster Reformation (Last Resort)¶
If standard restart doesn't work, use cluster-wide restart:
# Delete all server pods simultaneously (data preserved in PVCs)
kubectl delete pods -l app.kubernetes.io/name=neo4j,neo4j.com/cluster=production-cluster
# Monitor cluster reformation
kubectl get pods -l app.kubernetes.io/name=neo4j -w
⚠️ Warning: Cluster-wide restart should only be used as a last resort and may cause temporary service interruption.
Prevention Strategies¶
Network Resilience¶
-
Node Affinity Configuration:
-
Multi-Zone Deployment:
Resource Allocation¶
spec:
resources:
requests:
memory: "4Gi" # Adequate memory to prevent OOM
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
Network Configuration¶
spec:
config:
# Optimize discovery resolution timeout (operator uses LIST discovery with static pod FQDNs)
dbms.cluster.discovery.resolution_timeout: "30s"
# Cluster communication resilience (Neo4j 5.26+)
dbms.cluster.raft.election_timeout: "7s"
dbms.cluster.raft.leader_failure_detection_window: "30s"
Monitoring and Alerting¶
Prometheus Metrics¶
Monitor these key metrics for early split-brain detection:
# Cluster health metrics
neo4j_cluster_servers_total
neo4j_cluster_servers_online
neo4j_database_allocation_inconsistency
# Alert rules
groups:
- name: neo4j.split-brain
rules:
- alert: Neo4jSplitBrainDetected
expr: neo4j_cluster_servers_online < neo4j_cluster_servers_total
for: 2m
labels:
severity: critical
annotations:
summary: "Neo4j cluster split-brain detected"
description: "Cluster {{ $labels.cluster }} has {{ $value }} online servers out of {{ neo4j_cluster_servers_total }} total servers"
Log Monitoring¶
Set up log monitoring for split-brain events:
# Alert on split-brain detection logs
kubectl logs -f -n neo4j-operator-system deployment/neo4j-operator-controller-manager | \
grep -E "(split.*brain|Split.*Brain)" --line-buffered | \
while read line; do
echo "ALERT: $line"
# Send to monitoring system
done
Health Check Automation¶
#!/bin/bash
# Automated cluster health check script
CLUSTER_NAME="production-cluster"
NAMESPACE="default"
check_cluster_health() {
local expected_servers=3
local consistent_views=0
for i in $(seq 0 $((expected_servers-1))); do
local server_count=$(kubectl exec ${CLUSTER_NAME}-server-$i -n $NAMESPACE -- \
cypher-shell -u neo4j -p password \
"SHOW SERVERS YIELD name" 2>/dev/null | wc -l)
if [ "$server_count" -eq "$expected_servers" ]; then
((consistent_views++))
fi
done
if [ "$consistent_views" -eq "$expected_servers" ]; then
echo "✅ Cluster health: OK"
return 0
else
echo "❌ Split-brain detected: $consistent_views/$expected_servers servers have consistent views"
return 1
fi
}
# Run health check
if ! check_cluster_health; then
echo "🔄 Triggering operator reconciliation..."
kubectl annotate neo4jenterprisecluster $CLUSTER_NAME -n $NAMESPACE \
"troubleshooting.neo4j.com/reconcile=$(date +%s)" --overwrite
fi
Troubleshooting Common Issues¶
Split-Brain Detection Not Working¶
-
Check Operator Logs:
-
Verify RBAC Permissions:
-
Check Neo4j Connectivity:
False Split-Brain Detection¶
If the operator incorrectly identifies split-brain:
-
Check Resource Constraints:
-
Verify Network Connectivity:
-
Review Configuration:
Recovery Failures¶
If automatic recovery fails:
-
Check Pod Status:
-
Review Events:
-
Inspect Storage:
Emergency Recovery Procedures¶
Complete Cluster Reset¶
⚠️ Use only as a last resort - may cause data loss
# 1. Scale down the cluster
kubectl patch neo4jenterprisecluster production-cluster --type='json' \
-p='[{"op": "replace", "path": "/spec/topology/servers", "value": 0}]'
# 2. Wait for pods to terminate
kubectl wait --for=delete pod -l neo4j.com/cluster=production-cluster --timeout=300s
# 3. Clean up cluster state (if necessary)
# Note: This may cause data loss - only do if cluster is completely corrupted
# kubectl delete pvc -l neo4j.com/cluster=production-cluster,neo4j.com/role=server
# 4. Scale back up
kubectl patch neo4jenterprisecluster production-cluster --type='json' \
-p='[{"op": "replace", "path": "/spec/topology/servers", "value": 3}]'
# 5. Monitor recovery
kubectl get pods -l neo4j.com/cluster=production-cluster -w
Data Recovery from Backups¶
If split-brain causes data corruption:
# 1. Create restoration cluster
kubectl apply -f - <<EOF
apiVersion: neo4j.neo4j.com/v1beta1
kind: Neo4jEnterpriseCluster
metadata:
name: recovery-cluster
spec:
topology:
servers: 3
# ... same configuration as original cluster
EOF
# 2. Restore from backup
kubectl apply -f - <<EOF
apiVersion: neo4j.neo4j.com/v1beta1
kind: Neo4jRestore
metadata:
name: split-brain-recovery
spec:
clusterRef: recovery-cluster
backupRef: latest-backup
options:
force: true
EOF
# 3. Verify data integrity
kubectl exec recovery-cluster-server-0 -- cypher-shell -u neo4j -p password \
"MATCH (n) RETURN count(n) as node_count"
Best Practices Summary¶
- Prevention:
- Use adequate resource allocation
- Deploy across multiple zones
- Configure proper network policies
-
Monitor cluster health continuously
-
Detection:
- Rely on automatic split-brain detection
- Set up monitoring and alerting
-
Regular health checks
-
Recovery:
- Trust automatic repair mechanisms
- Manual intervention only when necessary
-
Always verify cluster health after recovery
-
Monitoring:
- Monitor operator logs for split-brain events
- Set up Kubernetes event alerting
- Track cluster consistency metrics
For additional troubleshooting help, see: - General Troubleshooting Guide - TLS Configuration Issues - Performance Troubleshooting - Backup/Restore Issues