Backup & Restore Troubleshooting Guide¶
This comprehensive troubleshooting guide covers common issues with Neo4j backup and restore operations when using the Neo4j Kubernetes Operator.
Overview¶
The Neo4j Kubernetes Operator provides comprehensive backup and restore capabilities including: - Automated backups with scheduling and retention policies - Point-in-Time Recovery (PITR) for Neo4j 2025.x - Multi-cloud storage support (S3, GCS, Azure Blob) - Centralized backup pod for clusters (standalone uses a backup sidecar) - Automatic RBAC management for backup operations
Common Backup Issues¶
Backup Job Failures¶
Symptom: Backup job fails to start¶
Diagnosis:
# Check backup resource status
kubectl get neo4jbackup
kubectl describe neo4jbackup production-backup
# Check operator logs for backup controller errors
kubectl logs -n neo4j-operator-system deployment/neo4j-operator-controller-manager | grep -i backup
# Verify RBAC permissions (default backup job service account)
kubectl auth can-i create pods/exec --as=system:serviceaccount:<namespace>:neo4j-backup-sa
Common Causes & Solutions:
- Missing RBAC Permissions: ```bash
The operator automatically creates RBAC - check if it exists¶
kubectl get serviceaccount neo4j-backup-sa kubectl get role neo4j-backup-role kubectl get rolebinding neo4j-backup-rolebinding
If missing, trigger operator reconciliation with a no-op annotation change¶
kubectl annotate neo4jenterprisecluster production-cluster troubleshooting.neo4j.com/reconcile="$(date +%s)" --overwrite ```
-
Storage Configuration Issues:
-
Cluster Reference Problems:
Symptom: Backup job starts but fails during execution¶
Diagnosis:
# Check backup job logs
kubectl logs job/production-backup-$(date +%Y%m%d)-001
# Check centralized backup pod logs (clusters)
kubectl logs production-cluster-backup-0 -c backup
# Check Neo4j server logs for backup-related errors
kubectl logs production-cluster-server-0 -c neo4j | grep -i backup
Common Solutions:
-
Insufficient Disk Space:
-
Database Lock Issues:
-
Memory Issues in Backup Process: Backup pod resources are fixed by the operator. Prefer off-peak scheduling, smaller backup scope, or larger cluster nodes.
Cloud Storage Issues¶
S3 Backup Failures¶
Authentication Issues:
# Check AWS credentials using the backup service account (default: neo4j-backup-sa)
kubectl run backup-auth-check --rm -it --image=amazon/aws-cli --serviceaccount=<backup-serviceaccount> -- aws sts get-caller-identity
# Test S3 access
kubectl run backup-auth-check --rm -it --image=amazon/aws-cli --serviceaccount=<backup-serviceaccount> -- aws s3 ls s3://your-backup-bucket/
Solutions: 1. IAM Role Issues:
# Use IAM roles for service accounts (IRSA)
spec:
backups:
cloud:
provider: aws
identity:
provider: aws
autoCreate:
enabled: true
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::123456789:role/Neo4jBackupRole"
- Bucket Policy Problems:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::123456789:role/Neo4jBackupRole" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::your-backup-bucket", "arn:aws:s3:::your-backup-bucket/*" ] } ] }
Google Cloud Storage Issues¶
Service Account Problems:
# Check GCP credentials using the backup service account (default: neo4j-backup-sa)
kubectl run backup-auth-check --rm -it --image=google/cloud-sdk:slim --serviceaccount=<backup-serviceaccount> -- gcloud auth list
# Test GCS access
kubectl run backup-auth-check --rm -it --image=google/cloud-sdk:slim --serviceaccount=<backup-serviceaccount> -- gsutil ls gs://your-backup-bucket/
Solutions:
# Use Workload Identity
spec:
backups:
cloud:
provider: gcp
identity:
provider: gcp
autoCreate:
enabled: true
annotations:
iam.gke.io/gcp-service-account: "neo4j-backup@project.iam.gserviceaccount.com"
Azure Blob Storage Issues¶
Authentication Problems:
# Check Azure credentials using the backup service account (default: neo4j-backup-sa)
kubectl run backup-auth-check --rm -it --image=mcr.microsoft.com/azure-cli --serviceaccount=<backup-serviceaccount> -- az account show
# Test storage access
kubectl run backup-auth-check --rm -it --image=mcr.microsoft.com/azure-cli --serviceaccount=<backup-serviceaccount> -- az storage blob list --account-name storageaccount --container-name backups
Scheduled Backup Issues¶
Symptom: Scheduled backups not running¶
Diagnosis:
# Check CronJob status
kubectl get cronjob
kubectl describe cronjob production-backup-schedule
# Check backup schedule configuration
kubectl get neo4jbackup production-backup -o yaml | grep -A 10 schedule
Common Solutions:
-
Invalid Cron Expression:
-
Timezone Issues:
-
Backup Window Conflicts:
Common Restore Issues¶
Restore Job Failures¶
Symptom: Restore job fails to start¶
Diagnosis:
# Check restore resource status
kubectl get neo4jrestore
kubectl describe neo4jrestore production-restore
# Check operator logs
kubectl logs -n neo4j-operator-system deployment/neo4j-operator-controller-manager | grep -i restore
Common Solutions:
-
Invalid Backup Reference:
-
Target Cluster Issues:
-
Storage Access Problems:
Symptom: Restore job fails during execution¶
Diagnosis:
# Check restore job logs
kubectl logs job/production-restore-$(date +%Y%m%d)
# Check target cluster logs during restore
kubectl logs target-cluster-server-0 | grep -i restore
Common Solutions:
-
Insufficient Storage Space:
-
Database Already Exists:
-
Version Incompatibility:
Point-in-Time Recovery (PITR) Issues¶
Symptom: PITR restore fails with timestamp errors¶
Diagnosis:
# Check backup logs for transaction timestamps
kubectl logs job/production-backup-latest | grep -i "restore-until"
# Verify PITR capability
kubectl exec production-cluster-server-0 -- neo4j-admin database info system
Solutions:
-
Invalid Timestamp Format:
-
Timestamp Outside Backup Range:
-
Neo4j Version Compatibility:
Backup Pod Issues (Cluster)¶
Backup Pod Problems¶
Symptom: Backup pod fails to start¶
Diagnosis:
# Check backup pod status
kubectl get pods -l neo4j.com/cluster=production-cluster -o wide
kubectl describe pod production-cluster-backup-0
# Check backup pod logs
kubectl logs production-cluster-backup-0 -c backup
Common Solutions:
-
Resource Constraints: Backup pod resources are fixed by the operator. Prefer off-peak scheduling or larger cluster nodes if backups are OOM-killed.
-
Storage Mount Issues:
-
Permission Problems:
Backup Request Processing Issues¶
Symptom: Backup requests not processed by backup pod¶
Diagnosis:
# Check backup request queue
kubectl exec production-cluster-backup-0 -c backup -- ls -la /backup-requests/
# Test manual backup request
kubectl exec production-cluster-backup-0 -c backup -- sh -c \
'echo "{\"type\":\"FULL\"}" > /backup-requests/test.request'
Solutions:
-
Request Format Issues:
-
Request Volume Problems:
Performance Issues¶
Slow Backup Performance¶
Diagnosis:
# Monitor backup progress
kubectl logs job/production-backup-latest -f
# Check resource utilization during backup
kubectl top pod production-cluster-server-0
Optimization Strategies:
-
Reduce primary load: Use database-specific backups and schedule during low-traffic windows.
-
Avoid overlapping backups: Stagger
Neo4jBackupschedules so only one job runs per cluster at a time. -
Storage Performance Tuning:
-
Network Optimization:
Slow Restore Performance¶
Optimization:
-
Target Cluster Resources:
-
Storage Configuration:
Monitoring and Alerting¶
Backup Health Monitoring¶
Prometheus Metrics:
# Monitor backup success rate
neo4j_backup_success_total
neo4j_backup_failure_total
neo4j_backup_duration_seconds
# Alert rules
groups:
- name: neo4j-backup
rules:
- alert: BackupFailure
expr: increase(neo4j_backup_failure_total[24h]) > 0
labels:
severity: critical
annotations:
summary: "Neo4j backup failed"
description: "Backup for cluster {{ $labels.cluster }} failed"
Log Monitoring:
# Monitor backup logs
kubectl logs -f job/production-backup-latest | grep -E "(ERROR|WARN|SUCCESS)"
# Set up log alerts
kubectl logs -f -n neo4j-operator-system deployment/neo4j-operator-controller-manager | \
grep -i "backup.*failed" --line-buffered | \
while read line; do
echo "BACKUP ALERT: $line"
# Send to alerting system
done
Backup Validation¶
Automated Validation Script:
#!/bin/bash
# Validate backup completeness
BACKUP_NAME="production-backup"
NAMESPACE="default"
validate_backup() {
local backup_status=$(kubectl get neo4jbackup $BACKUP_NAME -n $NAMESPACE -o jsonpath='{.status.phase}')
if [ "$backup_status" != "Succeeded" ]; then
echo "❌ Backup failed or incomplete: $backup_status"
return 1
fi
# Check backup size
local backup_size=$(kubectl get neo4jbackup $BACKUP_NAME -n $NAMESPACE -o jsonpath='{.status.backupSize}')
if [ "$backup_size" -lt 1000000 ]; then # Less than 1MB
echo "⚠️ Backup size suspiciously small: $backup_size bytes"
fi
echo "✅ Backup validation passed"
return 0
}
# Run validation
validate_backup
Emergency Recovery Procedures¶
Complete Database Recovery¶
Scenario: Primary database corrupted, need complete restore
# 1. Create new cluster for restoration
kubectl apply -f - <<EOF
apiVersion: neo4j.neo4j.com/v1beta1
kind: Neo4jEnterpriseCluster
metadata:
name: recovery-cluster
spec:
topology:
servers: 3
# Use same configuration as original cluster
storage:
className: "fast-ssd"
size: "1Ti"
EOF
# 2. Wait for cluster to be ready
kubectl wait --for=condition=Ready neo4jenterprisecluster/recovery-cluster --timeout=600s
# 3. Restore from latest backup
kubectl apply -f - <<EOF
apiVersion: neo4j.neo4j.com/v1beta1
kind: Neo4jRestore
metadata:
name: emergency-restore
spec:
clusterRef: recovery-cluster
source:
type: backup
backupRef: production-backup-latest
databaseName: neo4j
force: true
EOF
# 4. Monitor restore progress
kubectl logs -f job/emergency-restore
# 5. Verify data integrity
kubectl exec recovery-cluster-server-0 -- cypher-shell -u neo4j -p password \
"MATCH (n) RETURN count(n) as total_nodes"
Point-in-Time Emergency Recovery¶
# Restore to specific point before corruption
kubectl apply -f - <<EOF
apiVersion: neo4j.neo4j.com/v1beta1
kind: Neo4jRestore
metadata:
name: pitr-emergency-restore
spec:
clusterRef: recovery-cluster
source:
type: pitr
backupRef: production-backup-latest
pointInTime: "2025-01-15T10:30:00Z" # Before corruption occurred
databaseName: neo4j
force: true
EOF
Best Practices Summary¶
Backup Best Practices¶
- Regular Testing: Test backup and restore procedures regularly
- Multiple Storage Locations: Store backups in multiple locations/regions
- Retention Policies: Implement appropriate retention policies
- Monitoring: Set up comprehensive backup monitoring and alerting
- Documentation: Document recovery procedures and test them
- Security: Encrypt backups and use secure storage access
Restore Best Practices¶
- Validation: Always validate restored data integrity
- Staging Environment: Test restores in staging before production
- Downtime Planning: Plan for service interruption during restore
- Data Consistency: Ensure cluster consistency after restore
- Application Testing: Test applications after database restore
Performance Best Practices¶
- Resource Allocation: Adequate resources for backup/restore operations
- Storage Performance: Use high-performance storage for operations
- Network Optimization: Optimize network for data transfer
- Scheduling: Schedule backups during low-activity periods
- Parallel Operations: Use parallelism where possible
For additional help, see: - Backup & Restore Guide - Performance Tuning - Security Best Practices - Split-Brain Recovery