Monitoring¶

This guide explains how to expose Neo4j metrics for Prometheus and where the operator wires things for you. For a complete end-to-end setup with Grafana dashboards and alerting, see the Prometheus and Grafana Setup Guide.

Enable metrics via `spec.monitoring`¶

The operator uses Neo4j's built-in Prometheus endpoint (see https://neo4j.com/docs/operations-manual/current/monitoring/metrics/expose/). When spec.monitoring.enabled is true, the operator:

Enables the Neo4j Prometheus endpoint (server.metrics.prometheus.enabled=true).
Binds it to 0.0.0.0:2004 (safe in Kubernetes — pod network isolation prevents external access).
Exposes port 2004 on the Neo4j container.
Adds prometheus.io/* annotations for scrape-based setups.
Disables CSV metrics export (server.metrics.csv.enabled=false) to avoid unnecessary disk usage.

Cluster example¶

spec:
  monitoring:
    enabled: true

For clusters, the operator also creates: - A Service named <cluster>-metrics (port 2004) - A ServiceMonitor named <cluster>-monitoring (if Prometheus Operator CRDs are available)

Standalone example¶

spec:
  monitoring:
    enabled: true

For standalone deployments, the metrics port is added to the <standalone>-service Service and a ServiceMonitor named <standalone>-monitoring is created automatically.

Full configuration example¶

spec:
  monitoring:
    enabled: true
    slowQueryThreshold: "5s"       # Log queries slower than this (maps to db.logs.query.threshold)
    queryLogLevel: "INFO"          # OFF, INFO, or VERBOSE (maps to db.logs.query.enabled)
    obfuscateLiterals: true        # Mask literal values in query logs (recommended in production)
    explainPlan: false             # Include execution plan in logs (performance impact — avoid in production)
    metricsFilter: "*"             # Enable all Neo4j metrics (default: subset only)
    metricsPrefix: "neo4j"         # Custom prefix for metric names

Prometheus scraping¶

Prometheus Operator¶

The operator auto-creates ServiceMonitor resources for both cluster and standalone deployments when monitoring.enabled: true. These target the metrics Service on port metrics (2004) with a 30-second scrape interval. No manual ServiceMonitor creation is needed.

Standard Prometheus¶

Add a scrape config that targets the metrics Service:

scrape_configs:
  - job_name: neo4j
    metrics_path: /metrics
    static_configs:
      - targets:
          - <cluster>-metrics.<namespace>.svc.cluster.local:2004

Note: for standalone deployments, scrape <standalone>-service.<namespace>.svc.cluster.local:2004 (the metrics port is added to the same Service that serves Bolt/HTTP).

Customizing metrics settings¶

Metrics filter¶

By default, Neo4j only exposes a subset of its metrics. To enable all metrics or select specific categories:

spec:
  monitoring:
    enabled: true
    metricsFilter: "*"  # Enable all metrics

Or select specific categories with glob patterns:

spec:
  monitoring:
    enabled: true
    metricsFilter: "*bolt*,*transaction*,*page_cache*,*cluster.raft*"

Query log security¶

In production environments, enable literal obfuscation to prevent sensitive data (passwords, PII) from appearing in query logs:

spec:
  monitoring:
    enabled: true
    obfuscateLiterals: true
    queryLogLevel: "INFO"           # Log only slow queries, not all queries
    slowQueryThreshold: "2s"

Version-specific metrics¶

Neo4j metric names are identical across 5.26.x and 2025.x+ CalVer releases, but some metrics are only available in newer versions:

Metric Category	5.26.x	2025.x+
Core metrics (Bolt, transactions, page cache, JVM)	Yes	Yes
CPU usage (`vm.cpu_load.*`)	No	2025.01+
Raft snapshot metrics (`cluster.raft.snapshot_*`)	No	2025.01+
Virtual threads (`vm.threads.virtual`)	No	2025.05+
Raft election/queue metrics	No	2025.02+
Store copy download metrics	No	2025.02+
Deadlock rollback counter	No	2026.01+
Page cache async IO metrics	No	2026.02+
Discovery v1 metrics (`cluster.discovery.cluster.*`)	Deprecated	Removed

Prometheus metric naming¶

Neo4j converts metric names for Prometheus: dots become underscores, counter metrics get a _total suffix. The # HELP comment preserves the original name.

Examples: - neo4j.dbms.bolt.connections_running → neo4j_dbms_bolt_connections_running - neo4j.database.<db>.transaction.committed → neo4j_database_<db>_transaction_committed_total - neo4j.page_cache.hit_ratio → neo4j_page_cache_hit_ratio

Aura Fleet Management (cloud monitoring)¶

For a hosted monitoring experience, you can register your deployment with Neo4j Aura Fleet Management. This lets you view topology, status, and metrics for all self-managed Neo4j instances alongside your Aura-managed instances in the Aura console.

The operator handles plugin installation and token registration automatically. See the Aura Fleet Management Guide for setup instructions.

Complete Metrics Reference¶

The operator registers the following Prometheus metrics. All metrics use the prefix neo4j_operator_ (composed from the neo4j_operator subsystem in internal/metrics/metrics.go).

Cluster metrics¶

Metric	Type	Labels	Description
`neo4j_operator_cluster_healthy`	Gauge	`cluster_name`, `namespace`	`1` when cluster is healthy, `0` otherwise
`neo4j_operator_cluster_replicas_total`	Gauge	`cluster_name`, `namespace`, `role` (`primary`/`secondary`)	Current replica counts by role
`neo4j_operator_cluster_phase`	Gauge	`cluster_name`, `namespace`, `phase`	`1` for the current phase, `0` for all others (phases: `Pending`, `Forming`, `Ready`, `Failed`, `Degraded`, `Upgrading`)
`neo4j_operator_split_brain_detected_total`	Counter	`cluster_name`, `namespace`	Total split-brain detection events
`neo4j_operator_server_health`	Gauge	`cluster_name`, `namespace`, `server_name`, `server_address`	`1` = Enabled+Available; `0` = degraded

Reconcile metrics¶

Metric	Type	Labels	Description
`neo4j_operator_reconcile_total`	Counter	`cluster_name`, `namespace`, `operation`, `result` (`success`/`failure`)	Total reconciliation attempts
`neo4j_operator_reconcile_duration_seconds`	Histogram	`cluster_name`, `namespace`, `operation`	Reconciliation loop duration

Upgrade metrics¶

Metric	Type	Labels	Description
`neo4j_operator_upgrade_total`	Counter	`cluster_name`, `namespace`, `result` (`success`/`failure`)	Total upgrade attempts
`neo4j_operator_upgrade_duration_seconds`	Histogram	`cluster_name`, `namespace`, `phase`	Duration per upgrade phase

Backup metrics¶

Metric	Type	Labels	Description
`neo4j_operator_backup_total`	Counter	`cluster_name`, `namespace`, `result` (`success`/`failure`)	Total backup attempts
`neo4j_operator_backup_duration_seconds`	Histogram	`cluster_name`, `namespace`	Backup job duration
`neo4j_operator_backup_size_bytes`	Gauge	`cluster_name`, `namespace`	Size of the last successful backup in bytes

Cypher execution metrics¶

Metric	Type	Labels	Description
`neo4j_operator_cypher_executions_total`	Counter	`cluster_name`, `namespace`, `operation`, `result` (`success`/`failure`)	Total Cypher statement executions by the operator
`neo4j_operator_cypher_execution_duration_seconds`	Histogram	`cluster_name`, `namespace`, `operation`	Duration of operator-issued Cypher statements

Security operation metrics¶

Metric	Type	Labels	Description
`neo4j_operator_security_operations_total`	Counter	`cluster_name`, `namespace`, `operation`, `result` (`success`/`failure`)	Total security operations (user, role, grant)

Resource conflict metrics¶

Metric	Type	Labels	Description
`neo4j_operator_resource_version_conflicts_total`	Counter	`resource_type`, `namespace`	Total Kubernetes resource version conflicts encountered
`neo4j_operator_conflict_retry_attempts`	Histogram	`resource_type`, `namespace`	Retry attempts needed to resolve each conflict
`neo4j_operator_conflict_retry_duration_seconds`	Histogram	`resource_type`, `namespace`	Time spent retrying due to resource version conflicts

Disaster recovery metrics¶

Metric	Type	Labels	Description
`neo4j_operator_disaster_recovery_status`	Gauge	`cluster_name`, `namespace`, `primary_region`, `secondary_region`	`1` = DR ready, `0` = not ready
`neo4j_operator_failover_total`	Counter	`cluster_name`, `namespace`, `result` (`success`/`failure`)	Total failovers performed
`neo4j_operator_replication_lag_seconds`	Gauge	`cluster_name`, `namespace`, `primary_region`, `secondary_region`	Replication lag in seconds

Scaling metrics¶

Metric	Type	Labels	Description
`neo4j_operator_manual_scaler_enabled`	Gauge	`cluster_name`, `namespace`	`1` = manual scaling enabled, `0` = disabled
`neo4j_operator_scale_events_total`	Counter	`cluster_name`, `namespace`, `node_type`, `direction` (`up`/`down`)	Total manual scale events
`neo4j_operator_primary_count`	Gauge	`cluster_name`, `namespace`	Current number of primary nodes
`neo4j_operator_secondary_count`	Gauge	`cluster_name`, `namespace`	Current number of secondary nodes
`neo4j_operator_scaling_validation_total`	Counter	`cluster_name`, `namespace`, `validation_type`, `result` (`success`/`failure`)	Total scaling validation attempts

Live Cluster Diagnostics¶

When spec.monitoring.enabled: true and the cluster is in Ready phase, the operator automatically collects live diagnostics by running SHOW SERVERS and SHOW DATABASES against the cluster. Results are written to status.diagnostics and two new Kubernetes conditions without requiring kubectl exec into pods.

Prerequisites¶

spec:
  monitoring:
    enabled: true

The operator creates a Neo4j client connection for diagnostics only after the cluster reaches Ready phase. No extra configuration is needed — diagnostics run automatically on every reconcile cycle.

Viewing Diagnostic Status¶

# View the full diagnostics sub-object
kubectl get neo4jenterprisecluster <name> -o json | jq '.status.diagnostics'

# Quick server state overview
kubectl get neo4jenterprisecluster <name> \
  -o jsonpath='{range .status.diagnostics.servers[*]}{.name}{"\t"}{.state}{"\t"}{.health}{"\n"}{end}'

# Quick database status overview
kubectl get neo4jenterprisecluster <name> \
  -o jsonpath='{range .status.diagnostics.databases[*]}{.name}{"\t"}{.status}{"\n"}{end}'

# When diagnostics were last collected
kubectl get neo4jenterprisecluster <name> \
  -o jsonpath='{.status.diagnostics.lastCollected}'

# Any collection error
kubectl get neo4jenterprisecluster <name> \
  -o jsonpath='{.status.diagnostics.collectionError}'

Diagnostics Status Fields¶

Field	Type	Description
`status.diagnostics.servers[]`	Array	One entry per server from `SHOW SERVERS`
`status.diagnostics.servers[].name`	string	Server display name
`status.diagnostics.servers[].address`	string	Bolt address
`status.diagnostics.servers[].state`	string	Lifecycle state: `Enabled`, `Cordoned`, `Deallocating`
`status.diagnostics.servers[].health`	string	Health status: `Available` or `Unavailable`
`status.diagnostics.servers[].hostingDatabases`	int	Number of databases hosted
`status.diagnostics.databases[]`	Array	One entry per database from `SHOW DATABASES`
`status.diagnostics.databases[].name`	string	Database name
`status.diagnostics.databases[].status`	string	Current status: `online`, `offline`, `quarantined`
`status.diagnostics.databases[].requestedStatus`	string	Desired status
`status.diagnostics.databases[].role`	string	Role on last-contacted server: `primary`, `secondary`
`status.diagnostics.lastCollected`	RFC3339	Timestamp of last successful collection
`status.diagnostics.collectionError`	string	Error from last collection attempt; empty on success

Diagnostic Conditions¶

The operator maintains two standard Kubernetes conditions on the cluster resource:

Condition	`True` when	`False` when	`Unknown` when
`ServersHealthy`	All servers are `state=Enabled` and `health=Available`	Any server is Cordoned, Deallocating, or Unavailable	Diagnostics cannot be collected
`DatabasesHealthy`	All user databases have `status=online`	Any database has `requestedStatus=online` but `status≠online`	Diagnostics cannot be collected

Note: The system database is excluded from the DatabasesHealthy check because it has special internal lifecycle behavior.

Check conditions directly:

# Check ServersHealthy condition
kubectl get neo4jenterprisecluster <name> \
  -o jsonpath='{.status.conditions[?(@.type=="ServersHealthy")]}'

# Check DatabasesHealthy condition
kubectl get neo4jenterprisecluster <name> \
  -o jsonpath='{.status.conditions[?(@.type=="DatabasesHealthy")]}'

# Watch all conditions
kubectl get neo4jenterprisecluster <name> -o jsonpath='{.status.conditions}' | jq .

Prometheus Server Health Metric¶

The operator exposes a per-server health gauge when diagnostics are enabled:

Metric	Labels	Value
`neo4j_operator_server_health`	`cluster_name`, `namespace`, `server_name`, `server_address`	`1` = Enabled+Available; `0` = degraded

Example PrometheusRule alert:

groups:
  - name: neo4j-operator
    rules:
      - alert: Neo4jServerDegraded
        expr: neo4j_operator_server_health == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Neo4j server {{ $labels.server_name }} in cluster {{ $labels.cluster_name }} is degraded"
          description: "Server {{ $labels.server_name }} at {{ $labels.server_address }} has been unhealthy for 5 minutes"

Enable the ServiceMonitor to scrape this metric (see Prometheus scraping above).

Troubleshooting Diagnostics Collection¶

status.diagnostics.collectionError is set:

This means the operator could not reach the cluster via Bolt. Common causes:

Cause	Check
Cluster not yet `Ready`	Diagnostics only run when `status.phase=Ready`; check cluster phase
Auth secret missing or wrong	Check `spec.auth.adminSecret`; verify the secret exists
Network policy blocking Bolt	Verify the operator pod can reach port 7687 of the cluster service
Cluster overloaded	The Bolt client uses a 10s timeout; check Neo4j pod resource usage

ServersHealthy=False:

# Get details on which servers are degraded
kubectl get neo4jenterprisecluster <name> \
  -o jsonpath='{.status.conditions[?(@.type=="ServersHealthy")].message}'

# Check server pods directly
kubectl get pods -l neo4j.com/cluster=<name>

DatabasesHealthy=False:

# Get details on which databases are offline
kubectl get neo4jenterprisecluster <name> \
  -o jsonpath='{.status.conditions[?(@.type=="DatabasesHealthy")].message}'

# Exec into a pod to investigate
kubectl exec <cluster-name>-server-0 -c neo4j -- \
  cypher-shell -u neo4j -p <password> "SHOW DATABASES"

Disabling Diagnostics¶

Set spec.monitoring.enabled: false (or omit the monitoring section entirely). The status.diagnostics field will remain at its last-known value but will not be updated.