Skip to content

High Availability and Disaster Recovery

jitsudod is a stateless binary — all persistent state lives in PostgreSQL. This has an important implication: you can run multiple jitsudod instances behind a load balancer, sharing the same database, without any additional coordination infrastructure.

Internal Load Balancer (private, not internet-facing)
┌────┴────┐
▼ ▼
jitsudod-1 jitsudod-2 (multiple instances, same image)
│ │
└────┬────┘
PostgreSQL
(single source of truth for all state)

Expiry sweeper coordination — A PostgreSQL session-level advisory lock (pg_try_advisory_lock) ensures that only one jitsudod instance runs the expiry sweeper at a time. Because provider.Revoke() is called before the database state transition, without this lock multiple instances could issue duplicate revoke calls for the same grant. The winning instance acquires the lock, runs the sweep, then releases it; other instances skip that tick.

Policy sync — Each instance independently polls the database every 30 seconds and reloads its in-memory OPA query cache. This means policy changes applied via ApplyPolicy or DeletePolicy propagate to all replicas within one sync interval without any fan-out coordination.

Control plane unavailable (all jitsudod instances down)

Section titled “Control plane unavailable (all jitsudod instances down)”

Behavior: fail-closed for new requests.

If all jitsudod instances are unavailable:

  • Engineers cannot submit new elevation requests
  • Pending requests cannot be approved or denied
  • The jitsudo CLI will return connection errors

This is intentional. An unreachable access control system should not silently grant access.

Existing active grants are unaffected. Credentials already issued by the cloud provider (STS session tokens, Azure RBAC assignments, GCP IAM bindings, Kubernetes RBAC bindings) remain valid until their natural TTL expiry. The credentials are held by the cloud provider, not by jitsudod. A downed control plane does not immediately revoke active sessions.

Exception: the expiry sweeper stops. The background process that calls Revoke on expired grants will not run while jitsudod is down. Grants that expire during the outage will linger until the sweeper resumes. For providers with native TTL enforcement (GCP IAM conditions, Kubernetes TTL annotations), expiry is enforced by the provider regardless. For Azure RBAC, the sweeper is the enforcement mechanism — grants will overstay their TTL during a prolonged outage.

If PostgreSQL is unavailable, jitsudod cannot process any requests (all operations require database access). jitsudod instances will log errors and return 503 responses. Recovery is automatic once the database is restored.

Behind a load balancer, the load balancer routes around failed instances. Active requests in flight may return errors, but clients can retry. The CLI retries transient errors automatically.

Emergency Access When Control Plane Is Down

Section titled “Emergency Access When Control Plane Is Down”

Break-glass (jitsudo request --break-glass) requires a running jitsudod. If the control plane is truly unavailable:

  1. Use the cloud provider’s IAM console to grant the minimum required permissions directly
  2. Document the access: timestamp, user, resource, justification, incident ticket
  3. After jitsudod is restored, revoke the manual IAM change immediately
  4. File a post-incident review noting the out-of-band access

Persistent out-of-band access events are audit gaps. Minimize them by monitoring jitsudod availability and having runbooks for rapid recovery.

Use the bundled values-ha.yaml overlay to enable HA mode in a single command:

Terminal window
helm upgrade --install jitsudo ./helm/jitsudo \
--namespace jitsudo \
--create-namespace \
-f helm/jitsudo/values-ha.yaml \
--set config.auth.oidcIssuer=https://your-idp.example.com \
--set config.auth.clientId=jitsudo-server

The HA overlay enables:

  • 2 replicas by default (minimum for HA; HPA scales up from there)
  • HPA — scales on CPU (70%) and memory (80%), up to 10 replicas
  • PodDisruptionBudget — ensures at least 1 pod remains available during node drains
  • Pod anti-affinity — prefers scheduling pods on different nodes
  • PostgreSQL read replica — one streaming replica for the bundled subchart (see note below)

For full production deployments, also supply an external managed database and disable the bundled subchart:

# In your environment-specific values file
postgresql:
enabled: false
config:
database:
existingSecret: "jitsudo-db" # Secret with DATABASE_URL key

The bundled PostgreSQL subchart (values-ha.yaml adds one read replica via streaming replication) is suitable for testing HA configuration. It does not provide automatic failover — if the primary crashes, manual intervention is required.

For production, use a managed service with built-in automatic failover:

CloudManaged PostgreSQL
AWSRDS Multi-AZ (automatic failover ~30–60s)
AzureAzure Database for PostgreSQL - Flexible Server (HA mode)
GCPCloud SQL for PostgreSQL (HA with failover replica)
On-premPatroni + etcd, or Crunchy Data PGO

All managed services above provide automatic failover, point-in-time recovery (PITR), and automated backups.

PostgreSQL has a hard limit on concurrent connections. Use PgBouncer (or pgpool-II) between jitsudod and PostgreSQL for connection efficiency, especially during rolling restarts:

# In jitsudod config — point at PgBouncer, not PostgreSQL directly
database:
url: "postgres://jitsudo_app:${DB_PASSWORD}@pgbouncer:5432/jitsudo?sslmode=require"

jitsudod exposes a health endpoint:

GET /healthz → 200 OK if the server is healthy
GET /readyz → 200 OK if the server is ready to serve traffic

Configure your load balancer to use /readyz for routing decisions. The ready check includes a database connectivity check.

Take daily automated backups of the PostgreSQL database. Managed services (RDS, Cloud SQL, Azure Database) provide this by default.

For self-managed PostgreSQL:

Terminal window
# Daily pg_dump to S3 (example)
pg_dump -U jitsudo_app jitsudo \
| gzip \
| aws s3 cp - s3://your-backup-bucket/jitsudo/$(date +%Y-%m-%d).sql.gz
Terminal window
# 1. Stop jitsudod instances to prevent writes during restore
kubectl scale deployment jitsudod --replicas=0
# 2. Restore from backup
gunzip -c backup.sql.gz | psql -U postgres jitsudo
# 3. Verify audit log hash chain integrity
jitsudo audit verify
# 4. Restart jitsudod
kubectl scale deployment jitsudod --replicas=2

After any restore, verify the audit log hash chain:

Terminal window
jitsudo audit verify

If the chain breaks, entries were modified or inserted out-of-band between the backup point and the restore point. Investigate before allowing the restored instance to serve traffic. See Audit Log for the chain format and verification script.