Skip to content

Endpoints — Cluster

The cluster API is used to configure a two-node HA pair (main + secondary with PostgreSQL streaming replication), to monitor the cluster, and to perform one-click failover when the main server is unreachable. This is the same API the Settings → Cluster UI calls.

https://your-panel-host/api/cluster

The cluster API has two auth modes:

Route groupAuth
/api/cluster/join, /heartbeat, /promote, /notify, /uploadsCluster secretX-Cluster-Secret: <secret> header. Used by nodes to talk to each other.
/api/cluster/* (the rest)JWT with admin role. Used by operators in the UI.

The cluster secret is generated when the main server is set up and copied to the secondary at join time. It is not the same as the license key.

Configure the local server as the main of a new cluster. Generates a cluster id + secret, sets wal_level=replica, max_wal_senders=10, creates the cluster_config, cluster_nodes, and cluster_events tables on first call.

Permission: admin.

Request

POST /api/cluster/setup-main
Authorization: Bearer <admin-jwt>
Content-Type: application/json

Body: none required. Optional fields:

FieldTypeDescription
display_namestringHuman label — defaults to the hostname

Response — 200 OK

{
"success": true,
"data": {
"cluster_id": "cluster_a8f3kj92",
"cluster_secret": "csec_lpqz2v9j...",
"role": "main",
"node_id": 1
}
}

Save the cluster_secret — it is shown once in the UI (“Copy” button) and never again. The secondary will need it to join.

Errors

StatusmessageCause
409already configured (role=main)Idempotent — returns the existing cluster_id
500failed to set wal_level — postgres restart requiredSome Postgres tuning needs a container restart, not a runtime SET

Configure the local server as a secondary, replicating from the given main.

Permission: admin.

Body

FieldTypeRequiredDescription
main_ipstringyesIP or hostname of the main server
cluster_secretstringyesSecret from setup-main
display_namestringnoLabel for this node

The handler:

  1. Hits POST <main>/api/cluster/test-connection to verify API + DB + Redis are reachable.
  2. Calls POST <main>/api/cluster/join with the local server’s IP + hostname.
  3. Receives the DB connection string + a dedicated replication slot id.
  4. Generates a standby.signal file and a pg_basebackup script, restarts the local Postgres in replica mode.
  5. Stops the local RADIUS (it will run only on the main during normal operation).
  6. Writes a cluster_config row with role=secondary.

Response — 200 OK

{
"success": true,
"data": {
"role": "secondary",
"main_ip": "203.0.113.10",
"replication_slot": "replica_node_2",
"node_id": 2
}
}

Errors

StatusmessageCause
400cluster_secret invalidWrong secret
503main server unreachableTest-connection failed — see below

Cluster overview — all registered nodes, their last heartbeat, CPU / memory / disk %, current replication lag.

Permission: admin.

Response — 200 OK

{
"success": true,
"data": {
"cluster_id": "cluster_a8f3kj92",
"local_role": "secondary",
"nodes": [
{
"id": 1, "ip": "203.0.113.10", "role": "main", "status": "online",
"last_seen": "2026-05-12T11:45:01Z",
"cpu_pct": 12.4, "mem_pct": 38.2, "disk_pct": 41.0
},
{
"id": 2, "ip": "203.0.113.11", "role": "secondary", "status": "online",
"last_seen": "2026-05-12T11:45:03Z",
"cpu_pct": 4.1, "mem_pct": 18.7, "disk_pct": 41.0,
"replication_lag_sec": 0.8
}
],
"recent_events": [
{ "type": "node_joined", "node_id": 2, "at": "2026-05-11T09:00:00Z" }
]
}
}
Terminal window
curl https://panel.example.com/api/cluster/status \
-H "Authorization: Bearer ..."

replication_lag_sec comes from SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) on the replica. > 30 s is a yellow flag; > 120 s is red.

A focused health check that the UI polls every 30 s on the secondary. Returns whether the main is online and how long it has been unreachable.

Permission: admin.

Response — 200 OK

{
"success": true,
"data": {
"main_ip": "203.0.113.10",
"main_online": false,
"offline_seconds": 312,
"can_promote": true
}
}

can_promote = true once offline_seconds exceeds 120 (the failover threshold). The UI hides the “Promote to Main” button until then.

The big red button. Promote the local secondary to main.

Permission: admin.

The handler runs in this order:

  1. Re-check the main is unreachable (sanity guard).
  2. Warn if replication_lag_sec > 30. Operator must include force: true to proceed past 30 s lag.
  3. Call SELECT pg_promote() — PostgreSQL becomes primary (accepts writes).
  4. Stop Redis replication (REPLICAOF NO ONE).
  5. Update cluster_config.role = 'main'.
  6. Mark the old main status='failed' in cluster_nodes.
  7. Notify remaining nodes of the new main via POST /cluster/notify.
  8. Restart the RADIUS container so it picks up its now-primary DB.

Body

FieldTypeRequiredDescription
forceboolnoOverride lag warning and proceed with stale replica

Response — 200 OK

{
"success": true,
"data": {
"promoted_at": "2026-05-12T11:55:00Z",
"new_role": "main",
"replication_lag_at_promote_sec": 0.8
}
}

After this, update the MikroTik RADIUS pointer (/radius set [find] address=<new-main-ip>) and the DNS record / Cloudflare LB origin. Old main DB must be re-cloned from the new main before it can rejoin as a secondary.

Errors

StatusmessageCause
409main is online — refusing to promoteThe original main is responsive
412replication lag too high (X seconds), use force=true to overrideLag exceeds 30 s and force not set
500pg_promote() failedDB-side error — see API logs

Run on a fresh install to seed itself from an existing production server. Used for disaster recovery when the main is gone.

Permission: admin (on the new server).

The handler:

  1. SSHes to the source server using the provided root password.
  2. Runs pg_dump on the source.
  3. Downloads the dump.
  4. Restores into the local Postgres.
  5. Rsyncs /opt/proxpanel/frontend/dist/uploads/ (logos, favicons).
  6. Writes the new server’s cluster_config with role='main'.

Body

FieldTypeRequiredDescription
source_ipstringyesIP of the source server
source_passwordstringyesRoot password — used only for SSH session, not stored
source_portintnoDefault 22

Response — 200 OK

{
"success": true,
"data": {
"dump_size_bytes": 312456789,
"tables_restored": 142,
"subscribers_count": 8421,
"duration_seconds": 184
}
}
Terminal window
curl -X POST https://new-server.example.com/api/cluster/recover-from-server \
-H "Authorization: Bearer ..." \
-H "Content-Type: application/json" \
-d '{"source_ip":"203.0.113.10","source_password":"the-old-root-pw"}'

Errors

StatusmessageCause
503cannot connect to source serverSSH failed
500pg_dump failed: ...Source Postgres rejected the dump
500restore failed: ...Local Postgres rejected the import

Dry-run for recover-from-server — confirms SSH + Postgres reachability without doing anything. Body matches recover (source_ip, source_password, optional source_port). Returns { ssh_ok, postgres_ok, estimated_dump_size_bytes }.

Permission: admin.

Used by the secondary during setup to verify the main is reachable on API, DB (Postgres), and Redis ports. Body: { "main_ip": "203.0.113.10", "cluster_secret": "csec_..." }. Returns { api_ok, postgres_ok, redis_ok }; on any failure the corresponding error is in data.errors.

Permission: admin.

Planned switchover (vs the emergency promote). Fences writes on the current main, waits for replica to catch up, then promotes. Use during planned maintenance windows.

Permission: admin.

Body: { "target_node_id": 2, "drain_seconds": 30 } (drain_seconds defaults to 30 — how long to wait for connections to drain). Returns { old_main_id, new_main_id, drained_connections, completed_at }.

DELETE /api/cluster/nodes/:id · POST /api/cluster/leave

Section titled “DELETE /api/cluster/nodes/:id · POST /api/cluster/leave”

Remove a node from the cluster.

  • DELETE /nodes/:id (called from the main): drops the replication slot, marks the row removed.
  • POST /leave (called from a secondary): tells the main to drop us, then wipes the local cluster_config.

Permission: admin.

{ "success": false, "message": "main is online — refusing to promote" }
StatusMeaning
400Validation — message describes
401Missing / invalid JWT (or wrong X-Cluster-Secret on internal routes)
403Not an admin
409State conflict — already main, main still online, etc.
412Pre-condition failed — replication lag too high, source unreachable
503Network failure to peer node

Internal cluster routes (/heartbeat, /join, /promote, /notify) bypass the global 300 req/min limit and are gated only by the cluster-secret check. The heartbeat fires every 30 s per node.

Admin routes follow the standard 300 req/min/IP global limit.

  • HA Cluster — UI walk-through for the same operations
  • Authentication — admin JWT required for the operator-facing routes
  • Backups — full backup is a prerequisite before promote