Endpoints — Cluster

The cluster API is used to configure a two-node HA pair (main + secondary with PostgreSQL streaming replication), to monitor the cluster, and to perform one-click failover when the main server is unreachable. This is the same API the Settings → Cluster UI calls.

Base URL

https://your-panel-host/api/cluster

Authentication

The cluster API has two auth modes:

Route group	Auth
`/api/cluster/join`, `/heartbeat`, `/promote`, `/notify`, `/uploads`	Cluster secret — `X-Cluster-Secret: <secret>` header. Used by nodes to talk to each other.
`/api/cluster/*` (the rest)	JWT with admin role. Used by operators in the UI.

The cluster secret is generated when the main server is set up and copied to the secondary at join time. It is not the same as the license key.

POST /api/cluster/setup-main

Configure the local server as the main of a new cluster. Generates a cluster id + secret, sets wal_level=replica, max_wal_senders=10, creates the cluster_config, cluster_nodes, and cluster_events tables on first call.

Permission: admin.

Request

POST /api/cluster/setup-main
Authorization: Bearer <admin-jwt>
Content-Type: application/json

Body: none required. Optional fields:

Field	Type	Description
`display_name`	string	Human label — defaults to the hostname

Response — 200 OK

{
  "success": true,
  "data": {
    "cluster_id": "cluster_a8f3kj92",
    "cluster_secret": "csec_lpqz2v9j...",
    "role": "main",
    "node_id": 1
  }
}

Save the cluster_secret — it is shown once in the UI (“Copy” button) and never again. The secondary will need it to join.

Errors

Status	`message`	Cause
409	`already configured (role=main)`	Idempotent — returns the existing cluster_id
500	`failed to set wal_level — postgres restart required`	Some Postgres tuning needs a container restart, not a runtime SET

POST /api/cluster/setup-secondary

Configure the local server as a secondary, replicating from the given main.

Permission: admin.

Body

Field	Type	Required	Description
`main_ip`	string	yes	IP or hostname of the main server
`cluster_secret`	string	yes	Secret from `setup-main`
`display_name`	string	no	Label for this node

The handler:

Hits POST <main>/api/cluster/test-connection to verify API + DB + Redis are reachable.
Calls POST <main>/api/cluster/join with the local server’s IP + hostname.
Receives the DB connection string + a dedicated replication slot id.
Generates a standby.signal file and a pg_basebackup script, restarts the local Postgres in replica mode.
Stops the local RADIUS (it will run only on the main during normal operation).
Writes a cluster_config row with role=secondary.

Response — 200 OK

{
  "success": true,
  "data": {
    "role": "secondary",
    "main_ip": "203.0.113.10",
    "replication_slot": "replica_node_2",
    "node_id": 2
  }
}

Errors

Status	`message`	Cause
400	`cluster_secret invalid`	Wrong secret
503	`main server unreachable`	Test-connection failed — see below

GET /api/cluster/status

Cluster overview — all registered nodes, their last heartbeat, CPU / memory / disk %, current replication lag.

Permission: admin.

Response — 200 OK

{
  "success": true,
  "data": {
    "cluster_id": "cluster_a8f3kj92",
    "local_role": "secondary",
    "nodes": [
      {
        "id": 1, "ip": "203.0.113.10", "role": "main", "status": "online",
        "last_seen": "2026-05-12T11:45:01Z",
        "cpu_pct": 12.4, "mem_pct": 38.2, "disk_pct": 41.0
      },
      {
        "id": 2, "ip": "203.0.113.11", "role": "secondary", "status": "online",
        "last_seen": "2026-05-12T11:45:03Z",
        "cpu_pct": 4.1, "mem_pct": 18.7, "disk_pct": 41.0,
        "replication_lag_sec": 0.8
      }
    ],
    "recent_events": [
      { "type": "node_joined", "node_id": 2, "at": "2026-05-11T09:00:00Z" }
    ]
  }
}

curl https://panel.example.com/api/cluster/status \
  -H "Authorization: Bearer ..."

replication_lag_sec comes from SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) on the replica. > 30 s is a yellow flag; > 120 s is red.

GET /api/cluster/check-main-status

A focused health check that the UI polls every 30 s on the secondary. Returns whether the main is online and how long it has been unreachable.

Permission: admin.

Response — 200 OK

{
  "success": true,
  "data": {
    "main_ip": "203.0.113.10",
    "main_online": false,
    "offline_seconds": 312,
    "can_promote": true
  }
}

can_promote = true once offline_seconds exceeds 120 (the failover threshold). The UI hides the “Promote to Main” button until then.

POST /api/cluster/promote-to-main

The big red button. Promote the local secondary to main.

Permission: admin.

The handler runs in this order:

Re-check the main is unreachable (sanity guard).
Warn if replication_lag_sec > 30. Operator must include force: true to proceed past 30 s lag.
Call SELECT pg_promote() — PostgreSQL becomes primary (accepts writes).
Stop Redis replication (REPLICAOF NO ONE).
Update cluster_config.role = 'main'.
Mark the old main status='failed' in cluster_nodes.
Notify remaining nodes of the new main via POST /cluster/notify.
Restart the RADIUS container so it picks up its now-primary DB.

Body

Field	Type	Required	Description
`force`	bool	no	Override lag warning and proceed with stale replica

Response — 200 OK

{
  "success": true,
  "data": {
    "promoted_at": "2026-05-12T11:55:00Z",
    "new_role": "main",
    "replication_lag_at_promote_sec": 0.8
  }
}

After this, update the MikroTik RADIUS pointer (/radius set [find] address=<new-main-ip>) and the DNS record / Cloudflare LB origin. Old main DB must be re-cloned from the new main before it can rejoin as a secondary.

Errors

Status	`message`	Cause
409	`main is online — refusing to promote`	The original main is responsive
412	`replication lag too high (X seconds), use force=true to override`	Lag exceeds 30 s and `force` not set
500	`pg_promote() failed`	DB-side error — see API logs

POST /api/cluster/recover-from-server

Run on a fresh install to seed itself from an existing production server. Used for disaster recovery when the main is gone.

Permission: admin (on the new server).

The handler:

SSHes to the source server using the provided root password.
Runs pg_dump on the source.
Downloads the dump.
Restores into the local Postgres.
Rsyncs /opt/proxpanel/frontend/dist/uploads/ (logos, favicons).
Writes the new server’s cluster_config with role='main'.

Body

Field	Type	Required	Description
`source_ip`	string	yes	IP of the source server
`source_password`	string	yes	Root password — used only for SSH session, not stored
`source_port`	int	no	Default 22

Response — 200 OK

{
  "success": true,
  "data": {
    "dump_size_bytes": 312456789,
    "tables_restored": 142,
    "subscribers_count": 8421,
    "duration_seconds": 184
  }
}

curl -X POST https://new-server.example.com/api/cluster/recover-from-server \
  -H "Authorization: Bearer ..." \
  -H "Content-Type: application/json" \
  -d '{"source_ip":"203.0.113.10","source_password":"the-old-root-pw"}'

Errors

Status	`message`	Cause
503	`cannot connect to source server`	SSH failed
500	`pg_dump failed: ...`	Source Postgres rejected the dump
500	`restore failed: ...`	Local Postgres rejected the import

POST /api/cluster/test-source-connection

Dry-run for recover-from-server — confirms SSH + Postgres reachability without doing anything. Body matches recover (source_ip, source_password, optional source_port). Returns { ssh_ok, postgres_ok, estimated_dump_size_bytes }.

Permission: admin.

POST /api/cluster/test-connection

Used by the secondary during setup to verify the main is reachable on API, DB (Postgres), and Redis ports. Body: { "main_ip": "203.0.113.10", "cluster_secret": "csec_..." }. Returns { api_ok, postgres_ok, redis_ok }; on any failure the corresponding error is in data.errors.

Permission: admin.

POST /api/cluster/failover (manual)

Planned switchover (vs the emergency promote). Fences writes on the current main, waits for replica to catch up, then promotes. Use during planned maintenance windows.

Permission: admin.

Body: { "target_node_id": 2, "drain_seconds": 30 } (drain_seconds defaults to 30 — how long to wait for connections to drain). Returns { old_main_id, new_main_id, drained_connections, completed_at }.

DELETE /api/cluster/nodes/:id · POST /api/cluster/leave

Remove a node from the cluster.

DELETE /nodes/:id (called from the main): drops the replication slot, marks the row removed.
POST /leave (called from a secondary): tells the main to drop us, then wipes the local cluster_config.

Permission: admin.

Errors

{ "success": false, "message": "main is online — refusing to promote" }

Status	Meaning
400	Validation — `message` describes
401	Missing / invalid JWT (or wrong `X-Cluster-Secret` on internal routes)
403	Not an admin
409	State conflict — already main, main still online, etc.
412	Pre-condition failed — replication lag too high, source unreachable
503	Network failure to peer node

Rate limits

Internal cluster routes (/heartbeat, /join, /promote, /notify) bypass the global 300 req/min limit and are gated only by the cluster-secret check. The heartbeat fires every 30 s per node.

Admin routes follow the standard 300 req/min/IP global limit.

HA Cluster — UI walk-through for the same operations
Authentication — admin JWT required for the operator-facing routes
Backups — full backup is a prerequisite before promote

Endpoints — Cluster

Base URL

Authentication

POST /api/cluster/setup-main

POST /api/cluster/setup-secondary

GET /api/cluster/status

GET /api/cluster/check-main-status

POST /api/cluster/promote-to-main

POST /api/cluster/recover-from-server

POST /api/cluster/test-source-connection

POST /api/cluster/test-connection

POST /api/cluster/failover (manual)

DELETE /api/cluster/nodes/:id · POST /api/cluster/leave

Errors

Rate limits

Related pages