HA Cluster
The HA Cluster feature lets you run a second ProxPanel server as a real-time replica of the first. Postgres streams WAL to the secondary, the secondary’s API is live but database is read-only, and a one-click button promotes it to main when the primary fails.
This is the right setup when you have more than ~5 K subscribers, when downtime costs you money, or when you simply want a warm spare you can promote without a backup-restore cycle. It is not a replacement for backups — replication faithfully replicates DROP TABLE too. Run both.
| Role | DB state | API writes accepted | RADIUS accepts auth |
|---|---|---|---|
| standalone | Read-write | Yes | Yes |
| main | Read-write (Postgres primary) | Yes | Yes |
| secondary | Read-only (pg_is_in_recovery() = true) | UI loads, write attempts return 503 | Optional — see RADIUS below |
A fresh install starts in standalone. You convert to main by clicking Configure as Main Server; the second server joins as secondary.
Architecture
Section titled “Architecture” ┌─────────────────────────────┐ ┌─────────────────────────────┐ │ MAIN SERVER │ │ SECONDARY SERVER │ │ │ │ │ │ ┌───────────────────────┐ │ WAL │ ┌───────────────────────┐ │ │ │ proxpanel-api (RW) │ │ stream │ │ proxpanel-api (RO) │ │ │ │ proxpanel-radius │ │ ──────▶ │ │ proxpanel-radius │ │ │ │ proxpanel-frontend │ │ :5432 │ │ proxpanel-frontend │ │ │ │ │ │ │ │ │ │ │ │ proxpanel-db │ │ │ │ proxpanel-db │ │ │ │ • wal_level=replica │ │ │ │ • standby.signal │ │ │ │ • max_wal_senders=10 │ │ │ │ • primary_conninfo │ │ │ │ • replication slot │ │ │ │ • read replica │ │ │ └───────────────────────┘ │ │ └───────────────────────┘ │ │ │ │ │ │ Sends heartbeat every 30s │ HTTP │ Checks main /health 30s │ │ Polls secondary health 30s │ ──────▶ │ Auto-failover at 2 min │ └─────────────────────────────┘ └─────────────────────────────┘ │ ▼ (on promotion) pg_promote() — becomes RWDatabase schema
Section titled “Database schema”Three tables under cluster_*:
| Table | Purpose |
|---|---|
cluster_config | This server’s role, IDs, and replication settings. One row. |
cluster_nodes | All known nodes in the cluster (main + every secondary), with last-heartbeat, CPU/mem/disk, replication lag. |
cluster_events | Audit log of node_joined, node_left, failover_started, failover_completed, promote_requested, etc. |
These tables are replicated like everything else. The cluster service detects “this is a replica” via SELECT pg_is_in_recovery() and overrides the role to secondary even if the replicated cluster_config row says main — so you don’t accidentally end up with two servers thinking they’re main after a failover.
Setting up a cluster
Section titled “Setting up a cluster”You need two servers with ProxPanel installed and activated, on the same network or with a low-latency private link (sub-50 ms round-trip recommended). Both must run the same ProxPanel version.
Step 1: Configure the main server
Section titled “Step 1: Configure the main server”-
On the intended main server, Settings → Cluster → Configure as Main Server.
-
Enter a friendly server name (“DC1 Primary”). The IP is auto-filled to the local routable IP.
-
Click Configure. Behind the scenes:
- Generates a random
cluster_id(UUID) andcluster_secret(32-byte random). - Writes the
cluster_configrow with role=main. - Inserts this node into
cluster_nodes. - Sets Postgres for replication:
ALTER SYSTEM SET wal_level = replica;ALTER SYSTEM SET max_wal_senders = 10;ALTER SYSTEM SET max_replication_slots = 10;ALTER SYSTEM SET wal_keep_size = '1GB';ALTER SYSTEM SET hot_standby = on;SELECT pg_reload_conf();
- Creates a
replicatoruser withREPLICATIONprivilege.
- Generates a random
-
Copy the cluster secret shown in the UI. You’ll paste it on the secondary. It’s the only time the UI shows it in full.
You also need to allow replication connections from the secondary in pg_hba.conf:
host replication replicator <secondary_ip>/32 md5The cluster setup logs this line — apply it inside the proxpanel-db container and reload Postgres. See PostgreSQL Replication for the full pg_hba walkthrough.
Step 2: Join the secondary
Section titled “Step 2: Join the secondary”-
On the intended secondary server, Settings → Cluster → Join as Secondary.
-
Fill in:
- Main Server IP — the main’s reachable IP.
- Cluster Secret — pasted from Step 1.
-
Click Test Connection. Should show API/DB/Redis reachable from this server.
-
Click Join Cluster.
-
The secondary POSTs to
https://MAIN/api/cluster/joinwith the cluster secret. -
Main verifies the secret (constant-time compare to avoid timing attacks), inserts the new node into
cluster_nodes, creates a replication slot namedreplica_<node_id>, returns the connection info. -
Secondary’s UI shows a
setup_replica.shscript. Review and run it on the secondary’s host:Terminal window bash /tmp/setup_replica.shThe script stops
proxpanel-db, backs up the current data directory, runspg_basebackupfrom the main, writesstandby.signal, configuresprimary_conninfoandprimary_slot_name, and restarts.
-
-
Once Postgres comes back up in recovery mode, the cluster service auto-detects and starts heartbeating to the main.
After this, the secondary’s UI is read-only for writes. Any write attempt (create subscriber, edit service) returns HTTP 503 with {error: "secondary server is read-only"}. Reads, dashboards, and reports all work normally.
Cluster service — what runs in the background
Section titled “Cluster service — what runs in the background”The cluster service (services/cluster_service.go) starts on API startup if cluster_config.is_active = true:
| Role | Loop |
|---|---|
main | Every 30 s → check each known node’s last heartbeat. Mark offline after 2 minutes of silence. Run the node-health ticker. |
secondary | Every 30 s → POST to https://MAIN/api/cluster/heartbeat with CPU%, memory%, disk%, subscriber count, current version, replication lag (SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))). |
The dashboard’s Cluster tab refreshes every 5 s and shows each node with its last seen timestamp, status badge, and replication lag.
What replicates
Section titled “What replicates”Everything in the public schema. That’s:
- All subscriber, service, NAS, reseller, transaction, invoice tables.
- RADIUS check/reply tables (
radcheck,radreply,radacct). - Bandwidth rules, FUP counters, audit log, settings.
- Uploads stored in the DB (e.g. branding settings) — but NOT files on disk.
What doesn’t replicate:
/opt/proxpanel/frontend/dist/uploads/(logos, login backgrounds). Must be rsynced manually or scripted..env(license keys, hardware-bound).- TLS certificates (each server has its own).
- nginx.conf (each server has its own SSL config).
Updating a clustered deployment
Section titled “Updating a clustered deployment”-
Update the secondary first (Settings → License → Check for Updates → Install). It restarts; replication resumes.
-
Verify the secondary is healthy and replication lag is low.
-
Update the main. Brief downtime (~30 seconds) during container restart.
-
Both nodes now run the same version. Confirm in the Cluster tab.
The cluster service refuses to perform automatic failover during a version mismatch (mainVersion != currentVersion) — this prevents promoting a secondary that was running an older binary while the main was running a newer one. Always keep versions in sync.
Tear-down
Section titled “Tear-down”To leave a cluster cleanly:
- Secondary leaves: Settings → Cluster → Leave Cluster. The secondary stops heartbeating, demotes itself in
cluster_config(role → standalone), and you’d then promote its Postgres out of recovery (or just trash it and reinstall). - Main removes a secondary: Cluster tab → click
×next to the node → confirm. Drops the replication slot, marks the noderemovedincluster_nodes. - Dismantle the cluster: On main, Disband Cluster. Sets
cluster_config.is_active = false, returns to standalone, leaves Postgres replication running until you stop it (so the secondary keeps replicating into a now-orphaned read-only state until you also stop it).
Hardware sizing
Section titled “Hardware sizing”| Subscribers | Recommended main spec | Secondary spec |
|---|---|---|
| Up to 5 K | 4 vCPU / 8 GB RAM / 100 GB SSD | Same |
| 5 K – 15 K | 8 vCPU / 16 GB / 200 GB NVMe | Same |
| 15 K – 30 K | 16 vCPU / 32 GB / 500 GB NVMe | Same |
| 30 K – 60 K | 32 vCPU / 32 GB / 1 TB NVMe | Same (a slower secondary is acceptable but lag will grow under heavy write bursts) |
The secondary does not save you money on hardware — it has to keep up with WAL replay. Spec it to match the main.
What about Redis?
Section titled “What about Redis?”Redis holds session caches, dashboard caches, and JWT blacklist. It is replicated separately from Postgres using Redis’s REPLICAOF mechanism:
# On secondary's redis containerdocker exec proxpanel-redis redis-cli REPLICAOF <main_ip> 6379The cluster setup automates this. After a failover, the new main runs REPLICAOF NO ONE to become a standalone master. The Redis data loss in failover is typically zero because the contents are caches that re-populate within seconds — you don’t lose subscriber data in Redis, that’s all in Postgres.
Monitoring a healthy cluster
Section titled “Monitoring a healthy cluster”| Indicator | Where to find it | Healthy value |
|---|---|---|
| Last main heartbeat | Cluster tab → main row | < 60 s |
| Last secondary heartbeat | Cluster tab → secondary row | < 60 s |
| Replication lag (seconds) | Cluster tab → secondary row | < 5 s typically |
| Replication slot active | SELECT active FROM pg_replication_slots on main | t |
pg_is_in_recovery() on secondary | psql | t |
| Both nodes on same version | Cluster tab | matching badges |
If any of these is off, do not initiate a failover until you understand why. A failover into a stale or diverged secondary makes things worse.
Common pitfalls
Section titled “Common pitfalls”- Secondary stuck in “syncing”. Look at the Postgres logs on the secondary:
docker logs proxpanel-db. Usuallypg_hba.confdoesn’t allow the secondary’s IP forreplication. - Replication lag growing unbounded. Network issue, or a long-running write transaction on main is preventing WAL recycling.
SELECT * FROM pg_stat_replicationon main. - Both nodes think they’re main after a network partition. Split-brain. The fencing isn’t automatic — you must manually demote one before re-attaching. See the split-brain section of Failover.
- “cluster secret mismatch” on join. You pasted the secret with whitespace or from a UI that truncated it. Re-copy. The secret is exactly 64 hex characters.
- Cluster tab is blank. API on the secondary can’t reach the main’s
/api/cluster/heartbeat. Confirm port 80 (or your nginx port) is reachable between the two. Heartbeat now goes through port 80 since v1.0.233 (used to be 8080 — some firewalls blocked it). - Subscriber created on main doesn’t appear on secondary’s UI. Check replication lag first. If lag is 0 but the row is still missing, the secondary is querying a different database — confirm
cluster_config.is_active = trueon the secondary and that its API is reading the sameproxpanel-dbyou think. - Promoted secondary, but RADIUS still gets timeout from MikroTik. RADIUS NAS targets are not part of the cluster — they live in MikroTik’s own config. Update
/radius set [find] address=NEW_IPon every NAS after failover.
Permissions
Section titled “Permissions”Cluster setup, join, leave, and failover trigger require admin (not a specific permission — the routes are admin-only). Resellers cannot view the cluster status.
Related pages
Section titled “Related pages”- PostgreSQL Replication — the WAL streaming layer in detail.
- Failover — manual + auto promotion of the secondary.
- Cross-Server Restore — alternative to a cluster when you only need DR, not warm-standby.
- Backups & Recovery — still required; replication doesn’t protect against logical corruption.