HA Cluster

The HA Cluster feature lets you run a second ProxRad server as a real-time replica of the first. Postgres streams WAL to the secondary, the secondary’s API is live but database is read-only, and a one-click button promotes it to main when the primary fails.

This is the right setup when you have more than ~5 K subscribers, when downtime costs you money, or when you simply want a warm spare you can promote without a backup-restore cycle. It is not a replacement for backups — replication faithfully replicates DROP TABLE too. Run both.

Roles

Role	DB state	API writes accepted	RADIUS accepts auth
standalone	Read-write	Yes	Yes
main	Read-write (Postgres primary)	Yes	Yes
secondary	Read-only (`pg_is_in_recovery() = true`)	UI loads, write attempts return 503	Optional — see RADIUS below

A fresh install starts in standalone. You convert to main by clicking Configure as Main Server; the second server joins as secondary.

Architecture

       ┌─────────────────────────────┐         ┌─────────────────────────────┐
       │  MAIN SERVER                │         │  SECONDARY SERVER           │
       │                             │         │                             │
       │  ┌───────────────────────┐  │  WAL    │  ┌───────────────────────┐  │
       │  │ proxpanel-api (RW)    │  │ stream  │  │ proxpanel-api (RO)    │  │
       │  │ proxpanel-radius      │  │ ──────▶ │  │ proxpanel-radius      │  │
       │  │ proxpanel-frontend    │  │  :5432  │  │ proxpanel-frontend    │  │
       │  │                       │  │         │  │                       │  │
       │  │ proxpanel-db          │  │         │  │ proxpanel-db          │  │
       │  │  • wal_level=replica  │  │         │  │  • standby.signal     │  │
       │  │  • max_wal_senders=10 │  │         │  │  • primary_conninfo   │  │
       │  │  • replication slot   │  │         │  │  • read replica       │  │
       │  └───────────────────────┘  │         │  └───────────────────────┘  │
       │                             │         │                             │
       │  Sends heartbeat every 30s  │  HTTP   │  Checks main /health 30s    │
       │  Polls secondary health 30s │ ──────▶ │  Auto-failover at 2 min     │
       └─────────────────────────────┘         └─────────────────────────────┘
                                                            │
                                                            ▼ (on promotion)
                                                  pg_promote() — becomes RW

Database schema

Three tables under cluster_*:

Table	Purpose
`cluster_config`	This server’s role, IDs, and replication settings. One row.
`cluster_nodes`	All known nodes in the cluster (main + every secondary), with last-heartbeat, CPU/mem/disk, replication lag.
`cluster_events`	Audit log of `node_joined`, `node_left`, `failover_started`, `failover_completed`, `promote_requested`, etc.

These tables are replicated like everything else. The cluster service detects “this is a replica” via SELECT pg_is_in_recovery() and overrides the role to secondary even if the replicated cluster_config row says main — so you don’t accidentally end up with two servers thinking they’re main after a failover.

Setting up a cluster

You need two servers with ProxRad installed and activated, on the same network or with a low-latency private link (sub-50 ms round-trip recommended). Both must run the same ProxRad version.

Step 1: Configure the main server

On the intended main server, Settings → Cluster → Configure as Main Server.
Enter a friendly server name (“DC1 Primary”). The IP is auto-filled to the local routable IP.
Click Configure. Behind the scenes:
- Generates a random cluster_id (UUID) and cluster_secret (32-byte random).
- Writes the cluster_config row with role=main.
- Inserts this node into cluster_nodes.
- Sets Postgres for replication:
```
ALTER SYSTEM SET wal_level = replica;
ALTER SYSTEM SET max_wal_senders = 10;
ALTER SYSTEM SET max_replication_slots = 10;
ALTER SYSTEM SET wal_keep_size = '1GB';
ALTER SYSTEM SET hot_standby = on;
SELECT pg_reload_conf();
```
- Creates a replicator user with REPLICATION privilege.
Copy the cluster secret shown in the UI. You’ll paste it on the secondary. It’s the only time the UI shows it in full.

You also need to allow replication connections from the secondary in pg_hba.conf:

host replication replicator <secondary_ip>/32 md5

The cluster setup logs this line — apply it inside the proxpanel-db container and reload Postgres. See PostgreSQL Replication for the full pg_hba walkthrough.

Step 2: Join the secondary

On the intended secondary server, Settings → Cluster → Join as Secondary.
Fill in:
- Main Server IP — the main’s reachable IP.
- Cluster Secret — pasted from Step 1.
Click Test Connection. Should show API/DB/Redis reachable from this server.
Click Join Cluster.
- The secondary POSTs to https://MAIN/api/cluster/join with the cluster secret.
- Main verifies the secret (constant-time compare to avoid timing attacks), inserts the new node into cluster_nodes, creates a replication slot named replica_<node_id>, returns the connection info.
- Secondary’s UI shows a setup_replica.sh script. Review and run it on the secondary’s host:
  Terminal window
```
bash /tmp/setup_replica.sh
```
  The script stops proxpanel-db, backs up the current data directory, runs pg_basebackup from the main, writes standby.signal, configures primary_conninfo and primary_slot_name, and restarts.
Once Postgres comes back up in recovery mode, the cluster service auto-detects and starts heartbeating to the main.

After this, the secondary’s UI is read-only for writes. Any write attempt (create subscriber, edit service) returns HTTP 503 with {error: "secondary server is read-only"}. Reads, dashboards, and reports all work normally.

Cluster service — what runs in the background

The cluster service (services/cluster_service.go) starts on API startup if cluster_config.is_active = true:

Role	Loop
`main`	Every 30 s → check each known node’s last heartbeat. Mark `offline` after 2 minutes of silence. Run the node-health ticker.
`secondary`	Every 30 s → POST to `https://MAIN/api/cluster/heartbeat` with CPU%, memory%, disk%, subscriber count, current version, replication lag (`SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))`).

The dashboard’s Cluster tab refreshes every 5 s and shows each node with its last seen timestamp, status badge, and replication lag.

What replicates

Everything in the public schema. That’s:

All subscriber, service, NAS, reseller, transaction, invoice tables.
RADIUS check/reply tables (radcheck, radreply, radacct).
Bandwidth rules, FUP counters, audit log, settings.
Uploads stored in the DB (e.g. branding settings) — but NOT files on disk.

What doesn’t replicate:

/opt/proxpanel/frontend/dist/uploads/ (logos, login backgrounds). Must be rsynced manually or scripted.
.env (license keys, hardware-bound).
TLS certificates (each server has its own).
nginx.conf (each server has its own SSL config).

Updating a clustered deployment

Update the secondary first (Settings → License → Check for Updates → Install). It restarts; replication resumes.
Verify the secondary is healthy and replication lag is low.
Update the main. Brief downtime (~30 seconds) during container restart.
Both nodes now run the same version. Confirm in the Cluster tab.

The cluster service refuses to perform automatic failover during a version mismatch (mainVersion != currentVersion) — this prevents promoting a secondary that was running an older binary while the main was running a newer one. Always keep versions in sync.

Tear-down

To leave a cluster cleanly:

Secondary leaves: Settings → Cluster → Leave Cluster. The secondary stops heartbeating, demotes itself in cluster_config (role → standalone), and you’d then promote its Postgres out of recovery (or just trash it and reinstall).
Main removes a secondary: Cluster tab → click × next to the node → confirm. Drops the replication slot, marks the node removed in cluster_nodes.
Dismantle the cluster: On main, Disband Cluster. Sets cluster_config.is_active = false, returns to standalone, leaves Postgres replication running until you stop it (so the secondary keeps replicating into a now-orphaned read-only state until you also stop it).

Hardware sizing

Subscribers	Recommended main spec	Secondary spec
Up to 5 K	4 vCPU / 8 GB RAM / 100 GB SSD	Same
5 K – 15 K	8 vCPU / 16 GB / 200 GB NVMe	Same
15 K – 30 K	16 vCPU / 32 GB / 500 GB NVMe	Same
30 K – 60 K	32 vCPU / 32 GB / 1 TB NVMe	Same (a slower secondary is acceptable but lag will grow under heavy write bursts)

The secondary does not save you money on hardware — it has to keep up with WAL replay. Spec it to match the main.

What about Redis?

Redis holds session caches, dashboard caches, and JWT blacklist. It is replicated separately from Postgres using Redis’s REPLICAOF mechanism:

# On secondary's redis container
docker exec proxpanel-redis redis-cli REPLICAOF <main_ip> 6379

The cluster setup automates this. After a failover, the new main runs REPLICAOF NO ONE to become a standalone master. The Redis data loss in failover is typically zero because the contents are caches that re-populate within seconds — you don’t lose subscriber data in Redis, that’s all in Postgres.

Monitoring a healthy cluster

Indicator	Where to find it	Healthy value
Last main heartbeat	Cluster tab → main row	< 60 s
Last secondary heartbeat	Cluster tab → secondary row	< 60 s
Replication lag (seconds)	Cluster tab → secondary row	< 5 s typically
Replication slot active	`SELECT active FROM pg_replication_slots` on main	`t`
`pg_is_in_recovery()` on secondary	psql	`t`
Both nodes on same version	Cluster tab	matching badges

If any of these is off, do not initiate a failover until you understand why. A failover into a stale or diverged secondary makes things worse.

Common pitfalls

Secondary stuck in “syncing”. Look at the Postgres logs on the secondary: docker logs proxpanel-db. Usually pg_hba.conf doesn’t allow the secondary’s IP for replication.
Replication lag growing unbounded. Network issue, or a long-running write transaction on main is preventing WAL recycling. SELECT * FROM pg_stat_replication on main.
Both nodes think they’re main after a network partition. Split-brain. The fencing isn’t automatic — you must manually demote one before re-attaching. See the split-brain section of Failover.
“cluster secret mismatch” on join. You pasted the secret with whitespace or from a UI that truncated it. Re-copy. The secret is exactly 64 hex characters.
Cluster tab is blank. API on the secondary can’t reach the main’s /api/cluster/heartbeat. Confirm port 80 (or your nginx port) is reachable between the two. Heartbeat now goes through port 80 since v1.0.233 (used to be 8080 — some firewalls blocked it).
Subscriber created on main doesn’t appear on secondary’s UI. Check replication lag first. If lag is 0 but the row is still missing, the secondary is querying a different database — confirm cluster_config.is_active = true on the secondary and that its API is reading the same proxpanel-db you think.
Promoted secondary, but RADIUS still gets timeout from MikroTik. RADIUS NAS targets are not part of the cluster — they live in MikroTik’s own config. Update /radius set [find] address=NEW_IP on every NAS after failover.

Permissions

Cluster setup, join, leave, and failover trigger require admin (not a specific permission — the routes are admin-only). Resellers cannot view the cluster status.

PostgreSQL Replication — the WAL streaming layer in detail.
Failover — manual + auto promotion of the secondary.
Cross-Server Restore — alternative to a cluster when you only need DR, not warm-standby.
Backups & Recovery — still required; replication doesn’t protect against logical corruption.