Skip to content

HA Cluster

The HA Cluster feature lets you run a second ProxPanel server as a real-time replica of the first. Postgres streams WAL to the secondary, the secondary’s API is live but database is read-only, and a one-click button promotes it to main when the primary fails.

This is the right setup when you have more than ~5 K subscribers, when downtime costs you money, or when you simply want a warm spare you can promote without a backup-restore cycle. It is not a replacement for backups — replication faithfully replicates DROP TABLE too. Run both.

RoleDB stateAPI writes acceptedRADIUS accepts auth
standaloneRead-writeYesYes
mainRead-write (Postgres primary)YesYes
secondaryRead-only (pg_is_in_recovery() = true)UI loads, write attempts return 503Optional — see RADIUS below

A fresh install starts in standalone. You convert to main by clicking Configure as Main Server; the second server joins as secondary.

┌─────────────────────────────┐ ┌─────────────────────────────┐
│ MAIN SERVER │ │ SECONDARY SERVER │
│ │ │ │
│ ┌───────────────────────┐ │ WAL │ ┌───────────────────────┐ │
│ │ proxpanel-api (RW) │ │ stream │ │ proxpanel-api (RO) │ │
│ │ proxpanel-radius │ │ ──────▶ │ │ proxpanel-radius │ │
│ │ proxpanel-frontend │ │ :5432 │ │ proxpanel-frontend │ │
│ │ │ │ │ │ │ │
│ │ proxpanel-db │ │ │ │ proxpanel-db │ │
│ │ • wal_level=replica │ │ │ │ • standby.signal │ │
│ │ • max_wal_senders=10 │ │ │ │ • primary_conninfo │ │
│ │ • replication slot │ │ │ │ • read replica │ │
│ └───────────────────────┘ │ │ └───────────────────────┘ │
│ │ │ │
│ Sends heartbeat every 30s │ HTTP │ Checks main /health 30s │
│ Polls secondary health 30s │ ──────▶ │ Auto-failover at 2 min │
└─────────────────────────────┘ └─────────────────────────────┘
▼ (on promotion)
pg_promote() — becomes RW

Three tables under cluster_*:

TablePurpose
cluster_configThis server’s role, IDs, and replication settings. One row.
cluster_nodesAll known nodes in the cluster (main + every secondary), with last-heartbeat, CPU/mem/disk, replication lag.
cluster_eventsAudit log of node_joined, node_left, failover_started, failover_completed, promote_requested, etc.

These tables are replicated like everything else. The cluster service detects “this is a replica” via SELECT pg_is_in_recovery() and overrides the role to secondary even if the replicated cluster_config row says main — so you don’t accidentally end up with two servers thinking they’re main after a failover.

You need two servers with ProxPanel installed and activated, on the same network or with a low-latency private link (sub-50 ms round-trip recommended). Both must run the same ProxPanel version.

  1. On the intended main server, Settings → Cluster → Configure as Main Server.

  2. Enter a friendly server name (“DC1 Primary”). The IP is auto-filled to the local routable IP.

  3. Click Configure. Behind the scenes:

    • Generates a random cluster_id (UUID) and cluster_secret (32-byte random).
    • Writes the cluster_config row with role=main.
    • Inserts this node into cluster_nodes.
    • Sets Postgres for replication:
      ALTER SYSTEM SET wal_level = replica;
      ALTER SYSTEM SET max_wal_senders = 10;
      ALTER SYSTEM SET max_replication_slots = 10;
      ALTER SYSTEM SET wal_keep_size = '1GB';
      ALTER SYSTEM SET hot_standby = on;
      SELECT pg_reload_conf();
    • Creates a replicator user with REPLICATION privilege.
  4. Copy the cluster secret shown in the UI. You’ll paste it on the secondary. It’s the only time the UI shows it in full.

You also need to allow replication connections from the secondary in pg_hba.conf:

host replication replicator <secondary_ip>/32 md5

The cluster setup logs this line — apply it inside the proxpanel-db container and reload Postgres. See PostgreSQL Replication for the full pg_hba walkthrough.

  1. On the intended secondary server, Settings → Cluster → Join as Secondary.

  2. Fill in:

    • Main Server IP — the main’s reachable IP.
    • Cluster Secret — pasted from Step 1.
  3. Click Test Connection. Should show API/DB/Redis reachable from this server.

  4. Click Join Cluster.

    • The secondary POSTs to https://MAIN/api/cluster/join with the cluster secret.

    • Main verifies the secret (constant-time compare to avoid timing attacks), inserts the new node into cluster_nodes, creates a replication slot named replica_<node_id>, returns the connection info.

    • Secondary’s UI shows a setup_replica.sh script. Review and run it on the secondary’s host:

      Terminal window
      bash /tmp/setup_replica.sh

      The script stops proxpanel-db, backs up the current data directory, runs pg_basebackup from the main, writes standby.signal, configures primary_conninfo and primary_slot_name, and restarts.

  5. Once Postgres comes back up in recovery mode, the cluster service auto-detects and starts heartbeating to the main.

After this, the secondary’s UI is read-only for writes. Any write attempt (create subscriber, edit service) returns HTTP 503 with {error: "secondary server is read-only"}. Reads, dashboards, and reports all work normally.

Cluster service — what runs in the background

Section titled “Cluster service — what runs in the background”

The cluster service (services/cluster_service.go) starts on API startup if cluster_config.is_active = true:

RoleLoop
mainEvery 30 s → check each known node’s last heartbeat. Mark offline after 2 minutes of silence. Run the node-health ticker.
secondaryEvery 30 s → POST to https://MAIN/api/cluster/heartbeat with CPU%, memory%, disk%, subscriber count, current version, replication lag (SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))).

The dashboard’s Cluster tab refreshes every 5 s and shows each node with its last seen timestamp, status badge, and replication lag.

Everything in the public schema. That’s:

  • All subscriber, service, NAS, reseller, transaction, invoice tables.
  • RADIUS check/reply tables (radcheck, radreply, radacct).
  • Bandwidth rules, FUP counters, audit log, settings.
  • Uploads stored in the DB (e.g. branding settings) — but NOT files on disk.

What doesn’t replicate:

  • /opt/proxpanel/frontend/dist/uploads/ (logos, login backgrounds). Must be rsynced manually or scripted.
  • .env (license keys, hardware-bound).
  • TLS certificates (each server has its own).
  • nginx.conf (each server has its own SSL config).
  1. Update the secondary first (Settings → License → Check for Updates → Install). It restarts; replication resumes.

  2. Verify the secondary is healthy and replication lag is low.

  3. Update the main. Brief downtime (~30 seconds) during container restart.

  4. Both nodes now run the same version. Confirm in the Cluster tab.

The cluster service refuses to perform automatic failover during a version mismatch (mainVersion != currentVersion) — this prevents promoting a secondary that was running an older binary while the main was running a newer one. Always keep versions in sync.

To leave a cluster cleanly:

  • Secondary leaves: Settings → Cluster → Leave Cluster. The secondary stops heartbeating, demotes itself in cluster_config (role → standalone), and you’d then promote its Postgres out of recovery (or just trash it and reinstall).
  • Main removes a secondary: Cluster tab → click × next to the node → confirm. Drops the replication slot, marks the node removed in cluster_nodes.
  • Dismantle the cluster: On main, Disband Cluster. Sets cluster_config.is_active = false, returns to standalone, leaves Postgres replication running until you stop it (so the secondary keeps replicating into a now-orphaned read-only state until you also stop it).
SubscribersRecommended main specSecondary spec
Up to 5 K4 vCPU / 8 GB RAM / 100 GB SSDSame
5 K – 15 K8 vCPU / 16 GB / 200 GB NVMeSame
15 K – 30 K16 vCPU / 32 GB / 500 GB NVMeSame
30 K – 60 K32 vCPU / 32 GB / 1 TB NVMeSame (a slower secondary is acceptable but lag will grow under heavy write bursts)

The secondary does not save you money on hardware — it has to keep up with WAL replay. Spec it to match the main.

Redis holds session caches, dashboard caches, and JWT blacklist. It is replicated separately from Postgres using Redis’s REPLICAOF mechanism:

# On secondary's redis container
docker exec proxpanel-redis redis-cli REPLICAOF <main_ip> 6379

The cluster setup automates this. After a failover, the new main runs REPLICAOF NO ONE to become a standalone master. The Redis data loss in failover is typically zero because the contents are caches that re-populate within seconds — you don’t lose subscriber data in Redis, that’s all in Postgres.

IndicatorWhere to find itHealthy value
Last main heartbeatCluster tab → main row< 60 s
Last secondary heartbeatCluster tab → secondary row< 60 s
Replication lag (seconds)Cluster tab → secondary row< 5 s typically
Replication slot activeSELECT active FROM pg_replication_slots on maint
pg_is_in_recovery() on secondarypsqlt
Both nodes on same versionCluster tabmatching badges

If any of these is off, do not initiate a failover until you understand why. A failover into a stale or diverged secondary makes things worse.

  • Secondary stuck in “syncing”. Look at the Postgres logs on the secondary: docker logs proxpanel-db. Usually pg_hba.conf doesn’t allow the secondary’s IP for replication.
  • Replication lag growing unbounded. Network issue, or a long-running write transaction on main is preventing WAL recycling. SELECT * FROM pg_stat_replication on main.
  • Both nodes think they’re main after a network partition. Split-brain. The fencing isn’t automatic — you must manually demote one before re-attaching. See the split-brain section of Failover.
  • “cluster secret mismatch” on join. You pasted the secret with whitespace or from a UI that truncated it. Re-copy. The secret is exactly 64 hex characters.
  • Cluster tab is blank. API on the secondary can’t reach the main’s /api/cluster/heartbeat. Confirm port 80 (or your nginx port) is reachable between the two. Heartbeat now goes through port 80 since v1.0.233 (used to be 8080 — some firewalls blocked it).
  • Subscriber created on main doesn’t appear on secondary’s UI. Check replication lag first. If lag is 0 but the row is still missing, the secondary is querying a different database — confirm cluster_config.is_active = true on the secondary and that its API is reading the same proxpanel-db you think.
  • Promoted secondary, but RADIUS still gets timeout from MikroTik. RADIUS NAS targets are not part of the cluster — they live in MikroTik’s own config. Update /radius set [find] address=NEW_IP on every NAS after failover.

Cluster setup, join, leave, and failover trigger require admin (not a specific permission — the routes are admin-only). Resellers cannot view the cluster status.