Failover (Manual + Auto)
Failover is the moment your secondary becomes the new main. The data path stops flowing through the old primary, Postgres exits recovery on the replica, the panel starts accepting writes, and you update DNS or your RADIUS configs to point at the new IP.
ProxPanel handles automatic failover for unplanned outages and exposes a one-click button for planned switchovers (maintenance, scheduled migrations). This page is the runbook — both for the panel doing it itself and for you doing it by hand when the panel can’t.
| Mode | Triggered by | Data loss? | Speed |
|---|---|---|---|
| Auto | ClusterFailoverService after 2 minutes of no main heartbeat | Possible (lag-dependent) | ~30 s |
| Manual (planned) | Admin clicks Promote to Main in the UI | Zero (writes fenced first) | ~10 s |
| Manual (emergency) | Admin clicks Promote to Main with main offline | Possible (lag-dependent) | ~10 s |
| CLI fallback | pg_promote() by hand | Same as emergency | ~5 s |
DNS / VIP updates are not automated. You must repoint clients yourself. See the DNS section below.
ClusterFailoverService — the automatic path
Section titled “ClusterFailoverService — the automatic path”services/cluster_failover.go runs only on secondary servers. It loops every 30 seconds:
ticker := 30sthreshold := 2 min
loop: GET https://MAIN/health if ok: lastMainHeartbeat = now() else if time.Since(lastMainHeartbeat) >= threshold: performFailover()The 2-minute threshold means: 4 consecutive failed health checks (at 30 s each) before promotion. This avoids over-eager flapping during a brief network blip.
What performFailover() does
Section titled “What performFailover() does”-
Log event. Insert
failover_startedintocluster_events. -
Check replication lag.
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())). If > 30 seconds, log a warning — data loss is possible. -
Promote Postgres. Execute
SELECT pg_promote();against the localproxpanel-db. Returns true when the standby exits recovery and starts accepting writes. -
Stop Redis replication.
docker exec proxpanel-redis redis-cli REPLICAOF NO ONE. The replica becomes a standalone master. -
Update cluster config.
UPDATE cluster_config SET server_role = 'main', main_server_ip = local_ip. After this point the local API accepts writes. -
Update node statuses. Mark the old main as
offlineincluster_nodes, mark self asmain. -
Notify cluster. POST to
/api/cluster/notifyon every known secondary (there usually isn’t one in a 2-node cluster, but a multi-secondary setup needs this so they know to switch theirmain_server_ip). -
Restart RADIUS.
docker restart proxpanel-radiusso it picks up the new role. -
Log event. Insert
failover_completedintocluster_events.
The whole thing takes ~30 seconds. The new main is fully operational; the old main, if it comes back, will see itself outvoted and must be manually re-attached as a secondary.
Manual planned failover (zero-loss switchover)
Section titled “Manual planned failover (zero-loss switchover)”This is the path for “I want to take down the main for hardware maintenance.” It’s the only failover path with no data-loss risk because writes are fenced first.
-
On the current main, Settings → Cluster → click Switchover to Secondary → select the target secondary.
-
The main:
- Fences writes by setting
default_transaction_read_only = onfor new connections. - Waits for in-flight transactions to finish.
- Confirms replication lag is 0 bytes.
- Calls
pg_demote()(writes tostandby.signal, restarts Postgres into replica mode).
- Fences writes by setting
-
The secondary:
- Receives the switchover signal.
- Calls
pg_promote(). - Becomes new main.
-
You update DNS to point at the new main (manual step).
-
Old main starts streaming from new main as a normal secondary.
Time: about 30 seconds, no writes are accepted during the window. No data loss.
Manual emergency failover
Section titled “Manual emergency failover”When the main is dead, frozen, or unreachable and you don’t want to wait 2 minutes for auto:
-
On the secondary, Settings → Cluster.
-
The UI shows:
⚠ MAIN SERVER OFFLINEMain server (<sample-host>) has been offline for X minutes[ Promote to Main Server ] -
Click Promote to Main Server.
-
The secondary runs the same
performFailover()flow as the automatic path, immediately. -
Update DNS / MikroTik RADIUS target.
What you must do that the panel doesn’t
Section titled “What you must do that the panel doesn’t”| Task | Why it’s manual |
|---|---|
Update DNS (e.g. panel.example.com → new IP) | The panel doesn’t own your DNS provider. Use Cloudflare or your registrar API. |
Update MikroTik /radius set address=NEW_IP | Some operators have many MikroTik routers and a scripted update is faster than the panel iterating. |
| Update VIP / load balancer | If you front the panel with HAProxy or a cloud LB, switch the backend to the new IP. |
| Re-attach the old main as a new secondary | The panel can’t safely rebase the old main automatically — it might have unreplicated writes that need preserving. |
Re-attaching the old main as a secondary
Section titled “Re-attaching the old main as a secondary”Once the old main is back online, it cannot resume as main — its WAL diverged from the new main at the moment of promotion. You must rebase it:
-
On the old main, stop the API and Postgres:
Terminal window cd /opt/proxpaneldocker compose stop api radius proxpanel-db -
Back up the current data directory (it may contain writes that didn’t replicate before the crash — recover them later with
pg_dumpif needed):Terminal window docker run --rm -v proxpanel_postgres_data:/data -v /tmp:/backup alpine \tar -czf /backup/postgres_old_main_$(date +%s).tar.gz -C /data . -
Clear the data directory:
Terminal window docker run --rm -v proxpanel_postgres_data:/data alpine sh -c "rm -rf /data/*" -
Run
pg_basebackupfrom the new main:Terminal window docker run --rm \-v proxpanel_postgres_data:/var/lib/postgresql/data \-e PGPASSWORD='<DB_PASSWORD>' postgres:16 \pg_basebackup -h NEW_MAIN_IP -p 5432 -U replicator \-D /var/lib/postgresql/data -Fp -Xs -P -R -S replica_old_main -
Mark it as a standby:
Terminal window docker run --rm -v proxpanel_postgres_data:/data alpine touch /data/standby.signaldocker run --rm -v proxpanel_postgres_data:/data alpine chown -R 999:999 /data -
Start everything:
Terminal window docker compose start proxpanel-dbdocker compose start api radius -
In the new main’s UI, the recovered server should appear in the Cluster tab as a secondary within 60 seconds.
Alternatively, the new main’s UI has a “Demote and Re-attach” button (Cluster → click the offline old-main → Re-attach) that generates and SSHes the equivalent script.
Split-brain recovery
Section titled “Split-brain recovery”If two servers both believe they are main (e.g. auto-failover fired during a network partition, then the partition healed), you have split-brain. Both databases have accepted writes that the other doesn’t know about.
-
Identify which is authoritative. Usually the one that more clients are still talking to. Check
radacctrow counts since the partition:SELECT count(*) FROM radacct WHERE acctstarttime > 'PARTITION_START'. The one with more sessions is probably the live one. -
Take the other one offline. Shut down its API and Postgres containers.
-
On the authoritative server, mark the other as offline in
cluster_nodes. -
Recover writes from the loser if needed.
pg_dumpthe loser, identify rows missing from the authoritative DB, replay them manually. -
Rebase the loser as a secondary using the procedure above.
Prevention: always update DNS / MikroTik immediately after failover. Don’t leave the old main reachable on the same IP.
API endpoints (for scripting)
Section titled “API endpoints (for scripting)”# Manual switchover (planned)curl -X POST -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{"target_node_id":2}' \ https://MAIN/api/cluster/failover
# Promote this secondary to main (emergency)curl -X POST -H "Authorization: Bearer $TOKEN" \ https://SECONDARY/api/cluster/promote-to-main
# Check main status (used by the UI banner)curl -s https://SECONDARY/api/cluster/check-main-status | jqTuning the auto-failover threshold
Section titled “Tuning the auto-failover threshold”Auto-failover is on by default with a 2-minute threshold. You can disable it or change the threshold via cluster_config:
UPDATE cluster_config SET auto_failover_enabled = false;-- orUPDATE cluster_config SET auto_failover_threshold_seconds = 300; -- 5 minFor production, 2 minutes is the sweet spot. Lower means false positives during brief network blips; higher means longer customer-facing downtime.
Common pitfalls
Section titled “Common pitfalls”- Auto-failover fires, then DNS still points to the old main. Customers hit the old (now-replica) which returns 503 on writes. Always update DNS within seconds of promotion. Consider Cloudflare or a low-TTL DNS provider.
pg_promote()returns false. Postgres is not in recovery — already a primary, or the replica’sstandby.signalwas deleted. CheckSELECT pg_is_in_recovery();.- Failover completes but RADIUS still rejects auth. RADIUS container restart didn’t happen.
docker logs proxpanel-radiusshould show “License client initialized” within 30 seconds of failover. If not,docker restart proxpanel-radiusmanually. - “Failover in progress” hangs. Look at
cluster_eventsfor the lastfailover_startedand check API logs on the secondary for the actual error. Usuallypg_hba.conforpg_basebackuppermission issues for follow-up rebase. - Two nodes both showing role=main in
cluster_config. Split-brain. Stop the wrong one first; do not justUPDATE.
Permissions
Section titled “Permissions”Manual failover and switchover are admin-only. The buttons don’t appear in the reseller UI.
Related pages
Section titled “Related pages”- HA Cluster — cluster topology and setup.
- PostgreSQL Replication —
pg_promote(), replication slots, lag. - Cross-Server Restore — when you don’t have a hot standby and need to recover from a backup.