Failover (Manual + Auto)

Failover is the moment your secondary becomes the new main. The data path stops flowing through the old primary, Postgres exits recovery on the replica, the panel starts accepting writes, and you update DNS or your RADIUS configs to point at the new IP.

ProxRad handles automatic failover for unplanned outages and exposes a one-click button for planned switchovers (maintenance, scheduled migrations). This page is the runbook — both for the panel doing it itself and for you doing it by hand when the panel can’t.

Modes

Mode	Triggered by	Data loss?	Speed
Auto	`ClusterFailoverService` after 2 minutes of no main heartbeat	Possible (lag-dependent)	~30 s
Manual (planned)	Admin clicks Promote to Main in the UI	Zero (writes fenced first)	~10 s
Manual (emergency)	Admin clicks Promote to Main with main offline	Possible (lag-dependent)	~10 s
CLI fallback	`pg_promote()` by hand	Same as emergency	~5 s

DNS / VIP updates are not automated. You must repoint clients yourself. See the DNS section below.

ClusterFailoverService — the automatic path

services/cluster_failover.go runs only on secondary servers. It loops every 30 seconds:

ticker := 30s
threshold := 2 min

loop:
  GET https://MAIN/health
  if ok:
    lastMainHeartbeat = now()
  else if time.Since(lastMainHeartbeat) >= threshold:
    performFailover()

The 2-minute threshold means: 4 consecutive failed health checks (at 30 s each) before promotion. This avoids over-eager flapping during a brief network blip.

What `performFailover()` does

Log event. Insert failover_started into cluster_events.
Check replication lag. SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())). If > 30 seconds, log a warning — data loss is possible.
Promote Postgres. Execute SELECT pg_promote(); against the local proxpanel-db. Returns true when the standby exits recovery and starts accepting writes.
Stop Redis replication. docker exec proxpanel-redis redis-cli REPLICAOF NO ONE. The replica becomes a standalone master.
Update cluster config. UPDATE cluster_config SET server_role = 'main', main_server_ip = local_ip. After this point the local API accepts writes.
Update node statuses. Mark the old main as offline in cluster_nodes, mark self as main.
Notify cluster. POST to /api/cluster/notify on every known secondary (there usually isn’t one in a 2-node cluster, but a multi-secondary setup needs this so they know to switch their main_server_ip).
Restart RADIUS. docker restart proxpanel-radius so it picks up the new role.
Log event. Insert failover_completed into cluster_events.

The whole thing takes ~30 seconds. The new main is fully operational; the old main, if it comes back, will see itself outvoted and must be manually re-attached as a secondary.

Manual planned failover (zero-loss switchover)

This is the path for “I want to take down the main for hardware maintenance.” It’s the only failover path with no data-loss risk because writes are fenced first.

On the current main, Settings → Cluster → click Switchover to Secondary → select the target secondary.
The main:
- Fences writes by setting default_transaction_read_only = on for new connections.
- Waits for in-flight transactions to finish.
- Confirms replication lag is 0 bytes.
- Calls pg_demote() (writes to standby.signal, restarts Postgres into replica mode).
The secondary:
- Receives the switchover signal.
- Calls pg_promote().
- Becomes new main.
You update DNS to point at the new main (manual step).
Old main starts streaming from new main as a normal secondary.

Time: about 30 seconds, no writes are accepted during the window. No data loss.

Manual emergency failover

When the main is dead, frozen, or unreachable and you don’t want to wait 2 minutes for auto:

On the secondary, Settings → Cluster.

The UI shows:

⚠ MAIN SERVER OFFLINE
Main server (<sample-host>) has been offline for X minutes
[ Promote to Main Server ]

Click Promote to Main Server.
The secondary runs the same performFailover() flow as the automatic path, immediately.
Update DNS / MikroTik RADIUS target.

What you must do that the panel doesn’t

Task	Why it’s manual
Update DNS (e.g. `panel.example.com` → new IP)	The panel doesn’t own your DNS provider. Use Cloudflare or your registrar API.
Update MikroTik `/radius set address=NEW_IP`	Some operators have many MikroTik routers and a scripted update is faster than the panel iterating.
Update VIP / load balancer	If you front the panel with HAProxy or a cloud LB, switch the backend to the new IP.
Re-attach the old main as a new secondary	The panel can’t safely rebase the old main automatically — it might have unreplicated writes that need preserving.

Re-attaching the old main as a secondary

Once the old main is back online, it cannot resume as main — its WAL diverged from the new main at the moment of promotion. You must rebase it:

On the old main, stop the API and Postgres:

cd /opt/proxpanel
docker compose stop api radius proxpanel-db

Back up the current data directory (it may contain writes that didn’t replicate before the crash — recover them later with pg_dump if needed):

docker run --rm -v proxpanel_postgres_data:/data -v /tmp:/backup alpine \
    tar -czf /backup/postgres_old_main_$(date +%s).tar.gz -C /data .

Clear the data directory:

docker run --rm -v proxpanel_postgres_data:/data alpine sh -c "rm -rf /data/*"

Run pg_basebackup from the new main:

docker run --rm \
    -v proxpanel_postgres_data:/var/lib/postgresql/data \
    -e PGPASSWORD='<DB_PASSWORD>' postgres:16 \
    pg_basebackup -h NEW_MAIN_IP -p 5432 -U replicator \
      -D /var/lib/postgresql/data -Fp -Xs -P -R -S replica_old_main

Mark it as a standby:

docker run --rm -v proxpanel_postgres_data:/data alpine touch /data/standby.signal
docker run --rm -v proxpanel_postgres_data:/data alpine chown -R 999:999 /data

Start everything:

docker compose start proxpanel-db
docker compose start api radius

In the new main’s UI, the recovered server should appear in the Cluster tab as a secondary within 60 seconds.

Alternatively, the new main’s UI has a “Demote and Re-attach” button (Cluster → click the offline old-main → Re-attach) that generates and SSHes the equivalent script.

Split-brain recovery

If two servers both believe they are main (e.g. auto-failover fired during a network partition, then the partition healed), you have split-brain. Both databases have accepted writes that the other doesn’t know about.

Identify which is authoritative. Usually the one that more clients are still talking to. Check radacct row counts since the partition: SELECT count(*) FROM radacct WHERE acctstarttime > 'PARTITION_START'. The one with more sessions is probably the live one.
Take the other one offline. Shut down its API and Postgres containers.
On the authoritative server, mark the other as offline in cluster_nodes.
Recover writes from the loser if needed. pg_dump the loser, identify rows missing from the authoritative DB, replay them manually.
Rebase the loser as a secondary using the procedure above.

Prevention: always update DNS / MikroTik immediately after failover. Don’t leave the old main reachable on the same IP.

API endpoints (for scripting)

# Manual switchover (planned)
curl -X POST -H "Authorization: Bearer $TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"target_node_id":2}' \
     https://MAIN/api/cluster/failover

# Promote this secondary to main (emergency)
curl -X POST -H "Authorization: Bearer $TOKEN" \
     https://SECONDARY/api/cluster/promote-to-main

# Check main status (used by the UI banner)
curl -s https://SECONDARY/api/cluster/check-main-status | jq

Tuning the auto-failover threshold

Auto-failover is on by default with a 2-minute threshold. You can disable it or change the threshold via cluster_config:

UPDATE cluster_config SET auto_failover_enabled = false;
-- or
UPDATE cluster_config SET auto_failover_threshold_seconds = 300;  -- 5 min

For production, 2 minutes is the sweet spot. Lower means false positives during brief network blips; higher means longer customer-facing downtime.

Common pitfalls

Auto-failover fires, then DNS still points to the old main. Customers hit the old (now-replica) which returns 503 on writes. Always update DNS within seconds of promotion. Consider Cloudflare or a low-TTL DNS provider.
pg_promote() returns false. Postgres is not in recovery — already a primary, or the replica’s standby.signal was deleted. Check SELECT pg_is_in_recovery();.
Failover completes but RADIUS still rejects auth. RADIUS container restart didn’t happen. docker logs proxpanel-radius should show “License client initialized” within 30 seconds of failover. If not, docker restart proxpanel-radius manually.
“Failover in progress” hangs. Look at cluster_events for the last failover_started and check API logs on the secondary for the actual error. Usually pg_hba.conf or pg_basebackup permission issues for follow-up rebase.
Two nodes both showing role=main in cluster_config. Split-brain. Stop the wrong one first; do not just UPDATE.

Permissions

Manual failover and switchover are admin-only. The buttons don’t appear in the reseller UI.

HA Cluster — cluster topology and setup.
PostgreSQL Replication — pg_promote(), replication slots, lag.
Cross-Server Restore — when you don’t have a hot standby and need to recover from a backup.