Stale Session Cleanup

A “stale” session is a row in the radacct table where the PPPoE session ended but the matching Accounting-Stop packet never arrived. The row sits there with acctstoptime IS NULL forever, the dashboard claims the subscriber is online, and the IP they were using stays marked in_use in the pool tracker.

This happens regularly in real networks:

The MikroTik reboots while a session is up — RouterOS doesn’t replay queued accounting packets on boot.
A power flap to the BNG drops all sessions but loses the stop packets.
A network blip drops UDP 1813 packets temporarily — Acct-Stop is unreliable by design (it’s UDP, no retransmission guarantee).
The CPE device hard-powers off without a PPPoE LCP-TermReq.

Left alone, stale sessions accumulate forever. On one production install in early 2026 the count reached 955 ghost sessions for 295 real subscribers before the sweeper was deployed.

The fix is a background service — StaleSessionCleanupService in internal/services/stale_session_cleanup.go — that runs every 5 minutes and closes any session that hasn’t had an interim update in the last 30 minutes.

How the sweeper runs

The service starts automatically when the API container boots. The default configuration is:

Setting	Default	Where it’s set
Stale threshold	30 minutes (no interim update)	`NewStaleSessionCleanupService(30)` in `cmd/api/main.go`
Check interval	5 minutes	Hard-coded in the service struct
First run delay	2 minutes after boot	Lets the system stabilize before the first sweep

The 30-minute threshold is chosen because the default interim-update interval is 30 s. If the router hasn’t sent an interim packet in 60× the expected interval, the session is definitively dead.

What gets closed

Each sweep runs two SQL statements:

1. Close stale radacct rows

UPDATE radacct
   SET acctstoptime       = NOW(),
       acctterminatecause = 'Stale-Session-Cleanup'
 WHERE acctstoptime IS NULL
   AND (acctupdatetime IS NULL OR acctupdatetime < $1)
   AND (acctstarttime < $1);

Where $1 is NOW() - 30 minutes. A session is closed if:

It has no stop time yet, and
Either it has no interim update, or its last interim is older than 30 minutes, and
It started more than 30 minutes ago (no false-positives for very fresh sessions).

The acctterminatecause is set to Stale-Session-Cleanup so reports can distinguish ghost-closures from normal disconnects.

2. Sync `subscribers.is_online`

UPDATE subscribers
   SET is_online = false
 WHERE is_online = true
   AND deleted_at IS NULL
   AND (last_seen IS NULL OR last_seen < $1)
   AND username NOT IN (
       SELECT DISTINCT username FROM radacct WHERE acctstoptime IS NULL
   );

A subscriber is marked offline if:

They’re currently flagged online, and
Their last_seen is older than 30 minutes (or null), and
They have no open radacct row.

The last_seen check is critical — it prevents the sweeper from fighting QuotaSyncService. QuotaSync updates last_seen every 30 seconds for active users (both via the MikroTik API and via radius interim-update packets). After a container restart, radacct may be temporarily empty but the MikroTik still has the sessions up; without the last_seen check, the sweeper would mark all online users offline, then QuotaSync would re-mark them online on the next tick — bouncing the flag.

When to expect the sweeper to fire

Scenario	What the sweeper does
MikroTik reboot	All open radacct rows for that NAS are closed within 35 minutes (no interim updates can arrive from a dead router). Subscribers are marked offline. As they reconnect, fresh radacct rows are created.
Power outage	Same as above.
API container restart	`last_seen` is preserved (stored in DB), so subscribers stay online if their MikroTik is still up. The sweeper only closes radacct rows that genuinely have no interim updates.
Network blip	If interim updates resume within 30 minutes, nothing happens. If they don’t, the sweeper closes the rows.
Single subscriber’s CPE hard-powers off	Their session is closed after 30 minutes.

The 5-minute check interval means the maximum delay between a session becoming stale and being closed is about 5 minutes — fast enough for the dashboard to feel accurate, slow enough to not pound the database every second.

MikroTik reboot recovery walkthrough

This is the scenario the sweeper was originally built for. Here’s the exact timeline:

t=0: MikroTik reboots. 200 PPPoE sessions are killed instantly; no Acct-Stop packets are queued.
t+30s: Sessions start coming back online as customers reconnect. New radacct rows are created for them.
t+5min: The dashboard shows ~400 online subscribers — the new sessions (200 real) plus the orphaned-from-boot rows (200 stale).
t+30min: The 200 stale rows have not had an interim update for 30 minutes. They’re flagged stale.
t+30–35min: The next sweeper cycle fires. The 200 stale rows are closed with acctterminatecause = 'Stale-Session-Cleanup'. The 200 subscribers who were “double-counted” lose their second (offline) entry. Dashboard returns to ~200 online.

Operators who can’t wait 35 minutes for the dashboard to converge can trigger a manual sweep via the Sessions → Force cleanup button (or docker restart proxpanel-api, which triggers a sweep 2 minutes after boot).

Tuning the threshold

The 30-minute default is a balance. If you operate at scale or with aggressive accounting intervals, consider:

Lower threshold (15 min) — faster cleanup. Risk: a single dropped interim-update packet causes a false-close on a still-active session. Subscriber’s bytes for the next 30 s aren’t counted; their queue is recreated when they next reconnect.
Higher threshold (60 min) — safer. Risk: dashboard inaccuracies linger longer; ghost IPs stay marked in-use for an hour after a reboot.

To change it, set the staleMinutes argument in cmd/api/main.go when constructing the service. There is no UI knob — this is an operations-only tuning.

Common diagnostic queries

-- How many open sessions right now?
SELECT COUNT(*) FROM radacct WHERE acctstoptime IS NULL;

-- How many of those look stale?
SELECT COUNT(*) FROM radacct
 WHERE acctstoptime IS NULL
   AND acctstarttime < NOW() - INTERVAL '30 minutes'
   AND (acctupdatetime IS NULL OR acctupdatetime < NOW() - INTERVAL '30 minutes');

-- Who has had the sweeper fire most often in the last week?
SELECT username, COUNT(*) AS cleanups
  FROM radacct
 WHERE acctterminatecause = 'Stale-Session-Cleanup'
   AND acctstoptime > NOW() - INTERVAL '7 days'
 GROUP BY username
 ORDER BY cleanups DESC
 LIMIT 20;

The third query is the most useful — it identifies CPE devices that hard-power off regularly (the customer’s house is power-flapping, or the CPE has a faulty PSU).

What the sweeper does NOT do

To stay safe at production scale, the sweeper is deliberately conservative:

It does not close sessions younger than 30 minutes, even if they have no interim updates. Brand-new sessions might just not have ticked yet.
It does not send CoA Disconnect to the NAS. It only updates the database. If the session is genuinely still up on the router (e.g. the radius container was disconnected from the network for a while), the next interim update will reopen the radacct row. There’s no false-positive risk of disconnecting a live user.
It does not garbage-collect old closed sessions. That’s the job of radacct partitioning / archival, a separate service.
It does not touch is_online for subscribers with recent last_seen. Active users are safe even if radacct is briefly out of sync.

RADIUS Server Setup — interim-update interval, the input the sweeper depends on.
IP Pool Management — what happens to in-use IPs after stale sessions close.
MikroTik Integration — configuring interim-update on RouterOS.
Sessions — the UI surface that reflects the sweeper’s work.