Skip to content

Stale Session Cleanup

A “stale” session is a row in the radacct table where the PPPoE session ended but the matching Accounting-Stop packet never arrived. The row sits there with acctstoptime IS NULL forever, the dashboard claims the subscriber is online, and the IP they were using stays marked in_use in the pool tracker.

This happens regularly in real networks:

  • The MikroTik reboots while a session is up — RouterOS doesn’t replay queued accounting packets on boot.
  • A power flap to the BNG drops all sessions but loses the stop packets.
  • A network blip drops UDP 1813 packets temporarily — Acct-Stop is unreliable by design (it’s UDP, no retransmission guarantee).
  • The CPE device hard-powers off without a PPPoE LCP-TermReq.

Left alone, stale sessions accumulate forever. On one production install in early 2026 the count reached 955 ghost sessions for 295 real subscribers before the sweeper was deployed.

The fix is a background service — StaleSessionCleanupService in internal/services/stale_session_cleanup.go — that runs every 5 minutes and closes any session that hasn’t had an interim update in the last 30 minutes.

The service starts automatically when the API container boots. The default configuration is:

SettingDefaultWhere it’s set
Stale threshold30 minutes (no interim update)NewStaleSessionCleanupService(30) in cmd/api/main.go
Check interval5 minutesHard-coded in the service struct
First run delay2 minutes after bootLets the system stabilize before the first sweep

The 30-minute threshold is chosen because the default interim-update interval is 30 s. If the router hasn’t sent an interim packet in 60× the expected interval, the session is definitively dead.

Each sweep runs two SQL statements:

UPDATE radacct
SET acctstoptime = NOW(),
acctterminatecause = 'Stale-Session-Cleanup'
WHERE acctstoptime IS NULL
AND (acctupdatetime IS NULL OR acctupdatetime < $1)
AND (acctstarttime < $1);

Where $1 is NOW() - 30 minutes. A session is closed if:

  • It has no stop time yet, and
  • Either it has no interim update, or its last interim is older than 30 minutes, and
  • It started more than 30 minutes ago (no false-positives for very fresh sessions).

The acctterminatecause is set to Stale-Session-Cleanup so reports can distinguish ghost-closures from normal disconnects.

UPDATE subscribers
SET is_online = false
WHERE is_online = true
AND deleted_at IS NULL
AND (last_seen IS NULL OR last_seen < $1)
AND username NOT IN (
SELECT DISTINCT username FROM radacct WHERE acctstoptime IS NULL
);

A subscriber is marked offline if:

  • They’re currently flagged online, and
  • Their last_seen is older than 30 minutes (or null), and
  • They have no open radacct row.

The last_seen check is critical — it prevents the sweeper from fighting QuotaSyncService. QuotaSync updates last_seen every 30 seconds for active users (both via the MikroTik API and via radius interim-update packets). After a container restart, radacct may be temporarily empty but the MikroTik still has the sessions up; without the last_seen check, the sweeper would mark all online users offline, then QuotaSync would re-mark them online on the next tick — bouncing the flag.

ScenarioWhat the sweeper does
MikroTik rebootAll open radacct rows for that NAS are closed within 35 minutes (no interim updates can arrive from a dead router). Subscribers are marked offline. As they reconnect, fresh radacct rows are created.
Power outageSame as above.
API container restartlast_seen is preserved (stored in DB), so subscribers stay online if their MikroTik is still up. The sweeper only closes radacct rows that genuinely have no interim updates.
Network blipIf interim updates resume within 30 minutes, nothing happens. If they don’t, the sweeper closes the rows.
Single subscriber’s CPE hard-powers offTheir session is closed after 30 minutes.

The 5-minute check interval means the maximum delay between a session becoming stale and being closed is about 5 minutes — fast enough for the dashboard to feel accurate, slow enough to not pound the database every second.

This is the scenario the sweeper was originally built for. Here’s the exact timeline:

  1. t=0: MikroTik reboots. 200 PPPoE sessions are killed instantly; no Acct-Stop packets are queued.
  2. t+30s: Sessions start coming back online as customers reconnect. New radacct rows are created for them.
  3. t+5min: The dashboard shows ~400 online subscribers — the new sessions (200 real) plus the orphaned-from-boot rows (200 stale).
  4. t+30min: The 200 stale rows have not had an interim update for 30 minutes. They’re flagged stale.
  5. t+30–35min: The next sweeper cycle fires. The 200 stale rows are closed with acctterminatecause = 'Stale-Session-Cleanup'. The 200 subscribers who were “double-counted” lose their second (offline) entry. Dashboard returns to ~200 online.

Operators who can’t wait 35 minutes for the dashboard to converge can trigger a manual sweep via the Sessions → Force cleanup button (or docker restart proxpanel-api, which triggers a sweep 2 minutes after boot).

The 30-minute default is a balance. If you operate at scale or with aggressive accounting intervals, consider:

  • Lower threshold (15 min) — faster cleanup. Risk: a single dropped interim-update packet causes a false-close on a still-active session. Subscriber’s bytes for the next 30 s aren’t counted; their queue is recreated when they next reconnect.
  • Higher threshold (60 min) — safer. Risk: dashboard inaccuracies linger longer; ghost IPs stay marked in-use for an hour after a reboot.

To change it, set the staleMinutes argument in cmd/api/main.go when constructing the service. There is no UI knob — this is an operations-only tuning.

-- How many open sessions right now?
SELECT COUNT(*) FROM radacct WHERE acctstoptime IS NULL;
-- How many of those look stale?
SELECT COUNT(*) FROM radacct
WHERE acctstoptime IS NULL
AND acctstarttime < NOW() - INTERVAL '30 minutes'
AND (acctupdatetime IS NULL OR acctupdatetime < NOW() - INTERVAL '30 minutes');
-- Who has had the sweeper fire most often in the last week?
SELECT username, COUNT(*) AS cleanups
FROM radacct
WHERE acctterminatecause = 'Stale-Session-Cleanup'
AND acctstoptime > NOW() - INTERVAL '7 days'
GROUP BY username
ORDER BY cleanups DESC
LIMIT 20;

The third query is the most useful — it identifies CPE devices that hard-power off regularly (the customer’s house is power-flapping, or the CPE has a faulty PSU).

To stay safe at production scale, the sweeper is deliberately conservative:

  • It does not close sessions younger than 30 minutes, even if they have no interim updates. Brand-new sessions might just not have ticked yet.
  • It does not send CoA Disconnect to the NAS. It only updates the database. If the session is genuinely still up on the router (e.g. the radius container was disconnected from the network for a while), the next interim update will reopen the radacct row. There’s no false-positive risk of disconnecting a live user.
  • It does not garbage-collect old closed sessions. That’s the job of radacct partitioning / archival, a separate service.
  • It does not touch is_online for subscribers with recent last_seen. Active users are safe even if radacct is briefly out of sync.