Schema-Per-Tenant Isolation

Every tenant on a ProxRad SaaS cluster lives in its own Postgres schema. From the API’s perspective every tenant has its own set of tables — tenant_acme.subscribers, tenant_bigisp.subscribers, tenant_demo.subscribers. They share the same physical Postgres instance and the same connection pool, but the rows never mix. This page documents how that isolation is implemented, where the security boundary actually sits, and the design decisions behind it.

The isolation model

Layer	Mechanism
Physical	One Postgres instance, one Redis instance, one API container. Cheap to operate.
Logical	One Postgres schema per tenant, named `tenant_<slug>` (dashes become underscores).
Per-request	`SET search_path TO tenant_<slug>, public` issued at the start of every HTTP request.
Per-connection	A GORM connection pool keyed by schema name — connections retain their `search_path` for the lifetime of the connection.
Auth	JWTs carry `tenant_id`; the auth middleware rejects requests whose JWT tenant doesn’t match the hostname’s tenant.
Application	Every handler calls `middleware.GetTenantDBFromCtx(c)` instead of `database.DB` directly.

The combined effect: a query against subscribers running for tenant Acme physically reads from tenant_acme.subscribers because Postgres’s name resolution uses the connection’s search_path. Tenant Acme’s session token cannot be replayed against tenant BigISP because the JWT tenant claim won’t match the resolved tenant from the hostname.

How a request lands in the right schema

TLS terminates at nginx. nginx sets the X-Forwarded-Host header and proxies to the API container.
TenantMiddleware extracts the tenant. In backend/internal/middleware/tenant.go, the hostname is parsed:
- If it matches *.saas.proxrad.com, the subdomain is the slug.
- Otherwise, the full hostname is looked up against tenants.custom_domain.
- The result (tenant ID + schema name) is cached for 60 seconds in tenantCache to avoid hitting Postgres on every request.
The tenant-scoped DB handle is bound. database.GetTenantDB(schemaName) returns a *gorm.DB whose session has search_path pre-set to the tenant’s schema. It is stored in c.Locals("tenant_db").
AuthMiddleware verifies the JWT. The JWT’s tenant_id claim is compared against c.Locals("tenant_id"). Mismatch → 403.
The handler runs. Every database call is middleware.GetTenantDBFromCtx(c).Where(...) etc. The query is generated by GORM as SELECT ... FROM subscribers ... — unqualified — and Postgres resolves it against the connection’s search_path.
The connection returns to the pool. Its search_path is preserved. Next request that uses this connection issues SET search_path again (idempotent — usually a no-op for warm connections to the same tenant).

The whole tenant-resolution path adds about 0.3 ms to a warm request and a cache-miss path of around 2 ms — negligible at SaaS-host scale.

Why schema-per-tenant, not database-per-tenant

The two common approaches to multi-tenancy in Postgres are database-per-tenant (each tenant gets its own CREATE DATABASE) and schema-per-tenant (each tenant gets a schema inside one shared database). ProxRad uses the latter. The reasoning:

Concern	Database-per-tenant	Schema-per-tenant
Connection pool	One pool per database — 50 tenants × 25 conns = 1,250 connections	One pool shared, ~250 connections for the whole cluster
Cross-tenant queries (e.g. super-admin reports)	Multiple connections, manual UNIONs	A single query against `pg_namespace` enumerates schemas
Backup / restore granularity	`pg_dump -d tenantX` — clean	`pg_dump --schema=tenant_x` — equally clean
Schema migrations	Run migrations 50× — slow and error-prone	Loop over schemas inside a transaction, atomic
WAL / replication	Each database has its own WAL — many small streams	One WAL stream, simpler replication topology
Postgres overhead per tenant	About 8 MB of shared buffers + a connection’s worth of memory	Negligible — schemas are cheap

At the scale of 50–500 small tenants per host, schema-per-tenant is markedly cheaper to run. Database-per-tenant becomes appealing only when individual tenants are large enough that you’d give them dedicated hardware anyway — in which case you’d just deploy a standalone ProxRad install for them.

Security boundaries

The isolation is enforced at four points. Every one of them has to be intact for a cross-tenant leak to be impossible.

1. Hostname → tenant resolution

The tenant is resolved from the request hostname, not from any user-controlled input inside the request. An attacker cannot pass ?tenant=acme to read Acme’s data. The hostname comes from TLS SNI, which the attacker controls — but if they SNI for acme.saas.proxrad.com they hit Acme’s tenant context, and their own JWT (issued for tenant evilcorp) is rejected at step 4 above.

2. JWT tenant claim

Every JWT issued by a tenant’s /auth/login carries the tenant ID. AuthMiddleware compares the JWT’s tenant_id to the hostname-derived tenant ID. A token issued by tenant A cannot authenticate against tenant B even if the attacker controls A’s admin password.

3. search_path is set per request

The search_path is set explicitly at the start of every connection use, before any query runs. There is no code path where a query runs without an explicit search_path set first. If the resolution fails (no tenant found), the request is rejected with a 404 before any database call.

4. No raw `database.DB` writes in tenant handlers

The handler-level rule: never call database.DB.Create, database.DB.Where, etc. directly from a tenant-scoped handler. Always go through middleware.GetTenantDBFromCtx(c). The exceptions are:

Super-admin handlers under /api/saas/*, which deliberately operate on the public schema (the tenant registry).
Background services that don’t have a request context (backup scheduler, daily quota reset) — they loop over tenants explicitly and create per-tenant DB handles inside the loop.

What is NOT in `tenant_<slug>`

Some data must live outside the per-tenant schemas:

Table	Lives in	Why
`tenants`	`public`	The registry. The tenant resolver itself queries this; it can’t be tenant-scoped.
`saas_billing`	`public`	Billing across tenants is a super-admin concern.
`saas_audit_logs`	`public`	Super-admin actions (suspend tenant, delete tenant) are logged centrally so a deleted tenant doesn’t take its own audit trail with it.
Redis keys	Redis	Namespaced by `tenant_<id>:...` prefix. The Postgres `search_path` mechanism doesn’t apply.

GORM middleware fix — the long story

The original SaaS implementation tried to set search_path via a GORM Session() callback on every query. It failed in two ways:

Preload chains skipped the callback — GORM issued the preload query on a fresh session that didn’t inherit the search_path override, so a db.Preload("Service").Find(&subs) would fetch subscribers from the right schema but services from public. The services were always empty for tenants and the bug went undetected for two weeks.
Connection reuse confused the cache — a pooled connection that had been used by tenant Acme would, if returned to the pool with search_path still set, sometimes get picked up by a tenant BigISP request that didn’t re-issue SET search_path. The “rare” data leak was real but only happened when two near-simultaneous requests landed on the same connection.

The fix was to abandon per-query callbacks and instead maintain a separate *gorm.DB per tenant, keyed by schema name. database.GetTenantDB(schemaName) returns the cached handle for that tenant; its connection pool is configured so that every new connection starts with SET search_path TO tenant_<slug>, public baked in via a BeforeQuery callback that always runs.

The frontend has no SaaS-specific code at all. The API is the same for tenant admins and standalone admins; React only sees the data the API returns.

Verifying isolation manually

If you suspect a leak, the verification path is:

Identify two tenants with disjoint subscriber sets — e.g. tenant Acme has subscriber alice@acme, tenant BigISP has subscriber bob@bigisp.
Log in to Acme and confirm only alice@acme is visible. Capture the JWT.
Replay the JWT against BigISP’s hostname: curl -H "Authorization: Bearer <acme-jwt>" https://bigisp.saas.proxrad.com/api/subscribers. Expected: HTTP 403, tenant mismatch. Anything else is a bug.
From Postgres directly: SET search_path TO tenant_acme, public; SELECT username FROM subscribers; should return only Acme’s subscribers; SET search_path TO tenant_bigisp, public; SELECT username FROM subscribers; should return only BigISP’s.
Inspect the API container under load: docker exec proxpanel-api curl -s localhost:8080/health should always succeed regardless of the active tenant — the health endpoint is tenant-agnostic and runs on public.

If step 3 returns BigISP’s data instead of a 403, stop the API container immediately and contact info@proxrad.com with the JWT, the URL, and a docker logs snapshot. That is the single failure mode that justifies an emergency response.

Costs and limits

Schemas per database: Postgres handles thousands of schemas without trouble. ProxRad SaaS caps at 500 per host as a soft limit; beyond that, split the cluster.
Tables per schema: about 60 tables per tenant — at 500 tenants that’s 30,000 tables in one database. Postgres handles this fine; pg_stat_all_tables queries become slow but no production query path touches them.
Backup time: pg_dump --schema=tenant_<slug> of a 1,000-subscriber tenant takes ~2 seconds. Full-cluster pg_dumpall of 50 tenants takes ~2 minutes.
Migration time: running a DDL change (ALTER TABLE subscribers ADD COLUMN ...) across 50 schemas takes ~30 seconds in a transaction. Plan migrations as cluster-wide maintenance windows.

Migrations across the cluster

When a database migration ships (new column, new table, new index), it must be applied to every tenant schema, not just to public. The migration framework iterates:

for _, tenant := range listActiveTenants() {
    tx := db.Begin()
    tx.Exec(fmt.Sprintf("SET search_path TO %s", tenant.SchemaName))
    runMigration(tx)
    tx.Commit()
}

Each tenant gets its own transaction. If a migration fails for tenant Acme, BigISP’s migration still commits — the failure is isolated and surfaced to the operator. The retry path lets the operator fix Acme’s problem (usually a data-conflict) and re-run for just that one tenant.

Schema migrations on a 50-tenant cluster take 20–60 seconds total. The cluster stays available during the migration; queries against a schema that hasn’t been migrated yet still use the old DDL (Postgres handles concurrent DDL + DML correctly via MVCC). Brief query blocking is possible during the migration of a heavily-used table, which is why migrations are scheduled outside business hours.

Cross-tenant queries — the super-admin path

Super-admin reports need to aggregate across tenants — total subscriber count across the cluster, total revenue, etc. The pattern is explicit enumeration:

totalSubs := 0
for _, tenant := range listActiveTenants() {
    tenantDB := database.GetTenantDB(tenant.SchemaName)
    var count int64
    tenantDB.Model(&Subscriber{}).Count(&count)
    totalSubs += int(count)
}

This is N queries for N tenants. For 50 tenants and a simple COUNT(*), the whole loop takes about 200 ms. For larger queries (revenue aggregation, complex joins), the loop dominates request time — these endpoints are cached aggressively on the super-admin side (5-minute Redis cache).

We deliberately do not use Postgres pg_namespace enumeration to write cross-schema queries dynamically. The explicit loop is more verbose, but every tenant access is logged through the same path and the loop’s existence is grep-able for audit purposes.

Permissions

Schema isolation is enforced by the platform, not by per-user permissions. Inside a tenant the normal permission system applies (see Users & Permissions). Super-admins have a separate authentication system documented in Super Admin Console.

SaaS Overview — when and why to choose SaaS mode.
Tenant Onboarding — what CREATE SCHEMA tenant_<slug> actually creates.
Wildcard Subdomain Routing — how hostnames map to tenants.
Super Admin Console — managing the public.tenants registry.