Status Page Policies

Phase 0 deliverable. These policies must be agreed and this document must merge before any backend code for the status feature lands. Migrations, seeds, BullMQ defaults, and tests all derive their constants from the decisions recorded here.

Incident lifecycle

investigating → identified → monitoring → resolved

resolved is the only terminal state; incidents cannot be re-opened. Create a new incident instead.
Allowed forward transitions only — no skipping states, no going backwards.
Drafts (is_published = false) are invisible to the public and to subscribers. Publishing is an explicit operator action.

Update cadence SLA

State	Max time between updates
`investigating`	30 minutes
`identified`	60 minutes
`monitoring`	60 minutes
`resolved`	Final update required within 15 min of resolution

Component naming and public redaction

Goal: public incident messages must be readable by non-technical merchants and must never leak internal implementation details.

Component naming rules

Components are named for merchant impact, not for internal implementation:

✅ Public name	❌ Internal name
Payments	stripe-webhook-worker
Point of Sale	order-service
Inventory Sync	inventory-adjustment-queue
Email Delivery	sendgrid-transactional

Incident body redaction

Before an incident body is stored or rendered, the following patterns are stripped and replaced with [redacted]:

IPv4 / IPv6 addresses
Stack traces (lines matching at <identifier> ( or similar)
Email addresses (other than the support address)
Internal hostnames ending in .internal, .local, .svc.cluster.local
Vendor account / project IDs (e.g. proj-xxxxxxxx, acct_xxxxxxxx)
Employee names in ops-authored notes

The redaction dictionary is maintained in apps/backend/src/status/application/services/markdown-sanitizer.service.ts. Adding a new pattern requires a PR with a matching test case.

HTML sanitization (dompurify)

Incident bodies are sanitized on save AND render (defense in depth). Allowed tags:

p, br, strong, em, code, pre, ul, ol, li, a

Allowed attributes: href (on a only), rel="noopener noreferrer" forced on all links. All other HTML is stripped.

Authorization matrix

Surface	Who can access	Enforcement
`GET /api/v1/status/*`	Anyone, unauthenticated	`@IsPublic()`
`POST /api/v1/status/subscribe`	Anyone, unauthenticated	`@IsPublic()` + throttle
`GET /api/v1/status/unsubscribe/:token`	Anyone with valid token	`@IsPublic()`
`GET /api/v1/admin/status/audit`	`MANAGE_STATUS` role	`AuthGuard` + CASL
All other `/admin/status/*` write endpoints	`MANAGE_STATUS` role	`AuthGuard` + CASL

MANAGE_STATUS is initially mapped to OWNER and ADMIN roles. Deny by default — no new role automatically inherits this ability.

Data retention

Data	Retention
`status_health_check` (raw probe results)	25 hours, pruned by daily job
`status_uptime_bucket` (hourly aggregates)	90 days
`status_audit_log`	Indefinite (append-only; never deleted by application)
`status_subscription` unconfirmed rows	48 hours after `confirm_token_expires_at`, cleaned by daily job
GCS snapshot `current.json`	Object overwritten on each transition; GCS lifecycle rule deletes objects older than 7 days

Anti-flapping (hysteresis)

Without hysteresis, a single 5-second network blip would flip a service red, emit Socket.IO events, write audit rows, and email subscribers.

Event	Threshold
Component degrades	3 consecutive failed probes for the worst probe on that component
Component recovers	2 consecutive passing probes for the worst probe on that component

Counters (consecutive_failures, consecutive_passes) live on status_component_probe and are updated atomically with each probe write. A mixed sequence (fail, pass, fail) resets both counters — only consecutive runs count.

Manual override policy

Operators may pin a component's public status (e.g., maintenance during a planned upgrade) using the override API. Rules:

override_expires_at is required — no indefinite overrides.
Maximum allowed override duration: 24 hours.
Default UI suggestion: 1 hour.
override_reason is required — minimum 10 characters.
When override_expires_at passes, the override is cleared by the hourly expiry job. The component's displayed status reverts to auto-derived (probe rollup).
Every override set and every expiry are written to status_audit_log.

Rate limits (public endpoints)

Endpoint	Limit
All public `GET` status endpoints	60 requests / minute / IP
`POST /api/v1/status/subscribe`	10 requests / hour / IP

Limits enforced by @nestjs/throttler. Verify ThrottlerModule is registered in app.module.ts.

Subscription security

Double opt-in: subscription is only active after email confirmation.
Confirmation token: 32-byte cryptographically random (crypto.randomBytes(32).toString('base64url')), single-use, 24h expiry.
Unsubscribe token: 32-byte random, never expires, regenerated on resubscribe.
Honeypot: _hp field in subscribe form. If filled, respond 200 silently. No row inserted.
Email deduplication: email column is UNIQUE case-insensitive. A second subscribe for the same address re-sends the confirmation if unconfirmed, or is silently accepted if already confirmed.

v2 deferrals

The following were explicitly deferred to v2 and must not be implemented in v1:

Feature	Reason
Spanish (`es-GT`) localization of templates + public page	Requires native-speaker review of incident copy; English first
Auto-detect drafts (probes creating draft incidents automatically)	Requires Slack/email-to-ops notification path not yet built
Admin authoring UI in PWA	v1 ops uses REST; usable for small team
Independent hosting (Cloudflare/Vercel) for landing page	GCS snapshot covers backend-outage case for v1

The schema includes status_incident.auto_detected and status_subscription.language from day one so v2 adds these without a migration.

Incident lifecycle​

Update cadence SLA​

Component naming and public redaction​

Component naming rules​

Incident body redaction​

HTML sanitization (dompurify)​

Authorization matrix​

Data retention​

Anti-flapping (hysteresis)​

Manual override policy​

Rate limits (public endpoints)​

Subscription security​

v2 deferrals​