Status Page Policies
Phase 0 deliverable. These policies must be agreed and this document must merge before any backend code for the status feature lands. Migrations, seeds, BullMQ defaults, and tests all derive their constants from the decisions recorded here.
Incident lifecycle
investigating → identified → monitoring → resolved
resolvedis the only terminal state; incidents cannot be re-opened. Create a new incident instead.- Allowed forward transitions only — no skipping states, no going backwards.
- Drafts (
is_published = false) are invisible to the public and to subscribers. Publishing is an explicit operator action.
Update cadence SLA
| State | Max time between updates |
|---|---|
investigating | 30 minutes |
identified | 60 minutes |
monitoring | 60 minutes |
resolved | Final update required within 15 min of resolution |
Component naming and public redaction
Goal: public incident messages must be readable by non-technical merchants and must never leak internal implementation details.
Component naming rules
Components are named for merchant impact, not for internal implementation:
| ✅ Public name | ❌ Internal name |
|---|---|
| Payments | stripe-webhook-worker |
| Point of Sale | order-service |
| Inventory Sync | inventory-adjustment-queue |
| Email Delivery | sendgrid-transactional |
Incident body redaction
Before an incident body is stored or rendered, the following patterns are stripped and replaced with [redacted]:
- IPv4 / IPv6 addresses
- Stack traces (lines matching
at <identifier> (or similar) - Email addresses (other than the support address)
- Internal hostnames ending in
.internal,.local,.svc.cluster.local - Vendor account / project IDs (e.g.
proj-xxxxxxxx,acct_xxxxxxxx) - Employee names in ops-authored notes
The redaction dictionary is maintained in apps/backend/src/status/application/services/markdown-sanitizer.service.ts. Adding a new pattern requires a PR with a matching test case.
HTML sanitization (dompurify)
Incident bodies are sanitized on save AND render (defense in depth). Allowed tags:
p, br, strong, em, code, pre, ul, ol, li, a
Allowed attributes: href (on a only), rel="noopener noreferrer" forced on all links. All other HTML is stripped.
Authorization matrix
| Surface | Who can access | Enforcement |
|---|---|---|
GET /api/v1/status/* | Anyone, unauthenticated | @IsPublic() |
POST /api/v1/status/subscribe | Anyone, unauthenticated | @IsPublic() + throttle |
GET /api/v1/status/unsubscribe/:token | Anyone with valid token | @IsPublic() |
GET /api/v1/admin/status/audit | MANAGE_STATUS role | AuthGuard + CASL |
All other /admin/status/* write endpoints | MANAGE_STATUS role | AuthGuard + CASL |
MANAGE_STATUS is initially mapped to OWNER and ADMIN roles. Deny by default — no new role automatically inherits this ability.
Data retention
| Data | Retention |
|---|---|
status_health_check (raw probe results) | 25 hours, pruned by daily job |
status_uptime_bucket (hourly aggregates) | 90 days |
status_audit_log | Indefinite (append-only; never deleted by application) |
status_subscription unconfirmed rows | 48 hours after confirm_token_expires_at, cleaned by daily job |
GCS snapshot current.json | Object overwritten on each transition; GCS lifecycle rule deletes objects older than 7 days |
Anti-flapping (hysteresis)
Without hysteresis, a single 5-second network blip would flip a service red, emit Socket.IO events, write audit rows, and email subscribers.
| Event | Threshold |
|---|---|
| Component degrades | 3 consecutive failed probes for the worst probe on that component |
| Component recovers | 2 consecutive passing probes for the worst probe on that component |
Counters (consecutive_failures, consecutive_passes) live on status_component_probe and are updated atomically with each probe write. A mixed sequence (fail, pass, fail) resets both counters — only consecutive runs count.
Manual override policy
Operators may pin a component's public status (e.g., maintenance during a planned upgrade) using the override API. Rules:
override_expires_atis required — no indefinite overrides.- Maximum allowed override duration: 24 hours.
- Default UI suggestion: 1 hour.
override_reasonis required — minimum 10 characters.- When
override_expires_atpasses, the override is cleared by the hourly expiry job. The component's displayed status reverts to auto-derived (probe rollup). - Every override set and every expiry are written to
status_audit_log.
Rate limits (public endpoints)
| Endpoint | Limit |
|---|---|
All public GET status endpoints | 60 requests / minute / IP |
POST /api/v1/status/subscribe | 10 requests / hour / IP |
Limits enforced by @nestjs/throttler. Verify ThrottlerModule is registered in app.module.ts.
Subscription security
- Double opt-in: subscription is only active after email confirmation.
- Confirmation token: 32-byte cryptographically random (
crypto.randomBytes(32).toString('base64url')), single-use, 24h expiry. - Unsubscribe token: 32-byte random, never expires, regenerated on resubscribe.
- Honeypot:
_hpfield in subscribe form. If filled, respond 200 silently. No row inserted. - Email deduplication:
emailcolumn isUNIQUEcase-insensitive. A second subscribe for the same address re-sends the confirmation if unconfirmed, or is silently accepted if already confirmed.
v2 deferrals
The following were explicitly deferred to v2 and must not be implemented in v1:
| Feature | Reason |
|---|---|
Spanish (es-GT) localization of templates + public page | Requires native-speaker review of incident copy; English first |
| Auto-detect drafts (probes creating draft incidents automatically) | Requires Slack/email-to-ops notification path not yet built |
| Admin authoring UI in PWA | v1 ops uses REST; usable for small team |
| Independent hosting (Cloudflare/Vercel) for landing page | GCS snapshot covers backend-outage case for v1 |
The schema includes status_incident.auto_detected and status_subscription.language from day one so v2 adds these without a migration.