Skip to main content

Status Page Policies

Phase 0 deliverable. These policies must be agreed and this document must merge before any backend code for the status feature lands. Migrations, seeds, BullMQ defaults, and tests all derive their constants from the decisions recorded here.


Incident lifecycle

investigating → identified → monitoring → resolved
  • resolved is the only terminal state; incidents cannot be re-opened. Create a new incident instead.
  • Allowed forward transitions only — no skipping states, no going backwards.
  • Drafts (is_published = false) are invisible to the public and to subscribers. Publishing is an explicit operator action.

Update cadence SLA

StateMax time between updates
investigating30 minutes
identified60 minutes
monitoring60 minutes
resolvedFinal update required within 15 min of resolution

Component naming and public redaction

Goal: public incident messages must be readable by non-technical merchants and must never leak internal implementation details.

Component naming rules

Components are named for merchant impact, not for internal implementation:

✅ Public name❌ Internal name
Paymentsstripe-webhook-worker
Point of Saleorder-service
Inventory Syncinventory-adjustment-queue
Email Deliverysendgrid-transactional

Incident body redaction

Before an incident body is stored or rendered, the following patterns are stripped and replaced with [redacted]:

  • IPv4 / IPv6 addresses
  • Stack traces (lines matching at <identifier> ( or similar)
  • Email addresses (other than the support address)
  • Internal hostnames ending in .internal, .local, .svc.cluster.local
  • Vendor account / project IDs (e.g. proj-xxxxxxxx, acct_xxxxxxxx)
  • Employee names in ops-authored notes

The redaction dictionary is maintained in apps/backend/src/status/application/services/markdown-sanitizer.service.ts. Adding a new pattern requires a PR with a matching test case.

HTML sanitization (dompurify)

Incident bodies are sanitized on save AND render (defense in depth). Allowed tags:

p, br, strong, em, code, pre, ul, ol, li, a

Allowed attributes: href (on a only), rel="noopener noreferrer" forced on all links. All other HTML is stripped.


Authorization matrix

SurfaceWho can accessEnforcement
GET /api/v1/status/*Anyone, unauthenticated@IsPublic()
POST /api/v1/status/subscribeAnyone, unauthenticated@IsPublic() + throttle
GET /api/v1/status/unsubscribe/:tokenAnyone with valid token@IsPublic()
GET /api/v1/admin/status/auditMANAGE_STATUS roleAuthGuard + CASL
All other /admin/status/* write endpointsMANAGE_STATUS roleAuthGuard + CASL

MANAGE_STATUS is initially mapped to OWNER and ADMIN roles. Deny by default — no new role automatically inherits this ability.


Data retention

DataRetention
status_health_check (raw probe results)25 hours, pruned by daily job
status_uptime_bucket (hourly aggregates)90 days
status_audit_logIndefinite (append-only; never deleted by application)
status_subscription unconfirmed rows48 hours after confirm_token_expires_at, cleaned by daily job
GCS snapshot current.jsonObject overwritten on each transition; GCS lifecycle rule deletes objects older than 7 days

Anti-flapping (hysteresis)

Without hysteresis, a single 5-second network blip would flip a service red, emit Socket.IO events, write audit rows, and email subscribers.

EventThreshold
Component degrades3 consecutive failed probes for the worst probe on that component
Component recovers2 consecutive passing probes for the worst probe on that component

Counters (consecutive_failures, consecutive_passes) live on status_component_probe and are updated atomically with each probe write. A mixed sequence (fail, pass, fail) resets both counters — only consecutive runs count.


Manual override policy

Operators may pin a component's public status (e.g., maintenance during a planned upgrade) using the override API. Rules:

  • override_expires_at is required — no indefinite overrides.
  • Maximum allowed override duration: 24 hours.
  • Default UI suggestion: 1 hour.
  • override_reason is required — minimum 10 characters.
  • When override_expires_at passes, the override is cleared by the hourly expiry job. The component's displayed status reverts to auto-derived (probe rollup).
  • Every override set and every expiry are written to status_audit_log.

Rate limits (public endpoints)

EndpointLimit
All public GET status endpoints60 requests / minute / IP
POST /api/v1/status/subscribe10 requests / hour / IP

Limits enforced by @nestjs/throttler. Verify ThrottlerModule is registered in app.module.ts.


Subscription security

  • Double opt-in: subscription is only active after email confirmation.
  • Confirmation token: 32-byte cryptographically random (crypto.randomBytes(32).toString('base64url')), single-use, 24h expiry.
  • Unsubscribe token: 32-byte random, never expires, regenerated on resubscribe.
  • Honeypot: _hp field in subscribe form. If filled, respond 200 silently. No row inserted.
  • Email deduplication: email column is UNIQUE case-insensitive. A second subscribe for the same address re-sends the confirmation if unconfirmed, or is silently accepted if already confirmed.

v2 deferrals

The following were explicitly deferred to v2 and must not be implemented in v1:

FeatureReason
Spanish (es-GT) localization of templates + public pageRequires native-speaker review of incident copy; English first
Auto-detect drafts (probes creating draft incidents automatically)Requires Slack/email-to-ops notification path not yet built
Admin authoring UI in PWAv1 ops uses REST; usable for small team
Independent hosting (Cloudflare/Vercel) for landing pageGCS snapshot covers backend-outage case for v1

The schema includes status_incident.auto_detected and status_subscription.language from day one so v2 adds these without a migration.