FlowPOS Public Status Page

Public URL: https://flowandgrow.tech/status

Architecture

The status page uses a hybrid probe + manual incident model:

Automated probes drive the colored component grid and 90-day uptime timeline.
Ops-authored incidents provide the human narrative (investigating → identified → monitoring → resolved).

Probes (BullMQ, per-interval)
  └─► hysteresis evaluator (3-fail / 2-pass)
       └─► status_component.current_status (override wins if not expired)
            ├─► GCS snapshot (on state transition only) ──► CDN ──► /status page
            ├─► Socket.IO /status namespace ──► PWA StatusBanner
            └─► status_audit_log

A GCS snapshot (flowpos-status-snapshots/current.json) is written only on state transitions (not every probe). The landing page reads this snapshot first, falling back to the live API. This means the status page remains visible even when the backend is down.

Adding a new component

Add a row to status_component (via migration or admin API).
Add one or more rows to status_component_probe with the appropriate probe_type.
If needed, add a new probe strategy (see probe-types.md).

Probe types and default intervals:

Probe type	Default interval	Use case
`http_get`	60s	HTTP health check on a URL
`db_query`	60s	PostgreSQL `SELECT 1`
`redis_ping`	60s	Redis `PING`
`vendor_rss`	300s	Ingest third-party status RSS
`passive`	n/a	Heartbeat from external client

Authoring an incident (v1 — REST)

All admin routes require the MANAGE_STATUS CASL ability (mapped to OWNER + ADMIN roles).

1. Create a draft

POST /api/v1/admin/status/incidents
Authorization: Bearer <token>
Content-Type: application/json

{
  "title": "Elevated error rate on Stripe Payments",
  "severity": "major",
  "componentIds": ["<stripe-component-id>"]
}

2. Publish (fans out emails + Socket.IO push)

POST /api/v1/admin/status/incidents/<id>/publish
Authorization: Bearer <token>

3. Add updates as the incident progresses

POST /api/v1/admin/status/incidents/<id>/updates
Authorization: Bearer <token>
Content-Type: application/json

{
  "state": "identified",
  "bodyMd": "We have identified the root cause as a misconfigured webhook endpoint. A fix is being deployed."
}

4. Resolve

POST /api/v1/admin/status/incidents/<id>/resolve
Authorization: Bearer <token>

Valid state transitions: investigating → identified → monitoring → resolved.

Manual override

Pin a component to a specific status (e.g., during planned maintenance):

PATCH /api/v1/admin/status/components/<id>/override
Authorization: Bearer <token>
Content-Type: application/json

{
  "status": "maintenance",
  "reason": "Planned database upgrade",
  "expiresAt": "2026-04-26T03:00:00Z"
}

expiresAt is required and must be ≤ 24h in the future. The hourly expiry job clears it automatically.

Clear early:

DELETE /api/v1/admin/status/components/<id>/override

Feeds

Format	URL
RSS 2.0	`/api/v1/status/feed.rss`
JSON	`/api/v1/status/summary.json`

Audit log

Every status transition, override set/clear, and incident lifecycle event writes an append-only row to status_audit_log. The table has REVOKE UPDATE, DELETE so the application cannot mutate history.

Query:

GET /api/v1/admin/status/audit?entityType=component&entityId=<id>

Data retention

Data	Retention
Raw `status_health_check` rows	25 hours (daily prune job)
`status_uptime_bucket` hourly aggregates	90 days
`status_audit_log`	Permanent (append-only)

v2 deferrals

Admin authoring UI — v1 ops uses REST/curl.
Spanish localization — schema includes language column; templates add .es.html in v2.
Auto-detect drafts — probes never auto-create incidents in v1; auto_detected column reserved.
Independent hosting — landing page stays on Cloud Run for v1; GCS snapshot covers the outage case.

Architecture​

Adding a new component​

Authoring an incident (v1 — REST)​

1. Create a draft​

2. Publish (fans out emails + Socket.IO push)​

3. Add updates as the incident progresses​

4. Resolve​

Manual override​

Feeds​

Audit log​

Data retention​

v2 deferrals​