FlowPOS Public Status Page
Public URL: https://flowandgrow.tech/status
Architecture
The status page uses a hybrid probe + manual incident model:
- Automated probes drive the colored component grid and 90-day uptime timeline.
- Ops-authored incidents provide the human narrative (
investigating → identified → monitoring → resolved).
Probes (BullMQ, per-interval)
└─► hysteresis evaluator (3-fail / 2-pass)
└─► status_component.current_status (override wins if not expired)
├─► GCS snapshot (on state transition only) ──► CDN ──► /status page
├─► Socket.IO /status namespace ──► PWA StatusBanner
└─► status_audit_log
A GCS snapshot (flowpos-status-snapshots/current.json) is written only on state transitions (not every probe). The landing page reads this snapshot first, falling back to the live API. This means the status page remains visible even when the backend is down.
Adding a new component
- Add a row to
status_component(via migration or admin API). - Add one or more rows to
status_component_probewith the appropriateprobe_type. - If needed, add a new probe strategy (see probe-types.md).
Probe types and default intervals:
| Probe type | Default interval | Use case |
|---|---|---|
http_get | 60s | HTTP health check on a URL |
db_query | 60s | PostgreSQL SELECT 1 |
redis_ping | 60s | Redis PING |
vendor_rss | 300s | Ingest third-party status RSS |
passive | n/a | Heartbeat from external client |
Authoring an incident (v1 — REST)
All admin routes require the MANAGE_STATUS CASL ability (mapped to OWNER + ADMIN roles).
1. Create a draft
POST /api/v1/admin/status/incidents
Authorization: Bearer <token>
Content-Type: application/json
{
"title": "Elevated error rate on Stripe Payments",
"severity": "major",
"componentIds": ["<stripe-component-id>"]
}
2. Publish (fans out emails + Socket.IO push)
POST /api/v1/admin/status/incidents/<id>/publish
Authorization: Bearer <token>
3. Add updates as the incident progresses
POST /api/v1/admin/status/incidents/<id>/updates
Authorization: Bearer <token>
Content-Type: application/json
{
"state": "identified",
"bodyMd": "We have identified the root cause as a misconfigured webhook endpoint. A fix is being deployed."
}
4. Resolve
POST /api/v1/admin/status/incidents/<id>/resolve
Authorization: Bearer <token>
Valid state transitions: investigating → identified → monitoring → resolved.
Manual override
Pin a component to a specific status (e.g., during planned maintenance):
PATCH /api/v1/admin/status/components/<id>/override
Authorization: Bearer <token>
Content-Type: application/json
{
"status": "maintenance",
"reason": "Planned database upgrade",
"expiresAt": "2026-04-26T03:00:00Z"
}
expiresAt is required and must be ≤ 24h in the future. The hourly expiry job clears it automatically.
Clear early:
DELETE /api/v1/admin/status/components/<id>/override
Feeds
| Format | URL |
|---|---|
| RSS 2.0 | /api/v1/status/feed.rss |
| JSON | /api/v1/status/summary.json |
Audit log
Every status transition, override set/clear, and incident lifecycle event writes an append-only row to status_audit_log. The table has REVOKE UPDATE, DELETE so the application cannot mutate history.
Query:
GET /api/v1/admin/status/audit?entityType=component&entityId=<id>
Data retention
| Data | Retention |
|---|---|
Raw status_health_check rows | 25 hours (daily prune job) |
status_uptime_bucket hourly aggregates | 90 days |
status_audit_log | Permanent (append-only) |
v2 deferrals
- Admin authoring UI — v1 ops uses REST/curl.
- Spanish localization — schema includes
languagecolumn; templates add.es.htmlin v2. - Auto-detect drafts — probes never auto-create incidents in v1;
auto_detectedcolumn reserved. - Independent hosting — landing page stays on Cloud Run for v1; GCS snapshot covers the outage case.