Saltar al contenido principal

Database Connection Pool Management

Last Updated: 2025-01-10
Status: Active
Owner: Engineering Team

Overview

This document describes the connection pool management strategy for FlowPOS services connecting to Cloud SQL PostgreSQL instances. Proper connection pool configuration is critical to prevent connection exhaustion errors that can cause service failures.

Architecture

Services and Their Connection Requirements

ServiceMax InstancesConnections per InstanceTotal Max Connections
Metabase210-15 (HikariCP default)~20-30
Backend (Staging)103 (DB_POOL_MAX)30
Backend (Production)103 (DB_POOL_MAX)30
Reserved/Admin--~10
Total Estimated--~80-90

Cloud SQL Instance Configuration

  • Staging: metabase-db-staging (hosts both metabase and flowpos_staging databases)
  • Production: metabase-db-production (hosts both metabase and flowpos_production databases)

Important: Both Metabase metadata and application databases share the same Cloud SQL instance, so total connections are pooled across both databases.

Connection Pool Configuration

Metabase

Metabase uses HikariCP for connection pooling. The primary control mechanism is instance scaling, not per-instance pool size.

Configuration:

  • Max Instances: 2 (configured in deploy-metabase.yml)
  • Min Instances: 1
  • Pool Size: Managed by HikariCP (default ~10-15 connections per instance)
  • Control: Via --max-instances flag in Cloud Run deployment

Notes:

  • Metabase's HikariCP pool size is not easily configurable via environment variables when using MB_DB_* connection variables
  • The primary control mechanism is limiting the number of Metabase instances
  • Each instance typically maintains 10-15 connections in its pool

To adjust Metabase connections:

  1. Reduce --max-instances in deploy-metabase.yml (currently 2)
  2. Consider using MB_DB_CONNECTION_URI with pool parameters if Metabase version supports it

Backend Services

Backend services use pg (node-postgres) connection pooling with configurable limits.

Configuration:

  • Max Instances: 10 (configured in deployment workflows)
  • Connections per Instance: 3 (via DB_POOL_MAX environment variable)
  • Default: 3 (if DB_POOL_MAX secret not set)
  • Control: Via DB_POOL_MAX GitHub secret or environment variable

Environment Variables:

DB_POOL_MAX=3                    # Max connections per instance (default: 3)
DB_POOL_MIN=0 # Min idle connections (default: 0)
DB_POOL_IDLE_TIMEOUT=300000 # Idle timeout in ms (default: 5 minutes)
DB_POOL_CONNECTION_TIMEOUT=10000 # Connection timeout in ms (default: 10 seconds)
DB_POOL_MONITORING=true # Enable pool monitoring (default: true)

To adjust backend connections:

  1. Set DB_POOL_MAX secret in GitHub Actions (staging/production environments)
  2. Reduce --max-instances in deployment workflows if needed
  3. Adjust timeout values via environment variables

Example: To reduce backend connections:

# In GitHub Secrets → staging environment
DB_POOL_MAX=2 # Reduces from 3 to 2 per instance (10 instances × 2 = 20 total)

Connection Exhaustion Prevention

Warning Signs

  1. Log Errors:

    FATAL: remaining connection slots are reserved for non-replication superuser connections
  2. Symptoms:

    • Database connection failures
    • Metabase queries timing out
    • Backend API errors when accessing database
    • Statement cancellation errors
  3. Monitoring:

    -- Check active connections
    SELECT count(*) as active_connections, datname
    FROM pg_stat_activity
    GROUP BY datname;

    -- Check connections by user
    SELECT count(*) as connections, usename
    FROM pg_stat_activity
    GROUP BY usename;

Immediate Fix

Use the emergency fix script:

cd scripts
./fix-metabase-connection-pool.sh

Or manually reduce instances:

# Reduce Metabase instances
gcloud run services update flowpos-metabase-staging \
--region=us-central1 \
--project=barto-dev \
--max-instances=2

# Reduce backend instances (if needed)
gcloud run services update flowpos-backend \
--region=us-central1 \
--project=barto-dev \
--max-instances=5

Long-term Solutions

  1. Increase Cloud SQL Max Connections:

    gcloud sql instances patch metabase-db-staging \
    --database-flags=max_connections=200 \
    --project=barto-dev

    Note: Cloud SQL max connections depends on instance machine type. Check current limit first:

    gcloud sql instances describe metabase-db-staging \
    --project=barto-dev \
    --format="value(settings.databaseFlags[?(@.name=='max_connections')].value)"
  2. Optimize Connection Pool Settings:

    • Review actual connection usage patterns
    • Adjust DB_POOL_MAX based on real usage (not peak theoretical)
    • Consider reducing max instances if connection usage is consistently high
  3. Implement Connection Monitoring:

    • Set up alerts for connection usage > 80%
    • Monitor connection pool metrics
    • Track connection creation/destruction patterns

Monitoring

Cloud SQL Connection Usage

The deployment workflows include connection monitoring steps that:

  • Check Cloud SQL max_connections setting
  • Calculate estimated connection allocation
  • Warn if usage exceeds 60% or 80% thresholds
  • Provide recommendations for optimization

Manual Monitoring

Check current connection usage:

-- Total active connections
SELECT count(*) as total_connections FROM pg_stat_activity;

-- Connections by database
SELECT
datname,
count(*) as connections,
count(*) FILTER (WHERE state = 'active') as active,
count(*) FILTER (WHERE state = 'idle') as idle
FROM pg_stat_activity
GROUP BY datname
ORDER BY connections DESC;

-- Connections by application
SELECT
application_name,
count(*) as connections
FROM pg_stat_activity
GROUP BY application_name
ORDER BY connections DESC;

Check Cloud SQL instance settings:

gcloud sql instances describe metabase-db-staging \
--project=barto-dev \
--format="yaml(settings.databaseFlags)"

Alerts

Set up monitoring alerts for:

  • Connection usage > 80% of max_connections
  • Connection exhaustion errors in logs
  • Connection pool errors from applications

Best Practices

  1. Conservative Pool Sizing:

    • Start with lower connection limits and increase only if needed
    • Account for multiple instances (instances × connections_per_instance)
    • Reserve connections for admin/maintenance (typically 10-15%)
  2. Connection Timeouts:

    • Set reasonable connection timeouts
    • Use idle timeouts to release unused connections
    • Implement retry logic with exponential backoff
  3. Instance Scaling:

    • Balance between instance scaling and connection limits
    • Consider reducing instances before increasing connection pools
    • Monitor both instance count and connection usage
  4. Regular Review:

    • Review connection usage patterns monthly
    • Adjust limits based on actual usage, not theoretical maximums
    • Document changes and rationale

Troubleshooting

Connection Exhaustion Error

Error:

FATAL: remaining connection slots are reserved for non-replication superuser connections

Steps:

  1. Check current connection usage (see Monitoring section)
  2. Identify which service is using most connections
  3. Temporarily reduce instances or connection pool size
  4. Increase Cloud SQL max_connections if appropriate
  5. Review connection pool configuration

Metabase Connection Issues

Symptoms:

  • Metabase queries fail
  • "Failed to connect to database" errors

Steps:

  1. Verify Metabase instances are running:

    gcloud run services describe flowpos-metabase-staging \
    --region=us-central1 --project=barto-dev
  2. Check Metabase logs for connection errors:

    gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=flowpos-metabase-staging" \
    --limit=50 --project=barto-dev
  3. Verify Cloud SQL instance is accessible

  4. Check if connection pool is exhausted (see Monitoring)

Backend Connection Issues

Symptoms:

  • Database query timeouts
  • Connection pool errors in logs

Steps:

  1. Check DB_POOL_MAX setting in deployment
  2. Review backend connection pool metrics
  3. Verify database connection string is correct
  4. Check if total connections exceed Cloud SQL limits

Critical Finding: Cloud SQL Instance Sizing

Important: The staging Cloud SQL instance (metabase-db-staging) is currently db-f1-micro, which has a maximum of ~25 connections. However, the estimated connection usage is ~64 connections (256% of capacity). This is the root cause of connection exhaustion errors.

Solutions:

  1. Upgrade instance tier (recommended):

    ./scripts/upgrade-cloudsql-instance.sh

    Or manually:

    gcloud sql instances patch metabase-db-staging \
    --tier=db-g1-small \
    --project=barto-dev
  2. Increase max_connections (if instance supports it):

    gcloud sql instances patch metabase-db-staging \
    --database-flags=max_connections=100 \
    --project=barto-dev
  3. Reduce connection usage (temporary mitigation):

    • Reduce backend DB_POOL_MAX to 2 (saves 10 connections)
    • Reduce backend max-instances to 5 (saves 15 connections)
    • Keep Metabase at 2 instances (already optimized)

Cloud SQL Connection Limits by Tier:

  • db-f1-micro: ~25 connections (current staging)
  • db-g1-small: ~100 connections (recommended minimum)
  • db-n1-standard-1: ~100-200 connections
  • db-n1-standard-2+: 200+ connections

Change Log

  • 2025-01-10: Initial documentation created
    • Added connection pool configuration details
    • Documented Metabase and Backend pool settings
    • Added monitoring and troubleshooting sections
    • Created emergency fix procedures
    • CRITICAL: Identified root cause - Cloud SQL instance is too small (db-f1-micro)
    • Added instance upgrade procedures and connection limits by tier