Saltar al contenido principal

Production Connection Pool Fix - Step by Step Guide

Overview

This guide helps you fix connection pool exhaustion issues in production. The same issue that affected staging (Cloud SQL instance too small) likely affects production.

Quick Reference

Production Scripts:

  • ./scripts/fix-production-connection-pool.sh - Reduce Metabase instances
  • ENVIRONMENT=production PROJECT_ID=barto-prod ./scripts/upgrade-cloudsql-instance.sh - Upgrade Cloud SQL instance

Step-by-Step Fix

Step 1: Check Current Production Status

First, verify the issue exists in production:

# Set production environment
export ENVIRONMENT=production
export GCP_PROJECT_ID=barto-prod # Update if different
export GCP_REGION=us-central1 # Update if different

# Run the fix script (it will show current status without making changes)
./scripts/fix-production-connection-pool.sh

The script will:

  • Show current Metabase instance configuration
  • Check Cloud SQL instance tier and connection limits
  • Calculate estimated connection usage
  • Ask for confirmation before making any changes

Step 2: Reduce Metabase Instances (Quick Fix)

If the script shows high connection usage, it will offer to reduce Metabase instances:

# This will reduce Metabase from 10 instances to 2
./scripts/fix-production-connection-pool.sh

Impact:

  • ✅ Reduces connection pressure immediately
  • ⚠️ Slight reduction in concurrent query capacity
  • ✅ Safe to run (no downtime)

Step 3: Upgrade Cloud SQL Instance (Permanent Fix)

This is the real solution but requires a brief downtime (~1-5 minutes):

# Set production environment
export ENVIRONMENT=production
export GCP_PROJECT_ID=barto-prod
export GCP_REGION=us-central1

# Run the upgrade script
./scripts/upgrade-cloudsql-instance.sh

Or manually:

gcloud sql instances patch metabase-db-production \
--tier=db-g1-small \
--project=barto-prod \
--region=us-central1

Impact:

  • ✅ Increases max connections from ~25 to ~100
  • ⚠️ Brief downtime (1-5 minutes)
  • ✅ Permanent solution
  • ✅ Better performance

When to run: Schedule during low-traffic hours if possible.

Step 4: Verify Fix

After making changes, verify:

# Check Metabase instances
gcloud run services describe flowpos-metabase-production \
--region=us-central1 \
--project=barto-prod \
--format="value(spec.template.metadata.annotations['autoscaling.knative.dev/maxScale'])"

# Check Cloud SQL instance tier
gcloud sql instances describe metabase-db-production \
--project=barto-prod \
--format="value(settings.tier)"

# Monitor logs for connection errors
gcloud logging read "resource.type=cloud_sql_database AND textPayload:\"remaining connection slots\"" \
--limit=10 \
--project=barto-prod \
--format=json

Option A: Immediate Fix (No Downtime)

  1. Run ./scripts/fix-production-connection-pool.sh to reduce instances
  2. Monitor for improvement
  3. Schedule instance upgrade during maintenance window

Option B: Permanent Fix (Brief Downtime)

  1. Schedule maintenance window
  2. Upgrade Cloud SQL instance first
  3. Optionally reduce Metabase instances if still needed

Option C: Both (Safest)

  1. Reduce instances first (immediate relief, no downtime)
  2. Upgrade instance during next maintenance window (permanent fix)

Production Considerations

Before Making Changes

  1. Check current traffic patterns:

    # Check active instances
    gcloud run services describe flowpos-metabase-production \
    --region=us-central1 \
    --project=barto-prod \
    --format="value(status.traffic)"
  2. Review recent errors:

    gcloud logging read "resource.type=cloud_sql_database AND severity>=ERROR" \
    --limit=50 \
    --project=barto-prod \
    --freshness=1h
  3. Notify team of planned changes

During Changes

  • Monitor Cloud Run service status
  • Watch for increased error rates
  • Have rollback plan ready

After Changes

  • Verify connection errors have stopped
  • Monitor connection usage patterns
  • Document changes made

Rollback Plan

If issues occur after reducing instances:

# Increase instances back
gcloud run services update flowpos-metabase-production \
--region=us-central1 \
--project=barto-prod \
--max-instances=10 # Restore to original

If issues occur after instance upgrade:

  • Instance downgrade requires similar downtime
  • Consider this carefully before proceeding

Connection Limits Reference

TierMax ConnectionsCost Impact
db-f1-micro~25Baseline
db-g1-small~100~2-3x cost
db-n1-standard-1~100-200~5-6x cost

For production with your workload (~64 connections needed):

  • Minimum: db-g1-small (100 connections)
  • Recommended: db-g1-small or db-n1-standard-1 for headroom

Troubleshooting

Script Fails with Permission Error

# Ensure you're authenticated
gcloud auth login
gcloud config set project barto-prod

# Check required roles
# You need: Cloud SQL Admin, Cloud Run Admin

Instance Upgrade Takes Too Long

  • Normal upgrade time: 1-5 minutes
  • Large instances: 5-15 minutes
  • Monitor with: gcloud sql operations list --instance=metabase-db-production --project=barto-prod

Connection Errors Continue After Fix

  1. Verify instance tier was upgraded:

    gcloud sql instances describe metabase-db-production --project=barto-prod --format="value(settings.tier)"
  2. Check actual connection usage (requires database access):

    SELECT count(*) FROM pg_stat_activity;
  3. Verify backend DB_POOL_MAX is reasonable (should be 3 or less)

Support

For issues or questions:

  • Check logs: gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=flowpos-metabase-production" --limit=50 --project=barto-prod
  • Review documentation: docs/database/connection-pool-management.md
  • Check monitoring dashboards in GCP Console