Production Connection Pool Fix - Step by Step Guide
Overview
This guide helps you fix connection pool exhaustion issues in production. The same issue that affected staging (Cloud SQL instance too small) likely affects production.
Quick Reference
Production Scripts:
./scripts/fix-production-connection-pool.sh- Reduce Metabase instancesENVIRONMENT=production PROJECT_ID=barto-prod ./scripts/upgrade-cloudsql-instance.sh- Upgrade Cloud SQL instance
Step-by-Step Fix
Step 1: Check Current Production Status
First, verify the issue exists in production:
# Set production environment
export ENVIRONMENT=production
export GCP_PROJECT_ID=barto-prod # Update if different
export GCP_REGION=us-central1 # Update if different
# Run the fix script (it will show current status without making changes)
./scripts/fix-production-connection-pool.sh
The script will:
- Show current Metabase instance configuration
- Check Cloud SQL instance tier and connection limits
- Calculate estimated connection usage
- Ask for confirmation before making any changes
Step 2: Reduce Metabase Instances (Quick Fix)
If the script shows high connection usage, it will offer to reduce Metabase instances:
# This will reduce Metabase from 10 instances to 2
./scripts/fix-production-connection-pool.sh
Impact:
- ✅ Reduces connection pressure immediately
- ⚠️ Slight reduction in concurrent query capacity
- ✅ Safe to run (no downtime)
Step 3: Upgrade Cloud SQL Instance (Permanent Fix)
This is the real solution but requires a brief downtime (~1-5 minutes):
# Set production environment
export ENVIRONMENT=production
export GCP_PROJECT_ID=barto-prod
export GCP_REGION=us-central1
# Run the upgrade script
./scripts/upgrade-cloudsql-instance.sh
Or manually:
gcloud sql instances patch metabase-db-production \
--tier=db-g1-small \
--project=barto-prod \
--region=us-central1
Impact:
- ✅ Increases max connections from ~25 to ~100
- ⚠️ Brief downtime (1-5 minutes)
- ✅ Permanent solution
- ✅ Better performance
When to run: Schedule during low-traffic hours if possible.
Step 4: Verify Fix
After making changes, verify:
# Check Metabase instances
gcloud run services describe flowpos-metabase-production \
--region=us-central1 \
--project=barto-prod \
--format="value(spec.template.metadata.annotations['autoscaling.knative.dev/maxScale'])"
# Check Cloud SQL instance tier
gcloud sql instances describe metabase-db-production \
--project=barto-prod \
--format="value(settings.tier)"
# Monitor logs for connection errors
gcloud logging read "resource.type=cloud_sql_database AND textPayload:\"remaining connection slots\"" \
--limit=10 \
--project=barto-prod \
--format=json
Recommended Approach
Option A: Immediate Fix (No Downtime)
- Run
./scripts/fix-production-connection-pool.shto reduce instances - Monitor for improvement
- Schedule instance upgrade during maintenance window
Option B: Permanent Fix (Brief Downtime)
- Schedule maintenance window
- Upgrade Cloud SQL instance first
- Optionally reduce Metabase instances if still needed
Option C: Both (Safest)
- Reduce instances first (immediate relief, no downtime)
- Upgrade instance during next maintenance window (permanent fix)
Production Considerations
Before Making Changes
-
Check current traffic patterns:
# Check active instances
gcloud run services describe flowpos-metabase-production \
--region=us-central1 \
--project=barto-prod \
--format="value(status.traffic)" -
Review recent errors:
gcloud logging read "resource.type=cloud_sql_database AND severity>=ERROR" \
--limit=50 \
--project=barto-prod \
--freshness=1h -
Notify team of planned changes
During Changes
- Monitor Cloud Run service status
- Watch for increased error rates
- Have rollback plan ready
After Changes
- Verify connection errors have stopped
- Monitor connection usage patterns
- Document changes made
Rollback Plan
If issues occur after reducing instances:
# Increase instances back
gcloud run services update flowpos-metabase-production \
--region=us-central1 \
--project=barto-prod \
--max-instances=10 # Restore to original
If issues occur after instance upgrade:
- Instance downgrade requires similar downtime
- Consider this carefully before proceeding
Connection Limits Reference
| Tier | Max Connections | Cost Impact |
|---|---|---|
db-f1-micro | ~25 | Baseline |
db-g1-small | ~100 | ~2-3x cost |
db-n1-standard-1 | ~100-200 | ~5-6x cost |
For production with your workload (~64 connections needed):
- Minimum:
db-g1-small(100 connections) - Recommended:
db-g1-smallordb-n1-standard-1for headroom
Troubleshooting
Script Fails with Permission Error
# Ensure you're authenticated
gcloud auth login
gcloud config set project barto-prod
# Check required roles
# You need: Cloud SQL Admin, Cloud Run Admin
Instance Upgrade Takes Too Long
- Normal upgrade time: 1-5 minutes
- Large instances: 5-15 minutes
- Monitor with:
gcloud sql operations list --instance=metabase-db-production --project=barto-prod
Connection Errors Continue After Fix
-
Verify instance tier was upgraded:
gcloud sql instances describe metabase-db-production --project=barto-prod --format="value(settings.tier)" -
Check actual connection usage (requires database access):
SELECT count(*) FROM pg_stat_activity; -
Verify backend
DB_POOL_MAXis reasonable (should be 3 or less)
Support
For issues or questions:
- Check logs:
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=flowpos-metabase-production" --limit=50 --project=barto-prod - Review documentation:
docs/database/connection-pool-management.md - Check monitoring dashboards in GCP Console