Issues: URGENT: Implement auto-restart monitor to handle H12 timeouts and crashes

Posted Jul 27, 2025

closed

Problem

The site experiences frequent H12 timeout errors and crashes after deployments. The web dynos need to be manually restarted to restore service. This is causing significant downtime and user frustration.

Root Causes Identified

  1. Sidekiq Middleware Bug: Module.new was being used instead of Class.new in sidekiq_memory_killer.rb, causing NoMethodError exceptions
  2. Memory Issues: Web dynos approaching memory limits (512MB for Performance-M)
  3. H12 Timeouts: Requests taking longer than Heroku's 30-second limit
  4. Post-deployment instability: Site crashes within minutes of deployment

Solution Implemented

Created an automatic restart system that runs within the app itself:

1. Auto-Restart Monitor (config/initializers/auto_restart_monitor.rb)

  • Monitors for H12 errors (restarts after 3 errors in 5 minutes)
  • Tracks request timeouts
  • Monitors memory usage (restarts if > 450MB)
  • Implements cooldown period (10 minutes) to prevent restart loops
  • Gracefully shuts down Puma when restart needed
  • Coordinates restarts across multiple web dynos

2. Monitoring Endpoint (app/controllers/monitor_controller.rb)

  • Provides /monitor/status endpoint for health checks
  • Shows current memory usage, error counts, and system status
  • Protected by token authentication

3. Fixed Sidekiq Bug

  • Changed Module.new to Class.new in sidekiq_memory_killer.rb

How It Works

  1. The monitor runs in a background thread on each web dyno
  2. It tracks H12 errors and timeouts via middleware
  3. When thresholds are exceeded, it gracefully terminates the process
  4. Heroku automatically restarts the terminated dyno
  5. Multiple dynos coordinate to stagger restarts

Benefits

  • Automatic recovery from crashes
  • No manual intervention required
  • Minimal downtime (Heroku restarts dynos in ~10 seconds)
  • Prevents extended outages
  • Provides visibility into system health

Next Steps

  • Deploy this temporary fix to production
  • Continue investigating root cause of performance issues
  • Consider upgrading to larger dynos if memory is the constraint
  • Optimize slow database queries and ActiveStorage operations

Loading profiles...