Incidents | Vlge Inc.

Incidents | Vlge Inc. Incidents reported on status page for Vlge Inc. https://status.vlge.io/ https://d1lppblt9t2x15.cloudfront.net/logos/f48e51cb0ef59639a7079df76fca657d.png Incidents | Vlge Inc. https://status.vlge.io/ en Service Interruption https://status.vlge.io/incident/927004 Wed, 17 Jun 2026 19:10:00 -0000 https://status.vlge.io/incident/927004#1d0143f6bb345e10eefe89a0c04299a2a32af323fc14dc7877922d6a1b9471a5 **Summary (Less technical)** We made a necessary update/upgrade to the hosting services, and a quirky setting meant that while the code was up and fine, the domains (vlge.io) were not able to reach it. It appeared "down" but was simply not accessible at the normal location (which has the same effect) **More Details (Technical):** App Runner → ECS We migrated production from AWS App Runner to ECS Express Mode on Fargate. AWS is winding down App Runner for new customers, and ECS gives us more control over scaling, deploy behavior, and observability. The app itself is unchanged - same Docker image, same database, same Cloudflare front door. Current path for production traffic: Browser → Cloudflare (DNS/TLS) → Application Load Balancer (ALB) → ECS Fargate tasks **What broke:** We had two separate outage windows this morning, both caused by the same underlying ALB routing issue - not an application bug or data problem. **Root cause:** ECS Express Mode manages load balancer routing in a slightly awkward way. On each deploy, AWS automatically updates the ALB listener rule for the Express hostname (*.ecs.us-east-1.on.aws). But our real user traffic arrives through Cloudflare with normal hostnames - vlge.io, app.vlge.io, and community subdomains - and that traffic hits the ALB’s "default listener" rule, which Express Mode does not automatically "re-point". After certain deploys and scaling changes, the "default rule" was still pointing at an old, empty target group (no healthy ECS tasks behind it). The ALB correctly returned 503 Service Unavailable. Cloudflare passed that through, so the site looked “down” even though new tasks were healthy and the Express-internal URL still returned 200. **So Now:** Containers were fine; the load balancer was sending production hostnames to the wrong backend. **Here are the two windows:** ~10:24–10:55 AM CT ~30 min Scaling change (moving from 1 → 2 minimum tasks for zero-downtime deploys). Load Balancer default rule left on a "stale" target group (pointed to wrong place). ~11:30 AM–12:00 PM ~30 min A deploy completed successfully, but an automated post-deploy step that repoints the default listener failed on IAM permissions. We fixed it manually once we identified the issue. No data was lost or corrupted at any point - this was purely a routing/configuration problem at the load balancer layer. All communities on *.vlge.io would have seen the same behavior; it wasn’t subdomain-specific. *What we’ve done since:* 1. Identified the default-listener gap - documented that Cloudflare hostnames don’t ride the Express-managed rule. 2. Added an automated fix script that runs after every deploy and points the default listener at the target group with the most healthy ECS tasks. 3. Fixed IAM permissions on our deploy service account so that step can run unattended going forward. 4. Raised minimum task count to 2 so future deploys don’t create a brief “no healthy target” window during task swaps. 5. Added ECS-specific CloudWatch alarms and an internal infrastructure-health dashboard so we catch 503/latency spikes faster. We also put a branded Cloudflare custom 503 page in place so users see a proper “we’ll be right back” message instead of a raw error if something like this happens again during the migration soak period.