Detect #16: GCP's 503 storm; AI hates debugging?; how github tackles problems
Community
July 5, 2025
An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.
Join the newsletter:
Thanks for joining our newsletter
Oops! Something went wrong while submitting the form.
Welcome to the Julyedition of Detect — a monthly roundup of outages, debugging stories, and reliability tips for SREs and software engineers.
We're brought to you by Prequel, the creators of the new100% open-source problem detection projects: CRE and preq. Check them out and leave a ⭐.
Here's a digest of what's new in the world of problem detection...
Is debugging the job that AI doesn't actually want?
Users report that Gemini threatens self-harm while debugging (via x)
Podcasts/Videos 🎙️
Featured: What CVEs Did for Security, CREs Are Doing for Reliability” — how community-reported Common Reliability Enumerations (CREs) are helping automate problem detection(youtube)
Queuing Theory on a Cocktail Napkin— Dan Slimmon turns advanced math into on-call intuition you can use today (youtube)
Notable Incidents 🔥
Google Cloud global 503s (12 Jun) — a policy change returned 503s for quota checks, producing widespread failures for ~3 hours (status.cloud.google); quick-take analysis (surfingcomplexity); deeper critique (ebellani)
Salesforce outage (18 Jun) — missing routes led to widespread outages across Heroku and other services (help.salesforce)
GitHub Action disruption (17 Jun) —During this period, 47.2% of runs had delayed starts, and 21.0% of runs failed. The impact extended beyond Actions itself - 60% of Copilot Coding Agent sessions were cancelled, and all Pages sites using branch-based builds failed to deploy RCA pending (githubstatus)
Supabase API timeouts (12 Jun) — elevated 5xx for 2.5 hours tied to an upstream Cloudflare issue (status.supabase)
Hands-On Debugging 📐👷♀️
Featured: Detecting NGINX worker leaks — Prequel engineers tracked down silent OOM crashes in ingress-nginx, then published CREs PREQUEL-2025-0071 and PREQUEL-2025-0076 so you can detect them before you feel pain(prequel).
Why the Rust compiler feels slow — profiling shows LLVM inlining dominates; tips include cargo-chef, thin-LTO tweaks, and split debuginfo (sharnoff)
Debug like a champion — five detective-style habits that shrink bug-hunt time (flaky.build)
Simulate network latency in local containers — tc + NetEm recipes to reproduce slow links in Podman / Docker without a test lab (developers.redhat)
Fresh Ideas 🤔
How GitHub engineers tackle platform problems — impact-radius mapping, domain immersion, and a single availability metric for rapid triage (github.blog)
Tools 🛠️
preq v0.1.31 — new release adds Linear Action integration, embedded runbook expressions, extensive CLI + engine unit tests, and increased auth-timeout resilience (github)
New contributors: @samgaw, @amanycodes, @ramin, @kris-gaudel, @Harsh9485
Pair Preq with CREsto check all your services for bugs and misconfigurations, beforethey page you
Whether you're on call this week, looking to improve system reliability, or simply keeping up with the latest tips & tricks, we’re happy to be a part of your day.