Detect #14: Zoom, Spotify & Anthropic Outages; Debugging War Stories; Fresh Open Source Projects
Community
May 12, 2025
An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.
Join the newsletter:
Thanks for joining our newsletter
Oops! Something went wrong while submitting the form.
Welcome to the May edition of Detect — a monthly roundup of outages, debugging stories, and architecture insights for SREs and software engineers.
We're brought to you by Prequel, the creators of the new100% open-source problem detection projects: CRE and preq. Check them out and leave a ⭐.
Here's a digest of what's new in the world of problem detection.
HeapInUse metric reported by the Go runtime. Full debugging adventure below. (via Warpstream)
Recent Events 🗓️
Community‑Driven Problem Detection Launch - (Full Video) — Kelsey Hightower and Prequel Co-founder & CTO Tony Meehan unveiled two brand‑new open‑source projects, CRE and preq. The live demo shows how community rules can help you catch bugs, misconfigurations, and anti‑patterns before they lead to incidents. (linkedin)
Kubecon Europe + CloudNativeCon 2025 - Didn't make it to London? All sessions are now on the CNCF channel (youtube)
SREcon Americas 2025: San Francisco - Full slides and videos are up! (usenix)
Notable Incidents 🔥
Zoom DNS outage – TLDR: It was DNS. A registrar‑level change made zoom dot us resolve to NXDOMAIN, blocking new meetings for ~2 hours. (theverge) (zoom)
Spotify streaming failure – Login and streaming requests returned errors for users in the US & UK; service recovered after ~ an hour. (theverge)
Anthropic (Claude) partial outages – Multiple events lead to elevated model error rates/latency. (status.anthropic)
WP Engine RDS failover – Users experienced elevated errors for ~90 min. (wpenginestatus)
Stack Overflow outage – The platform became unavailable to all users for about 45 mins. (stackstatus)
Troubleshoot & Debug 📖
Memory leak sleuthing when pprof failed — WarpStream engineers resorted to heap profiling and custom Go traces to pinpoint a memory leak. (warpstream)
Debugging Kubernetes Service unavailability at Reddit — A reddit engineer details what they did to overcome cryptic 503 Service Unavailable errors and flakey control plane health.(reddit)
"The hardest bug I ever debugged" – Debugging a videoconferencing web client. (philippweissensteiner)
Fresh Ideas 🤔
Model error – Lorin Hochstein explores how every piece of software embeds an imperfect model of reality—and why those hidden assumptions turn into incidents. (surfingcomplexity)
Impressions of SREcon Americas 2025 – Niall Murphy’s conference recap notes an industry pivot: AI has moved from hype to hands‑on experimentation—alongside growing interest in sociotechnical resilience. (blog.relyabilit.ie)
Good models protect us from bad models – "Actionable-but-wrong" fixes can be worse than inaction; sound mental models are the “vaccine” against neat‑but‑dangerous solutions. (surfingcomplexity)
Architecting for Reliability 📐👷♀️
Anomaly detection for time‑series at Booking.com — a practical walkthrough of statistical methods to try and cut false positives in half. (medium)
CATS vs. FIFO eviction in MySQL — benchmark results under mixed workloads. (dzone)
The “fourth lost pillar” of observability: configuration data? — CloudQuery argues for the elevation of config changes in the 011y stack. (cloudquery)
OTel Sucks (But Also Rocks!) — a candid review of OTel’s new features and the hurdles that remain. (opentelemetry)
Tools 🛠️
CRE+preq(100% open‑source) – Community‑driven problem library and lightweight detector to surface issues before you get paged. (docs)
Whether you're on call this week, looking to improve system reliability, or simply keeping up with the latest tips & tricks, we’re happy to be a part of your day.