An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.
Join the newsletter:
Thanks for joining our newsletter
Oops! Something went wrong while submitting the form.
(Recent) Hands-On with Preq - Community-Driven Problem Detection | Rawkode Live - Prequel co-founder Tony Meehan joined the unscripted livestream for an entertaining deep dive. (youtube) (rawkode)
Open Observability Summit + OTEL Day (June 26, Denver) – Maintainers and end-users share real-world stories of instrumenting at scale and discuss what’s next for open-source projects (events.linuxfoundation)
PlatformCon 2025 (June 23 - 27, NY, London & Virtual) – 250+ talks on developer platforms, golden paths, and infrastructure (platformcon)
Hands-On Analysis 📐👷♀️
Cost-aware reliability at Warner Bros. Discovery – Tom Leaman details how they cut spend without compromising uptime across 400+ microservices (thefrugalarchitect)
Notable Incidents 🔥
Anthropic console & Claude.ai unavailable (May 27) – Login endpoints were down 22:11 - 00:24 UTC (≈ 2 h 13 m); API traffic was unaffected (status.anthropic)
X instability after data-center outage (May 22-23) – Timelines, DMs, and search remained glitchy for 24h; X Eng attributed the issue to a DC outage(techcrunch)
Denver ATC radio failure (May 15) – Both transmitters at the Denver Air Route Traffic Control Center went offline for ~2 minutes, briefly severing pilot-controller comms; operations continued on emergency frequencies (reuters)
Notes on Troubleshooting – Covers symptomatic and topographic approaches and highlights the need for community knowledge. (towardsdev)
The same incident never happens twice—patterns recur – Lorin Hochstein shows how family resemblances across outages accelerate diagnosis (surfingcomplexity)
The MTTI Manifesto – Track Mean Time to Isolate to expose knowledge gaps and slash customer pain (oldschoolburke)
When incident heroics are too heroic – Celebrating all-night saviors hides systemic reliability debt. What should you do about it? (thefridaydeploy)
Tools 🛠️
preq v0.1.25 – Last month's release of the open-source problem detector adds Slack notifications and the latest community rules(reddit)
FireDBG – Open-source time-travel debugger for Rust async: record once; step forward and backward (firedbg)
Whether you're on call this week, looking to improve system reliability, or simply keeping up with the latest tips & tricks, we’re happy to be a part of your day.