detect #18: August failures, September safeguards; incidents at Openai, Cloudflare, Pagerduty; Bitnami image deprecation
Community
September 11, 2025
An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.
Join the newsletter:
Thanks for joining our newsletter
Oops! Something went wrong while submitting the form.
Welcome to the Septemberedition of Detect — a monthly roundup of outages, monitoring tips, debugging stories, and more from the reliability community.
We're brought to you by Prequel, the creators of the new100% open-source problem detection projects: CRE and preq. Check them out and leave a ⭐.
Here's a digest of what you may have missed ...
Timeline of PagerDuty's 8/28 Incident (full details below)
Featured CRE ⚠️
Latest on the Bitnami Catalog Deprecation: Bitnami’s free Docker Hub catalog started going away onAug 28 with several "brown outs" planned in the coming weeks. Read the latest analysis and tips for managing it. (prequel)
Upcoming Events 📅
SEV0 (SF) September 23: Featuring:Prequel, speakers from Nvidia, Netflix, Plaid, Zendesk, Anthropic, and more; If you’re attending, stop by and say hi. (sev0)
Notable Incidents & Causes🔥
OpenAI reported elevated error rates for ChatGPT conversations with GPT-5 on Aug 8, 2025; the incident was identified and resolved the same day. Root cause not disclosed on the page at time of writing. (status.openai)
Cloudflare users experienced increased latency and packet loss for traffic between Cloudflare and AWS us-east-1 on Aug 21, 2025 from 16:27–20:18 UTC (partial impact began 15:56 UTC). Cloudflare attributes the event to a sudden surge from a single customer saturating interconnects—not an attack—with progressive capacity adds during mitigation. (blog.cloudflare)
PagerDuty (Aug 28, 2025): From 03:53–10:10 UTC, processing of new incoming events in US regions was disrupted (some 502s), with degraded outbound notifications, webhooks, chat integrations, and REST API; Root cause: a newly rolled-out feature instantiated a new Kafka producer per API request, driving ~4.2M extra producers/hour, exhausting broker JVM heap and cascading across the cluster; rollback and restarts stabilized. (pagerduty.com)
OpenRouter services were inaccessible for ~50 minutes beginning 5:40 a.m. Eastern (US) due to downtime of an upstream database dependency in US-East; status notes resolution and affected components. (status.openrouter)
OpenStreetMap post-mortem (Dec 15–18, 2024; published Feb 15, 2025) details a 68-hour network incident centered on HE.net routing equipment failure in Amsterdam, with partial read-only restoration after ~8 hours and full editing restored after ~56 hours (all times UTC). Root cause: upstream routing hardware failure and single-ISP exposure. (operations.osmfoundation)
Debugging 🔎
Interactively poke failed GitHub Actions: throw a reverse shell from the runner with netcat and an ngrok TCP tunnel; the author later recommends mxschmitt/action-tmate@v3 for safer, browser/SSH debugging. (jacobtomlinson.dev)
A year-long TCP mystery: exporting Linux kernel TCP_INFO (via a Go proxy and go-tcpinfo) made it possible to distinguish sender/receiver/network limits; the vendor ultimately fixed the bug. (fdi.sk)
DNS timeouts in a large EKS cluster: traced to the 1024 PPS limit on link-local services and search-domain amplification (ndots:5)—particularly for sudo hostname lookups—validated via CoreDNS logs and tcpdump. (cep.dev)
SSSD troubleshooting basics: enable debug_level per process, inspect /var/log/sssd/*, verify nsswitch.conf and resolver paths, and use sss_debuglevel to toggle logging without restarts. (docs.pagure.org)
Tools 🛠️
preq (github): Updated with the latest community CREs, including -
n8n silent data loss (CRE-2025-0179) by @Sahelisaha04 (github)
AutoGPT recursive self-analysis loop (CRE-2025-0200) by @MAVRICK-1 (github)
Stable Diffusion WebUI CUDA OOM (CRE-2025-0162) from @piyzard (github)
timep: v1.5 is a trap-based Bash profiler that emits hierarchical per-command timings and optional flamegraphs; the latest update fixes subtree merge issues (adds CRC32/FNV-1a loadables) and expands prebuilt architectures (x86_64, aarch64, armv7, ppc64le, riscv, s390) via CI. (github)
rotel: is a lightweight OpenTelemetry collector; the Sep 2, 2025 release v0.0.1-alpha28 (by @mheffner) fixes a low-severity CVE and bumps tracing-subscriber. (github)
Whether you're on call this week, looking to improve system reliability, or simply keeping up with the latest tips & tricks, we’re happy to be a part of your day.