July 2024: 🔥 Surviving on-call with an ex-Meta SRE; 🚨 Incidents at Google, Cloudflare, Github, and OpenAI; 🛠️ Get a handle on Flaky alerts...and more (Clone)
September 3, 2024
An adventure through the vibrant world of problem detection, where every post is a mix of expert insights, community wisdom, and tips, designed to turbocharge your expertise.
Join the newsletter:
Thanks for joining our newsletter
Oops! Something went wrong while submitting the form.
This month’s newsletter is brought to you by the team at Prequel (prequel.dev). The company bringing detection engineering to reliability. Join their early access program to see how they help teams overcome alert overload and manual troubleshooting.
And now, here is a digest of what happened last month in the world of problem detection and troubleshooting.
Upcoming Events 🗓️
🎉 Join us for our BIG July webinar featuring Amin Astaneh, ex-Meta SRE Manager. Hear about Amin’s experience at Meta, where he was responsible for operating 15% of the company’s backend services. Discover the benefits and pitfalls of top-down monitoring approaches and explore the role of problem detection. Valuable insights whether you're going from 0 to 1000 servers, or 1000 to ♾️ . Register here. 👈🏼
July Detect Webinar
News 📰
Catch up on interesting news:
Retired Engineer Discovers 55-Year-Old Bug in Lunar Lander Computer Game Code - An incredible story of persistence and discovery in an old classic game. Read more (arstechnica.com)
Intel Raptor Lake Crash Fix - An important update for those working with Intel's Raptor Lake, addressing a critical crash issue. Details here (theverge.com)
Blogs 📝
Sharpen your troubleshooting skills with these insightful blogs:
Fine-Tuning MySQL Performance - Tips and techniques for resolving MySQL issues and improving performance. Read more (dzone.com)
Elixir Code Anti-Patterns - Improve your Elixir code by avoiding common pitfalls and following best practices. Explore here (hexdocs.pm)
Kubernetes Tip: What Happens to Pods on Unreachable Nodes? - Essential tips for managing Kubernetes pods on unreachable nodes. Read here (medium.com)
TempleOS Reverse Engineering Part I - A deep dive into the reverse engineering of TempleOS. Check it out (starkeblog.com)
Chasing a Bug in a SAT Solver - An intriguing journey of debugging a SAT solver issue. Learn more (ochagavia.nl)
Notable Incidents 🔥
Stay informed about the latest incidents and their root cause analysis (RCA) - where available:
LastPass Incident - Analysis of a recent security incident at LastPass. RCA (status.lastpass.com)
Cloudflare Incident - 3x increase in p99 latency. Detailed breakdown of Cloudflare's incident and how it was resolved. RCA (blog.cloudflare.com)
GitHub Incident - Examination of a recent outage at GitHub and the root cause. RCA (githubstatus.com)
EntryWan Incident Postmortem - In-depth analysis of the recent incident at EntryWan. RCA (entrywan.com)
Google Search Bug with Indexing - Insights into a Google Search bug affecting indexing. Read more (searchengineland.com)
OpenAI ChatGPT Outage - Details on the recent ChatGPT outage and what was learned. OpenAI and News (status.openai.com, techradar.com)
Bitbucket Incident - Information on the recent service disruption at Bitbucket affecting pipelines. Details (bitbucket.status.atlassian.com)
Docker Incident - Recent incident at Docker affecting their registry. Details (dockerstatus.com)
Architecture 📐👷♀️
Explore how teams are architecting for reliability:
Improving Push Processing on GitHub - How GitHub enhanced their push processing system. Read more (github.blog)
Stripe's Zero Downtime Data Migrations - Learn about Stripe's approach to maintaining uptime during data migrations. Learn more (stripe.com)
Flaky Alerts Are Saying Something - Understanding what flaky alerts indicate about your system. Explore here (utcc.utoronto.ca)
How eBPF is Shaping the Future of Linux and Platform Engineering - Discover the impact of eBPF on Linux and platform engineering. Read here (infoworld.com)
Tools 🛠️
Boost your toolkit with these powerful tools:
Postgres-BPFTrace - A useful tool for tracing PostgreSQL using BPF. GitHub (github.com)
Enhanced Debugging with Console.log - Beginner to advanced tips for making the most out of console.log in your debugging process. Hacker Noon (hackernoon.com)
As always, we’re open to your feedback and suggestions. Whether you're troubleshooting an issue, looking to optimize performance, or simply keeping up with the latest tricks, we’re happy to be a part of your day.
Follow our brand new account on X (fka twitter): @detect_sh
Find us on the web? Join our mailing list so you'll be the first to know.