High profile high impact outages

Reading Time: < 1 minute

High profile or high impact outage or incidents are just a way of the life in modern internet, or public cloud. The key is to recover quickly from the incident, and learn from those incidents: post mortem analysis, or in some places they do root cause analysis or RCA, and in my personal opinion, RCA is usually useless exercise, both due to the political nature of scapegoating in large orgs, as well as there are usually some unique breakpoints in an incident.

FB

https://blog.cloudflare.com/october-2021-facebook-outage/ (written by Cloudflare, good)

https://www.theguardian.com/technology/2021/oct/05/facebook-outage-what-went-wrong-and-why-did-it-take-so-long-to-fix

Roblox

Below is written by Roblox, and it’s good.

AWS

https://www.thousandeyes.com/blog/aws-outage-analysis-december-15-2021

https://aws.amazon.com/message/12721/

Azure (Microsoft)

Slack

Salesforce

Zoom