This note is inspired by the AWS incident that brought down a lot of stuff in 2025.
People are obsessed with the "nines". Having a Service-level of 99.99%, or "Four nines" seems to be the default expectation. It does not matter: you can just arbitrarily say "We have four or five, or eleven nines. Better yet, you can quit the Cult of the Nines today and focus on better things.
Why the nines are meaningless:
Four nines of availability means that a thing can be unavailable for only 8.6 seconds a day, or if you are doing a 2-week development cycle, you have 2 minutes of leeway should something new bring your thing down. Add another nine and by definition, you get an infinitesimal more time, but everything has to work an order of magnitude harder to guarantee this. Instead of just upload the new code, there now need to be deployment strategy that is more and more complicated because exactly of the non-happy scenario: Blue-Green/Canary/Rolling/etc. deployment mean that you are doing all kind of acrobatic movement just to get sometimes a new JPEG header of your website up, on the offchance of a freak accident. These complicated maneuvers usually mean an increase in cost, which everyone now has to pay, even the users, because the cost is calculated into the margin.
But what if we exhausted the budget of the nines we committed? AMZN stock price increase 1.6 point on the day after the incident. If you ever use AWS and was affected by any sevice disruption, you know that it's your, the customer, total loss: You have just paid premium for all these nines that are everywhere in AWS documentation, product pages, or from the mouths of people in their Commercial org, but now you have to prove that you are in fact affected in anyway to them. They now have to judge that themselves are compelled to make it up for your lost of service, in term of service credits so that you can spend more on AWS. See the problem?