You have to love The Register for calling out AWS plainly.
AWS has given increasing levels of detail, as is their tradition, when outages strike, and as new information comes to light. Reading through it, one really gets the sense that it took them 75 minutes to go from “things are breaking” to “we’ve narrowed it down to a single service endpoint, but are still researching,” which is something of a bitter pill to swallow. To be clear: I’ve seen zero signs that this stems from a lack of transparency, and every indication that they legitimately did not know what was breaking for a patently absurd length of time.
[…]
I want to be very clear on one last point. This isn’t about the technology being old. It’s about the people maintaining it being new. If I had to guess what happens next, the market will forgive AWS this time, but the pattern will continue.
My own thoughts on this issue were published in Wired.
“When the system couldn’t correctly resolve which server to connect to, cascading failures took down services across the internet,” says Davi Ottenheimer, a longtime security operations and compliance manager and a vice president at the data infrastructure company Inrupt. “Today’s AWS outage is a classic availability problem, and we need to start seeing it more as data integrity failure.”
[…]
“Failures increasingly trace to integrity,” Ottenheimer says. “Corrupted data, failed validation or, in this case, broken name resolution that poisoned every downstream dependency. Until we better understand and protect integrity, our total focus on uptime is an illusion.”
And also I was interviewed by the BBC News, where I made some points about the people who can “run a cloud”.