Is it time to abandon cloud?

John D. Halamka, a healthcare CIO and practising emergency physician, gives some good perspective on triage and disaster scenarios in “Is it time to abandon cloud computing?” — a detailed and honest look at infrastructure failures.

I know what it takes to provide 99.999 percent uptime.

[…]

With all of this amazing infrastructure comes complexity. With complexity comes unanticipated consequences, change control challenges and human causes of failure.

Let’s look at the downtime I’ve had this year.

For a minute there I thought he was talking about a human body. The outages he lists are

  1. Changes made in production – DNS changes
  2. Changes made in production – Bad app code that filled storage
  3. Changes made in production – OS upgrade
  4. Changes made in production – Primary power taken offline, secondary failed
  5. “bugs in the network operating system”

I suspect some money might be in his budget forecast for change control technology, people and procedures. A succinct conclusion is offered to sum the above experiences:

These examples illustrate that even the most well engineered infrastructure can fail due to human mistakes, operating system bugs and unanticipated consequences of change.

We could argue whether operating system bugs are still human mistakes but I agree with his overall point — the success of security depends on people and processes. Unfortunately, he then turns and attacks a straw man argument:

The cloud is truly no different. Believing that Microsoft, Google, Amazon or anyone else can engineer perfection at low cost is fantasy. Technology is changing so fast and increasing demand requires so much change that every day is like replacing the wings on a 747 while it’s flying. On occasion bad things will happen. We need to have robust downtime procedures and business continuity planning to respond to failures when they occur.

Yes, engineered perfection is fantasy. I think we can all agree on that. Who is asking for perfection? I want to go back to the start of the post:

I know what it takes to provide 99.999 percent uptime.

I don’t see anyone asking or promising 100 percent uptime, even here. More to the point, this is the classic IT operations perspective: Technology is fast changing. Demand is increasing. Nothing cloudy yet. Every day is…wait, wings on a 747 while flying? I love a good risk simile but why must IT operations be like changing the wings while flying?

I mean who would buy a plane ticket if the pilot said “Welcome on board. We will have a chance of landing this aircraft but we’re not sure yet if the wings will even last the trip”? The answer is probably no one (except perhaps wing developers wearing parachutes).

Ok, but what if a pilot said there was a high-probability like 99.999%? The answer is someone in a developing airline industry, where they do not have an option to improve their chances of landing any higher. To put it in real terms, who would fly on a jet in Africa? Someone who has no choice but to fly in Africa. They take the risk because it is worth it to them:

African carriers are 2% of global traffic, but 23% of global western-built jet hull losses.

The Wall Street Journal provides insight based on a UN report on air safety for aid workers. They decided the risk of flying had become too high and researched reasons for failure.

Flying is so dangerous that it even impaired the United Nations World Food Program’s efforts to deliver aid to suffering populations around the continent. After a spate of crashes killed several WFP workers, the agency five years ago ordered a broad review of safety conditions on its flights.

[…]

At the root of Africa’s dismal air-safety record are low investment, crumbling infrastructure and lax national authorities. Across much of the continent, there is minimal air-traffic control or regulation, and pilots often fly without basic navigational aids like radar. National air authorities in many impoverished sub-Saharan states struggle to pay their bills, retain good staff and meet the minimal air-safety standards set by the U.N.’s International Civil Aviation Organization.

In some lawless regions, almost anyone with an airplane can fly with no oversight. In countries like Sudan and Congo, unpaved airstrips often double as soccer fields where children mark goal posts with piles of rocks — into which planes hurtle upon landing — according to WFP safety officials.

Note the heavy emphasis on authorities, regulation, staff, standards, oversight…technology is only briefly mentioned and it is in regard to navigational aids. The emphasis in prevention programs seems to be on improvements in procedures and training.

Back to the post on cloud, I see a similar theme emerge at the end of Halamka’s lament.

Problems on a centralized cloud architecture that is homogenous, well documented and highly staffed can be more rapidly resolved than problems in distributed, poorly staffed, one-off installations.

That seems to be a comparison between cloud and non-cloud.

First, of course anything can have advantages if it is homogeneous, documented and staffed well. Cloud does not automatically achieve those aims; he does not say what will prevent a cloud architecture from becoming heterogeneous, undocumented and without sufficient staff. This becomes especially pertinent as failures always push some to advocate for more heterogeneity — survival through diversity — so cloud may be under more pressure to be less homogeneous because of its problems.

Second, diversity may be reduced to help with other variables like staffing and documentation. Thus the advantage of a homogeneous environment over a homogeneous environment (documentation and staff being equal) is a different question, which he leaves unanswered. The typical response to this from an operations perspective is that heterogeneity is expensive. Yet, an expense can pay for itself quickly if it delivers higher availability. Heterogeneity should not be assumed always to be bad for security.

Third, Halamka points to his limited staff as a reason to use public cloud, and highlights his priorities.

The reason to use the public cloud is so that my limited staff can spend their time innovating – creating infrastructure and applications that the public cloud has not yet envisioned or refuses to support because of regulatory requirements (such as HIPAA).

I think that brings us back to the question of people and process failure in services rather than events unique to a cloud. In other words, his airline offers planes and routes, but they could have a partner fly you to Africa so they can focus on their own flights. If you want to manage the risk of being on a partner’s flight, make certain you look carefully at the role and track record of authorities, regulation, staff, standards, and oversight.

Abandon is a harsh word and hard to define in the world of technology (e.g. rapid change, high demand) yet Halamka’s post emphasises to me that it’s a really good time to increase the ability of cloud to meet standards and pass oversight.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.