Cloud computing: Threat or Menace?

("Cloud computing," for those of you not up on industry jargon, refers to a "a style of computing in which resources are provided “as a service” over the Internet to users who need not have knowledge of, expertise in, or control over the technology infrastructure." The canonical example would be Google Docs, fully-functional office apps delivered entirely via one's web browser.)

Lots of big companies are hot for cloud computing right now, in order to sell more servers, capture more customers, or outsource more support. But there's a problem. As the company I was working with started to detail their (public) cloud computing ideas, I was struck by the degree to which cloud computing represents a technical strategy that's the very opposite of resilient, dangerously so. I'll explain why in the extended entry.

But before I do so, I should say this: A resilient cloud is certainly possible, but would mean setting aside some of the cherished elements of the cloud vision. Distributed, individual systems would remain the primary tool of interaction with one's information. Data would live both locally and on the cloud, with updates happening in real-time if possible, delayed if necessary, but always invisibly. All cloud content should be in open formats, so that alternative tools can be used as desired or needed. Ideally, a personal system should be able to replicate data to multiple distinct clouds, to avoid monoculture and single-point-of-failure problems. This version of the cloud is less a primary source for computing services, and more a fail-safe repository. If my personal system fails, all of my data remains available and accessible via the cloud; if the cloud fails, all of my data remains available and accessible via my personal system.

This version of cloud computing is certainly possible, but is not where the industry is heading. And that's a problem.

For big computer companies, the cloud computing model breathes new life into the centralized server markets that were once their bread-and-butter, as they offer high profits on sales and service contracts. Cloud computing doesn't just use a server to store and transfer files, it uses the servers to do the hard computing work, too, in principle making your personal machine little more than a fancy dumb terminal. Companies that already have significant server and bandwidth space, such as Amazon and Google, love the idea because it offers them more ways to lock users in to proprietary formats and utilities. For many of the corporate users looking at cloud services, that's a worthwhile trade-off to avoid having to deal with continuously expanding IT expenditures. Let the cloud companies worry about the software and hardware upgrades; all we need to handle are the dumb terminals.

Cost-effective, perhaps. But by no means resilient.

Recall that the core premise of a resilience strategy is that failure happens, and that the precise mode of failure can't necessarily be predicted. Resilience demands that we prepare for unexpected problems so as to minimize actual disruption -- minimize in terms of time, but particularly in terms of how widespread the disruption may be.

Resilience design principles include: Diversity (or avoidance of monocultures); Redundancy; Decentralization; Transparency; Collaboration; Graceful Failure; Minimal Footprint; Flexibility; Openness; Reversibility; and Foresight. As per Jim Moore's comments on this post, we should add "Spare Capacity" to the list.

How does cloud computing match up?

On the positive side, the standard (Google Apps) model for cloud computing does well with collaboration, reversibility, and (arguably) spare capacity. While the collaboration and reversibility aspects of these apps could likely be replicated with standard desktop software, they're definitely intrinsic to the cloud approach. These are fundamental to the appeal of the cloud model.

Conversely, cloud computing clearly falls well short in terms of diversity, decentralization, graceful failure, and flexibility; one might also include redundancy, transparency, and openness on the negative list.

Here's where we get to the heart of the problem. Centralization is the core of the cloud computing model, meaning that anything that takes down the centralized service -- network failures, massive malware hit, denial-of-service attack, and so forth -- affects everyone who uses that service. When the documents and the tools both live in the cloud, there's no way for someone to continue working in this failure state. If users don't have their own personal backups (and alternative apps), they're stuck.

Similarly, if a bug affects the cloud application, everyone who uses that application is hurt by it. As the cloud applications and services become more sophisticated (well beyond word processors and spreadsheets), the ability to pull up an alternative system to manipulate the same data becomes far more difficult -- especially if the failed cloud application limits access to stored content.

Flexibility suffers when one is limited to just the applications available on the cloud. That's not much of a worry right now, when most cloud computing takes place via normal laptops and desktop computers, able to load and run any kind of application. It's a greater problem in the future envisioned by many cloud proponents, where people carry systems that provide little more than cloud access.

There's also the issue of how well it fares when network access is spotty or degraded.

In short, the cloud computing model envisioned by many tech pundits (and tech companies) is a wonderful system when it works, and a nightmare when it fails. And the more people who come to depend upon it, the bigger the nightmare. For an individual, a crashed laptop and a crashed cloud may be initially indistinguishable, but the former only afflicts one person and one point of access to information. If a cloud system locks up, potentially millions of people lose access.

So what does all of this mean?

My take is that cloud computing, for all of its apparent (and supposed) benefits, stands to lose legitimacy and support (financial and otherwise) when the first big, millions-of-people-affecting, failure hits. Companies that tie themselves too closely to this particular model, as either service providers or customers, could be in real trouble. Conversely, if the big failure hits before cloud has swept up lots of users and visibility, the failure could be a signal to shift towards a more resilient model.

I would love to use the resilient cloud described above, and I suspect I'm not alone. But who's going to provide it?