Mar 20, 2009
I have been speaking more frequently about cloud computing in recent days. As SOA is becoming a daily reality, I needed to advance to new, still slightly nebulous topic. What could be a better fit than cloud computing?
Seriously, cloud computing is an exciting topic. Many of the underlying technologies and principles have been around, mostly in distributed systems and systems integration. But as so often, it took some time for things to reach a tipping point where cheap hardware and storage is just cheap enough and abundant bandwidth is just abundant enough to make this a reality for a wide range of developers. Many people (including my cynic self) often criticize our industry for being too much driven by fads. Strangely, I am finding myself increasingly convinced that it’s the only way the industry can really advance. We have so many different technologies on our hands that it’s almost impossible to focus, turning us into a collective scatterbrain. In the days of Gigahertz processor in every other refrigerator, technically almost anything is possible. But it really takes a number of people pulling the same direction for disruptive change to happen. The best way for these people to find each other is to collectively jump on the latest fad bandwagon. Objects were a big fad: "created by the people who could not grasp structured analysis" (Kent Beck), complete with the usual dose of hype, useless tools, and confusion. Still, we should be quite happy objects became mainstream. With this in mind, I decided to jump on the cloud computing wagon.
Google is no newbie (or Noogler, as we like to say) to building large-scale distributed systems, so we should have something to say in this topic space. Some of it may not even be top-secret. All conspiracy jokes aside (I work on Tokyo, not on the moon base), Google has been publishing a wide range of papers on our research pages. There you can learn about some of the technologies that form the foundation of Google computing infrastructure, such as the Google File System, Bigtable, MapReduce, or the Sawzall log analysis language. And of course we also have a public product in this space, namely Google App Engine. I have spoken about App Engine a few times, as I consider it a very interesting and very demo-able product. Alas, giving product demos always carries the danger of turning you into the proverbial corporate sock puppet. At last year’s JAOO conference Dan North cautioned me after my talk that the hand seemed to be inside the sock puppet already, even if it wasn’t quite moving yet. A painful image indeed, and a scenario I’d like to avoid. So I prefer to talk about App Engine in the proper context of cloud computing, explaining what’s different when programming the cloud and how Google builds systems that are almost infinitely scalable.
I always cringe when every self-respecting self-help methodology is based on a dorky acronym – from KISS (Keep It Simple, Stupid) to the ADDIE model of instructional design (Analysis - Design - Development - Implementation - Evaluation) or SMART goals (Specific, Measurable, Achievable, Realistic, Timed). Of course our industry is equally littered with (maybe less cute, but equally dorky) acronyms. And let’s face it; people tend to remember only a handful of things from a 45 minute talk, so making things catchy can be useful.
One of the most inescapable acronyms in software must be ACID – the Atomic, Consistent, Isolated, and Durable property of (database) transactions. These properties form the backbone of many a computer system, ensuring a precise, predictable outcome for database (and other) operations, having promoted ACID to a sort of axiom of modern computing.
As I illustrated quite a while ago, your local coffee shop and many other distributed, high throughput systems have chosen to trade off some of these properties in favor of scalability and throughput. Instead of coming up with a new acronym to describe the architecture of these systems, a very clever person has managed to rebrand the old one with new meaning. ACID now stands for:
I learned about this catchy “new acid” acronym from Pat Helland’s when we discussed his excellent paper on (the demise of) distributed transactions. Pat told me he picked it up from Shel Finkelstein from SAP, but we are not sure where exactly it originated. The software industry has been chasing reuse so much, now here we have a good example of it.
While the original ACID properties are all about correctness and precision, the new ones are about flexibility and robustness. Being associative means that a chain of operations can be performed in any order, the classical example being (A + B) + C = A + (B + C) for the associative operator ‘+’. Likewise, we remember commutative from junior high math as A + B = B + A, meaning an operator can be applied to a pair of operands in either direction. Finally, a function is idempotent if its repeated application does not change the result, i.e. f(x) = f(f(x)).
What does high-school math have to do with cloud computing and distributed systems? The new ACID properties give cloud-based systems the necessary flexibility to distribute and parallelize operations. As I lectured many times before, distributed computing is to a large extent about reducing assumptions. Too many assumptions make systems tightly coupled and rigid. Ordering assumptions in particular restrain parallel processing, which typically does not respect a global order. Put another way, ordering is the prototypical example of state-fulness, but we know that state-lessness is a key property of high-throughput systems.
Back to the new ACID, let’s assume I need to process a large set of records to compute the sum of a certain elements across all records. I can distribute this operation by dividing the input record set into multiple chunks, process each chunk on a separate machine, and then aggregate the intermediate results into the final sum. This approach works because the sum operator is associative and commutative: I can sum up intermediate results for each chunk as they come along. I can then sum up intermediate results into the final result. Luckily, many aggregation operations have these properties. Computing averages (if you remember how many elements are in each sub-result), computing percentiles, statistical samples, etc shares the same properties.
In the world of distributed systems, idempotency translates to the fact that an operation can be invoked repeatedly without changing the result. Some operations, such as reading a value or setting a field, are inherently idempotent, while others can be made idempotent by using concepts such as Correlation Identifiers. Idempotency is critical in distributed computing to deal with the inherent uncertainty between communicating systems. If System A communicates with System B, A is inherently uncertain about B’s actual state – it merely makes assumptions about B’s state based on the communication it receives. If anything goes wrong, such as no response from B, A usually left with two choices: retry the operation or give up. Giving up being the appealing of the two, A will be inclined to retry. If the original action was in fact processed by B, B now receives a second request for the same operation. Idempotency resolves this conundrum by ensuring the repeated request does not have any effect beyond the original request.
The last property, being distributed is kind of a truism in cloud computing. Unlike the army, a cloud of one is not that interesting. I guess the original author had to find something to make the ACID acronym work. After my talk on this topic at DevSummit in Tokyo, Tadayoshi Sato from OGIS challenged me to find a better “D”. I am thinking about it…
So where does Google App Engine fit in regards to distributed transactions? App Engine uses Bigtable storage under the covers. Bigtable does not support traditional (“old ACID”) transactions because transactional guarantees can put a heavy burden on a highly distributed data storage model. In case of Bigtable, shedding transactional guarantees is not such a big deal because Bigtable also does not support foreign keys or joins. After all it’s Big-table, not Big-database.
Google App Engine aims to present a programming model that’s familiar to engineers, but scales up to enormous scale: "easy to start, easy to scale". In order to get the latter, we have to make compromises on transactional guarantees. In order to get the former, we need to offer developers a convenient storage model that does not simply drop the burden of consisting data into the application developer’s lap. The answer lies in allowing developers to declare entity groups (). Google App Engine supports a richer storage model than out-of-box Bigtable by offering object-storage mapping (skipping the relational part). Python entities can be persisted into the storage layer based on a primary key. By default, only single entity updates are atomic. However, developers can place multiple entities in a single entity group, which provides atomic updates along the lines of old-style ACID. The trade-off is largely convenience and consistency vs. throughput. Entity groups incur a definite cost in a highly distributed environment, so we ask developers to choose wisely. Brett Slatkin ’s Google I/O talk on App Engine performance does a nice job of explaining the trade-offs.
If you'd like to hear more from me about Cloud Computing, I'll be speaking at QCon Tokyo and participating in a panel at JavaOne in June. As always, you can track my public appearances on my Talks Page.