Are You Driving a DeLorean or a Toyota?

My blog posts related to IT strategy, enterprise architecture, digital transformation, and cloud have moved to a new home: ArchitectElevator.com.

Indigo SDR

I am in Seattle at Microsoft's Indigo SDR (Strategic Design Review). Oops -- now Ari will send me to jail again for blogging about this. Oh wait, this time the bits are available as a CTP (Community Technology Preview), so I guess I am legit. Actually, this post is not really about Indigo but the result of a great discussion we had at the birds of a feather roundtables. I'll save the Indigo ramblings for another posting - I need to get that stuff setup on my machine first.

The Toyota

Indigo implements WS-ReliableMessaging, a protocol overlaid over SOAP that makes message transmission reliable even if the underlying transport is not (e.g., HTTP) or if the communication contains unreliable intermediaries. The way it accomplishes this is by incorporating a retry and acknowledgment protocol that resubmits failed messages. The receiver is idempotent so it can filter out potential duplicate resends. Combined with a persistent send buffer this makes the message transmission reliable even if the underlying network or protocol is not (if you use SOAP over a reliable transport like JMS you generally won't need WS-ReliableMessaging). The apparent trade-off is that the retransmission takes time. The less reliable the network is the longer it will take to successfully transmit a series of messages. So you make a conscious decision to trade off speed (latency) for more reliability, sort of like buying a Toyota (since they stopped making the Supra). This approach is definitely desirable for many business applications where a message can mean "Order over $1 million". Waiting a few seconds is definitely better than losing the message.

The DeLorean

Some other systems have very different requirements. A great example is the networked version of Halo 2. A fast paced first-person shooter game like Halo cannot afford to allow significant of latency between machines. If I am fighting an opponent I need to see where the other player is so I can aim accurately. Likewise if I am shot at I want to be able to see the fire so I can duck or run. A latency of even a few hundred milliseconds would make the game unplayable. Therefore, the developers of Halo 2 decided to minimize latency between machines. This meant, though, that they had to give up reliability. Sort of like having a DeLorean with Flux Capacitor but a spotty reliability record. They need to be prepared for the fact that a transmission might get lost. They also need to acknowledge the fact that there is inherent uncertainty about the state of the other machine. Because there is a delay between the actions on one machine and the transmission of the information, one user might see a slightly different world than another user. They use predictive algorithms to try to minimize surprises. For example, if the screen shows another player running from left to right, the local display will show the person continuing to run in the same direction if it has not received any updates. Continuing in the same direction is the best guess the local machine can make in the absence of more accurate information.

No Cheating

An RPC-based system model is based on the basic assumption that latency and unreliability do not exist. Unfortunately, for a distributed system these are bad assumptions. You can often trade off one aspect for the other but it is a definite trade-off. The more reliable you want your system to be the more latency you need to be able to accommodate. In such a system you need to be prepared for the fact that information does not travel instantaneously. This means you should be prepared to deal with long-running interactions, out-of-sequence messages, correlation, orchestration and so on. If you want to reduce latency you usually have to give up some reliability (or make your system not distributed). Once again there is no free lunch...