The key to distributed – and enterprise – computing is boundary management. Even worse, it’s not conceptual boundary management – it’s real boundary management, which is a lot harder to figure out.
A boundary is a barrier between things or places, right? Well, consider a web browser and a web server; the TCP/IP layer is a boundary.
The application has a boundary between it and its datastore.
You can play games with definitions here all day, and you’d be right: an object/relational mapper (an ORM) is a way to get around the boundary between your object-oriented language and your tuple-oriented database, for example.
The goal of a boundary, here, is to be as invisible as it can be. ORMs fail because they don’t work invisibly; they affect your data model fairly severely, they don’t leverage the strength of the underlying data store very well (although I suppose they’ll stay out of your way as you do an end-run around them); they affect your code pretty severely as well.
JMS fails, sort of, because it provides an excellent boundary between the originator of a request and the consumer of that request – but the implementations include that pesky network all the time. You can eliminate the network, but then you’re not really distributing.
SOAP… well, it’s just a six pails of fail. REST is better because it’s so much lighter, but it never hides the network layer from anything.
A distributed application should be able to cross network boundaries without making that boundary management stick out.
You should be able to submit a task or data item or event to a system and have that system cross boundaries if necessary, or localize processing if it can. It shouldn’t consume network bandwidth just because you said “this can run somewhere else.” It should use the network only if the local node is inappropriate for handling a given request.
Data should work this way, too, and here’s where I think most distributed computing platforms fail hard; they provide fantastic capabilities to share data (Terracotta DSO, Coherence, and Hadoop comes to mind) or the ability to distribute processing (JMS and GridGain comes to mind) but these have limits.
When you can share data but not processing, you have localization and sharding concerns (some of which are supposed to be manageable by the platform. DSO is supposed to be able to do this, for example, although I couldn’t find any references quickly that discussed how.)
When you can share out a process but not the data, you end up creating a choke point through accessing your data, or you have to carry all your data long with you. (Hello, JMS and friends!)
The key here is to keep your eye on the boundaries between your architectural components – and if those boundaries slow you down, well, that’s where your potential for optimization lies.
It’s impossible to tell you generically whether the optimization is sharding a database for local access to critical data, or task partitioning such that map/reduce is optimized for the mapping phase, or any of a number of other potential optimizations. The distribution platform you’ve chosen will massively impact what your options are here.
Don’t forget that key, though: if you manage the boundaries between components well, your application is going to win.
ObEmployerReference: I work for GigaSpaces, and IMO GigaSpaces XAP manages the boundaries impressively well, automating most of it out of your hands. If you’d like to see how that happens, feel free to point out a problem space and maybe we can show you how XAP can help with it.
Author’s Note: Repost, clearly having been written while I worked for GigaSpaces. Some concepts are out of date, because some of the ideas caught on; this is preserved for posterity.