Map/Reduce is a strategy by which one uses a divide-and-conquer approach to handling data. The division is normally provided along natural lines, and it can provide some really impressive performance gains.
The way it does this is by limiting the selection of data for which processing applies. Here’s a good way to think of it, with an anecdote originally provided by Owen Taylor:
Imagine if you had two types of popcorn at a party; one type is savory, and the other is sweet. You could have both types in a single bowl, and partygoers who wanted sweet popcorn could dig through the bowl looking for what they wanted; those who preferred savory popcorn could do the same.
Apart from sanitary concerns, this is slow.
What you could alternatively do, however, is provide two bowls: one with savory popcorn, the other with sweet popcorn. Then people would just line up for the bowl that had the popcorn they liked. Speedy, and potentially far more sanitary. Just line up in front of the neckbeard with the cold who keeps sneezing into his hand.
To explain it in computing terms, let’s determine a requirement with some artificial constraints.
Our project is to count marbles; we have four colors (red, green, blue, white) and our data storage doesn’t provide easy collation… or we’ve terabytes’ worth of marbles, which might be a more logical actual requirement.
The lack of collation is important, for our artificial requirements; ordinarily one would create an index in a database for each marble, based on color, and issue a simple “select color, count(*) from marbles group by color
” and be done. With a giant dataset the select
could be very expensive; it might walk the entire dataset based on color.
Let’s be real: the best approach for this actual requirement (assuming a relational database) is to use a database trigger to update an actual count of the marbles as the marbles are added or removed.
So what’s another approach? Well, what you could do is provide a sharding datastore.
Imagine a datastore in which a write directed data to one of many nodes, based on the data itself. If we had four nodes, we’d shard based on the marble color; node one would get all red marbles, node two would get all green marbles, node three would get all blue marbles, and node four would get all white marbles.
If we have direct access to our data – embedded in the sharding mechanism, perhaps – we can then count the marbles very, very, very quickly. We don’t even have to consider what color a given marble is, only what the node is. Our structure would look like this:
- For each node, count the marbles the node contains, and return that count, along with the color of the first marble encountered.
- Collect each count and build a map based on the marble color and the count.
If the shards have their own CPUs, then you can see how the runtime would end up taking only as long as it took to iterate over the largest dataset plus a touch of network communication (which, normally, won’t factor in by comparison to the first operation.)
The first step – where you send a request to each of the shards to collect data – is called the Map phase.
The second step – where you collate the data returned by the Map – is called the Reduction phase.
Thus: “Map/Reduce.” You map a request to many nodes, then reduce the results into a cohesive returned value.
Of course, I’ve simplified my requirements such that I only iterate over the marbles, gathering a simple count. Often, the real world is rarely so convenient; you’d be more likely to have marbles of multiple colors in each node.
In this case, you’re still iterating over each marble, doing a collation – definitely slower than a simple count, but your runtime is still going to only be as long as it takes to iterate over the largest node’s data.
This is really important, by the way. Your data distribution is critical. You’re not going to run in one fourth the time if you have four nodes; you’re going to run as long as it takes to process the node that takes the longest. If your nodes take 200ms, 220ms, 225ms, and then 1372ms… then your runtime is going to be 1372ms. Compare that to the 2017ms that it would take if you only had one node. If each one takes the same amount of time – let’s say 400ms – then your total runtime will be (roughly) 400ms.
The reduction would be a little more complicated if you’re looking at more than one color per shard, but not by much; you’d have a map of color and counts returned from the reduction already, so you’d simply be combining the results rather than building a map of the results in the first place.
Map/Reduce is a handy way to leverage multiple CPUs to gather data quickly – but it means you have to have your data sharded in the first place, in such a way that you can leverage the sharding. Hadoop and Gluster can do it, providing a filesystem as shards; in-memory products like Infinispan (produced by Red Hat, by whom I am employed), Gigaspaces, Coherence, Terracotta DSO, and even lowly GridGain can manage Map/Reduce, and the in-memory nature of the shards yields some truly impressive speed gains.
Map/Reduce is proven and useful, and one of the biggest logical drivers for “the cloud.”