Repost: Brandon Sanderson might actually understand people

I’ve been reading Robert Jordan’s The Wheel of Time series, trying to finish it at last after having abandoned it fourteen years ago or something like that. I abandoned it because the books were becoming repetitive, and because Robert Jordan created characters who were plastic and immature even then; his fondness for corporal punishment was vaguely offensive.

I had written a critique of the series up through book seven, as I’d given up during book 8; there were a few good books in the series (four and six) but much of the “advancement” involved abuse and misinformation among allies; I’m not a feminist, but the women in the series are portrayed as such buffoons (and powerful buffoons, at that) that I couldn’t quite swallow some of the premises about culture. Unfortunately, my review has been lost in the mists of time.

But the series is finished now, surviving past Jordan‘s fatal encounter with cardiac amyloidosis; the final three books were written by Brandon Sanderson, who had Jordan’s notes and the blessing of Jordan’s wife and editor. It’s basically fan fiction as canon for the series.

Book eleven (“The Knife of Dreams“) actually showed some progress – I don’t know the timing involved, but it’s almost like Jordan realized that his cash cow wouldn’t be very useful to him once he had passed away. (I am unaware if he knew of his disease by the time he was writing book eleven.) With that, though, Sanderson seems to have decided that it was time for the series to actually wrap up as quickly as it could, while respecting the historical (and glacial) pace – he finished in three books, and I’m only halfway through the first of the three.

But it’s a marvelous change so far! People actually react in ways that you could imagine real people reacting.

For example, the women in the series are all bullies and buffoons; those women who can use magic (or “channel”) are among the worst of them, except for their social peers who are unable to channel. They are the worst. The men, bullied and chastised (and often beaten), simply take it, with the suggestion that they just don’t understand women, but that this is somehow valid behavior.

It’s not.

It’s especially not valid when you absolutely need the investment of the target of bullying.

The main protagonist (among what seems like hundreds of protagonists) is the Dragon Reborn, a reincarnated and tragic hero from the distant past, destined to combat the “Dark One,” dying in the process. He takes a number of wounds – some through ignorance, because even if people have useful information – a rare event – they still won’t share it with him. He takes a number of wounds that will not heal and cannot be healed; he is maimed and marked over and over again.

It’s not an easy role to fulfill. Everyone fears him; many see him as a target, because they’re idiots. Those he loves are targets, and he loves a lot of people.

So naturally, the bullies – remember, this is where I started this thought – spend a lot of time bullying him and those closest to him. “You may be the Dragon Reborn, boy, but I need a fresh cup of tea. Now travel five hundred leagues and get me one. Jerk.” They make promises they can’t keep, but so what? He’s a man, he’ll never be able to tell the difference just because you kept an artifact fatal to him just lying around.

… at least, that’s the way it is in the books Jordan authored.

In The Gathering Storm, the Dragon runs into an artifact that’s, um, fatal to him and his purpose. And commits himself to a path from which he may not be able to recover in order to survive. Because, well, survival. There’s no guarantee that he will win against the Dark One – but if he doesn’t survive to fight the last battle, there’s absolutely a guarantee that he’ll lose. (And for some reason, despite all the bullying, he cares.)

So after he commits himself to this irredeemable path, one of the people who keeps putting him down as a boy who needs to learn his lesson in order to die in battle … runs into him, head on with her failure to protect him from this artifact that she’s had in her possession. Sure, she put it in what she considers a safe place… in her room… in the same house in which one of the Dragon’s most powerful arch-nemeses is kept prisoner. Sure hope the arch-enemy doesn’t somehow break free, maybe through the aid of the arch-enemy’s allies and spies!

Because if the enemy did break free, well, the enemy is super-duper powerful, more powerful than anyone except for the Dragon! And the traps set for anyone trying to reach that artifact, well, the enemy might be a lot stronger!

In other words: “Whoopsie-doodle, boy! I suppose I screwed up a little, for the first time ever. Glad you survived, I guess. Now you’d better pay attention to me and what I have to tell you to do…”

And bless you, Brandon Sanderson! Because the Dragon, the Big Bad of the side of Good, has had enough.

The person who’s been whining at him, derisively calling him “boy,” using the idea that she is supposed to be somehow a trusted advisor, the one who kept that dangerous artifact around? She’s exiled. And all of a sudden, she realizes that if he actually needs her – and apparently he does – her exile would be fatal for the world. All of it. Fatal. Cataclysmic.

Whoopsie-doodle, indeed. And Sanderson doesn’t even write it as if I’m supposed to pity the poor old whiny woman. She’s been abusing the most powerful man in the world, constantly putting him down and punishing him despite his maturity and necessity. And all that abuse and insult comes home to roost, and he’s done with it. Surprise!

… and surprise for me, too, as a reader, because I can’t see Robert Jordan as having allowed the Dragon to have human responses. (Or, well, anyone, but especially the Dragon.)

Excellently done. The rest of the series might redeem the long, long, interminable stretches of dreck that the Wheel of Time had been. I’m now looking forward to seeing what happens.

Repost: Java 8 Streams: filtering and mapping

I’ve been making some progress with my Java 8 streams and lambdas explorations. I’m still not anywhere near an expert yet, and chances are that the literati would see my attempts as childish and ignorant, but that’s okay. How else do you learn?

What I’ve been working on is a tokenizer. I’m feeding data from a corpus, a body of text, into a grinder, to spit out valid tokens.

There’re currently two types of tokens, although this isn’t reflected in code (the rules for being a token are the same regardless of how the token was generated). Each token is generated based on one of two sets of rules.

The first ruleset, which applies to the other ruleset as well, is this:

Each token should map to one word in the corpus; “foo bar” should always map to two tokens.

Each token should have only alphabetic or numeric characters; “foo7” is fine as a token, but “shouldn't” is not; “shouldn't” should be converted to “shouldnt” instead. The content should be trimmed, as well, so whitespace on either side of the word should not be included.

These rules facilitate ease of use: I want to be able to say “give me the list of tokens that corresponds to ‘foo bar baz quux‘”, as opposed to having to build up a list by looking up each word by itself. (This is part of the actual project storyboard: a critical part of the overall task is to take text from a web page and generate tokens from it. Parsing the page into individual chunks would be more work than it’s worth… right now.)

The second ruleset adds a few requirements.

First, the token has to be at least three characters long. This trims out a number of extraneous conjunctions, definite articles, pronouns, and other such connecting words that tend to add up to noise.

Secondly, the corpus should not exist in a list of stop words. The stop words include articles and conjunctions, for example, but also add a number of other typically common words.

Thirdly, the corpus should be stemmed, meaning that they should be reduced to their base form. “Amenity“, for example, has the root word of “amen”; “porterhouse” has a root word of “porterhous”; “aversion” has a root stem of “avers”.

This sounds like a perfect application for the streams. Get a list of the words from the text, and process each one. Here’s some incomplete code for grabbing the text and mapping it to a set of tokens in a histogram object:

public final Histogram buildHistogram(String inputs) {
    Histogram histogram = new Histogram();
    // this takes a simple input
    Arrays.stream(inputs.trim().split(" "))
            .filter(s -> s.length() < 3)
            .map(String::toLowerCase)
            .map(this::getToken)
            .forEach(histogram::add);
    return histogram;
}

… Yuck.

For one thing, this doesn’t actually apply many of our criteria, and doesn’t address the two sets of criteria at all.

It works, though: it trims the input, then splits along spaces (which is generally correct); any short words get ignored, then the text is converted to lower case; the resulting text is mapped to a token, which is added to the histogram.

But all of the tokenizers should do things the same way, just with different rules. This sounds like a mission for a custom filter and mapper, to accept sets of filters and mappings. This way, each tokenizer would have a set of rules and mappings, and the code to actually do the tokenization would not have to change.

So how would this look in actual code? Well, here’s the buildHistogram() method again, with the custom filter and mapper, and the default ruleset:

public final Histogram buildHistogram(String inputs) {
    Histogram histogram = new Histogram();
    // this takes a simple input
    Arrays.stream(inputs.trim().split(" "))
            .map(this::evaluateMappings)
            .filter(this::evaluatePredicates)
            .map(s -> getToken(s))
            .forEach(histogram::add);
    return histogram;
}

private String evaluateMappings(final String s) {
    String text = s;
    for (Function<String, String> f : getMappings()) {
        text = f.apply(text);
        if (s == null || s.isEmpty()) {
            return "";
        }
    }
    return text;
}

private boolean evaluatePredicates(String s) {
    if (s == null || s.isEmpty()) {
        return false;
    }
    for (Predicate<String> p : getPredicates()) {
        if (!p.test(s)) {
            return false;
        }
    }
    return true;
}

protected final Predicate<String> minLengthFilter = text -> text.length() > 2;

@SuppressWarnings("unchecked")
protected Predicate<String>[] getPredicates() {
    return new Predicate[]{};
}

protected final Function<String, String> toAlphanumeric = s -> {
    StringBuilder sb = new StringBuilder();
    for (char ch : s.toCharArray()) {
        if (Character.isLetterOrDigit(ch)) {
            sb.append(Character.toLowerCase(ch));
        }
    }
    return sb.toString();
};
protected final Function<String, String> toTrimmedForm = this::normalized;

@SuppressWarnings("unchecked")
protected Function<String, String>[] getMappings() {
    return new Function[]{toTrimmedForm, toAlphanumeric,};
}

This means the changes for other tokenizers is limited to getMappings() and getPredicates. An incomplete version of the stemming tokenizer looks like this (and, oddly enough, it doesn’t actually have the stemming code yet):

public class StemmingTokenizer extends Tokenizer {
    final static Set<String> stopWords;
    static { 
        /* Code to initialize the stop words goes here, but isn't copied
        stopWords=new HashSet<>();
    }

    Predicate<String> stopWordsFilter=s -> !stopWords.contains(s);

    @SuppressWarnings("unchecked")
    @Override
    protected Predicate<String>[] getPredicates() {
        return new Predicate[]{minLengthFilter, stopWordsFilter,};
    }

    @SuppressWarnings("unchecked")
    @Override
    protected Function<String, String>[] getMappings() {
        return new Function[]{toTrimmedForm, toAlphanumeric};
    }
}

As it stands, this will correctly trim out the stop words, and all it needs is a stemming class and a lambda (like the toAlphanumeric function) to stem the text.

Now, here’s the thing: is this good code?

I don’t entirely know. I know that it works, because I’ve tested it; the stemmer isn’t here, but stemming’s not rocket science. (There are plenty of good stemmers in Java, using either the Snowball or Porter algorithms; use Snowball. It’s an enhanced and corrected Porter.)

But I can’t help but wonder if the way I’m doing this isn’t idiomatic. Maybe there’s some magic way to apply a set of filters and mappings that I just haven’t seen; I’ve tried to think some about how this would be written and specified such that it would be correct for the general case, and failed.

Repost: Playing with Java 8 Streams

I’ve been playing around with some more neural network algorithms lately, which has given me yet another chance to revisit a machine learning library. Since Java 8 is due later this month, I’ve decided it’s time to take the plunge and start using it.

Overall, I don’t think I’m leveraging it whatsoever. I know it has new features, of course – duh – but I am only using a few features of the API, mainly where I discover improvements accidentally.

That’s not very efficient, especially considering how neural networks use lots and lots (and lots) of loops – for which Java 8 offers the Streaming API as a potential improvement, as it turns out.

Thus, I have an ideal opportunity to get my feet wet – in a real way, “in anger,” you might say – with the new Java 8 lambda features, to really kick the tires.

This post is only the start of my explorations; I’m not even going to pretend that it’s groundbreaking. It’s just something I’m writing to save what I’ve done, so I don’t end up forgetting – and if someone reads it and sees something I should have done, well, then I’ll learn.

So: since networks tend to build slices of matrices out of slices of matrices, intersections make sense. Let’s start off with building two lists, and determining the intersection. For a first run, I’ll dump them to stdout.

@Test
public void testIntersectionToStdoutOld() {
    List<Integer> l1 = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
    List<Integer> l2 = Arrays.asList(2, 4, 6, 8, 10, 12, 14, 16, 18, 20);
    System.out.println("Intersection: ");
    for (Integer i : l1) {
        if (l2.contains(i)) {
            System.out.println(i);
        }
    }
}

Well, isn’t that exciting… not really. Let’s spruce it up some. Here’s my first stab at the streaming version:

@Test
public void testIntersectionToStdoutStreaming() {
    List<Integer> l1 = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
    List<Integer> l2 = Arrays.asList(2, 4, 6, 8, 10, 12, 14, 16, 18, 20);

    System.out.println("Intersection: ");
    l1.stream().filter(l2::contains).forEach(System.out::println);
}

Is that any more exciting? Hmm, I suppose. It’s shorter; the method references are actually pretty convenient.

I like this, but it’s just shorthand so far; on a source code level, it’s … shorter, but not necessarily more clear because the data types are so simple.

Let’s do better. Let’s create a, um, fictional team. What’s odd about this team is that the communication is representable as a directed graph: messages have to go along certain routes, in specific directions.

We’ll have two types for our data model, an Enum for our team members, and a Connection for the, well, connections between people.

enum PERSON {
    DAVE, GARRETT, DANIEL, RUTH, JASON, CARL, ROBYN, TOM, JOE, KARSTEN
}

class Connection {
    PERSON from;
    PERSON to;

    public Connection(PERSON from, PERSON to) {
        this.from = from;
        this.to = to;
    }
}

Now, let’s seed some collections (Lists) with data. We’ll even do it with the streaming API:

List<Connection> connections;
List<PERSON> community = Arrays.asList(RUTH, JASON, TOM, JOE);

@BeforeMethod
public void setUp() {
    PERSON[][] connectionsSource = new PERSON[][]{
            {DAVE, RUTH},
            {RUTH, DANIEL},
            {RUTH, JOE},
            {GARRETT, JASON},
            {GARRETT, CARL},
            {CARL, JASON},
            {DANIEL, ROBYN},
            {DANIEL, TOM},
            {JASON, ROBYN},
            {CARL, KARSTEN},
            {RUTH, KARSTEN},
            {RUTH, TOM},
            {CARL, DAVE},
            {KARSTEN, RUTH},
    };
    connections = new ArrayList<>();
    Arrays.stream(connectionsSource).
            forEach(data -> connections.add(new Connection(data[0], data[1])));
}

What this tells us is this: Dave can talk to Ruth; Ruth can talk to Daniel, Joe, Tom, and Karsten; Carl can talk to Karsten and Dave, and so forth and so on.

You’ll notice that we have a few names separated off as community. This is a smaller team within the larger group.

Now, let’s see who can talk to them, because scanning the data manually is making me cross.

What we’ll do is create a stream of Connections, and filter the results based on whether the connection’s target is in the community team. We’ll still use stdout for output, just because.

@Test
public void whoDoesTheCommunityTalkTo() {
    connections.stream()
            .filter(c -> community.contains(c.from))
            .forEach(c -> System.out.println(c.from + " talks to " + c.to));
}

Running this gets us some good results:

RUTH talks to DANIEL
RUTH talks to JOE
JASON talks to ROBYN
RUTH talks to KARSTEN
RUTH talks to TOM

However, it’s slightly inbred; we don’t want to see community team members who can talk to other community team members. We can do this by amending our filter (adding “&& !community.contains(c.to)“) or by adding another filter altogether, giving us this:

@Test
public void whoDoesTheCommunityTalkToOutside() {
    System.out.println("community talks outside to...");
    connections.stream()
            .filter(c -> !community.contains(c.to))
            .filter(c -> community.contains(c.from))
            .forEach(c -> System.out.println(c.from + " talks to " + c.to));
}

Hmm. It’s… interesting, I suppose, and I’ll happily admit that my problem definition could use some work, but I still don’t see a massive advantage. In fact, I’d say that it has a disadvantage because it’s harder to think through on first glance. Familiarity might repair that.

Let’s see one last streaming example. Let’s say that I want to know who can connect to Joe (me) in two hops, no more, and no less.

So what I want to do is find every route to myself, where the starting point is connected to someone who can talk to me.

First, here’s some old school Java code to do this:

// should be: karsten, dave
for (Connection start : connections) {
    for (Connection middle : connections) {
        if (middle.to == JOE && middle.from == start.to) {
            System.out.println(start.from + " can reach JOE through "
                + middle.from);
        }
    }
}

Exciting, but not. Now let’s see if streaming can make it better, as I’m thinking of it right now:

connections.stream()
    .filter(c -> c.to == JOE)
    .forEach(middle ->
        connections.stream()
            .filter(start -> start.to == middle.from)
            .forEach(m -> System.out.printf("%s can read JOE through %s%n", 
                m.from, middle.from)));

Now, is this any better?

I don’t know. I think it’s probably harder to screw up, once you get it right.

I can definitely see where, in a neural network, streaming might help the expressiveness of quite a bit of code. The syntax is nice, as well; if you’re just calling a method, the method syntax (“collection.stream().forEach(System.out::println);") makes some things quite nice, I think, even though it’s not complete (or I don’t know how to do something with it, which is more likely.)

Wait, what is it that I’d like to be able to do?

Well, look at the “who can the community talk to” filters. Here they are again:

connections.stream()
    .filter(c -> !community.contains(c.to))
    .filter(c -> community.contains(c.from))
    .forEach(c -> System.out.println(c.from + " talks to " + c.to));

I’d love to have some way to say “use this expression instead of the element”, such that the filter might look something like:

connections.stream()
    .filter(c.to -> community::contains)
    .filter(c.from -> community::contains)
    .forEach(c -> System.out.println(c.from + " talks to " + c.to));

Again, there might be a way to do this, without jumping through too many hoops (you could always map the value and use *that*, but I’m not sure how you’d get back to the containing object). I just don’t know it, and I keep thinking it’d be convenient if they’re going to allow me the short way to express the method call in the first place.

But is this desirable, or just “nice,” “neato,” “I’m glad that other languages that have this will stop making fun of Java since it doesn’t?”

So far, I’m leaning towards the latter. But I’m going to keep working on this, because I can see the shadows of something fascinating based on this – but I’m just not there yet.

Repost: Godwin’s Law, expanded

Recently, I had a discussion with someone in which I took a position that is not representative of my actual thought on a matter, against a rather … energetic opposing point. In it, Godwin’s Law was in full evidence, along with some corollaries; partly inspired by the corollaries, I’d like to propose an addendum to Godwin’s Law, with an impact on derivatives.

It was not a happy discussion; let’s just say that I would be relegated to a ghetto if my esteemed opponent would have stuck to his guns – and, I suppose, if I claimed actual allegiance to the argument I was posing. I was playing Devil’s Advocate, because while I don’t fully agree with the argument I was making, I can empathize with those who would.

So: Godwin’s Law. Basically, the Law says that as an argument goes on, there’s a greater and greater (and greater) likelihood that one side of the argument will invoke Hitler or the Nazis somehow. One can imagine a discussion, with tracked probabilities:

Source Statement Probability of Hitler’s name being invoked
Jim I like peanut butter. 0.000001
John I do too. I like strawberry jelly. 0.0000012
Jim Strawberry jelly’s not bad, but I prefer raspberry. 0.00000121
John Raspberries have the larger seeds, though, and they get stuck in my teeth. And I like the large strawberry chunks. 0.00000122
Jim You only get the strawberry chunks if it’s real strawberry, though. 0.00000125
John Well, maybe they could put strawberry chunks in with artificial flavoring. 0.0000015
Jim yeah, but then you might as well use real strawberries in the first place. 0.000005
John Maybe they’re running short of real strawberries. They could use genetic modification to help them grow faster or in more places. 0.0004
Jim I worry that engineering plants to be more hardy makes them interact poorly with the rest of the ecosystem. 0.08
John Ecosystems have greater concerns than genetic modification, though, especially if you don’t factor in direct DNA manipulation. 0.6
Jim HITLER THOUGHT EUGENICS WERE GOOD! HOW DARE YOU DECLARE GLOBAL WARMING DATA INVALID! 1

There’s a corollary that suggests that the person who falls to Godwin’s Law automatically loses the argument. It’s not always true, of course; sometimes Hitler is in scope of the discussion (for example, if Jim and John were actually talking about fascism, or World War II), but Godwin’s Law – according to Godwin – was inspired partially by a desire to get participants to actually think about what they were saying, instead of blithely throwing Hitler in, to win by a horrible association.

For example, from our example discussion above, genetic modification may or may not be a bad thing, but Jim tried to associate genetic modification with Hitler, which would poison the well of any argument using genetic modification.

Now, all of this is ordinary and normal for Godwin’s Law. So where’s my suggestion?

I thought you’d never ask. (I also wondered if you’d get this far.)

I think that there needs to be an expansion past the association with the Nazis.

The argument in which I participated, for example, eventually did invoke Hitler, but it led off with another attempt to poison the well: the Taliban.

Now, the Taliban is a difficult association to offer, because it’s still a viable political group; some people still support the Taliban’s political point of view. Therefore, an association might be an actively offensive reference. (“How dare you associate the Taliban with people who like Rice Krispies!,” as opposed to people who like Rice Krispies being offended by their association with the Taliban.)

However, I think that with limited and agreed scope, that Godwin’s Law can include the Taliban, perhaps the North Korean government, perhaps the Lord’s Resistance Army; I really don’t think that my concept of where the expansion should take place is really the important point.

I think that the important concept is the expansion itself, to require that invocation of such groups inspire some thought on the part of the one making the association.

Your thoughts?

Repost: Graduation rates in college football

Wouldn’t it be awesome if the contracts for college football coaches paid out at the graduation rate experienced by their players?

It’d have to be tweaked, of course; it’d be unfair to punish an incoming coach for the poor graduation rates of his predecessor. But imagine: the first year, you might pay the coach 100% of his contract rate, then halve the failure rate for graduation for the next year, then apply the actual graduation rate to his contract for the duration.

And a coach’s tenure would be only reset to one year if he went from college to college; you’d not want to encourage a coach to fail, then get a clean slate by moving to a new college.

Imagine: a coach with a $5,000,000 contract for five years goes to college Wantstowin.

Assuming an even pay rate of one million dollars per year (and holy cow!), the first year he’d be paid his “fair” one million dollars, no matter what his graduation rate for his fourth-year players.

His second year, well, let’s say his graduation rate is a fantastic 68%. That means 32% of his players who should have graduated did not; he gets his pay cut by 16% and has to get by on a mere $840,000.

The third year? Well, he has a rough year; his players’ graduation rate goes down to 62%. Ouch, Coach – that means he loses 38% of his possible pay for that year, so he gets a mere $620,000 – he might be eating Ramen noodles until he gets his graduation rates back up.

And so forth and so on. Actually, let’s keep going: his fifth year, his graduation rates are at – let’s say – 65%; not great, but he learned from his third year and did better. But OtherCollege University hires him away, after firing Coach Snee K. Weasel, but for the purposes of simplicity let’s say he signs the same five million dollar contract.

His first year factors in only half of the nongraduates, so instead of his 65% pay rate, he’s back up to 17.5%, so he takes home $825,000 his first year, because his pay is affected somewhat by his last year at the other college, but isn’t punished for the lame graduation rates of Coach Weasel.

Now, is this workable? I doubt it; honestly, any coach who could do math (i.e., probably 10% of them) would front load his contract so his poor graduation rates would take less from him. (In other words, a five million dollar contract would pay him three million up front, with $500,000 being his base pay for the other four years.)

Plus, early exits would have to factor in somehow, and – here’s the kicker – the coaches would have to buy into the idea, and care. They wouldn’t.

But wouldn’t it be nice?

Repost: A brilliant new film idea

Inspired by the apparent congruence between the book World War Z and the film World War Z, I was struck by an imaginary conversation that led me to a very marketable idea.

Here’s the conversation:

“Let’s call it ‘Tom Sawyer.'”

“But the main character’s named Philip MacGillicuddy, and it’s about an invasion of flying robotic robots with lasers that cause impact damage.”

“Hmm, good point. Let’s name the love interest ‘Becky Thatcher.'”

“Okay!”

The Trailer

“A taco stand. A man. A plan. Panama. A flying robot invasion that denies every law of physics we can think of. In theaters 2014: Tom Sawyer.”

The script almost writes itself:

PHILIP: I'm almost sort of like Tom Sawyer. The character. 
        In the book.

BECKY : Except you're not named Tom, and you're not like him, 
        you coward.

   BUCKET, a robot that looks like a soda can with a 
   rounded top, enters.

BUCKET: Hey! I's a bucket! And I talk jive! Slap yo mama and 
        feed me wingnuts!

PHILIP: Ha! Ha! Ha!

BECKY : Shut up.

PHILIP: Ha! Ha! Ha!

    BECKY cuts off PHILIP's right hand.

BECKY : Philip... I am your-

BUCKET: Slap yo mama and feed me wingnuts!

    ROBOT enters, laser guns blazing, ricocheting laser bolts 
    going everywhere.

ROBOT, robotically: I am from the future and I have come for 
       your water, sacks of meat.

BUCKET: Slap yo mama and feed me wingnuts!

PHILIP: Okay, take it all!

    BECKY dances for seven long minutes for INJUN JOE, 
    the name I just made up for the ROBOT. A number one 
    pop song, written by JUSTIN BEIBER, plays.

See? It’s perfect!

Repost: Rocket Java: What is Map/Reduce?

Map/Reduce is a strategy by which one uses a divide-and-conquer approach to handling data. The division is normally provided along natural lines, and it can provide some really impressive performance gains.

The way it does this is by limiting the selection of data for which processing applies. Here’s a good way to think of it, with an anecdote originally provided by Owen Taylor:

Imagine if you had two types of popcorn at a party; one type is savory, and the other is sweet. You could have both types in a single bowl, and partygoers who wanted sweet popcorn could dig through the bowl looking for what they wanted; those who preferred savory popcorn could do the same.

Apart from sanitary concerns, this is slow.

What you could alternatively do, however, is provide two bowls: one with savory popcorn, the other with sweet popcorn. Then people would just line up for the bowl that had the popcorn they liked. Speedy, and potentially far more sanitary. Just line up in front of the neckbeard with the cold who keeps sneezing into his hand.

To explain it in computing terms, let’s determine a requirement with some artificial constraints.

Our project is to count marbles; we have four colors (red, green, blue, white) and our data storage doesn’t provide easy collation… or we’ve terabytes’ worth of marbles, which might be a more logical actual requirement.

The lack of collation is important, for our artificial requirements; ordinarily one would create an index in a database for each marble, based on color, and issue a simple “select color, count(*) from marbles group by color” and be done. With a giant dataset the select could be very expensive; it might walk the entire dataset based on color.

Let’s be real: the best approach for this actual requirement (assuming a relational database) is to use a database trigger to update an actual count of the marbles as the marbles are added or removed.

So what’s another approach? Well, what you could do is provide a sharding datastore.

Imagine a datastore in which a write directed data to one of many nodes, based on the data itself. If we had four nodes, we’d shard based on the marble color; node one would get all red marbles, node two would get all green marbles, node three would get all blue marbles, and node four would get all white marbles.

If we have direct access to our data – embedded in the sharding mechanism, perhaps – we can then count the marbles very, very, very quickly. We don’t even have to consider what color a given marble is, only what the node is. Our structure would look like this:

  • For each node, count the marbles the node contains, and return that count, along with the color of the first marble encountered.
  • Collect each count and build a map based on the marble color and the count.

If the shards have their own CPUs, then you can see how the runtime would end up taking only as long as it took to iterate over the largest dataset plus a touch of network communication (which, normally, won’t factor in by comparison to the first operation.)

The first step – where you send a request to each of the shards to collect data – is called the Map phase.

The second step – where you collate the data returned by the Map – is called the Reduction phase.

Thus: “Map/Reduce.” You map a request to many nodes, then reduce the results into a cohesive returned value.

Of course, I’ve simplified my requirements such that I only iterate over the marbles, gathering a simple count. Often, the real world is rarely so convenient; you’d be more likely to have marbles of multiple colors in each node.

In this case, you’re still iterating over each marble, doing a collation – definitely slower than a simple count, but your runtime is still going to only be as long as it takes to iterate over the largest node’s data.

This is really important, by the way. Your data distribution is critical. You’re not going to run in one fourth the time if you have four nodes; you’re going to run as long as it takes to process the node that takes the longest. If your nodes take 200ms, 220ms, 225ms, and then 1372ms… then your runtime is going to be 1372ms. Compare that to the 2017ms that it would take if you only had one node. If each one takes the same amount of time – let’s say 400ms – then your total runtime will be (roughly) 400ms.

The reduction would be a little more complicated if you’re looking at more than one color per shard, but not by much; you’d have a map of color and counts returned from the reduction already, so you’d simply be combining the results rather than building a map of the results in the first place.

Map/Reduce is a handy way to leverage multiple CPUs to gather data quickly – but it means you have to have your data sharded in the first place, in such a way that you can leverage the sharding. Hadoop and Gluster can do it, providing a filesystem as shards; in-memory products like Infinispan (produced by Red Hat, by whom I am employed), Gigaspaces, Coherence, Terracotta DSO, and even lowly GridGain can manage Map/Reduce, and the in-memory nature of the shards yields some truly impressive speed gains.

Map/Reduce is proven and useful, and one of the biggest logical drivers for “the cloud.”

Repost: Geddy Lee’s weird bass is a Precision!

Today I read an article from Guitar Player, from 1980, where Geddy Lee solved a question I’ve had for years, about what kind of bass was featured in a number of pictures.

See the bass immediately to his right? It looks like a teardrop. I’ve seen color pictures; it’s sort of a light blue gradient; it’s actually a pretty bass, even if it looks funky.

It also looks even less comfortable to play than a Steinberger Spirit, but hey… what do I know? Maybe it’s an acquired taste, and maybe this teardrop-shaped thing led Geddy to the Steinberger around the Grace Under Pressure era…

Anyway, like I said, I’ve wondered what that bass was for a long, long, long time. And Guitar Player finally explained!

It’s a Fender Precision – Geddy’s old, old, old (pre-Rickenbacker) Precision bass, cut down and modified beyond recognition (apart from, I guess, the pickup configuration – but that’s not a sole identifier, since other basses have used the Precision pickup alignment as well.) Apparently, it underwent some surgery (I don’t know why offhand, nor do I recall hearing/reading) but apparently even in this picture it was effectively unplayable.

That’s a pity; one hates to hear of any instrument being damaged. (Modified, sure. Damaged… no.)

But at any rate, the question I’ve had for a really long time has been answered at last.

Repost: Java sucks without semantic awareness

This is a short post from TheServerSide.com that I’d written way back in 2008. It was written in humor, which many who read it did not quite understand, but the point remains for those who wish to see it:


A conversation with someone highlighted yet another problem with Java, a fatal one:

Java’s lack of semantic awareness. Without this, coders are unable to use examples from the web, lessening whether Java is actually usable or not. Here’s a representative sample of the conversation:

Person1> What I want is x.contains(y).
Person2> That's the method name that the Collections API provides, 
         though, and it does exactly what you say you want.
Person1> But my collection isn't named 'x'!

This is a clear example of where Java, had it understood that when the first person said ‘x.contains(y)‘, he meant to use his collection name, would have been able to compile and execute the code properly.

He would have been able to find an example on the web using contains() and cut and paste it, et voíla! Executing code. Java needs semantic awareness. Without it it will die.


I’m reposting the original content here to preserve it for posterity.

The problem doesn’t go away, though: today, someone else did the same thing.

Person1> questions.get((int) spinner.getValue()) = new Question(); 
              tells me left hand must be a variable
Person2> You want put() then
Person2> or set, whichever
Person1> because it is an ArrayList?
Person2> because it's an OBJECT.
Person1> shut up.  You just said nonsense saying to use .put() 
         on an ArrayList

Um… yeah.

Repost: Seventeen seconds is all you get?

This statement really surprised me:

5. Refuse to interrupt. Recent research has indicated that the average individual listens for only seventeen seconds before interrupting and interjecting his own ideas…
(from The Five Love Languages Men’s Edition“, Gary Chapman, 2009)

Wow. I had four primary thoughtlines generated from that statement.

My First Reaction

The first was “Say, this is what I think, Mr. Chapman…”

My Second Reaction

The second, far less sarcastic, was … wow. Seventeen seconds. That’s… really not very long at all. I tried counting out seventeen seconds, and my attention really started wandering after about eight; I can easily see doubling that (because I want to listen to my wife, right?) and running out of patience and wanting to interject something about me.

That’s almost cruel. Chapman’s point is dead on here; if our wives need us to listen to them (which was the “Love Language” being discussed, called “quality time”), then forcing them to listen to us is just awful, and destructive.

Of course, if our love language is also quality time, we need her to listen to us at some point – but we’re not going to receive that by forcing the issue. We’d only make her need for quality time more intense, and when one’s needs aren’t fulfilled, we fulfill others’ needs more poorly.

To receive, you must give, in this case. Seventeen seconds isn’t enough. It’s going to take effort for me to track myself in this area, but I’m going to try.

My Third Reaction

What does it mean for other media? I write a lot; I help others write as well. One of the things I’m always repeating (over and over again) is that you have to consider your audience – and one of the things I really haven’t thought about is time.

Sure, everyone knows you have one chance to make a first impression; that first sentence has got to be a hook, or you’ve lost your readers.

But I haven’t done a good job of thinking not only in terms of a count of sentences, but in terms of the time it takes for an idea to be expressed, as well. How long do I read something I’m actually interested in before I naturally start drifting, or applying my own thoughts to the subject?

That drift is the web’s form of “interruption,” after all. I’m not interrupting the page I’m reading, but for the content’s purpose, it’s been interrupted; my attention’s wandering, focusing on myself.

It’s an interesting problem to work around. I don’t know that I have a good strategy for it, outside of being really, really interesting.

My Fourth Reaction

My fourth reaction was to think about the research itself. I cited my source in my quote; however, Mr. Chapman did not.

I tried (somewhat casually) to find some actual source research, and found some content related to physicians (“The Zen of Listening,” Rebecca Z. Shafir, 2003, p152), but …

I kept looking. I eliminated songs that referred to those seventeen seconds, and TV episodes (“Grey’s Anatomy” apparently had an episode using it); I found Ms. Shafir’s book, and numerous other references to Mr. Chapman’s statement, and I also found a lot of uncited other references.

I can’t find actual research anywhere. I’m fairly certain it’s there; anecdotal tests on myself correspond with the statement (although this may be confirmation bias on my part). But I’m really concerned about the lack of an actual research paper.

Even “The Zen of Listening” doesn’t cite an actual reference. This seventeen seconds thing may be entirely made up.

I’m not saying it is made up – I’m definitely not trying to accuse Mr. Chapman of fabricating research. But I do wish he’d included some reference, just so we could actually validate the science behind the claim.

In Conclusion…

I am really enjoying the book, honestly; I don’t want anyone to think I don’t, or that it’s not worth checking out. I haven’t finished it yet, but I anticipate it being a worthwhile read.

And seventeen seconds… just wow. I’m going to try to do some personal research and figure out if I think that number is high or low… the challenge will be to do this research without offending my loving wife.

Maybe I’ll observe my kids; however, my thought there is that seventeen seconds before interruption will be absurdly high. (I’m betting they get five seconds.)