Repost: Java 8 Streams: filtering and mapping

I’ve been making some progress with my Java 8 streams and lambdas explorations. I’m still not anywhere near an expert yet, and chances are that the literati would see my attempts as childish and ignorant, but that’s okay. How else do you learn?

What I’ve been working on is a tokenizer. I’m feeding data from a corpus, a body of text, into a grinder, to spit out valid tokens.

There’re currently two types of tokens, although this isn’t reflected in code (the rules for being a token are the same regardless of how the token was generated). Each token is generated based on one of two sets of rules.

The first ruleset, which applies to the other ruleset as well, is this:

Each token should map to one word in the corpus; “foo bar” should always map to two tokens.

Each token should have only alphabetic or numeric characters; “foo7” is fine as a token, but “shouldn't” is not; “shouldn't” should be converted to “shouldnt” instead. The content should be trimmed, as well, so whitespace on either side of the word should not be included.

These rules facilitate ease of use: I want to be able to say “give me the list of tokens that corresponds to ‘foo bar baz quux’”, as opposed to having to build up a list by looking up each word by itself. (This is part of the actual project storyboard: a critical part of the overall task is to take text from a web page and generate tokens from it. Parsing the page into individual chunks would be more work than it’s worth… right now.)

The second ruleset adds a few requirements.

First, the token has to be at least three characters long. This trims out a number of extraneous conjunctions, definite articles, pronouns, and other such connecting words that tend to add up to noise.

Secondly, the corpus should not exist in a list of stop words. The stop words include articles and conjunctions, for example, but also add a number of other typically common words.

Thirdly, the corpus should be stemmed, meaning that they should be reduced to their base form. “Amenity”, for example, has the root word of “amen”; “porterhouse” has a root word of “porterhous”; “aversion” has a root stem of “avers”.

This sounds like a perfect application for the streams. Get a list of the words from the text, and process each one. Here’s some incomplete code for grabbing the text and mapping it to a set of tokens in a histogram object:

public final Histogram buildHistogram(String inputs) {
    Histogram histogram = new Histogram();
    // this takes a simple input
    Arrays.stream(inputs.trim().split(" "))
            .filter(s -> s.length() < 3)
            .map(String::toLowerCase)
            .map(this::getToken)
            .forEach(histogram::add);
    return histogram;
}

… Yuck.

For one thing, this doesn’t actually apply many of our criteria, and doesn’t address the two sets of criteria at all.

It works, though: it trims the input, then splits along spaces (which is generally correct); any short words get ignored, then the text is converted to lower case; the resulting text is mapped to a token, which is added to the histogram.

But all of the tokenizers should do things the same way, just with different rules. This sounds like a mission for a custom filter and mapper, to accept sets of filters and mappings. This way, each tokenizer would have a set of rules and mappings, and the code to actually do the tokenization would not have to change.

So how would this look in actual code? Well, here’s the buildHistogram() method again, with the custom filter and mapper, and the default ruleset:

public final Histogram buildHistogram(String inputs) {
    Histogram histogram = new Histogram();
    // this takes a simple input
    Arrays.stream(inputs.trim().split(" "))
            .map(this::evaluateMappings)
            .filter(this::evaluatePredicates)
            .map(s -> getToken(s))
            .forEach(histogram::add);
    return histogram;
}

private String evaluateMappings(final String s) {
    String text = s;
    for (Function<String, String> f : getMappings()) {
        text = f.apply(text);
        if (s == null || s.isEmpty()) {
            return "";
        }
    }
    return text;
}

private boolean evaluatePredicates(String s) {
    if (s == null || s.isEmpty()) {
        return false;
    }
    for (Predicate<String> p : getPredicates()) {
        if (!p.test(s)) {
            return false;
        }
    }
    return true;
}

protected final Predicate<String> minLengthFilter = text -> text.length() > 2;

@SuppressWarnings("unchecked")
protected Predicate<String>[] getPredicates() {
    return new Predicate[]{};
}

protected final Function<String, String> toAlphanumeric = s -> {
    StringBuilder sb = new StringBuilder();
    for (char ch : s.toCharArray()) {
        if (Character.isLetterOrDigit(ch)) {
            sb.append(Character.toLowerCase(ch));
        }
    }
    return sb.toString();
};
protected final Function<String, String> toTrimmedForm = this::normalized;

@SuppressWarnings("unchecked")
protected Function<String, String>[] getMappings() {
    return new Function[]{toTrimmedForm, toAlphanumeric,};
}

This means the changes for other tokenizers is limited to getMappings() and getPredicates. An incomplete version of the stemming tokenizer looks like this (and, oddly enough, it doesn’t actually have the stemming code yet):

public class StemmingTokenizer extends Tokenizer {
    final static Set<String> stopWords;
    static { 
        /* Code to initialize the stop words goes here, but isn't copied
        stopWords=new HashSet<>();
    }

    Predicate<String> stopWordsFilter=s -> !stopWords.contains(s);

    @SuppressWarnings("unchecked")
    @Override
    protected Predicate<String>[] getPredicates() {
        return new Predicate[]{minLengthFilter, stopWordsFilter,};
    }

    @SuppressWarnings("unchecked")
    @Override
    protected Function<String, String>[] getMappings() {
        return new Function[]{toTrimmedForm, toAlphanumeric};
    }
}

As it stands, this will correctly trim out the stop words, and all it needs is a stemming class and a lambda (like the toAlphanumeric function) to stem the text.

Now, here’s the thing: is this good code?

I don’t entirely know. I know that it works, because I’ve tested it; the stemmer isn’t here, but stemming’s not rocket science. (There are plenty of good stemmers in Java, using either the Snowball or Porter algorithms; use Snowball. It’s an enhanced and corrected Porter.)

But I can’t help but wonder if the way I’m doing this isn’t idiomatic. Maybe there’s some magic way to apply a set of filters and mappings that I just haven’t seen; I’ve tried to think some about how this would be written and specified such that it would be correct for the general case, and failed.