Saturday, February 25, 2017

Combiners, Counters and Properties in Hadoop

Before starting, I would like to thank everyone that spent her/his time reading my first blog post about Kotlin and Hadoop, and, in particular, I would like to thank the guys @KotlinWeekly for referencing it in the newsletter #29. If you haven't signed yet, go to http://www.kotlinweekly.net/  and register to their newsletter, you can find amazing articles about Kotlin every week.
I've also decided to introduce Kotlin 1.1RC in the project for starting learning what is new in this version. As always, you can find the code of this post in its repository.

In this post, I'm going to introduce the Hadoop concepts of a Combiner but also how to define custom Counters and Properties in a MapReduce program. The example this time is the n-gram counter problem with a keyword for filtering the results. Given as input one or more text files, for each line we create multiple sequences of n or less words starting from the first one. We then filter out all the n-grams that start with the input keyword and we count how many times each sequence appears in the inputs.

Let's start with the easy one, Properties and Counters. A property is an attribute of the program that we want to share in our working context, and is usually a read-only property that we don't have interest in changing during the execution. In our example, we will use them to save the value n of our n-grams and the keyword to use in the filtering.
Instead, a counter is some sort of statistics we want to keep track of during the running of our program. In this case, we want to keep track of how many n-grams we filter out because they start with our keyword and we will increment the counter by 1 every time we discard one of them.

This is our Driver code, and lines 30 and 31 is where we define the properties for our Configuration. Because a property is a simple pair, the method Configuration.set(key, value) allows us to set a property easily, while Configuration.get(key) gives the value back as a String, as shown in line 15 of the NGramMapper.

The Counter code is, instead, between lines 13 and 15. In order to create a counter, it is enough to declare a public enum class inside the Driver class. This counter starts from 0 and it is incremented in the Combiner and the Reducer classes. For improving the readability of incrementing a counter, I have implemented a couple of extension methods for the Counter class. The operator keyword in Kotlin allows to extend a class with the methods inc() and plusAssign (value) and to call them as counter++ and counter += value, making the use of a counter more intuitive.

A Combiner places itself between the Mapper and the Reducer execution. The idea is that, in order to reduce the amount of data that the Mapper sends in the network, it is possible to combine these data in some way while they are still in memory, making possible to reduce the amount of them sent to the Reducer. A Combiner, like the Reducer, extends the org.hadoop.mapreduce.Reducer class and its input key:value and output key:value are the same of the reducer. Considering our example, instead of sending multiple times the same n-gram, or sending a n-gram that is going to be filtered out, we can first combine those results in the Combiner phase, reducing the amount of data in the network by counting a priori the number of times a n-gram appears, or immediately filter out those n-grams that match our keyword.

Two things are important to know before using a Combiner: first, the function executed must be commutative and associative in order to be able to use the Combiner correctly. And second, the execution of the Combiner is not guaranteed for all the keys generated from the Mapper, so it is important to consider this when writing the Reducer's code. For those two reasons the input and output of a Combiner must be the same of the Reducer used.

[NB] I'm not really sure about this second thing, so if somebody can give me more info about this, it would be incredibly appreciated.
[NB.2] The Combiner and the Reducer code are exactly the same, but for the sake of making everything more readable I decided to write two separate classes.

At the end of the execution, the value of the counter can be printed on the screen in the Driver class, but it is easier to simply look at the output of our Hadoop execution, which will print the counter result as shown in the image below.


With this ends my second post about Hadoop using Kotlin. If you have any question, or if you see that something is wrong in my explanation, don't hesitate to write me.