Hadoop MapReduce is a software framework for processing enormous data sets. It is the main component for data processing in the Hadoop framework. It divides the input data into several parts and runs a program on every data component parallel at one. The word MapReduce refers to two separate and different tasks.
The first is the map operation, which takes a set of data and transforms it into a different collection of data, where individual elements are divided into tuples. The reduce operation consolidates those data tuples based on the key and subsequently modifies the value of the key.
Let us take an example of a text file called example_data.txt and understand how MapReduce works.
The content of the example_data.txt file is:
coding,jamming,ice,river,man,driving
Now, assume we have to find out the word count on the example_data.txt using MapReduce. So, we will be looking for the unique words and the number of times those unique words appeared.
- First, we break the input into three divisions, as seen in the figure. This will share the work among all the map nodes.
- Then, all the words are tokenized in each of the mappers, and a hardcoded value (1) to each of the tokens is given. The reason behind giving a hardcoded value equal to 1 is that every word by itself will, at least, occur once.
- Now, a list of key-value pairs will be created where the key is nothing but the individual words and value is one. So, for the first line (Coding Ice Jamming), we have three key-value pairs – Coding, 1; Ice, 1; Jamming, 1.
- The mapping process persists the same on all the nodes.
- Next, a partition process occurs where sorting and shuffling follow so that all the tuples with the same key are sent to the identical reducer.
- Subsequent to the sorting and shuffling phase, every reducer will have a unique key and a list of values matching that very key. For example, Coding, [1,1]; Ice, [1,1,1].., etc.
- Now, each Reducer adds the values which are present in that list of values. As shown in the example, the reducer gets a list of values [1,1] for the key Jamming. Then, it adds the number of ones in the same list and gives the final output as – Jamming, 2.
- Lastly, all the output key/value pairs are then assembled and written in the output file.