To run this example we need to prepare something. We assume that we have the HDFS service running; if we didn't create a user directory, we have to do it now (assuming the hadoop user we're using is mapred):
Now that our user has a home directory, we can create a directory that we'll use lo load the input file for the mapreduce programs:
Now we're ready to execute the MapReduce program. Hadoop tarball comes with a JAR containing the WordCount example; we can launch Hadoop with these parameters:
Well, we've successfully run our first mapreduce job on our Hadoop installation!
$ hadoop fs -mkdir -p /user/mapredWhen we pass "fs" as the first argument to the hadoop command, we're telling hadoop to work on HDFS filesystem; in this case, we used the mkdir command as a switch to create a new directory on HDFS.
Now that our user has a home directory, we can create a directory that we'll use lo load the input file for the mapreduce programs:
$ hadoop fs -mkdir inputdirWe can check the result issuing a "ls" command on HDFS:
$ hadoop fs -ls Found 1 items drwxr-xr-x - mapred mrusers 0 2014-02-11 22:54 inputdirNow we can decide which file we'll count the words of; in this example, I'll use the text of the novella Flatland by Edwin Abbot, which is freely available on gutemberg project for download:
$ wget http://www.gutenberg.org/cache/epub/201/pg201.txtNow we can put this file onto the HDFS, more precisely into the inputdir dir we created a moment ago:
$ hadoop fs -put pg201.txt inputdirThe switch "-put" tells Hadoop to get the file from the machine's file system and to put it onto the HDFS filesystem. We can check that the file is really there:
$ hadoop fs -ls inputdir Found 1 items drwxr-xr-x - mapred mrusers 227368 2014-02-11 22:59 inputdir/pg201.txt
Now we're ready to execute the MapReduce program. Hadoop tarball comes with a JAR containing the WordCount example; we can launch Hadoop with these parameters:
- jar: we're telling Hadoop we want to execute a mapreduce program contained in a JAR
- /opt/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar: this is the absolute path and filename of the JAR
- wordcount: tells Hadoop which of the many examples contained in the JAR to run
- inputdir: the directory on HDFS in which Hadoop can find the input file(s)
- outputdir: the directory on HDFS in which Hadoop must write the result of the program
$ hadoop jar /opt/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount inputdir outputdirand the output is:
14/02/11 23:16:19 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/02/11 23:16:20 INFO input.FileInputFormat: Total input paths to process : 1 14/02/11 23:16:20 INFO mapreduce.JobSubmitter: number of splits:1 14/02/11 23:16:21 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 14/02/11 23:16:21 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 14/02/11 23:16:21 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 14/02/11 23:16:21 INFO Configuration.deprecation: mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class 14/02/11 23:16:21 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 14/02/11 23:16:21 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 14/02/11 23:16:21 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class 14/02/11 23:16:21 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 14/02/11 23:16:21 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 14/02/11 23:16:21 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 14/02/11 23:16:21 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 14/02/11 23:16:21 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 14/02/11 23:16:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1392155226604_0001 14/02/11 23:16:22 INFO impl.YarnClientImpl: Submitted application application_1392155226604_0001 to ResourceManager at /0.0.0.0:8032 14/02/11 23:16:23 INFO mapreduce.Job: The url to track the job: http://hadoop-VirtualBox:8088/proxy/application_1392155226604_0001/ 14/02/11 23:16:23 INFO mapreduce.Job: Running job: job_1392155226604_0001 14/02/11 23:16:38 INFO mapreduce.Job: Job job_1392155226604_0001 running in uber mode : false 14/02/11 23:16:38 INFO mapreduce.Job: map 0% reduce 0% 14/02/11 23:16:47 INFO mapreduce.Job: map 100% reduce 0% 14/02/11 23:16:57 INFO mapreduce.Job: map 100% reduce 100% 14/02/11 23:16:58 INFO mapreduce.Job: Job job_1392155226604_0001 completed successfully 14/02/11 23:16:58 INFO mapreduce.Job: Counters: 43 File System Counters FILE: Number of bytes read=121375 FILE: Number of bytes written=401139 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=227485 HDFS: Number of bytes written=88461 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=7693 Total time spent by all reduces in occupied slots (ms)=7383 Map-Reduce Framework Map input records=4239 Map output records=37680 Map output bytes=366902 Map output materialized bytes=121375 Input split bytes=117 Combine input records=37680 Combine output records=8341 Reduce input groups=8341 Reduce shuffle bytes=121375 Reduce input records=8341 Reduce output records=8341 Spilled Records=16682 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=150 CPU time spent (ms)=5490 Physical memory (bytes) snapshot=399077376 Virtual memory (bytes) snapshot=1674149888 Total committed heap usage (bytes)=314048512 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=227368 File Output Format Counters Bytes Written=88461The last part of the output is a summary of the execution of the mapreduce program; just before this, we can spot the "Job job_1392155226604_0001 completed successfully" line, which tells us the mapreduce program has been executed successfully. As told, Hadoop wrote the output onto the outputdir on HDFS; let's see what's inside this dir:
$ hadoop fs -ls outputdir Found 2 items -rw-r--r-- 1 mapred mrusers 0 2014-02-11 23:16 outputdir/_SUCCESS -rw-r--r-- 1 mapred mrusers 88461 2014-02-11 23:16 outputdir/part-r-00000The presence of the _SUCCESS file confirms us the successful execution of the job; in the part-r-00000 Hadoop wrote the result of the execution. We can bring the file up to the filesystem of our machine using the "get" switch:
$ hadoop fs -get outputdir/part-r-00000 .Now we can see the content of the file (this is a small subset of the whole file):
... leading 2 leagues 1 leaning 1 leap 1 leaped 1 learn 7 learned 1 least 23 least. 1 leave 3 leaves 3 leaving 2 lecture 1 led 4 left 9 ...The wordcount program just count the occurrences of every single word and outputs it.
Well, we've successfully run our first mapreduce job on our Hadoop installation!
from: http://andreaiacono.blogspot.com/2014/02/running-hadoop-example.html