运行Hadoop的示例程序WordCount-Running Hadoop Example

To run this example we need to prepare something. We assume that we have the HDFS service running; if we didn't create a user directory, we have to do it now (assuming the hadoop user we're using is mapred):

$ hadoop fs -mkdir -p /user/mapred

When we pass "fs" as the first argument to the hadoop command, we're telling hadoop to work on HDFS filesystem; in this case, we used the mkdir command as a switch to create a new directory on HDFS.
Now that our user has a home directory, we can create a directory that we'll use lo load the input file for the mapreduce programs:

$ hadoop fs -mkdir inputdir

We can check the result issuing a "ls" command on HDFS:

$ hadoop fs -ls 
Found 1 items
drwxr-xr-x   - mapred mrusers        0 2014-02-11 22:54 inputdir

Now we can decide which file we'll count the words of; in this example, I'll use the text of the novella Flatland by Edwin Abbot, which is freely available on gutemberg project for download:

$ wget http://www.gutenberg.org/cache/epub/201/pg201.txt

Now we can put this file onto the HDFS, more precisely into the inputdir dir we created a moment ago:

$ hadoop fs -put pg201.txt inputdir

The switch "-put" tells Hadoop to get the file from the machine's file system and to put it onto the HDFS filesystem. We can check that the file is really there:

$ hadoop fs -ls inputdir
Found 1 items
drwxr-xr-x   - mapred mrusers        227368 2014-02-11 22:59 inputdir/pg201.txt

Now we're ready to execute the MapReduce program. Hadoop tarball comes with a JAR containing the WordCount example; we can launch Hadoop with these parameters:

jar: we're telling Hadoop we want to execute a mapreduce program contained in a JAR
/opt/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar: this is the absolute path and filename of the JAR
wordcount: tells Hadoop which of the many examples contained in the JAR to run
inputdir: the directory on HDFS in which Hadoop can find the input file(s)
outputdir: the directory on HDFS in which Hadoop must write the result of the program

$ hadoop jar /opt/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount inputdir outputdir

and the output is:

14/02/11 23:16:19 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/02/11 23:16:20 INFO input.FileInputFormat: Total input paths to process : 1
14/02/11 23:16:20 INFO mapreduce.JobSubmitter: number of splits:1
14/02/11 23:16:21 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
14/02/11 23:16:21 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/02/11 23:16:21 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
14/02/11 23:16:21 INFO Configuration.deprecation: mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class
14/02/11 23:16:21 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
14/02/11 23:16:21 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/02/11 23:16:21 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
14/02/11 23:16:21 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/02/11 23:16:21 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/02/11 23:16:21 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/02/11 23:16:21 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
14/02/11 23:16:21 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/02/11 23:16:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1392155226604_0001
14/02/11 23:16:22 INFO impl.YarnClientImpl: Submitted application application_1392155226604_0001 to ResourceManager at /0.0.0.0:8032
14/02/11 23:16:23 INFO mapreduce.Job: The url to track the job: http://hadoop-VirtualBox:8088/proxy/application_1392155226604_0001/
14/02/11 23:16:23 INFO mapreduce.Job: Running job: job_1392155226604_0001
14/02/11 23:16:38 INFO mapreduce.Job: Job job_1392155226604_0001 running in uber mode : false
14/02/11 23:16:38 INFO mapreduce.Job:  map 0% reduce 0%
14/02/11 23:16:47 INFO mapreduce.Job:  map 100% reduce 0%
14/02/11 23:16:57 INFO mapreduce.Job:  map 100% reduce 100%
14/02/11 23:16:58 INFO mapreduce.Job: Job job_1392155226604_0001 completed successfully
14/02/11 23:16:58 INFO mapreduce.Job: Counters: 43
 File System Counters
  FILE: Number of bytes read=121375
  FILE: Number of bytes written=401139
  FILE: Number of read operations=0
  FILE: Number of large read operations=0
  FILE: Number of write operations=0
  HDFS: Number of bytes read=227485
  HDFS: Number of bytes written=88461
  HDFS: Number of read operations=6
  HDFS: Number of large read operations=0
  HDFS: Number of write operations=2
 Job Counters 
  Launched map tasks=1
  Launched reduce tasks=1
  Data-local map tasks=1
  Total time spent by all maps in occupied slots (ms)=7693
  Total time spent by all reduces in occupied slots (ms)=7383
 Map-Reduce Framework
  Map input records=4239
  Map output records=37680
  Map output bytes=366902
  Map output materialized bytes=121375
  Input split bytes=117
  Combine input records=37680
  Combine output records=8341
  Reduce input groups=8341
  Reduce shuffle bytes=121375
  Reduce input records=8341
  Reduce output records=8341
  Spilled Records=16682
  Shuffled Maps =1
  Failed Shuffles=0
  Merged Map outputs=1
  GC time elapsed (ms)=150
  CPU time spent (ms)=5490
  Physical memory (bytes) snapshot=399077376
  Virtual memory (bytes) snapshot=1674149888
  Total committed heap usage (bytes)=314048512
 Shuffle Errors
  BAD_ID=0
  CONNECTION=0
  IO_ERROR=0
  WRONG_LENGTH=0
  WRONG_MAP=0
  WRONG_REDUCE=0
 File Input Format Counters 
  Bytes Read=227368
 File Output Format Counters 
  Bytes Written=88461

The last part of the output is a summary of the execution of the mapreduce program; just before this, we can spot the "Job job_1392155226604_0001 completed successfully" line, which tells us the mapreduce program has been executed successfully. As told, Hadoop wrote the output onto the outputdir on HDFS; let's see what's inside this dir:

$ hadoop fs -ls outputdir
Found 2 items
-rw-r--r--   1 mapred mrusers          0 2014-02-11 23:16 outputdir/_SUCCESS
-rw-r--r--   1 mapred mrusers      88461 2014-02-11 23:16 outputdir/part-r-00000

The presence of the _SUCCESS file confirms us the successful execution of the job; in the part-r-00000 Hadoop wrote the result of the execution. We can bring the file up to the filesystem of our machine using the "get" switch:

$ hadoop fs -get outputdir/part-r-00000 .

Now we can see the content of the file (this is a small subset of the whole file):

...
leading 2
leagues 1
leaning 1
leap    1
leaped  1
learn   7
learned 1
least   23
least.  1
leave   3
leaves  3
leaving 2
lecture 1
led     4
left    9
...

The wordcount program just count the occurrences of every single word and outputs it.
Well, we've successfully run our first mapreduce job on our Hadoop installation!

from: http://andreaiacono.blogspot.com/2014/02/running-hadoop-example.html