Hadoop_Mapper&Context&InputSplit&FileSplit源码浅析

Hadoop之Mapper类

源码:

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

/**

* The <code>Context</code> passed on to the {@link Mapper} implementations.
*/

public abstract class Context

implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
}
/**
* Called once at the beginning of the task.
*/

protected void setup(Context context

) throws IOException, InterruptedException {
// NOTHING
}

/**
* Called once for each key/value pair in the input split. Most applications
* should override this, but the default is the identity function.
*/

@SuppressWarnings("unchecked")

protected void map(KEYIN key, VALUEIN value,

Context context) throws IOException, InterruptedException {

context.write((KEYOUT) key, (VALUEOUT) value);
}

/**
* Called once at the end of the task.
*/

protected void cleanup(Context context

) throws IOException, InterruptedException {
// NOTHING
}
/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.

* @param context

* @throws IOException
*/

public void run(Context context) throws IOException, InterruptedException {

setup(context);

try {

while (context.nextKeyValue()) {

map(context.getCurrentKey(), context.getCurrentValue(), context);
}

} finally {

cleanup(context);
}
}
}

主要方法有setup(一个task调用一次),map(每一个k/v对调用一次),cleanup(每一个task调用一次),run(Expert users can override this method for more complete control over the)

还有一个内部类Context (implements MapContext)，

而在MapContext中只有一个方法：

/**
* Get the input split for this map.
*/

public InputSplit getInputSplit();

再看看InputSplit类，该类是一个抽象类，先看类的注释:

/**

* <code>InputSplit</code> represents the data to be processed by an

* individual {@link Mapper}.
*

* <p>Typically, it presents a byte-oriented[面向] view on the input and is the

* responsibility[责任] of {@link RecordReader} of the job to process this and present
* a record-oriented view.

源码:

public abstract class InputSplit {
/**
* Get the size of the split, so that the input splits can be sorted by size.

* @return the number of bytes in the split

* @throws IOException

* @throws InterruptedException
*/

public abstract long getLength() throws IOException, InterruptedException;

/**
* Get the list of nodes by name where the data for the split would be local.
* The locations do not need to be serialized.
*

* @return a new array of the node nodes.

* @throws IOException

* @throws InterruptedException
*/
public abstract

String[] getLocations() throws IOException, InterruptedException;
/**
* Gets info about which nodes the input split is stored on and how it is
* stored at each location.
*

* @return list of <code>SplitLocationInfo</code>s describing how the split
* data is stored at each location. A null value indicates that all the
* locations have the data stored on disk.

* @throws IOException
*/
@Evolving

public SplitLocationInfo[] getLocationInfo() throws IOException {
return null;
}
}

主要的方法有:

public abstract long getLength【获取切片的大小，并按大小排序】,

public abstract String[] getLocations,

public SplitLocationInfo[] getLocationInfo【获取信息，包含有哪些节点存储了切片，是如何存储的】

Ctrl+H查看类的层级结构:

Hadoop_Mapper&Context&InputSplit&FileSplit源码浅析

看到了InputSplit类的实现类，其中包括FileSplit类(看lib包下的)

老规矩，先看类注释:

/** A section of an input file. Returned by {@link

* InputFormat#getSplits(JobContext)} and passed to

* {@link InputFormat#createRecordReader(InputSplit,TaskAttemptContext)}. */

说明了类的来源【InputFormat#getSplits(JobContext)】和去处【InputFormat#createRecordReader(InputSplit,TaskAttemptContext)】

这里先说FileSplit类本身，有关InputFormat下节说。

public class FileSplit extends InputSplit implements Writable

类中，有几个属性，一个无参构造，两个有参构造，暂时不用管/

继续，还有几个实现方法不管，其中有以下几个方法:

/** The file containing this split's data. */

public Path getPath() { return file; }

/** The position of the first byte in the file to process. */

public long getStart() { return start; }

/** The number of bytes in the file to process. */
@Override

public long getLength() { return length; }

可以看出，从FileSplit中也能获取到文件的Path，进而通过Path可以获取FileSystem，以及输入输出流FSDataInputStream，FSDataOutputSream

至此，Mapper中的内容大致看完了。