Hadoop之Mapper类

源码:
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

/**
* The <code>Context</code> passed on to the {@link Mapper} implementations.
*/
public abstract class Context
implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
}
/**
* Called once at the beginning of the task.
*/
protected void setup(Context context
) throws IOException, InterruptedException {
// NOTHING
}

/**
* Called once for each key/value pair in the input split. Most applications
* should override this, but the default is the identity function.
*/
@SuppressWarnings("unchecked")
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException, InterruptedException {
context.write((KEYOUT) key, (VALUEOUT) value);
}

/**
* Called once at the end of the task.
*/
protected void cleanup(Context context
) throws IOException, InterruptedException {
// NOTHING
}
/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* @param context
* @throws IOException
*/
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
cleanup(context);
}
}
}
主要方法有setup(一个task调用一次),map(每一个k/v对调用一次),cleanup(每一个task调用一次),run(Expert users can override this method for more complete control over the)
还有一个内部类Context (implements MapContext),

而在MapContext中只有一个方法:
/**
* Get the input split for this map.
*/
public InputSplit getInputSplit();


再看看InputSplit类,该类是一个抽象类,先看类的注释:
/**
* <code>InputSplit</code> represents the data to be processed by an
* individual {@link Mapper}.
*
* <p>Typically, it presents a byte-oriented[面向] view on the input and is the
* responsibility[责任] of {@link RecordReader} of the job to process this and present
* a record-oriented view.
源码:

public abstract class InputSplit {
/**
* Get the size of the split, so that the input splits can be sorted by size.
* @return the number of bytes in the split
* @throws IOException
* @throws InterruptedException
*/
public abstract long getLength() throws IOException, InterruptedException;

/**
* Get the list of nodes by name where the data for the split would be local.
* The locations do not need to be serialized.
*
* @return a new array of the node nodes.
* @throws IOException
* @throws InterruptedException
*/
public abstract
String[] getLocations() throws IOException, InterruptedException;
/**
* Gets info about which nodes the input split is stored on and how it is
* stored at each location.
*
* @return list of <code>SplitLocationInfo</code>s describing how the split
* data is stored at each location. A null value indicates that all the
* locations have the data stored on disk.
* @throws IOException
*/
@Evolving
public SplitLocationInfo[] getLocationInfo() throws IOException {
return null;
}
}
主要的方法有:
                        public abstract long getLength【获取切片的大小,并按大小排序】,
                        public abstract String[] getLocations,                               
                        public SplitLocationInfo[] getLocationInfo【获取信息,包含有哪些节点存储了切片,是如何存储的】
Ctrl+H查看类的层级结构:
Hadoop_Mapper&Context&InputSplit&FileSplit源码浅析

看到了InputSplit类的实现类,其中包括FileSplit类(看lib包下的)
老规矩,先看类注释:
/** A section of an input file. Returned by {@link
* InputFormat#getSplits(JobContext)} and passed to
* {@link InputFormat#createRecordReader(InputSplit,TaskAttemptContext)}. */

说明了类的来源【InputFormat#getSplits(JobContext)】和去处【InputFormat#createRecordReader(InputSplit,TaskAttemptContext)
这里先说FileSplit类本身,有关InputFormat下节说。
public class FileSplit extends InputSplit implements Writable
类中,有几个属性,一个无参构造,两个有参构造,暂时不用管/
继续,还有几个实现方法不管,其中有以下几个方法:
/** The file containing this split's data. */
public Path getPath() { return file; }

/** The position of the first byte in the file to process. */
public long getStart() { return start; }

/** The number of bytes in the file to process. */
@Override
public long getLength() { return length; }


可以看出,从FileSplit中也能获取到文件的Path,进而通过Path可以获取FileSystem,以及输入输出流FSDataInputStream,FSDataOutputSream

至此,Mapper中的内容大致看完了。






相关文章:

  • 2022-12-23
  • 2021-04-01
  • 2021-06-14
  • 2021-09-23
  • 2021-10-23
  • 2021-06-07
猜你喜欢
  • 2022-01-27
  • 2021-12-19
  • 2021-09-20
  • 2021-05-23
  • 2022-12-23
  • 2021-08-03
  • 2021-06-04
相关资源
相似解决方案