如何从 pig 脚本中运行 Mapreduce答案

【问题标题】：How to run Mapreduce from within a pig script如何从 pig 脚本中运行 Mapreduce
【发布时间】：2015-09-07 11:44:52
【问题描述】：

我想了解如何在 pig 脚本中集成调用 mapreduce 作业。

我提到了链接 https://wiki.apache.org/pig/NativeMapReduce

但我不知道该怎么做，因为它会如何理解我的映射器或减速器代码。解释不是很清楚。

如果有人可以举例说明，那将是非常有帮助的。

提前致谢，干杯:)

【问题讨论】：

标签： hadoop mapreduce apache-pig

【解决方案1】：

来自pig documentation的示例

A = LOAD 'WordcountInput.txt';
B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir' 
    AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;

在上面的示例中，pig 会将来自A 的输入数据存储到inputDir 并从outputDir 加载作业的输出数据。

另外，HDFS 中有一个名为 wordcount.jar 的 jar，其中有一个名为 org.myorg.WordCount 的类，主类负责设置映射器和缩减器、输入和输出等。

您也可以通过 hadoop jar mymr.jar org.myorg.WordCount inputDir outputDir 调用 mapreduce 作业。

【讨论】：

嗨@Fred，有一个问题，我能够执行MR作业，但最大的问题是它首先将输入复制到文件夹“inputDir”，然后才，它正在执行 MapReduce 作业（此处为 Wordcount.jar）。因此，复制大数据既耗时又效率低下。您能否建议一个替代方案来不复制数据并仍然使用 MapReduce 代码？？
我不确定 STORE A INTO 'inputDir' 是否是强制性的。如果不是，请跳过它。如果是，只需将一些小的虚拟数据复制到某个位置，但从 mapreduce 程序中的真实/大型输入中读取。
感谢@Fred，虽然我无法避免使用存储功能，但它对我来说很有效。想知道如果我已经实现了我自己的猪阅读器并使用加载命令通过该加载器读取数据，那么这种技术可能不起作用，如果可以找到任何使用猪加载器向 Mapreduce 提供数据的替代方法，那么它将是一个奖励.感谢您的帮助。！！

【解决方案2】：

默认情况下 pig 会预测 map/reduce 程序。然而 hadoop 带有默认的 mapper/reducer 实现；这是 Pig 使用的 - 当 map reduce 类未被识别时。

Further Pig 使用 Hadoop 的属性及其特定属性。尝试设置，在pig脚本中的属性下面，它应该也被Pig选中。

SET mapred.mapper.class="<fully qualified classname for mapper>"
SET mapred.reducer.class="<fully qualified classname for reducer>"

同样可以使用-Dmapred.mapper.class 选项进行设置。综合名单here 根据您的 hadoop 安装，属性也可能是：

mapreduce.map.class
mapreduce.reduce.class

仅供参考...

hadoop.mapred 已被弃用。 0.20.1 之前的版本使用 mapred。之后的版本使用 mapreduce。

另外pig有自己的一组属性，可以使用命令pig -help properties查看

e.g. in my pig installation, below are the properties:

The following properties are supported:
    Logging:
        verbose=true|false; default is false. This property is the same as -v switch
        brief=true|false; default is false. This property is the same as -b switch
        debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO. This property is the same as -d switch
        aggregate.warning=true|false; default is true. If true, prints count of warnings
            of each type rather than logging each warning.
    Performance tuning:
        pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory).
            Note that this memory is shared across all large bags used by the application.
        pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory).
            Specifies the fraction of heap available for the reducer to perform the join.
        pig.exec.nocombiner=true|false; default is false.
            Only disable combiner as a temporary workaround for problems.
        opt.multiquery=true|false; multiquery is on by default.
            Only disable multiquery as a temporary workaround for problems.
        opt.fetch=true|false; fetch is on by default.
            Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs.
        pig.tmpfilecompression=true|false; compression is off by default.
            Determines whether output of intermediate jobs is compressed.
        pig.tmpfilecompression.codec=lzo|gzip; default is gzip.
            Used in conjunction with pig.tmpfilecompression. Defines compression type.
        pig.noSplitCombination=true|false. Split combination is on by default.
            Determines if multiple small files are combined into a single map.
        pig.exec.mapPartAgg=true|false. Default is false.
            Determines if partial aggregation is done within map phase,
            before records are sent to combiner.
        pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.
            If the in-map partial aggregation does not reduce the output num records
            by this factor, it gets disabled.
    Miscellaneous:
        exectype=mapreduce|local; default is mapreduce. This property is the same as -x switch
        pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
        udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
        stop.on.failure=true|false; default is false. Set to true to terminate on the first error.
        pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host.
            Determines the timezone used to handle datetime datatype and UDFs. Additionally, any Hadoop property can be specified.

【讨论】：