【问题标题】:hadoop sequence file collectionhadoop 序列文件集合
【发布时间】:2013-12-10 21:40:21
【问题描述】:

reducer(带有 Text 键和 Iterable MapWritable 值)如何将其所有 Map 输出到序列文件以保留对其键的分组?例如,假设映射器将记录发送到减速器,如下所示:

<"dog", {<"name", "Fido">, <"pure bred?", "false">, <"type", "mutt">}>
<"cat", {<"name", "Felix">, <"color", "black">, <"origin", "film">, <"date", "1919">}>
<"dog", {<"name", "Lassie">, <"type", "collie">, <"origin", " short story">}>

我想将序列文件写成:

key = "dog"
value =  {
            {<"name", "Fido">, <"pure bred?", "false">, <"type", "mutt">},
            {<"name", "Lassie">, <"type", "collie">, <"origin", "short story">}
         }

key = "cat"
value = {
            {<"name", "Felix">, <"color", "black">, <"origin", "film">, <"date", "1919">}
        }

我猜我需要创建一个实现 Writable 的自定义值输出类,但我不确定如何执行此操作,因为据我所知,集合并不能真正处理序列文件。我想这样做,以便下一个 map/reduce 阶段将读取与每个键关联的所有 Maps 作为一个单元。

TIA,

【问题讨论】:

    标签: hadoop sequencefile


    【解决方案1】:

    如您所述,您可以创建一个扩展 ArrayWritable 的自定义 Writable:

    public class MapWritableArray extends ArrayWritable {
        public MapWritableArray() {
            super(MapWritable.class);
        }
    }
    

    然后,在您的 reducer 中,您需要将 MapWritable 值的可迭代累积到一个数组中(记住随着每次迭代的底层内容发生变化,复制这些值)。类似的东西(完全未经测试,未经编译验证且未经优化):

    MapWritableArray mapWritableArray = new MapWritableArray();
    ArrayList<MapWritable> valList = new ArrayList<MapWritable>();
    for (MapWritable value : values) {
        MapWritable copy = ReflectionUtils.newInstance(context.getConfiguration(), MapWritable.class);
        ReflectionUtils.copy(context.getConfiguration, value, copy);
        valList.add(copy);
    }
    mapWritableArray.set(valList.toArray(new MapWritable[0]));
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2012-09-30
      • 2015-02-20
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-03-18
      • 1970-01-01
      相关资源
      最近更新 更多