使用分布式缓存分发小型查找文件的最佳方式答案

【问题标题】：Best way to get distribute a small lookup file using Distributed Cache使用分布式缓存分发小型查找文件的最佳方式
【发布时间】：2014-09-10 08:36:11
【问题描述】：

获取分布式缓存数据的最佳方式是什么？

public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    ArrayList<String> globalFreq = new ArrayList<String>();
    public void setup(Context context) throws IOException{
        Configuration conf = context.getConfiguration();
        FileSystem fs = FileSystem.get(conf);
        URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
        Path getPath = new Path(cacheFiles[0].getPath());
        BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
        String setupData = null;
        while ((setupData = bf.readLine()) != null) {
            String [] parts = setupData.split(" ");
            globalFreq.add(parts[0]);
        }
    }
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        //Accessing "globalFreq" data .and do further processing
        }

或

public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    URI[] cacheFiles
    public void setup(Context context) throws IOException{
        Configuration conf = context.getConfiguration();
        FileSystem fs = FileSystem.get(conf);
        cacheFiles = DistributedCache.getCacheFiles(conf);

    }
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        ArrayList<String> globalFreq = new ArrayList<String>();
        Path getPath = new Path(cacheFiles[0].getPath());
        BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
        String setupData = null;
        while ((setupData = bf.readLine()) != null) {
            String [] parts = setupData.split(" ");
            globalFreq.add(parts[0]);
        }

        }

所以如果我们像 (code 2) 那样做，这是否意味着 Say we have 5 map task every map task reads the same copy of the data 。在为每个地图这样写时，任务会多次读取数据，我是对的（5 次）吗？

代码 1：因为它是在 setup 中编写的，所以它被读取一次，并且在 map 中访问全局数据。

分布式缓存的正确写法是什么。

【问题讨论】：

标签： java caching hadoop mapreduce distributed-cache

【解决方案1】：

在setup 方法中尽你所能：每个映射器都会调用一次，但随后会为传递给映射器的每条记录缓存。为每条记录解析数据是您可以避免的开销，因为没有任何东西取决于您在 map 方法中收到的 key、value 和 context 变量。

每个地图任务都会调用setup方法，但该任务处理的每个记录都会调用map （这显然是一个非常高的数字）。

【讨论】：

所以最好使用代码 1 对吗？第二个是直接的方式吗？正如分布式缓存的文档所说“每个节点都将访问数据副本”hadoop.apache.org/docs/stable/api/org/apache/hadoop/filecache/…
我肯定会选择第一个选项：您无法避免每个任务都必须解析一次缓存内容的事实，但是一旦完成，您就可以避免为每个任务再次解析记录。
如果缓存数据太大会怎样。不能以列表或其他方式存储。可能会出现需要获取大量数据的情况。例如：（如果我没有错，如果我错了，请纠正我）KNN算法。它的模型是相同的输入数据。在预测我们需要为这些情况获取模型数据时，我们不能依赖代码 1，因为它可能会占用堆空间
如果您的数据太大而无法放入 D-Cache，那么另一种选择是使用不同类型的连接（例如 map-或 reduce-side），而不是复制加入。
但是例如：像 KNN 这样的算法，我们需要在所有模型数据中找到 1 条输入线的距离。所以我认为这不能变成地图侧连接。否则对于模型数据，我们需要使用相同的键发出所有数据