【发布时间】:2014-09-10 08:36:11
【问题描述】:
获取分布式缓存数据的最佳方式是什么?
public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
ArrayList<String> globalFreq = new ArrayList<String>();
public void setup(Context context) throws IOException{
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
Path getPath = new Path(cacheFiles[0].getPath());
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
String [] parts = setupData.split(" ");
globalFreq.add(parts[0]);
}
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
//Accessing "globalFreq" data .and do further processing
}
或
public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
URI[] cacheFiles
public void setup(Context context) throws IOException{
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
cacheFiles = DistributedCache.getCacheFiles(conf);
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
ArrayList<String> globalFreq = new ArrayList<String>();
Path getPath = new Path(cacheFiles[0].getPath());
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
String [] parts = setupData.split(" ");
globalFreq.add(parts[0]);
}
}
所以如果我们像 (code 2) 那样做,这是否意味着 Say we have 5 map task every map task reads the same copy of the data 。在为每个地图这样写时,任务会多次读取数据,我是对的(5 次)吗?
代码 1:因为它是在 setup 中编写的,所以它被读取一次,并且在 map 中访问全局数据。
分布式缓存的正确写法是什么。
【问题讨论】:
标签: java caching hadoop mapreduce distributed-cache