【发布时间】:2021-07-10 02:28:03
【问题描述】:
我目前正在使用 Java 开发一个 Hadoop 项目。我的目标是制作一个减少每个单词的行频的地图。例如,不输出一个单词在输入文件中被计数的确切次数,而只是计算它出现的行数。如果一个单词在一行中出现多次,它应该只计算一次,因为我们只是计算它出现的行数。我有一个基本的地图减少工作,我将发布,但我对如何只计算单词的行频而不是完整的字数有点迷茫。任何帮助将不胜感激,非常感谢。
地图字数
public class MapWordCount extends Mapper <LongWritable, Text, Text, IntWritable>
{
private Text wordToken = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
StringTokenizer tokens = new StringTokenizer(value.toString(), "[_|$#0123456789<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"']"); //Dividing String into tokens
while (tokens.hasMoreTokens())
{
wordToken.set(tokens.nextToken());
context.write(wordToken, new IntWritable(1));
}
}
}
减少字数
public class ReduceWordCount extends Reducer <Text, IntWritable, Text, IntWritable>
{
private IntWritable count = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int valueSum = 0;
for (IntWritable val : values)
{
valueSum += val.get();
}
count.set(valueSum);
context.write(key, count);
}
}
驱动程序代码
public class WordCount {
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
String[] pathArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (pathArgs.length < 2)
{
System.err.println("MR Project Usage: wordcount <input-path> [...] <output-path>");
System.exit(2);
}
Job wcJob = Job.getInstance(conf, "MapReduce WordCount");
wcJob.setJarByClass(WordCount.class);
wcJob.setMapperClass(MapWordCount.class);
wcJob.setCombinerClass(ReduceWordCount.class);
wcJob.setReducerClass(ReduceWordCount.class);
wcJob.setOutputKeyClass(Text.class);
wcJob.setOutputValueClass(IntWritable.class);
for (int i = 0; i < pathArgs.length - 1; ++i)
{
FileInputFormat.addInputPath(wcJob, new Path(pathArgs[i]));
}
FileOutputFormat.setOutputPath(wcJob, new Path(pathArgs[pathArgs.length - 1]));
System.exit(wcJob.waitForCompletion(true) ? 0 : 1);
}
}
【问题讨论】:
-
您可能想多了,您只需将文件/文档分成几行(而不是文字),然后一次处理/映射一行并保持总行数。一种非常简单的方法(使用 pusdo 代码)可能类似于
for(int i = 0; i < doucment.length; i++){if(yourLine.contains(yourWord)) wordCount++;}显然您会为此使用单词映射并修剪/忽略一行上的任何重复项,但同样的逻辑适用。
标签: java hadoop mapreduce bigdata