在 R 中的 hadoop reducer 函数中过滤键值对答案

【问题标题】：Filtering the key value pair in hadoop reducer function in R在 R 中的 hadoop reducer 函数中过滤键值对
【发布时间】：2015-10-15 03:05:44
【问题描述】：

我想知道如何在 hadoop reducer 函数中设置条件来过滤掉键、值对。例如，在下面给出的字数示例中，我怎样才能得到那些计数大于某个阈值的单词，比如 3。

library(rmr2)
library(rhdfs)

# initiate rhdfs package
hdfs.init()

map <- function(k,lines) {
  words.list <- strsplit(lines, '\\s')
  words <- unlist(words.list)
  return( keyval(words, 1) )
}

reduce <- function(word, counts) {
  keyval(word, sum(counts))
}

wordcount <- function (input, output=NULL) {
  mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)
}

## read text files from folder example/wordcount/data
hdfs.root <- 'example/wordcount'
hdfs.data <- file.path(hdfs.root, 'data')

## save result in folder example/wordcount/out
hdfs.out <- file.path(hdfs.root, 'out')

## Submit job
out <- wordcount(hdfs.data, hdfs.out) 

## Fetch results from HDFS
results <- from.dfs(out)
results.df <- as.data.frame(results, stringsAsFactors=F)
colnames(results.df) <- c('word', 'count')

head(results.df)

【问题讨论】：

您是否尝试过在您的 reduce 函数中添加 if 语句？ if 语句可以走很长的路
我真的不明白如何将 if 条件放在 reduce 函数中。我确实有同样的想法，并尝试了类似这样的方法 cnt threshold){ keyval(word, counts) } 但我怀疑它如何仅对单词相同的那些计数求和。
轮到我不明白。 reducer 里只有一个字。只算一个。在洗牌中解决了相同性。
我的意思是，如果我计算总和（计数），无论单词是什么，它都不会对所有计数进行总和。这个词及其对应的频率是如何保持的？
这是一个单独的问题。由于 wordcount 有效，因此 sum(counts) 奇迹般地做了正确的事情。所以同样的奇迹，你可以写if(sum(counts) > 44)。您必须先说服自己 wordcount 有效，然后才能对其进行修改。

标签： r hadoop mapreduce

【解决方案1】：

reduce <- function(word, counts) {
  if(sum(counts) > 3)
    keyval(word, sum(counts))
}

【讨论】：