根据字典中单词的值检索句子分数答案

【问题标题】：Retrieving sentence score based on values of words in a dictionary根据字典中单词的值检索句子分数
【发布时间】：2015-03-19 13:01:37
【问题描述】：

已编辑 df 和 dict

我有一个包含句子的数据框：

df <- data_frame(text = c("I love pandas", "I hate monkeys", "pandas pandas pandas", "monkeys monkeys"))

还有一本包含单词及其对应分数的字典：

dict <- data_frame(word = c("love", "hate", "pandas", "monkeys"),
                   score = c(1,-1,1,-1))

我想在df 后面附加一个“分数”列，将每个句子的分数相加：

预期结果

                  text score
1        I love pandas     2
2       I hate monkeys    -2
3 pandas pandas pandas     3
4      monkeys monkeys    -2

更新

以下是目前的结果：

Akrun 的方法

建议 1

df %>% mutate(score = sapply(strsplit(text, ' '), function(x) with(dict, sum(score[word %in% x]))))

请注意，要使此方法起作用，我必须使用 data_frame() 创建 df 和 dict 而不是 data.frame() 否则我会得到：Error in strsplit(text, " ") : non-character argument

Source: local data frame [4 x 2]

                  text score
1        I love pandas     2
2       I hate monkeys    -2
3 pandas pandas pandas     1
4      monkeys monkeys    -1

这不考虑单个字符串中的多个匹配项。接近预期的结果，但还没有完全达到。

建议二

我在 cmets 中稍微调整了 akrun 的建议之一，将其应用到编辑后的帖子中

cbind(df, unnest(stri_split_fixed(df$text, ' '), group) %>% 
        group_by(group) %>% 
        summarise(score = sum(dict$score[dict$word %in% x])) %>% 
        ungroup() %>% select(-group) %>% data.frame())

这不考虑字符串中的多个匹配项：

                  text score
1        I love pandas     2
2       I hate monkeys    -2
3 pandas pandas pandas     1
4      monkeys monkeys    -1

理查德·斯克里文的方法

建议 1

group_by(df, text) %>%
mutate(score = sum(dict$score[stri_detect_fixed(text, dict$word)]))

更新所有包后，现在可以使用（尽管它不考虑多个匹配项）

Source: local data frame [4 x 2]
Groups: text

                  text score
1        I love pandas     2
2       I hate monkeys    -2
3 pandas pandas pandas     1
4      monkeys monkeys    -1

建议二

total <- with(dict, {
  vapply(df$text, function(X) {
    sum(score[vapply(word, grepl, logical(1L), x = X, fixed = TRUE)])
  }, 1)
})

cbind(df, total)

这给出了相同的结果：

                  text total
1        I love pandas     2
2       I hate monkeys    -2
3 pandas pandas pandas     1
4      monkeys monkeys    -1

建议 3

s <- strsplit(df$text, " ")
total <- vapply(s, function(x) sum(with(dict, score[match(x, word, 0L)])), 1)
cbind(df, total)

这确实有效：

                  text total
1        I love pandas     2
2       I hate monkeys    -2
3 pandas pandas pandas     3
4      monkeys monkeys    -2

Thelatemail 的方法

res <- sapply(dict$word, function(x) {
  sapply(gregexpr(x,df$text),function(y) length(y[y!=-1]) )
})

cbind(df, score = rowSums(res * dict$score))

请注意，我添加了 cbind() 部分。这实际上符合预期的结果。

                  text score
1        I love pandas     2
2       I hate monkeys    -2
3 pandas pandas pandas     3
4      monkeys monkeys    -2

最终答案

受 akrun 建议的启发，这是我最后写的最dplyr-esque 解决方案：

library(dplyr)
library(tidyr)
library(stringi)

bind_cols(df, unnest(stri_split_fixed(df$text, ' '), group) %>% 
            group_by(x) %>% mutate(score = sum(dict$score[dict$word %in% x])) %>% 
            group_by(group) %>% 
            summarise(score = sum(score)) %>% 
            select(-group))

虽然我会执行 Richard Scriven 的建议 #3，因为它是最有效的。

基准测试

以下是使用microbenchmark() 应用于更大数据集（93 个句子的df 和 14K 单词的dict）的建议：

mbm = microbenchmark(
  akrun = df %>% mutate(score = sapply(stri_detect_fixed(text, ' '), function(x) with(dict, sum(score[word %in% x])))),
  akrun2 = cbind(df, unnest(stri_split_fixed(df$text, ' '), group) %>% group_by(group) %>% summarise(score = sum(dict$score[dict$word %in% x])) %>% ungroup() %>% select(-group) %>% data.frame()),
  rscriven1 = group_by(df, text) %>% mutate(score = sum(dict$score[stri_detect_fixed(text, dict$word)])),
  rscriven2 = cbind(df, score = with(dict, { vapply(df$text, function(X) { sum(score[vapply(word, grepl, logical(1L), x = X, fixed = TRUE)])}, 1)})),
  rscriven3 = cbind(df, score = vapply(strsplit(df$text, " "), function(x) sum(with(dict, score[match(x, word, 0L)])), 1)),
  thelatemail = cbind(df, score = rowSums(sapply(dict$word, function(x) { sapply(gregexpr(x,df$text),function(y) length(y[y!=-1]) ) }) * dict$score)),
  sbeaupre = bind_cols(df, unnest(stri_split_fixed(df$text, ' '), group) %>% group_by(x) %>% mutate(score = sum(dict$score[dict$word %in% x])) %>% group_by(group) %>% summarise(score = sum(score)) %>% select(-group)),
  times = 10
)

结果：

【问题讨论】：

你自己尝试了什么？
我猜你得试试strsplit。类似sapply(strsplit(df$text, ' '), function(x) with(dict, sum(score[word %in% x])))
@akrun。这就成功了。 df %>% mutate(score = sapply(strsplit(text, ' '), function(x) with(dict, sum(score[word %in% x]))))
@akrun 我怎样才能将结果分数除以在字典中为给定句子返回匹配项的不同单词数？
你可以通过sapply(strsplit(df$text, ' '), function(x) length(unique(x)))得到不同的字数

标签： r dplyr lapply sapply stringi

【解决方案1】：

更新：这是迄今为止我发现的最简单的dplyr 方法。我将添加一个stringi 函数来加快速度。如果df$text中没有相同的句子，我们可以按该列分组，然后申请mutate()

注意：软件包版本为 dplyr 0.4.1 和 stringi 0.4.1

library(dplyr)
library(stringi)

group_by(df, text) %>%
    mutate(score = sum(dict$score[stri_detect_fixed(text, dict$word)]))
# Source: local data frame [2 x 2]
# Groups: text
#
#             text score
# 1  I love pandas     2
# 2 I hate monkeys    -2

我删除了我昨晚发布的do() 方法，但您可以在编辑历史记录中找到它。对我来说，这似乎没有必要，因为上述方法也很有效，而且是更多的dplyr 方法。

此外，如果您愿意接受非dplyr 的回答，这里有两个使用基函数。

total <- with(dict, {
    vapply(df$text, function(X) {
        sum(score[vapply(word, grepl, logical(1L), x = X, fixed = TRUE)])
    }, 1)
})
cbind(df, total)
#             text total
# 1  I love pandas     2
# 2 I hate monkeys    -2

或者使用strsplit() 的替代方法产生相同的结果

s <- strsplit(df$text, " ")
total <- vapply(s, function(x) sum(with(dict, score[match(x, word, 0L)])), 1)
cbind(df, total)

【讨论】：

【解决方案2】：

通过sapply 和gregexpr 进行一点双循环：

res <- sapply(dict$word, function(x) {
  sapply(gregexpr(x,df$text),function(y) length(y[y!=-1]) )
})
rowSums(res * dict$score)
#[1]  2 -2

这也说明了单个字符串中有多个匹配项的情况：

df <- data.frame(text = c("I love love pandas", "I hate monkeys"))
# run same code as above
#[1]  3 -2

【讨论】：