将用户定义的函数应用于一个 df，使用另一个 df 中的单个列答案

【问题标题】：Apply a user-defined function to one df, using a single column in another df将用户定义的函数应用于一个 df，使用另一个 df 中的单个列
【发布时间】：2020-07-12 21:00:02
【问题描述】：

df1（1,500 行）显示问题、正确回答的百分比和问题尝试次数：

qtitle                                   avg_correct                       attempts  

"Asthma and exercise, question 1"         54.32                            893
"COVID-19 and ventilators, q 3"           23.60                            143
"Pedestrian vs. car MVCs"                 74.19                            227
"Hemophilia and monoclonal Abs"           34.56                            78
"COVID-19 and droplets"                   83.21                            234

使用 tidytext 库，识别出 qtitle 列中出现频率最高的单词并按频率计数，以创建第二个数据帧（df2，320 行）。

word                n
COVID-19            68
Trauma              57
Hemophilia          46

我想使用 df2 的单词 col 中的每个条目来匹配 df1 (qtitle) 中的问题标题中的单词，并找到 avg_correct 的平均值、尝试的总和，并包含搜索词的频率 (n in df2).[即通过自定义函数将df2映射到df1]

word            avg_correct        attempts              count(n)
COVID-19        55.23              456                   68
Hemophilia      45.92              123                   46

这不起作用（显然）

correct_by_terms <- function(x) {
  filter(df1, str_detect(title, x))
  result <- summarise(df1, mean = mean(average), n = n(), x = x)
  return (result)
}
frequent_terms_by_correct_percent<- map_df(df2$word, correct_by_terms)

【问题讨论】：

似乎使用fuzzyjoin::regex_left_join 可能会更好

标签： r dictionary apply tidyverse tidytext

【解决方案1】：

这是使用基数 R 来计算您要求的内容。

# get total number of correct per question
df1$correct <- df1$avg_correct * df1$attempts / 100

# initialize attempts and correct to 0
df2$attempts <- 0
df2$correct <- 0

# loop over df2
for (df2_index in 1:nrow(df2)){
  df2_row <- df2[df2_index,]
  # loop over df1
  for (df1_index in 1:nrow(df1)){
    df1_row <- df1[df1_index,]
    # if df1 qtitle contains df2 word
    if(grepl(df2_row$word, df1_row$qtitle, fixed = T)){
      df2[df2_index ,"attempts"] <- df2[df2_index ,"attempts"] + df1_row$attempts
      df2[df2_index ,"correct"] <- df2[df2_index ,"correct"] + df1_row$correct
    }
  }
}

df2$avg_correct = (df2$correct / df2$attempts) * 100

【讨论】：

【解决方案2】：

您可以尝试使用这种基本的 R 方法。使用sapply，我们遍历df2中的每个word，将其与df1中的问题标题中的grepl匹配，并返回avg_correct中的mean和attempts中的sum。索引。

cbind(df2, t(sapply(df2$word, function(x) {
        inds <- grepl(paste0('\\b', x, '\\b'), df1$qtitle)
        c(avg_correct = mean(df1$avg_correct[inds]), 
          attempts = sum(df1$attempts[inds]))
})))

【讨论】：

您不能简单地平均 avg_correct。考虑基于 2 次尝试的 avg_correct 值为 50，基于 100 次尝试的另一个 avg_correct 值为 98，其产生的平均值（avg_correct）为 75，但考虑到尝试次数，平均值为 99。

【解决方案3】：

如果您要匹配的单词都是可以通过标记化识别的所有单词，就像您展示的示例一样，我会：

标记化，
内连接，然后
group_by() 并总结一下。

library(tidyverse)
library(tidytext)

df1 <- tribble(~qtitle,                                ~avg_correct,   ~attempts,  
               "Asthma and exercise, question 1",      54.32,          893,
               "COVID19 and ventilators, q 3",        23.60,          143,
               "Pedestrian vs. car MVCs",              74.19,          227,
               "Hemophilia and monoclonal Abs",        34.56,          78,
               "COVID19 and droplets",                83.21,          234
)

df2 <- tribble(~word,              ~n,
               "COVID19",         68,
               "Trauma",           57,
               "Hemophilia",       46) %>%
  mutate(word = tolower(word))

df1 %>% 
  unnest_tokens(word, qtitle) %>%
  inner_join(df2) %>%
  group_by(word) %>%
  summarise(avg_correct = mean(avg_correct),
            attempts = sum(attempts),
            n = first(n))
#> Joining, by = "word"
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 4
#>   word       avg_correct attempts     n
#>   <chr>            <dbl>    <dbl> <dbl>
#> 1 covid19           53.4      377    68
#> 2 hemophilia        34.6       78    46

^{由reprex package (v0.3.0) 于 2020 年 7 月 18 日创建}

【讨论】：