【问题标题】:Apply a user-defined function to one df, using a single column in another df将用户定义的函数应用于一个 df,使用另一个 df 中的单个列
【发布时间】:2020-07-12 21:00:02
【问题描述】:

df1(1,500 行)显示问题、正确回答的百分比和问题尝试次数:

qtitle                                   avg_correct                       attempts  

"Asthma and exercise, question 1"         54.32                            893
"COVID-19 and ventilators, q 3"           23.60                            143
"Pedestrian vs. car MVCs"                 74.19                            227
"Hemophilia and monoclonal Abs"           34.56                            78
"COVID-19 and droplets"                   83.21                            234

使用 tidytext 库,识别出 qtitle 列中出现频率最高的单词并按频率计数,以创建第二个数据帧(df2,320 行)。

word                n
COVID-19            68
Trauma              57
Hemophilia          46

我想使用 df2 的单词 col 中的每个条目来匹配 df1 (qtitle) 中的问题标题中的单词,并找到 avg_correct 的平均值、尝试的总和,并包含搜索词的频率 (n in df2).[即通过自定义函数将df2映射到df1]

word            avg_correct        attempts              count(n)
COVID-19        55.23              456                   68
Hemophilia      45.92              123                   46

这不起作用(显然)

correct_by_terms <- function(x) {
  filter(df1, str_detect(title, x))
  result <- summarise(df1, mean = mean(average), n = n(), x = x)
  return (result)
}
frequent_terms_by_correct_percent<- map_df(df2$word, correct_by_terms)

【问题讨论】:

  • 似乎使用fuzzyjoin::regex_left_join 可能会更好

标签: r dictionary apply tidyverse tidytext


【解决方案1】:

这是使用基数 R 来计算您要求的内容。

# get total number of correct per question
df1$correct <- df1$avg_correct * df1$attempts / 100

# initialize attempts and correct to 0
df2$attempts <- 0
df2$correct <- 0

# loop over df2
for (df2_index in 1:nrow(df2)){
  df2_row <- df2[df2_index,]
  # loop over df1
  for (df1_index in 1:nrow(df1)){
    df1_row <- df1[df1_index,]
    # if df1 qtitle contains df2 word
    if(grepl(df2_row$word, df1_row$qtitle, fixed = T)){
      df2[df2_index ,"attempts"] <- df2[df2_index ,"attempts"] + df1_row$attempts
      df2[df2_index ,"correct"] <- df2[df2_index ,"correct"] + df1_row$correct
    }
  }
}

df2$avg_correct = (df2$correct / df2$attempts) * 100

【讨论】:

    【解决方案2】:

    您可以尝试使用这种基本的 R 方法。使用sapply,我们遍历df2中的每个word,将其与df1中的问题标题中的grepl匹配,并返回avg_correct中的meanattempts中的sum。索引。

    cbind(df2, t(sapply(df2$word, function(x) {
            inds <- grepl(paste0('\\b', x, '\\b'), df1$qtitle)
            c(avg_correct = mean(df1$avg_correct[inds]), 
              attempts = sum(df1$attempts[inds]))
    })))
    

    【讨论】:

    • 您不能简单地平均 avg_correct。考虑基于 2 次尝试的 avg_correct 值为 50,基于 100 次尝试的另一个 avg_correct 值为 98,其产生的平均值(avg_correct)为 75,但考虑到尝试次数,平均值为 99。
    【解决方案3】:

    如果您要匹配的单词都是可以通过标记化识别的所有单词,就像您展示的示例一样,我会:

    • 标记化,
    • 内连接,然后
    • group_by() 并总结一下。
    library(tidyverse)
    library(tidytext)
    
    df1 <- tribble(~qtitle,                                ~avg_correct,   ~attempts,  
                   "Asthma and exercise, question 1",      54.32,          893,
                   "COVID19 and ventilators, q 3",        23.60,          143,
                   "Pedestrian vs. car MVCs",              74.19,          227,
                   "Hemophilia and monoclonal Abs",        34.56,          78,
                   "COVID19 and droplets",                83.21,          234
    )
    
    df2 <- tribble(~word,              ~n,
                   "COVID19",         68,
                   "Trauma",           57,
                   "Hemophilia",       46) %>%
      mutate(word = tolower(word))
    
    df1 %>% 
      unnest_tokens(word, qtitle) %>%
      inner_join(df2) %>%
      group_by(word) %>%
      summarise(avg_correct = mean(avg_correct),
                attempts = sum(attempts),
                n = first(n))
    #> Joining, by = "word"
    #> `summarise()` ungrouping output (override with `.groups` argument)
    #> # A tibble: 2 x 4
    #>   word       avg_correct attempts     n
    #>   <chr>            <dbl>    <dbl> <dbl>
    #> 1 covid19           53.4      377    68
    #> 2 hemophilia        34.6       78    46
    

    reprex package (v0.3.0) 于 2020 年 7 月 18 日创建

    【讨论】:

      猜你喜欢
      • 2021-09-07
      • 2021-12-26
      • 1970-01-01
      • 2021-10-06
      • 2019-09-04
      • 1970-01-01
      • 2020-01-25
      • 1970-01-01
      • 2023-01-17
      相关资源
      最近更新 更多