【问题标题】:String matching to data.frames of different sizes与不同大小的data.frames匹配的字符串
【发布时间】:2016-09-24 15:25:30
【问题描述】:

我有两个不同大小的 data.frame,我正在寻找最有效的方法来将字符串从一个 data.frame 匹配到另一个 data.frame,并提取一些相关信息。

这是一个例子:

两个初始 data.frames,a 和 b,以及期望的结果:

a = data.frame(term = c("red", "salad", "rope", "ball", "tent", "plane", "gift", "meat"),
               age = c(30, 24, 52, 44, 73, 44, 33, 12),
               visits = c(5, 1, 3, 2, 8, 5, 19, 3))

b = data.frame(string = c("the red ball went over the fence",
                          "sorry to see that your tent fell down",
                          "the ball fell into the red salad",
                          "serious people eat peanuts on Sundays"))

desired_result = data.frame(string = b$string,
                            num_matches = c(2, 1, 3, 0),
                            avg_age = c(37, 73, 32.66667, NA),
                            avg_visits = c(3.5, 8, 2.66667, NA))

以下是更易读格式的 data.frames:

> a
   term age visits
1   red  30      5
2 salad  24      1
3  rope  52      3
4  ball  44      2
5  tent  73      8
6 plane  44      5
7  gift  33     19
8  meat  12      3

> b
                                 string
1      the red ball went over the fence
2 sorry to see that your tent fell down
3      the ball fell into the red salad
4 serious people eat peanuts on Sundays

> desired_result
                                 string num_matches  avg_age avg_visits
1      the red ball went over the fence           2 37.00000    3.50000
2 sorry to see that your tent fell down           1 73.00000    8.00000
3      the ball fell into the red salad           3 32.66667    2.66667
4 serious people eat peanuts on Sundays           0       NA         NA
  • num_matches 是“string”中“terms”的个数
  • avg_age 是 "string" 中的 "terms" 的平均年龄
  • avg_visits 是 "string" 中找到的 "terms" 的平均访问次数

关于如何以有效的方式实现这一点的任何想法?

谢谢。

【问题讨论】:

  • 如果没有找到匹配项,avg_age 和 avg_visits 也可以为零

标签: r string dataframe


【解决方案1】:

你可以用base R试试这个(不需要包):

res <- t(apply(b, 1, function(x) {
    l <- strsplit(x, " ")
    r <- unlist(lapply(unlist(l), function(y) which(a$term==y)))
    rbind(length(r), mean(a$age[r]), mean(a$visits[r]))

}))

res <- cbind(b, res)
                                 # string 1        2        3
# 1      the red ball went over the fence 2 37.00000 3.500000
# 2 sorry to see that your tent fell down 1 73.00000 8.000000
# 3      the ball fell into the red salad 3 32.66667 2.666667
# 4 serious people eat peanuts on Sundays 0      NaN      NaN

【讨论】:

  • strsplit 如果字符串有点乱,可能会受到限制,例如plane's
【解决方案2】:

使用data.table,使用by = string 处理每一行。将匹配结果保存在列表中,然后按匹配结果汇总。

注意matches 列是一个列表列表,每个单元格包含一个列表。您需要使用 .() 包装匹配结果,这实际上是另一个 list(),因为 data.table 需要一个普通列的列表。

library(data.table)
library(stringr)
a = data.table(term = c("red", "salad", "rope", "ball", "tent", "plane", "gift", "meat"),
  age = c(30, 24, 52, 44, 73, 44, 33, 12),
  visits = c(5, 1, 3, 2, 8, 5, 19, 3))
b = data.table(string = c("the red ball went over the fence",
  "sorry to see that your tent fell down",
  "the ball fell into the red salad",
  "serious people eat peanuts on Sundays"))

b[, matches := vector("list", .N)]
b[, matches := .(list(str_detect(string, a[, term]))), by = string]
b[, num_matches := sum(unlist(matches)), by = string]
b[, avg_age := mean(a[unlist(matches), age]), by = string]
b[, avg_visits := mean(a[unlist(matches), visits]), by = string]

【讨论】:

  • 感谢所有受访者......我非常喜欢这个解决方案,因为每个步骤都很清晰(更容易添加额外的步骤/复杂性)。它还使用 str_detect ,它使用整个字符串并泛化到不是单个单词的术语。注意:将“na.rm = T”添加到“sum(unlist(matches))”和“b[,matches:=NULL]”以删除最后的匹配列
【解决方案3】:

我会一个接一个地建立desired_result

因此,您需要两个函数,一个用于计数,一个用于计算平均值。

首先出现:

  counter <- function(sentence, pattern)
  { 
      count <-0
      for (var in pattern)
      {
         if(grepl(pattern=var, sentence)) count <- count +1
      }
      return(count)
   }

第二个求两个平均值,两种情况都可以使用函数:

average <- function(sentence, look_up)
{ 
  pattern <- look_up[,1]
  count <-0
  summed <- 0
  for (var in pattern)
  {
    if(grepl(pattern=var, sentence)) {
      count <- count + 1
      summed <- sum(look_up[look_up[,1]==var,2]) + summed
    }
  }        
  return(summed/count)
}

您可以通过这种方式将其应用于您的数据:

先做:

desire_result <- data.frame(string = b$string)

然后获取值:

desired_result$num_match<- sapply(b$string,counter,pattern=a$term)
desired_result$avg_age<- sapply(b$string,average,look_up=a[,c(1,2)])
desired_result$avg_visit<- sapply(b$string,average,look_up=a[,c(1,3)])

这给了你现在desired_result 你的问题中提到的:

> desired_result
                                 string num_match  avg_age avg_visit
1      the red ball went over the fence         2 37.00000  3.500000
2 sorry to see that your tent fell down         1 73.00000  8.000000
3      the ball fell into the red salad         3 32.66667  2.666667
4 serious people eat peanuts on Sundays         0      NaN       NaN

【讨论】:

    猜你喜欢
    • 2023-03-10
    • 1970-01-01
    • 1970-01-01
    • 2014-09-07
    • 2010-11-08
    • 1970-01-01
    • 2022-06-12
    • 2014-10-26
    • 2017-12-09
    相关资源
    最近更新 更多