【问题标题】:Extract names from a string using a list of names with grepl and a loop and add them to a new column in R使用带有 grepl 和循环的名称列表从字符串中提取名称,并将它们添加到 R 中的新列
【发布时间】:2021-09-18 00:17:37
【问题描述】:

我有一个数据集,其中有一列包含姓名,一列指示该人白天做了什么。我试图找出谁在那天使用 R 在我的数据集中遇到了谁。我创建了一个包含数据集中名称的向量,并在循环中使用 grepl 来识别名称出现在详细说明人们活动的列中的位置在数据集中。

name <- c("Dupont","Dupuy","Smith") 

activity <- c("On that day, he had lunch with Dupuy in London.", 
              "She had lunch with Dupont and then went to Brighton to meet Smith.", 
              "Smith remembers that he was tired on that day.")

met_with <- c("Dupont","Dupuy","Smith")

df<-data.frame(name, activity, met_with=NA)


for (i in 1:length(met_with)) {
df$met_with<-ifelse(grepl(met_with[i], df$activity), met_with[i], df$met_with)
}

但是,由于两个原因,此解决方案并不令人满意。当此人遇到多个其他人(例如 Dupuy 在我的示例中)时,我无法提取多个名称,并且我不能告诉 R 在使用该名称而不是代词时不要返回该人的姓名活动列(例如 Smith)。

理想情况下,我希望 df 看起来像:

  name         activity                                            met_with                             
  Dupont       On that day, he had lunch with Dupuy in London.     Dupuy
  Dupuy        She had lunch with Dupont and then (...).           Dupont Smith
  Smith        Smith remembers that he was tired on that day.      NA

我正在清理字符串以构建边缘列表和节点列表,以便稍后进行网络分析。

谢谢

【问题讨论】:

    标签: r string loops grepl edge-list


    【解决方案1】:

    您可以使用setdiff 排除要与行匹配的名称,并使用gregexprregmatches 提取匹配的名称。也许还可以考虑在名称周围加上\\b

    for(i in seq_len(nrow(df))) {
      df$met_with[i] <- paste(regmatches(df$activity[i],
       gregexpr(paste(setdiff(name, df$name[i]), collapse="|"),
       df$activity[i]))[[1]], collapse = " ")
    }
    
    df
    #    name                                                           activity     met_with
    #1 Dupont                    On that day, he had lunch with Dupuy in London.        Dupuy
    #2  Dupuy She had lunch with Dupont and then went to Brighton to meet Smith. Dupont Smith
    #3  Smith                     Smith remembers that he was tired on that day.             
    

    使用Reduce的另一种方式可能是:

    df$met_with <- Reduce(function(x, y) {
      i <- grepl(y, df$activity, fixed = TRUE) & y != df$name
      x[i] <- lapply(x[i], `c`, y)
      x
    }, unique(name), vector("list", nrow(df)))
    
    df
    #    name                                                           activity      met_with
    #1 Dupont                    On that day, he had lunch with Dupuy in London.         Dupuy
    #2  Dupuy She had lunch with Dupont and then went to Brighton to meet Smith. Dupont, Smith
    #3  Smith                     Smith remembers that he was tired on that day.          NULL
    

    【讨论】:

      【解决方案2】:

      与@Gki 相同的逻辑,但使用stringr 函数和mapply 而不是循环。

      library(stringr)
      
      pat <- str_c('\\b', df$name, '\\b', collapse = '|')
      df$met_with <- mapply(function(x, y) str_c(setdiff(x, y), collapse = ' '), 
             str_extract_all(df$activity, pat), df$name)
      
      df
      
      #    name                                                           activity
      #1 Dupont                    On that day, he had lunch with Dupuy in London.
      #2  Dupuy She had lunch with Dupont and then went to Brighton to meet Smith.
      #3  Smith                     Smith remembers that he was tired on that day.
      
      #      met_with
      #1        Dupuy
      #2 Dupont Smith
      #3             
      

      【讨论】:

      • 嗨!谢谢你的回答,效果很好。但是,当我在整个数据集上运行它时遇到问题,因为我认为具有许多名称的模式会使 R 崩溃。我一直在包含超过 21000 行名称 + 活动的完整数据集上运行代码,并且我总共有大约 10 000 个met_with 名称(我的“名称”列中有重复项,因为我有许多常见名称,例如“史密斯”在数据集中——我将尝试更好地识别该人稍后遇到的“史密斯”。)。
      • 如果您的数据中有重复的名称,您可以使用pat &lt;- str_c('\\b', unique(df$name), '\\b', collapse = '|') 创建模式以仅使用唯一名称。是的,但是正则表达式的长度有一些限制,所以如果你有很多名字,如果不分解它可能会不起作用。
      • 非常感谢您的帮助!它已经运行了半个多小时,所以我认为分解它可能是唯一的方法。
      猜你喜欢
      • 1970-01-01
      • 2021-12-10
      • 1970-01-01
      • 2017-12-23
      • 2023-01-15
      • 1970-01-01
      • 2014-01-11
      • 1970-01-01
      • 2017-05-13
      相关资源
      最近更新 更多