按字符分隔字段中的变量答案

【问题标题】：Separate variable in field by character按字符分隔字段中的变量
【发布时间】：2019-04-18 11:48:35
【问题描述】：

我最近问了这个问题 Separate contents of field 并得到了一个非常快速和非常简单的答案。

我可以在 Excel 中简单地做的事情是查看一个单元格，找到一个字符的第一个实例，然后返回该字符左侧的所有字符。

例如

作者

Drijgers RL、Verhey FR、Leentjens AF、Kahler S、Aalten P.

我可以将 Drijgers RL 和 Aalten P 提取到 Excel 中的单独列中。这让我可以计算某人是第一作者和最后作者的次数。

如何在 R 中复制它？我可以从上面单独的行答案中计算出作者发表文章的总次数。

如何将第一作者和最后作者分开以分隔列。知道这可能很有用。在这个答案Splitting column by separator from right to left in R

列数是已知的。怎么说“用逗号分割这个字符串，然后根据原字段右侧作者列表中的姓名数量将它们放入未知数量的列中”？

【问题讨论】：

标签： r regex tidyr

【解决方案1】：

试试这个功能：

extract_authors <- function(df, authors) {

  df[["FirstAuthor"]] <- ifelse(
    grepl(",", df[[authors]]), trimws(gsub(",.*", "", df[[authors]])), df[[authors]]
  )


  df[["LastAuthor"]] <- ifelse(
    grepl(",", df[[authors]]), trimws(gsub(".*,", "", df[[authors]])), "No last author"
  )

  return(df)

}

与本主题的另一个示例一起使用：

data.frame(
  authors = c(
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.",
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S",
    "Drijgers RL, Verhey FR, Leentjens AF",
    "Drijgers RL, Verhey FR",
    "Drijgers RL"
  ),
  stringsAsFactors = FALSE
) -> sample_df

你可以这样称呼它：

extract_authors(df, "authors")

在输出中，您会得到 2 个新列，FirstAuthor 和 LastAuthor：

                                                    authors FirstAuthor     LastAuthor
1 Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P. Drijgers RL      Aalten P.
2            Drijgers RL, Verhey FR, Leentjens AF, Kahler S Drijgers RL       Kahler S
3                      Drijgers RL, Verhey FR, Leentjens AF Drijgers RL   Leentjens AF
4                                    Drijgers RL, Verhey FR Drijgers RL      Verhey FR
5                                               Drijgers RL Drijgers RL No last author

【讨论】：

已经修复了，你也可以只用一个作者调用它。
酷。我希望你能增加你的解决方案，因为它比我展示的要快 6 倍 :-)
现在看一下我回答中的微基准测试。即使使用新的stringi 解决方案，您的速度仍然快得惊人。严肃的道具！
谢谢@hrbrmstr！我很惊讶地看到，因为我的应该是一个纯粹的便利功能，它的鲁棒性会随着进一步的要求而受到质疑，例如如果 OP 想要开始提取第二、第三等作者。但是好老的ifelse 和grepl 可能毕竟不是这样的性能瓶颈，当然取决于用例。

【解决方案2】：

data.frame(
  authors = c(
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.",
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S",
    "Drijgers RL, Verhey FR, Leentjens AF",
    "Drijgers RL, Verhey FR",
    "Drijgers RL"
  ),
  stringsAsFactors = FALSE
) -> sample_df

cbind.data.frame( # add the columns to the original data frame after the do.cal() completes
  sample_df,
  do.call( # turn the list created with lapply below into a data frame
    rbind.data.frame, 
    lapply(
      strsplit(sample_df$authors, ", "), # split at comma+space
      function(x) {
        data.frame( # pull first/last into a data frame
          first = x[1],
          last = if (length(x) < 2) NA_character_ else x[length(x)], # NA last if only one author
          stringsAsFactors = FALSE
        )
      }
    )
  )
)
##                                                     authors       first         last
## 1 Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P. Drijgers RL    Aalten P.
## 2            Drijgers RL, Verhey FR, Leentjens AF, Kahler S Drijgers RL     Kahler S
## 3                      Drijgers RL, Verhey FR, Leentjens AF Drijgers RL Leentjens AF
## 4                                    Drijgers RL, Verhey FR Drijgers RL    Verhey FR
## 5                                               Drijgers RL Drijgers RL         <NA>

以上内容在性能方面糟糕。我制作了 stringi 匹配组提取版本，但 arg0naut 的仍然更快并且我还优化了 arg0naut 的一点，因为只需要在左侧进行空格剥离：

library(stringi)

data.frame(
  authors = c(
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.",
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S",
    "Drijgers RL, Verhey FR, Leentjens AF",
    "Drijgers RL, Verhey FR",
    "Drijgers RL"
  ),
  stringsAsFactors = FALSE
) -> sample_df

# make some copies since we're modifying in-place now
s1 <- s2 <- sample_df

microbenchmark::microbenchmark(

  stri_regex = {
    s1$first <-  stri_match_first_regex(s1$authors, "^([^,]+)")[,2]
    s1$last <- stri_trim_left(stri_match_last_regex(s1$authors, "([^,]+)$")[,2])
    s1$last <- ifelse(s1$last == s1$first, NA_character_, s1$last)
  },

  extract_authors = {
    s2[["first"]] <- ifelse(
      grepl(",", s2[["authors"]]), gsub(",.*", "", s2[["authors"]]), s2[["authors"]]
    )
    s2[["last"]] <- ifelse(
      grepl(",", s2[["authors"]]), trimws(gsub(".*,", "", s2[["authors"]]), "left"), NA_character_
    )

  }

)

结果：

## Unit: microseconds
##             expr     min       lq     mean   median       uq      max neval
##       stri_regex 236.948 265.8055 331.5695 291.6610 334.1685 1002.921   100
##  extract_authors 127.584 150.8490 217.1192 162.4625 227.9995 1130.913   100

identical(s1, s2)
## [1] TRUE

s1
##                                                     authors       first         last
## 1 Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P. Drijgers RL    Aalten P.
## 2            Drijgers RL, Verhey FR, Leentjens AF, Kahler S Drijgers RL     Kahler S
## 3                      Drijgers RL, Verhey FR, Leentjens AF Drijgers RL Leentjens AF
## 4                                    Drijgers RL, Verhey FR Drijgers RL    Verhey FR
## 5                                               Drijgers RL Drijgers RL         <NA>

【讨论】：