如何在 R 中使用 tidyr 将字符串列分成多个其他列答案

【问题标题】：How to use tidyr in R to separate a string column into multiple other columns如何在 R 中使用 tidyr 将字符串列分成多个其他列
【发布时间】：2020-10-06 12:26:36
【问题描述】：

所以我在 R 中使用 tidyr，我试图将 'pub_author' 列（附在下面）中的数据分成 3 个单独的列：'website_title'、'year' 和 'author'。我尝试使用 separate() 函数执行 separate('pub_author',c('website_title','year', 'author'),'-')，但由于 R 单独读取每个 '-'，它只返回前三个单词。有谁知道如何对标题和作者的单词进行分组，以便它们出现在适当的列或任何其他方法中？

【问题讨论】：

标签： r tidyr

【解决方案1】：

使用separate，我们可以通过正则表达式环视。在这种情况下，它将匹配 4 位之前的 - 或 4 位之后的 -

library(tidyr)
separate(df1, pub_author, into = c('website_title','year', 'author'), 
     "-(?=\\d{4})|(?<=\\d{4})-")
#        website_title year        author
#1       nfl-draft-geek 2018 justin-miller
#2                  cbs 2019   pete-prisco
#3            sb-nation 2020     dan-kadar
#4    football-fan-spot 2019 steven-lourie
#5             fanspeak 2018       william
#6 acme-packing-company 2020  shawn-wagner

数据

df1 <- structure(list(pub_author = c("nfl-draft-geek-2018-justin-miller", 
"cbs-2019-pete-prisco", "sb-nation-2020-dan-kadar", 
  "football-fan-spot-2019-steven-lourie", 
"fanspeak-2018-william", "acme-packing-company-2020-shawn-wagner"
)), class = "data.frame", row.names = c(NA, -6L))

【讨论】：

感谢@akrun 成功了！同样，我将如何分隔包含以下数据的列： 9Shaq LawsonDE |克莱姆森分为5列：编号、名字、姓氏、职位和学校？
@GeorgeCoumantaros 基于 cmets，可能是 extract(df, pub_author, into = c('number', 'first name', 'last name', 'position', 'school'), "^(\\d+)(\\w+)\\s+([A-Z][a-z]+)([A-Z]{2})\\s+\\|\\s+(\\w+)")
它部分工作，但许多行缺少值。工作的行是完美的，但其余的返回 NA
@GeorgeCoumantaros 可能是某些行具有不同的模式。正则表达式基于匹配模式
@GeorgeCoumantaros 我会要求在发布问题时使用dput，因为其他人无法从图像中复制并且您也可以避免潜在的反对票:=)