【问题标题】：Making a variable from detection of any multiple string pattern in one string通过检测一个字符串中的任何多个字符串模式来创建一个变量
【发布时间】：2018-10-28 12:56:06
【问题描述】：

end_result_tbl

此 end_result_tbl 是来自不同选民文件的理想格式示例。

ID     GEN_16  GEN_14  GEN_08  PP_16  PR_16  PR_15  PR_14
0001     1       1       1       1      0      0       0
0002     0       0       0       0      1      0       1
0003     1       1       1       0      0      0       0
0004     1       0       1       0      0      0       1
0005     1       0       1       1      1      0       1

raw_data_tbl

ID     Voter_History  
0001   GE 20161108;20121106 GE;20081104 GE;20080205 PP;General Election 2004
0002   2016 GENERAL ELECTION;2014 GENERAL ELECTION
0003   20121106 GE;20081104 GE;General Election 2006
0004   GE 20150910
0005   16 GENERAL ELECTION; 14 PRIMARY ELECTION

希望从每个文本字符串的条件字符串匹配中为每次选举生成变量。

每次选举大约有 9 次迭代。如果一次迭代与选举匹配，则放置“1”以显示该选举中的投票，如果没有匹配，则放置“0”表示不投票。

以下是 2016 年 11 月大选的迭代

GEN_16<-c("20161108 GE",
          "16 GENERAL ELECTION",
          "GENERAL 2016",
          "GENERAL ELECTION 2016", 
          "2016 GENERAL ELECTION", 
          "GENERAL ELECTION, 2016",
          "16 GENERAL ELECTION",
          "GE 20161108")

这是我尝试过的（仅尝试 2016 年大选）：

 raw_data_tbl$GEN_16<- 
 as.integer(stri_detect(raw_data_tbl$Voter_History,GEN_16))

 which(GEN_16%in%raw_data_tbl$Voter_History

require(dplyr)
Sequences <- GEN_16
Database <- raw_data_tabl$Voter_History

df=as.data.frame(sapply(Sequences, function(x) grep(x,Database)))
stats=df %>% summarise_all(funs(sum))
cbind(Sequences,as.numeric(stats))

这实际上是一个非常简单的虽然 sql 中的超长代码，但在 R 中很难找到它的等价物。

raw_data_tabl 中有大约 1700 万选民。

任何方向都非常感谢，在此先感谢。

【问题讨论】：

标签： r regex pattern-matching dummy-variable

【解决方案1】：

你可以试试这个——

library(stringr)
library(tidyverse)

#read input file
txt <- readLines("test.txt")

#put delimiter between columns and transform it into a dataframe
txt <- gsub("\\s+(.*)", ",\\1", txt)
df <- read.table(textConnection(txt), 
                header = T, stringsAsFactors = F, sep = ",", colClasses = c("ID" = "character"))

初始数据框看起来像

> df
#    ID                                                         Voter_History
#1 0001 GE 20161108;20121106 GE;20081104 GE;20080205 PP;General Election 2004
#2 0002                           2016 GENERAL ELECTION;2014 GENERAL ELECTION
#3 0003                         20121106 GE;20081104 GE;General Election 2006
#4 0004                                                           GE 20150910
#5 0005                              16 GENERAL ELECTION; 14 PRIMARY ELECTION

清理Voter_History列的数据以提取有用信息

election_func <- function(x){
  #extract year
  yr <- gsub("20", "", substr(str_extract_all(strsplit(x, split=";")[[1]], "[0-9]+"), 1, 4))
  #extract election type
  elec_type <- toupper(substr(str_extract(strsplit(x, split=";")[[1]], '[A-Za-z]+'), 1, 2))

  return(paste(sort(paste(elec_type, yr, sep="_")), collapse = ";"))
  }

df$Voter_History <- do.call(rbind, lapply(df$Voter_History, function(x) election_func(x)))

清理后的数据是

> df
#    ID                 Voter_History
#1 0001 GE_04;GE_08;GE_12;GE_16;PP_08
#2 0002                   GE_14;GE_16
#3 0003             GE_06;GE_08;GE_12
#4 0004                         GE_15
#5 0005                   GE_16;PR_14

最后将这些数据转换成所需的格式

df1 <- df %>%
  separate_rows("Voter_History", sep= ";") %>%
  distinct(ID, Voter_History) %>%
  mutate(value = 1) %>%
  spread(Voter_History, value, fill = 0)

df1
#    ID GE_04 GE_06 GE_08 GE_12 GE_14 GE_15 GE_16 PP_08 PR_14
#1 0001     1     0     1     1     0     0     1     1     0
#2 0002     0     0     0     0     1     0     1     0     0
#3 0003     0     1     1     1     0     0     0     0     0
#4 0004     0     0     0     0     0     1     0     0     0
#5 0005     0     0     0     0     0     0     1     0     1

样本数据： test.txt 包含

ID     Voter_History  
0001   GE 20161108;20121106 GE;20081104 GE;20080205 PP;General Election 2004
0002   2016 GENERAL ELECTION;2014 GENERAL ELECTION
0003   20121106 GE;20081104 GE;General Election 2006
0004   GE 20150910
0005   16 GENERAL ELECTION; 14 PRIMARY ELECTION

（更新 - 添加了解析Error: Duplicate identifiers for rows... 的逻辑。这是因为spread 调用中有重复的ID 和Voter_History 组合）

【讨论】：

感谢您的回答 Prem。我相信它几乎就在那里。我不断在输出中收到错误的重复标识符。例如：“错误：行 (117, 118, 119, 120, 121), (207, 208, 209), (294, 295), (312, 313), (360, 361) 的重复标识符... " 正在考虑使用 gsub() 使初始数据集更加统一