R中基于字符串和字符串顺序的编码答案

【问题标题】：Coding based on strings and order of strings in RR中基于字符串和字符串顺序的编码
【发布时间】：2015-02-05 10:32:58
【问题描述】：

我必须编码很多data.frames。例如：

tt <- data.frame(V1=c("test1", "test3", "test1", "test4", "wins", "loses"),
             V2=c("someannotation", "othertext", "loads of text including the word winning for the winner and the word losing for the loser", "blablabla", "blablabla", "blablabla"))

tt 
V1       V2
test1    someannotation
test3    othertext
test1    loads of text including the word winning for the winner and the word losing for the loser
test4    blablabla
wins     blablabla
loses    blablabla

编码必须进入一个新的data.frame，如果跑步者获胜或失败，我必须编码。如果V1 表示wins，那么他赢了（如果他输了，则用loses 表示）。但是，跑步者有可能赢得或输掉比赛的一部分，这由V1 中的test1 表示，并由V2 指定。如果V2 中的术语winning 出现在术语losing 之前，则跑步者将赢得部分比赛（反之亦然）。

我尝试从这里实现答案元素以指定哪个单词/字符串出现在哪个位置：

find location of character in string

实现如下所示：

result <- data.frame()
for(i in 1:length(tt[,1])){
  if(grepl("wins", tt[i,1])) result[i,1] <- "wins"
  if(grepl("loses", tt[i,1])) result[i,1] <- "loses"
  if(grepl("test1", tt[i,1])&(which(strsplit(tt[i,2], " ")[[1]]=="winning")>which(strsplit(tt[i,2], " ")[[1]]=="losing"))) result[i,1] <- "loses"
  if(grepl("test1", tt[i,1])&(which(strsplit(tt[i,2], " ")[[1]]=="winning")<which(strsplit(tt[i,2], " ")[[1]]=="losing"))) result[i,1] <- "wins"
}

但是对于不包含winning 或losing 的V2 列的单元格有错误消息：

Error in if (grepl("test1", tt[i, 1]) & (which(strsplit(tt[i, 2], " ")[[1]] ==  : argument is of length zero

有没有人能解决这个问题，甚至是复杂的解决方案？任何帮助表示赞赏，谢谢！

编辑正如@grrgrrbla 澄清的那样，有两种可能获胜：一种是如果V1 == "win"，另一种是如果V2 在“失败”一词之前包含“获胜”一词，则跑步者也获胜，有两种可能会失败: V1 == "loses" 或 V2 在“获胜”之前包含“失败”。

我的输出应该是这样的：

result
  V1
  NA
  NA
  wins
  NA
  wins
  loses

【问题讨论】：

请指定你想要的输出：一列，两列，你只需要一列说赢/输吗，你需要索引等等。所以据我所知，有两种可能性赢：一种是如果V1 ==“赢”，另一种是如果V2在“输”之前包含“赢”这个词，跑步者也赢了，有两种可能输：V1 ==“输”或V2包含在“赢”之前先“输”，对吧？输出应该是一列“赢”或“输”，对吧？
为什么输出中有 NA 值？什么时候应该出现 NA 值？结果应该给什么输入？
由于只有第 3、5 和 6 行包含“赢/赢”或“输/输”，其他行的编码应为 NA。

标签： r regex string

【解决方案1】：

您可以尝试（可能不是最简单的解决方案......）创建一个函数，如果您的“获胜”条件满足，则返回“获胜”，如果您的“失败”条件满足，则返回“失败”，并且NA 在其他情况下：

wilo<-function(vec){
    if(grepl("wins|loses",vec[1])){ # if the first variable is "wins" or "loses" you return the value of the first variable
        return(vec[1])
    } else {
        if(grepl("winning|losing",vec[2])){ # if in the second variable, there is winning or losing (actually both need to be in the sentence and are supposed to be so you can just check for one word : grepl("winning",vec[2]) )
            ifelse(gregexpr("winning",vec[2])[[1]]<gregexpr("losing",vec[2])[[1]], # if "winning" is placed before "losing"
                   return("wins"), # return "wins"
                   return("loses")) # else return "loses"
        } else {
            return(NA) # if none of the conditions are fulfilled, return NA
        }
    }
 }

然后您可以在 data.frame 的每一行上应用该函数：

apply(tt,1,wilo)
#[1] NA      NA      "wins"  NA      "wins"  "loses"

注意：正如@grrgrrbla 所建议的，使用函数gregexpr 的替代方法是使用来自stringr 包的函数str_locate。

【讨论】：

你也可以使用stringr包中的：str_locate来查找输赢条件的位置，看看哪个更小：ifelse(str_locate(vec[2], "winning") - str_locate(vec[2], "losing" ) <0,return("wins"),return("lose"))
@grrgrrbla，是的，确实，感谢您的评论，我不“习惯使用”这个包，我会添加替代方案，谢谢