【问题标题】:merge data frame rows by string parse通过字符串解析合并数据框行
【发布时间】:2015-07-07 04:49:02
【问题描述】:

我正在尝试将具有以下结构的对话导入数据框:

conversation<-data.frame(
             uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
                         "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
                         "01/08/2015 2:59:19 pm: Person 1: Same here"))

这种结构可以相对容易地解析日期、时间、人员和消息。但是在某些情况下,消息带有换行符,因此数据帧结构错误,如下所示:

conversation_errors<-data.frame(
                     uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
                                 "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
                                 "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: ",
                                 "lend me your arms,",
                                 "fast as thunderbolts,",
                                 "for a pillow on my journey."))

您将如何合并这些实例?有什么我不知道的包吗?

所需的功能将简单地识别缺失的结构并与前一行“合并”,这样我会得到:

conversation_fixed<-data.frame(
                    uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
                                "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
                                "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: lend me your arms, fast as thunderbolts, for a pillow on my journey."))

有什么想法吗?

【问题讨论】:

  • 我的想法是,需要有一种确定性的方法将数据框conversation_errors 中的最后 3 行连接到第 1 个人。您能解释一下这是如何知道的吗?
  • @TimBiegeleisen 基本上,缺少的结构......太脏了,无法提供更具体的内容,抱歉
  • 您可能希望将 stringAsFactors = FALSE 添加到您的 data.frame 分配中。
  • @vaettchen 抱歉,可重现的示例没有,但我的代码有。这不是问题...

标签: r string text dataframe string-concatenation


【解决方案1】:

假设您可以使用时间戳正确识别结构正确的行(在下面的properDataRegex 中表示),那么就可以做到:

mydata <- c("01/08/2015 2:49:49 pm: Person 1: Hello",
            "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
            "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: ",
            "lend me your arms,",
            "fast as thunderbolts,",
            "for a pillow on my journey.",
            "07/07/2015 3:29:00 pm: Person 1: This is not the most efficient method",
            "but it will get the job done.")

properDataRegex <- "^\\d{2}/\\d{2}/\\d{4}\\s"
improperDataBool <- !grepl(properDataRegex, mydata)
while (sum(improperDataBool)) {
    mergeWPrevIndex <- which(c(FALSE, !improperDataBool[-length(improperDataBool)]) & 
                             improperDataBool)
    mydata[mergeWPrevIndex - 1] <- 
        paste(mydata[mergeWPrevIndex - 1], mydata[mergeWPrevIndex])
    mydata <- mydata[-mergeWPrevIndex]
    improperDataBool <- !grepl(properDataRegex, mydata)
}

mydata
## [1] "01/08/2015 2:49:49 pm: Person 1: Hello"                                                                                                    
## [2] "01/08/2015 2:51:49 pm: Person 2: Nice to meet you"                                                                                         
## [3] "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku:  lend me your arms, fast as thunderbolts, for a pillow on my journey."
## [4] "07/07/2015 3:29:00 pm: Person 1: This is not the most efficient method but it will get the job done."

在这里,mydata 是一个字符向量,但当然现在可以像问题中那样制作成 data.frame,或者使用 read.table()read.fwf() 解析它。

【讨论】:

  • 像往常一样在使用正则表达式时遇到了一些问题,但效果很好,谢谢!
【解决方案2】:

这是另一种方法:

read.table(text=paste(gsub("(^\\d{2}/\\d{2}/\\d{4}\\s)", "\n\\1", conversation_errors$uniquerow),
                      collapse = " "), sep = "\n", stringsAsFactors = F)[,1]

这给出了:

[1] "01/08/2015 2:49:49 pm: Person 1: Hello "                                                                                                   
[2] "01/08/2015 2:51:49 pm: Person 2: Nice to meet you "                                                                                        
[3] "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku:  lend me your arms, fast as thunderbolts, for a pillow on my journey."

(感谢 Ken 借用的正则表达式)

【讨论】:

    猜你喜欢
    • 2018-04-23
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-08-08
    • 1970-01-01
    • 2023-04-08
    • 2020-04-24
    • 2020-06-16
    相关资源
    最近更新 更多