【问题标题】:Collapse multiple rows of strings into one row based on a condition.根据条件将多行字符串折叠成一行。
【发布时间】:2017-10-15 21:22:40
【问题描述】:

假设我有这些数据:

df <- data.frame(
   text = c("Treatment1: This text is","on two lines","","Treatment2:This text","has","three lines","","Treatment3: This has one")
                  )
df
                      text
1 Treatment1: This text is
2             on two lines
3                         
4     Treatment2:This text
5                      has
6              three lines
7                         
8 Treatment3: This has one

我将如何解析此文本,以便所有“治疗”都在自己的行上,而下面的所有文本都在同一行上?

例如,这是所需的输出:

text
1 Treatment1: This text is on two lines
2 Treatment2: This text has three lines                
3 Treatment3: This has one

谁能推荐一种方法来做到这一点?

【问题讨论】:

    标签: r regex split


    【解决方案1】:

    可能类似于以下内容。
    第一,dput格式的数据,在帖子中分享数据集的最佳格式。

    df <-
    structure(list(text = c("Treatment1: This text is", "on two lines", 
    "", "Treatment2:This text", "has", "three lines", "", "Treatment3: This has one"
    )), .Names = "text", class = "data.frame", row.names = c("1", 
    "2", "3", "4", "5", "6", "7", "8"))
    

    现在是base R 代码。

    fact <- cumsum(grepl("treatment", df$text, , ignore.case = TRUE))
    result <- do.call(rbind, lapply(split(df, fact), function(x)
                         trimws(paste(x$text, collapse = " "))))
    result <- as.data.frame(result)
    names(result) <- "text"
    result
    #                                   text
    #1 Treatment1: This text is on two lines
    #2  Treatment2:This text has three lines
    #3              Treatment3: This has one
    

    编辑。
    正如 Rich Scriven 在他的评论中指出的那样,tapply 可以大大简化上面的代码。 (我没有看到,有时我太复杂了。)

    result2 <- data.frame(
        text = tapply(df$text, fact, function(x) trimws(paste(x, collapse = " ")))
    )
    
    all.equal(result, result2)
    #[1] "Component “text”: 'current' is not a factor"
    

    【讨论】:

    • 看看tapply()。可以替换do.call(rbind, lapply(split(...), ...))
    • @RichScriven 谢谢,根据您的建议编辑答案。
    【解决方案2】:
    x <- gsub("\\s+Treatment", "*BREAK*Treatment",
              as.character(paste(df[[1]], collapse = " ")))
    data.frame(text = unlist(strsplit(x, "\\*BREAK\\*")))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-11-30
      • 2019-06-28
      • 2015-03-16
      • 2015-12-10
      • 1970-01-01
      • 2017-11-30
      • 1970-01-01
      相关资源
      最近更新 更多