根据 cumsum 索引重复数据帧行答案

【问题标题】：Repeat dataframe rows based on cumsum index根据 cumsum 索引重复数据帧行
【发布时间】：2019-07-22 13:55:07
【问题描述】：

我有一个如下的数据框：

data.frame(title="Title", bk=c("Book 1", "Book 1", "Book 3"), ch=c("Chapter 1", "Chapter 2", "Chapter 1"))

  title     bk        ch
1 Title Book 1 Chapter 1
2 Title Book 1 Chapter 2
3 Title Book 3 Chapter 1

如何根据下面的 cumsum 索引重复每个观察：

id=c(1,1,1,2,2,3,3,3,3)

所以数据框可以以这样的方式扩展，以容纳生成 cumsum 索引的源向量？

  title     bk        ch   source_vector
1 Title Book 1 Chapter 1   ...
1 Title Book 1 Chapter 1   
1 Title Book 1 Chapter 1   
2 Title Book 1 Chapter 2   
2 Title Book 1 Chapter 2   
3 Title Book 3 Chapter 1   
3 Title Book 3 Chapter 1   
3 Title Book 3 Chapter 1   
3 Title Book 3 Chapter 1

【问题讨论】：

你想如何使用id ？还是您只想将content 中的每个单词分隔为单独的行？
原始数据是中文文本，我去掉了str_split的标点符号。
@akrun 对我来说看起来一样（分隔的词 == 组的长度）但由于不确定，我重新打开了
@Sotos 我认为这与您标记的不同。从那边的答案中我不需要知道什么。
我重新打开了，但我仍然看不到你想要完成什么

标签： r cumsum

【解决方案1】：

在 base 中，你可以使用 do.call 的 r.bind，在你完成每一行的 strsplit 和 cbind 之后：

x <- data.frame(title="Title", bk=c("Book 1", "Book 1", "Book 3"), ch=c("Chapter 1", "Chapter 2", "Chapter 1"), content=c("This is the", "content of", "each chapter in books"))
do.call("rbind", by(x, 1:nrow(x), function(x) {cbind(x[-ncol(x)], str_split_content=strsplit(as.character(x$content[1]), " ")[[1]])}))
#    title     bk        ch str_split_content
#1.1 Title Book 1 Chapter 1              This
#1.2 Title Book 1 Chapter 1                is
#1.3 Title Book 1 Chapter 1               the
#2.1 Title Book 1 Chapter 2           content
#2.2 Title Book 1 Chapter 2                of
#3.1 Title Book 3 Chapter 1              each
#3.2 Title Book 3 Chapter 1           chapter
#3.3 Title Book 3 Chapter 1                in
#3.4 Title Book 3 Chapter 1             books

【讨论】：

【解决方案2】：

如果您只是想根据content 中的单词数来扩展行，那么这是一种方法，

library(splitstackshape)
expandRows(ddf, lengths(gregexpr("\\W+", ddf$content)) + 1, count.is.col = FALSE)

#    title     bk        ch               content
#1   Title Book 1 Chapter 1           This is the
#1.1 Title Book 1 Chapter 1           This is the
#1.2 Title Book 1 Chapter 1           This is the
#2   Title Book 1 Chapter 2            content of
#2.1 Title Book 1 Chapter 2            content of
#3   Title Book 3 Chapter 1 each chapter in books
#3.1 Title Book 3 Chapter 1 each chapter in books
#3.2 Title Book 3 Chapter 1 each chapter in books
#3.3 Title Book 3 Chapter 1 each chapter in books

【讨论】：

@akrun 我知道，但根据我们和 OP 的讨论，我认为他们可能需要找出的只是如何扩展......在 OP 澄清我猜之前回答假设跨度>
这与这个答案有什么关系？是的，我知道你不会投反对票。我不同意...
是的，加上重新打开/噪音等...但我不明白我们为什么要讨论这个...

【解决方案3】：

这更接近我想要的：

df %>%
  mutate(str_split_content = str_split(content, " ")) %>%
  unnest()

有人发布，然后在不久前修改/删除。

原来str_split 的内容实际上是标点符号。因此，并非完全按字数划分。

【讨论】：

df %>% unnest(str_split_content = str_split(content, " ")) 只需阅读文档，并且 unnest 允许这样做:)

【解决方案4】：

一个选项是使用separate_rows

library(tidyverse)
df1 %>%
    separate_rows(content)
#  title     bk        ch content
#1 Title Book 1 Chapter 1    This
#2 Title Book 1 Chapter 1      is
#3 Title Book 1 Chapter 1     the
#4 Title Book 1 Chapter 2 content
#5 Title Book 1 Chapter 2      of
#6 Title Book 3 Chapter 1    each
#7 Title Book 3 Chapter 1 chapter
#8 Title Book 3 Chapter 1      in
#9 Title Book 3 Chapter 1   books

如果我们需要复制原始行

df1 %>% 
    uncount(str_count(content, "\\w+")) %>%
    as_tibble
# A tibble: 9 x 4
#  title bk     ch        content              
#  <fct> <fct>  <fct>     <fct>                
#1 Title Book 1 Chapter 1 This is the          
#2 Title Book 1 Chapter 1 This is the          
#3 Title Book 1 Chapter 1 This is the          
#4 Title Book 1 Chapter 2 content of           
#5 Title Book 1 Chapter 2 content of           
#6 Title Book 3 Chapter 1 each chapter in books
#7 Title Book 3 Chapter 1 each chapter in books
#8 Title Book 3 Chapter 1 each chapter in books
#9 Title Book 3 Chapter 1 each chapter in books

【讨论】：

那么您如何处理这里的per id 部分？因为如果这是解决方案，那么我们同意这是一个骗局
@Sotos 我会说，如果 OP 提出了一个巨大的 for 循环并想要修复某些东西，那么展示一个没有 for 循环的更简单的解决方案是否公平？我对您的标签的评论是基于 OP 帖子的意图，但他/她得到的输出是一样的
当然。但我不明白你的意思。该示例有效，因为它们与每个组的长度相同。也许我不明白这个问题
@Sotos 在这里，OP 提出了一个strsplit，创建了 ssome 'id's，然后想要以一种轮回的方式获得预期的输出
但是，当其他人发帖时，这不会发生...我看到你正在远离友好的讨论，所以我会离开。祝阿伦过得愉快！