【问题标题】:R: split and duplicate a rowR:拆分和复制一行
【发布时间】:2016-08-29 00:30:06
【问题描述】:

我的数据框有一个列,我想用破折号分隔,在破折号的左侧和右侧带有字符的重复行。我知道如何拆分和复制,但不知道如何保留部分字符串。非常糟糕的描述 - 我认为显示数据框和所需输出更容易。

tmp = structure(list(Unit.Types = c("10 - 12 Pack 11.2 - 14.9 oz Bottle or Can", 
"8 - 12 Pack 11.5 - 16 oz Bottle or Can"), Row.Count = c("899", 
"305"), Test = c("B", "A")), .Names = c("Unit.Types", "Row.Count", 
"Test"), row.names = c(104L, 196L), class = "data.frame") 

library(tidyr)
library(dplyr)

tmp2 = tmp %>% mutate(Unit.Types = strsplit(as.character(Unit.Types), "-")) %>% unnest(Unit.Types)
tmp2

  Row.Count Test             Unit.Types
1       899    B                    10 
2       899    B          12 Pack 11.2 
3       899    B  14.9 oz Bottle or Can
4       305    A                     8 
5       305    A          12 Pack 11.5 
6       305    A    16 oz Bottle or Can

我想要的输出应该是这样的:

                                 Unit.Types Row.Count Test
1 10 Pack 11.2 oz Bottle or Can       899    B
2 10 Pack 14.9 oz Bottle or Can       899    B
3 12 Pack 11.2 oz Bottle or Can       899    B
4 12 Pack 14.9 oz Bottle or Can       899    B
5 8 Pack 11.5 oz Bottle or Can       305    A
6 8 Pack 16 oz Bottle or Can       305    A
7 12 Pack 11.5 oz Bottle or Can       305    A
8 12 Pack 16 oz Bottle or Can       305    A

或者至少是这样,用“oz”用破折号分隔

                                 Unit.Types Row.Count Test
1 10 - 12 Pack 11.2 oz Bottle or Can       899    B
2 10 - 12 Pack 14.9 oz Bottle or Can       899    B
3 8 - 12 Pack 11.5 oz Bottle or Can       305    A
4 8 - 12 Pack 16 oz Bottle or Can       305    A

非常感谢任何帮助!

【问题讨论】:

  • 所有行都是“10 - 12 Pack 11.2 - 14.9 oz Bottle or Can”形式的吗?
  • 也可以是“10 Pack 14 - 16 oz Can”

标签: r strsplit


【解决方案1】:

看看这个函数

f <- function(x){
    strsplit(x, " Pack | oz Bottle or Can")[[1]] %>%
    strsplit(" - ") %>%
    expand.grid() %>%
    mutate(V = paste(Var1, "Pack", Var2, "oz Bottle or Can")) %>%
    `[[`("V")
}

它将应用于Unit.Types 列中的字符串。示例:

> f(tmp$Unit.Types[[1]])
[1] "10 Pack 11.2 oz Bottle or Can" "12 Pack 11.2 oz Bottle or Can"
[3] "10 Pack 14.9 oz Bottle or Can" "12 Pack 14.9 oz Bottle or Can"

然后使用这个函数我们可以做到以下几点:

ans <- tmp %>% split(1:nrow(tmp)) %>%
lapply(function(x) data.frame(Unit.Types = f(x$Unit.Types),
                              Row.Count = x$Row.Count,
                              Test = x$Test
                              )
       ) %>%
do.call(rbind, .)
row.names(ans) <- NULL

ans 是我们想要的data.frame。

UPD关于您的评论:我们可以使用正则表达式来匹配以' - ' 分隔的数字对,或者只使用数字并用它重写f

regex <- "[0-9]+(.[0-9]+)?( - [0-9]+(.[0-9]+)?)?"

f <- function(x){
    m <- gregexpr(regex, x)
    matches <- regmatches(x, m)[[1]]
    nonmatches <- regmatches(x, m, invert = T)[[1]][-1]
    strsplit(matches, " - ") %>%
    expand.grid(stringsAsFactors = F) %>%
    apply(MARGIN = 1, function(y) rbind(y, nonmatches) %>%
                                  c %>%
                                  paste(collapse = ""))
}

此函数可以处理具有三个或更多数字规格的偶数字符串:

> x <- "2 - 3 big packs of 10 - 12 Pack 11.2 - 14.9 oz Can"
> f(x)
[1] "2 big packs of 10 Pack 11.2 oz Can" "3 big packs of 10 Pack 11.2 oz Can"
[3] "2 big packs of 12 Pack 11.2 oz Can" "3 big packs of 12 Pack 11.2 oz Can"
[5] "2 big packs of 10 Pack 14.9 oz Can" "3 big packs of 10 Pack 14.9 oz Can"
[7] "2 big packs of 12 Pack 14.9 oz Can" "3 big packs of 12 Pack 14.9 oz Can"

【讨论】:

  • 谢谢!这适用于字符串的“oz Bottle or Can”结尾。在更一般的情况下,结尾重复两次(例如,字符串的结尾可以是“blah.. 10 - 12 oz Can”或其他形式 - 瓶子、小桶等)。
猜你喜欢
  • 2020-10-23
  • 2013-01-29
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-05-25
相关资源
最近更新 更多