将购物清单分成多列答案

【问题标题】：Separate a shopping list into multiple columns将购物清单分成多列
【发布时间】：2020-03-08 09:04:00
【问题描述】：

我有一个这样的购物清单数据：

df <- data.frame(id = 1:5, item = c("apple2milk5", "milk1", "juice3apple5", "egg10juice1", "egg8milk2"), stringsAsFactors = F)

#   id         item
# 1  1  apple2milk5
# 2  2        milk1
# 3  3 juice3apple5
# 4  4  egg10juice1
# 5  5    egg8milk2

我想把变量item分成多列，记录商品后面的数字。我遇到的问题是每个人购买的商品都不一样，所以我无法使用tidyr::separate()或其他类似的功能解决它。我的期望是：

#   id apple milk  juice egg  
# 1  1 2     5     NA    NA   
# 2  2 NA    1     NA    NA   
# 3  3 5     NA    3     NA   
# 4  4 NA    NA    1     10   
# 5  5 NA    2     NA    8

注意：市场上的商品类别未知。所以不要假设只有4种商品。

感谢您的帮助！

【问题讨论】：

标签： r tidyr

【解决方案1】：

我将添加另一个答案。它与@ASuliman 的略有不同，但使用了一些较新的tidyr 和一些可爱的正则表达式来变得更直接。

正则表达式技巧是模式"(?<=\\d)\\B(?=[a-z])" 将匹配数字和字母之间的非边界（即空位置），允许您为每个"apple5" 类型的条目创建行。将字母提取到项目列中，将数字提取到计数列中。使用替换 spread 的新 pivot_wider，您可以在重塑时将这些计数转换为数值。

library(dplyr)
library(tidyr)

df %>%
  separate_rows(item, sep = "(?<=\\d)\\B(?=[a-z])") %>%
  extract(item, into = c("item", "count"), regex = "^([a-z]+)(\\d+)$") %>%
  pivot_wider(names_from = item, values_from = count, values_fn = list(count = as.numeric))
#> # A tibble: 5 x 5
#>      id apple  milk juice   egg
#>   <int> <dbl> <dbl> <dbl> <dbl>
#> 1     1     2     5    NA    NA
#> 2     2    NA     1    NA    NA
#> 3     3     5    NA     3    NA
#> 4     4    NA    NA     1    10
#> 5     5    NA     2    NA     8

【讨论】：

【解决方案2】：

一个更清洁的data.table 解决方案，输入来自stringr：

df[, 
   .(it_count = str_extract_all(item, "[0-9]+")[[1]], 
     it_name = str_extract_all(item, "[^0-9]+")[[1]]), 
   by = id
   ][, dcast(.SD, id ~ it_name, value.var = "it_count")]

   id apple  egg juice milk
1:  1     2 <NA>  <NA>    5
2:  2  <NA> <NA>  <NA>    1
3:  3     5 <NA>     3 <NA>
4:  4  <NA>   10     1 <NA>
5:  5  <NA>    8  <NA>    2

【讨论】：

【解决方案3】：

主要是基于 R 的一些来自 stringr 和 data.table 的输入：

library(stringr)
library(data.table)
cbind(
  id = df$id,
  rbindlist(
    lapply(df$item, function(x) as.list(setNames(str_extract_all(x, "[0-9]+")[[1]], strsplit(x, "[0-9]+")[[1]]))),
    fill = TRUE
  )
)

   id apple milk juice  egg
1:  1     2    5  <NA> <NA>
2:  2  <NA>    1  <NA> <NA>
3:  3     5 <NA>     3 <NA>
4:  4  <NA> <NA>     1   10
5:  5  <NA>    2  <NA>    8

【讨论】：

感谢您的帮助！这是一个很好的解决方案。

【解决方案4】：

在每个数字子字符串之前放置一个空格，在它之后放置一个换行符。然后使用read.table 和unnest it 读取该数据。最后使用pivot_wider 将长格式转换为宽格式。

library(dplyr)
library(tidyr)

df %>%
  mutate(item = gsub("(\\d+)", " \\1\n", item)) %>%
  rowwise %>%
  mutate(item = list(read.table(text = item, as.is = TRUE))) %>%
  ungroup %>%
  unnest(item) %>%
  pivot_wider(names_from = "V1", values_from = "V2")

给予：

# A tibble: 5 x 5
     id apple  milk juice   egg
  <int> <int> <int> <int> <int>
1     1     2     5    NA    NA
2     2    NA     1    NA    NA
3     3     5    NA     3    NA
4     4    NA    NA     1    10
5     5    NA     2    NA     8

变化

这是上述代码的变体，它消除了unnest。我们将每个数字字符串替换为一个空格、那个字符串、另一个空格、id 和一个换行符。然后使用read.table 读入。注意read.table 之前使用%$% 而不是%>%。最后使用pivot_wider 将长格式转换为宽格式。

library(dplyr)
library(magrittr)
library(tidyr)

df %>%
  rowwise %>%
  mutate(item = gsub("(\\d+)", paste(" \\1", id, "\n"), item)) %$%
  read.table(text = item, as.is = TRUE, col.names = c("nm", "no", "id")) %>%
  ungroup %>%
  pivot_wider(names_from = "nm", values_from = "no")

【讨论】：

感谢您的帮助！这是一个很好的解决方案。
嘿，变体部分倒数第二行的ungroup可以去掉，对吧？
它仍然可以工作，但最好总是用ungroup 完成group_by 或rowwise，否则数据会记住它在后续操作中分组，您可能会在一个惊喜。

【解决方案5】：

我刚刚想出了一个tidyverse 解决方案。使用str_extract() 提取数量并将其名称设置为产品名称。然后reduce(bind_rows) 生成预期的结果。

library(tidyverse)

df$item %>%
  map(~ set_names(str_extract_all(., "\\d+")[[1]], str_extract_all(., "\\D+")[[1]])) %>%
  reduce(bind_rows) %>%
  mutate_all(as.numeric) %>%
  bind_cols(df, .)

#   id         item apple milk juice egg
# 1  1  apple2milk5     2    5    NA  NA
# 2  2        milk1    NA    1    NA  NA
# 3  3 juice3apple5     5   NA     3  NA
# 4  4  egg10juice1    NA   NA     1  10
# 5  5    egg8milk2    NA    2    NA   8

【讨论】：

【解决方案6】：

这是基于 R 和 stringr 的简单解决方案：

goods <- unique(unlist(stringr::str_split(df$item, pattern = "[0-9]")))
goods <- goods[goods != ""]
df <- cbind(df$id, sapply(goods,
       function(x) stringr::str_extract(df$item, pattern = paste0(x,"[0-9]*"))))
df <- as.data.frame(df)
df[-1] <- lapply(df[-1], function(x) as.numeric(stringr::str_extract(x, pattern = "[0-9]*$")))
names(df)[1] <- "id"

输出

id apple milk juice egg
1  1     2    5    NA  NA
2  2    NA    1    NA  NA
3  3     5   NA     3  NA
4  4    NA   NA     1   10
5  5    NA    2    NA   8

【讨论】：

对不起，其实我不知道市场上有多少类商品。所以你的代码的第一行在我的情况下不起作用。
@DarrenTsai 已修复
感谢您的帮助。我发现第 4 行显示 1 个鸡蛋，但它必须是 10 个。
@DarrenTsai 修复了其中一种模式。谢谢

【解决方案7】：

#replace any digit followed by a character "positive look-ahead assertion" by the digit plus a comma
library(dplyr)
library(tidyr)
df %>% mutate(item=gsub('(\\d+(?=\\D))','\\1,' ,item, perl = TRUE)) %>% 
       separate_rows(item, sep = ",") %>% 
       extract(item, into = c('prod','quan'), '(\\D+)(\\d+)') %>% 
       spread(prod, quan, fill=0)

  id apple egg juice milk
1  1     2   0     0    5
2  2     0   0     0    1
3  3     5   0     3    0
4  4     0  10     1    0
5  5     0   8     0    2

【讨论】：

感谢您的帮助！这是一个很好的解决方案。

【解决方案8】：

你可以试试

library(tidyverse)
library(stringi)
df %>% 
  mutate(item2 =gsub("[0-9]", " ", df$item)) %>% 
  mutate(item3 =gsub("[a-z]", " ", df$item)) %>% 
  mutate_at(vars(item2, item3), ~stringi::stri_extract_all_words(.) %>% map(paste, collapse=",")) %>% 
  separate_rows(item2, item3, sep = ",") %>% 
  spread(item2, item3)
  id         item apple  egg juice milk
1  1  apple2milk5     2 <NA>  <NA>    5
2  2        milk1  <NA> <NA>  <NA>    1
3  3 juice3apple5     5 <NA>     3 <NA>
4  4  egg10juice1  <NA>   10     1 <NA>
5  5    egg8milk2  <NA>    8  <NA>    2

【讨论】：

感谢您的帮助！这是一个很好的解决方案。

【解决方案9】：

tmp = lapply(strsplit(df$item, "(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)", perl = TRUE),
             function(x) {
                 d = split(x, 0:1)
                 setNames(as.numeric(d[[2]]), d[[1]])
             })
nm = unique(unlist(lapply(tmp, names)))

cbind(df, do.call(rbind, lapply(tmp, function(x) setNames(x[nm], nm))))
#  id         item apple milk juice egg
#1  1  apple2milk5     2    5    NA  NA
#2  2        milk1    NA    1    NA  NA
#3  3 juice3apple5     5   NA     3  NA
#4  4  egg10juice1    NA   NA     1  10
#5  5    egg8milk2    NA    2    NA   8

【讨论】：

感谢您的帮助！这是一个很好的解决方案。

【解决方案10】：

可能是这样的，应该适用于任何项目/数量。它只是假设数量跟随项目。

让我们使用一个自定义函数来提取项目和数量：

my_fun <- function(w) {
  items <- stringr::str_split(w, "\\d+", simplify = T)
  items <- items[items!=""] # dont now why but you get en empty spot each time
  quantities <- stringr::str_split(w, "\\D+", simplify = T)
  quantities <- quantities[quantities!=""]

  d <- data.frame(item = items, quantity=quantities, stringsAsFactors = F)


  return(d)

}

例子：

my_fun("apple2milk5")
# gives:
#    item quantity
# 1 apple        2
# 2  milk        5

现在我们可以将函数应用于每个 id，使用 nest 和 map：

library(dplyr)
df_result <- df %>% 
  nest(item) %>% 
  mutate(res = purrr::map(data, ~my_fun(.x))) %>% 
  unnest(res)

df_results
# # A tibble: 9 x 3
# id item  quantity
# <int> <chr> <chr>   
# 1     1 apple 2       
# 2     1 milk  5       
# 3     2 milk  1       
# 4     3 juice 3       
# 5     3 apple 5       
# 6     4 egg   10      
# 7     4 juice 1       
# 8     5 egg   8       
# 9     5 milk  2

现在我们可以使用dcast()（可能spread 也可以）：

data.table::dcast(df_result, id~item, value.var="quantity")

#     id apple  egg juice milk
#   1  1     2 <NA>  <NA>    5
#   2  2  <NA> <NA>  <NA>    1
#   3  3     5 <NA>     3 <NA>
#   4  4  <NA>   10     1 <NA>
#   5  5  <NA>    8  <NA>    2

数据：

df <- data.frame(id = 1:5, item = c("apple2milk5", "milk1", "juice3apple5", "egg10juice1", "egg8milk2"), stringsAsFactors = F)

【讨论】：

我的想法完全正确！ dcast 的替代方案：tidyr::spread(df, item, quantity)