【问题标题】:Spread returns duplicate identifier error even with unique rows included即使包含唯一行,Spread 也会返回重复标识符错误
【发布时间】:2018-07-01 09:18:46
【问题描述】:

我有以下数据框:

       location asset_status count   row
      <chr>          <chr>  <dbl>  <int>
 1  location1        Owned     1     1
 2  location1    Available     1     2
 3  location1        Owned     1     3
 4  location2        Owned     1     4
 5  location2        Owned     1     5
 6  location2        Owned     1     6
 7  location2        Owned     1     7
 8  location2    no status     1     8
 9  location3        Owned     1     9
10  location3        Owned     1    10

当我尝试使用它进行传播时,我收到以下错误:

df <- head(us_can_laptops,10) %>% 
  select(location,asset_status,count) %>% 
  #mutate(row = row_number()) %>% #excluded
  group_by(location) %>% 
  spread(asset_status,count)

Error: Duplicate identifiers for rows (4, 5, 6, 7), (1, 3)

因此,根据 SO 上与此相关的其他问题,我添加了一个带有 mutate 的唯一标识符:

df <- head(us_can_laptops,10) %>% 
  select(location,asset_status,count) %>% 
  mutate(row = row_number()) %>%
  group_by(location) %>% 
  spread(asset_status,count)

但这会返回:

    location     row   Available   `no status` Owned
 *        <chr> <int>     <dbl>       <dbl> <dbl>
 1  location2     4        NA          NA     1
 2  location2     5        NA          NA     1
 3  location2     6        NA          NA     1
 4  location2     7        NA          NA     1
 5  location2     8        NA           1    NA
 6  location3    10        NA          NA     1
 7  location3     9        NA          NA     1
 8  location1     1        NA          NA     1
 9  location1     2         1          NA    NA
10  location1     3        NA          NA     1

此外,每当我尝试汇总调用时,它都会破坏我的传播。

这是期望的结果:

 location        Available   `no status` Owned
 *      <chr>     <dbl>       <dbl>      <dbl>
 1  location1        1          NA        2
 2  location2       NA          1         4
 3  location3       NA          NA        2

任何帮助将不胜感激。我知道这看起来像重复,但以下链接问题的答案仍然无法为我解决问题: Spread function Error: Duplicate identifiers for rows [duplicate] Spread with duplicate identifiers for rows 1

我真的在寻找使用 dplyr 的解决方案,而不是 dcast

【问题讨论】:

  • 在预期的输出中,第一行的 4 的值是从哪里来的?
  • 哦,请忽略行列。那只是柜台。我将在实际代码中删除它。

标签: r dataframe dplyr


【解决方案1】:

以下应该可以工作(至少给出所需的输出):

df <- structure(list(location = c("location1", "location1", "location1", 
                                  "location2", "location2", "location2", "location2", "location2", 
                                  "location3", "location3"), asset_status = c("Owned", "Available", 
                                                                              "Owned", "Owned", "Owned", "Owned", "Owned", "no status", "Owned", 
                                                                              "Owned"), count = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 
                     row = 1:10), row.names = c(NA, -10L), .Names = c("location", 
                                                                      "asset_status", "count", "row"), class = "data.frame")

library(dplyr)
library(tidyr)
df %>% 
  group_by(location, asset_status) %>% 
  summarise(count = sum(count)) %>% 
  spread(key = asset_status, value = count)

【讨论】:

  • 我发布了同样的内容,而是使用summarize_at("count",sum),而不是重复变量名:)
  • 我经常忘记summarise_at,因为I发现它更具可读性......但也许这只是我。但我真的很喜欢将summarise_at 与一些使用vars(matches(...)) 的正则表达式结合起来:-)。
  • 谢谢@Tino 和 Moody_Mudskipper。那工作得很好。我想我的代码过于复杂了。
  • 是的,这里只是口味问题,我只是借此机会宣传该功能。
猜你喜欢
  • 2018-10-18
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-05-24
  • 1970-01-01
  • 2021-02-08
  • 2021-03-24
相关资源
最近更新 更多