【问题标题】:Complete dataframe with missing combinations of values缺少值组合的完整数据框
【发布时间】:2018-12-03 19:18:55
【问题描述】:

我有一个包含两个因子 (distance) 和年份 (years) 的数据框。我想将每个因子的所有 years 值补全 0。

即从此:

    distance years area
1      NPR     3   10
2      NPR     4   20
3      NPR     7   30
4      100     1   40
5      100     5   50
6      100     6   60

得到这个:

   distance years area
1       NPR     1    0
2       NPR     2    0
3       NPR     3   10
4       NPR     4   20
5       NPR     5    0
6       NPR     6    0
7       NPR     7   30
8       100     1   40
9       100     2    0
10      100     3    0
11      100     4    0
12      100     5   50
13      100     6   60
14      100     7    0

我尝试应用expand函数:

library(tidyr)
library(dplyr, warn.conflicts = FALSE)

expand(df, years = 1:7)

但这只会产生一列数据框,并不会扩展原始数据框:

# A tibble: 7 x 1
  years
  <int>
1     1
2     2
3     3
4     4
5     5
6     6
7     7

expand.grid 也不起作用:

require(utils)    
expand.grid(df, years = 1:7)

Error in match.names(clabs, names(xi)) : 
  names do not match previous names
In addition: Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NAs

expand我的数据框的简单方法吗?以及如何根据distanceuniqueLoc两大类进行扩展?

distance <- rep(c("NPR", "100"), each = 3)
years <-c(3,4,7, 1,5,6)
area <-seq(10,60,10)
uniqueLoc<-rep(c("a", "b"), 3)

df<-data.frame(uniqueLoc, distance, years, area)

> df
  uniqueLoc distance years area
1         a      NPR     3   10
2         b      NPR     4   20
3         a      NPR     7   30
4         b      100     1   40
5         a      100     5   50
6         b      100     6   60

【问题讨论】:

标签: r tidyr


【解决方案1】:

你可以使用tidyr::complete函数:

complete(df, distance, years = full_seq(years, period = 1), fill = list(area = 0))

# A tibble: 14 x 3
   distance years  area
   <fct>    <dbl> <dbl>
 1 100         1.   40.
 2 100         2.    0.
 3 100         3.    0.
 4 100         4.    0.
 5 100         5.   50.
 6 100         6.   60.
 7 100         7.    0.
 8 NPR         1.    0.
 9 NPR         2.    0.
10 NPR         3.   10.
11 NPR         4.   20.
12 NPR         5.    0.
13 NPR         6.    0.
14 NPR         7.   30.

或略短:

complete(df, distance, years = 1:7, fill = list(area = 0))

【讨论】:

    【解决方案2】:

    结合 tidyr::pivot_wider() 和 tidyr::pivot_longer() 也会使隐式缺失值显式化。

    # Load packages 
    library(tidyverse)
    
    # Your data
        df <- tibble(distance = c(rep("NPR",3), rep(100, 3)),
                     years = c(3,4,7,1,5,6),
                     area = seq(10, 60, by = 10))
    # Solution 
        df %>%
          pivot_wider(names_from = years, 
                      values_from = area) %>% # pivot_wider() makes your implicit missing values explicit 
          pivot_longer(2:7, names_to = "years", 
                       values_to = "area") %>% # Turn to your desired format (long)
          mutate(area = replace_na(area, 0)) # Replace missing values (NA) with 0s
    

    【讨论】:

      猜你喜欢
      • 2015-04-05
      • 1970-01-01
      • 1970-01-01
      • 2015-01-10
      • 2021-08-13
      • 2022-06-27
      • 1970-01-01
      • 2021-08-13
      相关资源
      最近更新 更多