【问题标题】:how to combine sparse columns tidily如何整齐地组合稀疏列
【发布时间】:2017-04-13 20:45:15
【问题描述】:

一位同事有一些由许多稀疏列组成的数据,应该折叠成几个填充的列。例如:

d1 <- data.frame(X1 = c(rep("Northampton", times=3), rep(NA, times=7)), 
                 X2 = c(rep(NA, times=3), rep("Amherst", times=5), rep(NA, times=2)), 
                 X3 = c(rep(NA, times=8), rep("Hadley", times=2)), 
                 X4 = c(rep("Stop and Shop", times=2), rep(NA, times=6), rep("Stop and Shop", times=2)), 
                 X5 = c(rep(NA, times=2), rep("Whole Foods", times=6), rep(NA, times=2)))

d1
            X1      X2     X3            X4          X5
1  Northampton    <NA>   <NA> Stop and Shop        <NA>
2  Northampton    <NA>   <NA> Stop and Shop        <NA>
3  Northampton    <NA>   <NA>          <NA> Whole Foods
4         <NA> Amherst   <NA>          <NA> Whole Foods
5         <NA> Amherst   <NA>          <NA> Whole Foods
6         <NA> Amherst   <NA>          <NA> Whole Foods
7         <NA> Amherst   <NA>          <NA> Whole Foods
8         <NA> Amherst   <NA>          <NA> Whole Foods
9         <NA>    <NA> Hadley Stop and Shop        <NA>
10        <NA>    <NA> Hadley Stop and Shop        <NA>

X1:X3 应折叠到名为 Town 的一列中,X4:X5 应折叠到名为 Store 的一列中。这里必须有一个 tidyverse 解决方案。我试过gather()unite(),但没有发现任何优雅的东西。

【问题讨论】:

    标签: r dplyr tidyr tidyverse


    【解决方案1】:

    你可以使用coalesce:

    d1 %>% mutate_if(is.factor, as.character) %>%    # coerce explicitly
        transmute(town = coalesce(X1, X2, X3), 
                  store = coalesce(X4, X5))
    
    ##           town         store
    ## 1  Northampton Stop and Shop
    ## 2  Northampton Stop and Shop
    ## 3  Northampton   Whole Foods
    ## 4      Amherst   Whole Foods
    ## 5      Amherst   Whole Foods
    ## 6      Amherst   Whole Foods
    ## 7      Amherst   Whole Foods
    ## 8      Amherst   Whole Foods
    ## 9       Hadley Stop and Shop
    ## 10      Hadley Stop and Shop
    

    【讨论】:

      【解决方案2】:

      我认为一系列gather() 调用和一些修剪会让你得到你想要的。一个问题是使用na.rm = TRUEgather() 的参数来剔除不需要的行。

      d1 %>% 
        gather(key = "town", value = "town_name", X1:X3, na.rm = TRUE) %>% 
        gather(key = "store", value = "store_name", X4:X5, na.rm = TRUE) %>%
        select(-town, -store)
      

      这样可以吗?

      【讨论】:

        【解决方案3】:

        您也可以在 base R 中使用apply run rowwise 执行此操作:

        d2 <- data.frame(X1 = apply(d1[,c("X1", "X2", "X3")], 1, function(x) x[!is.na(x)]),
                         X2 = apply(d1[,c("X4", "X5")], 1, function(x) x[!is.na(x)]),
                         stringsAsFactors = FALSE)
        

        结果:

        > d2
                    X1            X2
        1  Northampton Stop and Shop
        2  Northampton Stop and Shop
        3  Northampton   Whole Foods
        4      Amherst   Whole Foods
        5      Amherst   Whole Foods
        6      Amherst   Whole Foods
        7      Amherst   Whole Foods
        8      Amherst   Whole Foods
        9       Hadley Stop and Shop
        10      Hadley Stop and Shop
        

        【讨论】:

          【解决方案4】:

          这是base R 使用pmax/pmin 的另一种方式

          data.frame(lapply(list(Town = d1[1:3], Store = d1[4:5]), function(x) 
                     do.call(pmax, c(x, na.rm = TRUE))), stringsAsFactors=FALSE)
          #          Town         Store
          #1  Northampton Stop and Shop
          #2  Northampton Stop and Shop
          #3  Northampton   Whole Foods
          #4      Amherst   Whole Foods
          #5      Amherst   Whole Foods
          #6      Amherst   Whole Foods
          #7      Amherst   Whole Foods
          #8      Amherst   Whole Foods
          #9       Hadley Stop and Shop
          #10      Hadley Stop and Shop
          

          数据

          d1 <- data.frame(X1 = c(rep("Northampton", times=3),rep(NA, times=7)),
             X2 = c(rep(NA, times=3), rep("Amherst", times=5), rep(NA, times=2)),
            X3 = c(rep(NA, times=8), rep("Hadley", times=2)), 
            X4 = c(rep("Stop and Shop", times=2), rep(NA, times=6), rep("Stop and Shop", times=2)), 
            X5 = c(rep(NA, times=2), rep("Whole Foods", times=6), 
                  rep(NA, times=2)), stringsAsFactors=FALSE)
          

          【讨论】:

            猜你喜欢
            • 2013-03-03
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2021-03-01
            • 1970-01-01
            • 1970-01-01
            • 2018-02-15
            • 1970-01-01
            相关资源
            最近更新 更多