【问题标题】:How to add expand a data table based on table information如何根据表信息添加展开数据表
【发布时间】:2018-07-03 14:38:31
【问题描述】:

我有如下数据表:

RowID| Col1   | Col2 |
----------------------
1    | apple  | cow  |
2    | orange | dog  |
3    | apple  | cat  |
4    | cherry | fish |
5    | cherry | ant  |
6    | apple  | rat  |

我想去这张桌子:

RowID| Col1   | Col2 | newCol
------------------------------
1    | apple  | cow  | cat
2    | apple  | cow  | rat   
3    | orange | dog  | na        
4    | apple  | cat  | cow
5    | apple  | cat  | rat   
6    | cherry | fish | ant       
7    | cherry | ant  | fish      
8    | apple  | rat  | cow
9    | apple  | rat  | cat   

为了帮助可视化上表的逻辑,它与下表基本相同,但列表列根据存在的值拆分为行。它与 col1 中的值匹配,因此例如原始表的第 1 行、第 3 行和第 6 行在第一列中有“apple”。因此,新的“列表”列将包括匹配行的所有 Col2 值。然后将每个列表元素扩展为一个新行。上面的第二个表是我想要的结果,第三个表只是为了帮助可视化值的来源。

RowID| Col1   | Col2 | newCol
------------------------------
1    | apple  | cow  | cat,rat   (Row 3 & 6 match col1 values)
2    | orange | dog  | na        (No rows match this col1 value)
3    | apple  | cat  | cow,rat   (Row 1 & 6 match col1 values)
4    | cherry | fish | ant       (Row 5 matches col1 values)
5    | cherry | ant  | fish      (Row 4 matches col1 values)
6    | apple  | rat  | cow,cat   (Row 1 & 3 match col1 values)

【问题讨论】:

    标签: r


    【解决方案1】:

    使用 包:

    library(data.table)
    
    # option 1
    setDT(dat)[, .SD[CJ(Col2 = Col2, newCol = Col2, unique = TRUE), on = .(Col2)]
               , by = Col1
               ][order(RowID), .SD[Col2 != newCol | .N == 1], by = RowID]
    
    # option 2
    setDT(dat)[, newCol := paste0(Col2, collapse = ","), by = Col1
               ][, .(newCol = unlist(tstrsplit(newCol, ","))), by = .(RowID, Col1, Col2)
                 ][, .SD[Col2 != newCol | .N == 1], by = RowID]
    

    给出:

       RowID   Col1 Col2 newCol
    1:     1  apple  cow    cat
    2:     1  apple  cow    rat
    3:     2 orange  dog    dog
    4:     3  apple  cat    cow
    5:     3  apple  cat    rat
    6:     4 cherry fish    ant
    7:     5 cherry  ant   fish
    8:     6  apple  rat    cow
    9:     6  apple  rat    cat
    

    等效项:

    library(dplyr)
    library(tidyr)
    
    dat %>% 
      group_by(Col1) %>% 
      mutate(newCol = paste0(Col2, collapse = ",")) %>% 
      separate_rows(newCol) %>% 
      group_by(RowID) %>% 
      filter(Col2 != newCol | n() == 1)
    

    【讨论】:

      【解决方案2】:

      在第一列自连接表,去掉 NewCol 等于 Col2 的行。困难的一点是保留 data.table 中只出现一次的行。

      require(data.table)
      require(magrittr)
      
      dt_foo = data.table(Col1 = c("apple", "orange","apple","cherry",
                            "cherry", "apple"),
                          Col2 = c("cow","dog","cat","fish",
                            "ant","rat"))
      
      # required to later set NA values
      single_occ = dt_foo[, .N, Col1] %>% 
        .[N == 1, Col1]
      
      dt_foo2 = dt_foo %>% 
        .[., on = "Col1", allow.cartesian = T] %>% 
        setnames("i.Col2", "NewCol") %>% 
        .[Col1 %in% single_occ, NewCol := NA] %>% 
        .[Col2 != NewCol | is.na(NewCol)]
      

      【讨论】:

        猜你喜欢
        • 2021-02-24
        • 2014-03-07
        • 2013-03-20
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2013-10-25
        相关资源
        最近更新 更多