【问题标题】:Efficient way of labelling based on start and end position基于开始和结束位置的有效标记方式
【发布时间】:2020-09-19 05:32:14
【问题描述】:

我有 2 个数据框

das <- data.frame(val=1:20,
              type =c("A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C"),
              weigh=c(20,22,23,32,34,54,19,22,24,26,31,34,36,37,51,54,31,35,43,45))

mapper <- data.frame(type=c("A","A","A","A","B","B","B","B","C","C","C","C"),start = c(19,23,27,37   ,17,25,39,50, 17,23,33,39),end = c(23,27,37,55  ,25,39,50,60, 23,33,39,48))

预期的输出是

val type weigh labelweight
1    1    A    20    A_19
2    2    A    22    A_19
3    3    A    23    A_23
4    4    A    32    A_27
5    5    A    34    A_27
6    6    A    54    A_37
7    7    B    19    B_17
8    8    B    22    B_17
9    9    B    24    B_17
10  10    B    26    B_25
11  11    B    31    B_25
12  12    B    34    B_25
13  13    B    36    B_25
14  14    B    37    B_25
15  15    B    51    B_50
16  16    B    54    B_50
17  17    C    31    C_23
18  18    C    35    C_33
19  19    C    43    C_39
20  20    C    45    C_39

我可以使用以下代码获得预期的输出

p <- left_join(das,mapper)
q <- p%>%filter(weigh>=start & weigh<end)%>%mutate(labelweight= paste0(type,"_",start))

我想出的代码在处理大型数据集时会抛出“错误:向量内存已用尽(达到限制?)”。

我正在考虑是否有任何更有效的方法来获得所需的输出而不进行连接。

【问题讨论】:

    标签: r join dplyr


    【解决方案1】:

    间隔似乎是连续的。这是在data.table 中使用滚动连接的快速选项:

    library(data.table)
    setDT(das)[, weight := 
        setDT(mapper)[.SD, on=.(type, start=weigh), roll=Inf, paste(type, x.start, sep="_")]
    ]
    

    如果区间不连续,可以使用非等连接:

    setDT(das)[, weight := 
        setDT(mapper)[setDT(das), on=.(type, start<=weigh, end>weigh), paste(type, x.start, sep="_")]        
    ]
    

    输出:

        val type weigh weight
     1:   1    A    20   A_19
     2:   2    A    22   A_19
     3:   3    A    23   A_23
     4:   4    A    32   A_27
     5:   5    A    34   A_27
     6:   6    A    54   A_37
     7:   7    B    19   B_17
     8:   8    B    22   B_17
     9:   9    B    24   B_17
    10:  10    B    26   B_25
    11:  11    B    31   B_25
    12:  12    B    34   B_25
    13:  13    B    36   B_25
    14:  14    B    37   B_25
    15:  15    B    51   B_50
    16:  16    B    54   B_50
    17:  17    C    31   C_23
    18:  18    C    35   C_33
    19:  19    C    43   C_39
    20:  20    C    45   C_39
    

    【讨论】:

      【解决方案2】:

      也许,你可以在这里使用fuzzyjoin

      fuzzyjoin::fuzzy_left_join(das, mapper, 
                         by = c('type' = 'type', 'weigh' = 'start', 'weigh' = 'end'), 
                         match_fun = list(`==`, `>=`, `<=`)) %>%
           dplyr::transmute(type = type.x, val, weigh, 
                            labelweight = paste(type.y, start, sep = '_'))
      
      
      #   type val weigh labelweight
      #1     A   1    20        A_19
      #2     A   2    22        A_19
      #3     A   3    23        A_19
      #4     A   3    23        A_23
      #5     A   4    32        A_27
      #6     A   5    34        A_27
      #7     A   6    54        A_37
      #8     B   7    19        B_17
      #9     B   8    22        B_17
      #10    B   9    24        B_17
      #11    B  10    26        B_25
      #12    B  11    31        B_25
      #13    B  12    34        B_25
      #14    B  13    36        B_25
      #15    B  14    37        B_25
      #16    B  15    51        B_50
      #17    B  16    54        B_50
      #18    C  17    31        C_23
      #19    C  18    35        C_33
      #20    C  19    43        C_39
      #21    C  20    45        C_39
      

      【讨论】:

        【解决方案3】:

        使用 R 基础:

        encon <- function(x, y) {
          lower <- y[y[,1] == x[[2]], 2]
          upper <- y[y[,1] == x[[2]], 3]
          paste(as.character(x[[2]]), min(lower[x[[3]] >= lower & x[[3]] <= upper]), sep="_" )
        }
        
        for (i in seq(1, nrow(das))) das[i,"label"] <- encon(das[i,], mapper)
        > das
           val type weigh label
        1    1    A    20  A_19
        2    2    A    22  A_19
        3    3    A    23  A_19
        4    4    A    32  A_27
        5    5    A    34  A_27
        6    6    A    54  A_37
        7    7    B    19  B_17
        8    8    B    22  B_17
        9    9    B    24  B_17
        10  10    B    26  B_25
        11  11    B    31  B_25
        12  12    B    34  B_25
        13  13    B    36  B_25
        14  14    B    37  B_25
        15  15    B    51  B_50
        16  16    B    54  B_50
        17  17    C    31  C_23
        18  18    C    35  C_33
        19  19    C    43  C_39
        20  20    C    45  C_39
        
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2015-12-13
          • 2016-02-12
          相关资源
          最近更新 更多