【问题标题】:Search indexes in data.table R在 data.table R 中搜索索引
【发布时间】:2020-06-20 00:14:25
【问题描述】:

我有一个 data.table,proce,其中每一行都定义了一个“特殊程序”。现在,我有另一个包含患者程序的 data.table,codes。对于每个人,我想提取与他/她的程序匹配的“特殊程序”的索引(如果有的话)。这是一个例子:

library(data.table)
proce <- data.table(v1 = c('o09513','o721','o701','z370'), v2 = c('0w8nxzz','10d07z6','0tqd7zz','0uqg0zz'),
                         v3 = c('3e030vj','3e033vj',NA,NA))

codes <- data.table(a1 =  c(list(c('o721','10d07z6','3e033vj')),
                            list(c('z370','0uqg0zz',"0tqd7zz","o701")),
                            list(c('o09513','o721','o701','z370','0uqg8zz'))))
> proce
       v1      v2      v3
1: o09513 0w8nxzz 3e030vj
2:   o721 10d07z6 3e033vj
3:   o701 0tqd7zz    <NA>
4:   z370 0uqg0zz    <NA>

> codes
                              a1
1:          o721,10d07z6,3e033vj
2:     z370,0uqg0zz,0tqd7zz,o701
3: o09513,o721,o701,z370,0uqg8zz

在这里实现,但是由于两个表都有几十万行,所以速度很慢。


index_procedures <- list()     
for(i in 1:nrow(codes)){ # i <- 2
  a2 <- unlist(codes[i,a1])
  index_procedures[[i]] <- which(apply(proce[,.(v1,v2,v3)], 1,function(x) all(x[!is.na(x)] %in% a2)))
}
index_procedures
> index_procedures
[[1]]
[1] 2

[[2]]
[1] 3 4

[[3]]
integer(0)

【问题讨论】:

    标签: r list data.table match


    【解决方案1】:

    如果我理解正确,

    • codes 包含已应用于患者的程序步骤。 codes 中的一行表示一位患者。
    • proce 包含构成特殊程序的程序步骤。

    OP 希望确定已对每位患者应用了哪些特殊程序(如果有)。因此,只有当所有程序步骤都已应用时,才认为已对患者应用了特殊程序。

    为了解决这个问题,我建议以整齐的格式重塑所有数据,即首先采用长格式。

    然后我们可以加入程序步骤,过滤完整个特殊程序,然后汇总为每位患者获得一个:

    lc <- codes[, cid := .I][, .(step = unlist(a1)), by = cid]
    lp <- melt(proce[, pid := .I], "pid", na.rm = TRUE, value.name = "step")[
      , n_steps := .N, by = pid][]
    lp[lc, on = .(step)][
      , .N == first(n_steps), by = .(cid, pid)][
        (V1), .(pid = toString(sort(pid))), by = cid]
    
       cid  pid
    1:   1    2
    2:   2 3, 4
    

    请注意,pids 以压缩形式显示,仅用于演示;其他输出格式也可用,具体取决于后续处理步骤。

    如果需要显示所有名患者,即使他们没有接受特殊程序:

    lp[lc, on = .(step)][, .N == first(n_steps), by = .(cid, pid)][
      V1 | is.na(V1), .(pid = toString(sort(pid))), by = cid]
    
       cid  pid
    1:   1    2
    2:   2 3, 4
    3:   3
    

    注释代码

    # reshape data to long format, thereby adding a row number to identify patients
    lc <- codes[, cid := .I][, .(step = unlist(a1)), by = cid]
    # reshape data to long format, thereby adding a row number to identify special procdures
    lp <- melt(proce[, pid := .I], "pid", na.rm = TRUE, value.name = "step")[
      # count the number of procedure steps which constitute a special procedure
      , n_steps := .N, by = pid][]
    # join on procedure steps
    lp[lc, on = .(step)][
      # group  by patient and special procedure and test for completeness of steps 
      , .N == first(n_steps), by = .(cid, pid)][
        # filter for complete special procedures and aggregate to get one row per patient
        (V1), .(pid = toString(sort(pid))), by = cid]
    

    整形后lc

        cid    step
     1:   1    o721
     2:   1 10d07z6
     3:   1 3e033vj
     4:   2    z370
     5:   2 0uqg0zz
     6:   2 0tqd7zz
     7:   2    o701
     8:   3  o09513
     9:   3    o721
    10:   3    o701
    11:   3    z370
    12:   3 0uqg8zz
    

    lp

        pid variable    step n_steps
     1:   1       v1  o09513       3
     2:   2       v1    o721       3
     3:   3       v1    o701       2
     4:   4       v1    z370       2
     5:   1       v2 0w8nxzz       3
     6:   2       v2 10d07z6       3
     7:   3       v2 0tqd7zz       2
     8:   4       v2 0uqg0zz       2
     9:   1       v3 3e030vj       3
    10:   2       v3 3e033vj       3
    

    【讨论】:

      【解决方案2】:

      我不确定性能,但以下代码可能是替代方案:

      pl <- split(as.matrix(proce), seq_len(nrow(proce)))
      pl <- lapply(pl, na.omit)
      
      codes[, indexes := lapply(a1, function(x) which(unlist(lapply(pl, function(p) all(p %in% x)))) )]
      

      【讨论】:

        猜你喜欢
        • 2016-02-21
        • 1970-01-01
        • 1970-01-01
        • 2016-03-14
        • 1970-01-01
        • 2021-07-23
        • 2012-04-16
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多