【问题标题】:reduce row to unique items将行减少到唯一项目
【发布时间】:2012-09-07 04:51:30
【问题描述】:

我有数据框

test <- structure(list(
     y2002 = c("freshman","freshman","freshman","sophomore","sophomore","senior"),
     y2003 = c("freshman","junior","junior","sophomore","sophomore","senior"),
     y2004 = c("junior","sophomore","sophomore","senior","senior",NA),
     y2005 = c("senior","senior","senior",NA, NA, NA)), 
              .Names = c("2002","2003","2004","2005"),
              row.names = c(c(1:6)),
              class = "data.frame")
> test
       2002      2003      2004   2005
1  freshman  freshman    junior senior
2  freshman    junior sophomore senior
3  freshman    junior sophomore senior
4 sophomore sophomore    senior   <NA>
5 sophomore sophomore    senior   <NA>
6    senior    senior      <NA>   <NA>

我想调整数据以仅获取每一行的各个步骤,如

result <- structure(list(
 y2002 = c("freshman","freshman","freshman","sophomore","sophomore","senior"),
 y2003 = c("junior","junior","junior","senior","senior",NA),
 y2004 = c("senior","sophomore","sophomore",NA,NA,NA),
 y2005 = c(NA,"senior","senior",NA, NA, NA)), 
               .Names = c("1","2","3","4"),
               row.names = c(c(1:6)),
               class = "data.frame")

> result
          1      2         3      4
1  freshman junior    senior   <NA>
2  freshman junior sophomore senior
3  freshman junior sophomore senior
4 sophomore senior      <NA>   <NA>
5 sophomore senior      <NA>   <NA>
6    senior   <NA>      <NA>   <NA>

我知道如果我将每一行视为一个向量,我可以做类似的事情

careerrow <- c(1,2,3,3,4)
pairz <- lapply(careerrow,function(i){c(careerrow[i],careerrow[i+1])})
uniquepairz <- careerrow[sapply(pairz,function(x){x[1]!=x[2]})]

我的困难是将该行应用到我的数据表中。我认为 lapply 是要走的路,但到目前为止我无法解决这个问题。

【问题讨论】:

  • 你需要它是一个有效的 data.frame 填充 NA 值还是与每个 ID 关联的列表就足够了?
  • 我想计算相同的行,所以我认为能够将其作为有效的 data.frame 是一件好事。或者列表列表是否可以方便地执行此类计数?

标签: r dataframe data.table lapply


【解决方案1】:

lapply,当传递一个 data.frame 时,对其列进行操作。那是因为 data.frame 是一个列表,其元素是列。您可以将applyMARGIN=1 一起使用,而不是lapply

unique.padded <- function(x) {
   uniq <- unique(x)
   out  <- c(uniq, rep(NA, length(x) - length(uniq)))
}

t(apply(test, 1, unique.padded))

#   [,1]        [,2]     [,3]        [,4]    
# 1 "freshman"  "junior" "senior"    NA      
# 2 "freshman"  "junior" "sophomore" "senior"
# 3 "freshman"  "junior" "sophomore" "senior"
# 4 "sophomore" "senior" NA          NA      
# 5 "sophomore" "senior" NA          NA      
# 6 "senior"    NA       NA          NA

编辑:我看到了您对最终目标的评论。我会这样做:

table(sapply(apply(test, 1, function(x)unique(na.omit(x))),
             paste, collapse = "_"))

#           freshman_junior_senior freshman_junior_sophomore_senior 
#                                1                                2 
#                           senior                 sophomore_senior 
#                                1                                2 

【讨论】:

    【解决方案2】:

    如果您的目标是计算每个路径的总数

    您可以使用类似这样的东西(使用data.table,因为它将列表作为data.table(类似data.frame)对象中的元素处理的很好。

    我正在使用!duplicated(...) 删除重复项,因为这比唯一更有效。

    library(data.table)
    library(reshape2)
    # make the rownames a column 
    test$id <- rownames(test)
    # put in long format
    DT <- as.data.table(melt(test,id='id'))
    # get the unique steps and concatenate into a unique identifier for each pathway
    DL <- DT[!is.na(value), {.steps <- value[!duplicated(value)]
      stepid <- paste(.steps, sep ='.',collapse = '.')
      list(steps = list(.steps), stepid =stepid)}, by=id]
    ##    id                            steps                           stepid
    ## 1:  1           freshman,junior,senior           freshman.junior.senior
    ## 2:  2 freshman,junior,sophomore,senior freshman.junior.sophomore.senior
    ## 3:  3 freshman,junior,sophomore,senior freshman.junior.sophomore.senior
    ## 4:  4                 sophomore,senior                 sophomore.senior
    ## 5:  5                 sophomore,senior                 sophomore.senior
    ## 6:  6                           senior                           senior
    
    # count the number per path
    
    DL[, .N, by = stepid]
    ##                              stepid N
    ## 1:           freshman.junior.senior 1
    ## 2: freshman.junior.sophomore.senior 2
    ## 3:                 sophomore.senior 2
    ## 4:                           senior 1
    

    【讨论】:

    • +1 list 列输出(steps 列)的漂亮(我认为是第一个)示例,带有漂亮的逗号连续打印(在 1.8.2 中是新的)。
    猜你喜欢
    • 1970-01-01
    • 2022-12-22
    • 1970-01-01
    • 2017-12-30
    • 2012-12-09
    • 1970-01-01
    • 2012-11-09
    • 1970-01-01
    • 2021-04-23
    相关资源
    最近更新 更多