【问题标题】:How I can find out 1st and last observation with in group in R for every by group我如何在 R 中的每个分组中找出第一个和最后一个观察结果
【发布时间】:2015-01-18 03:19:20
【问题描述】:

你好我的数据集如下

dialled     Ringing     state   duration
NA  NA  NA  0
NA  NA  NA  0
NA  NA  NA  0
NA  NA  NA  0
123 NA  NA  0
123 NA  NA  0
123 NA  NA  0
123 NA  NA  60
NA  NA  active  0
NA  NA  active  0
NA  NA  inactive    0
NA  NA  inactive    0
NA  145 inactive    0
NA  145 inactive    0
NA  145 inactive    56
NA  NA  active  0
NA  NA  active  0
NA  NA  inactive    0
222 NA  inactive    0
222 NA  inactive    0
222 NA  inactive    37
NA  NA  active  0
NA  NA  active  0
NA  NA  inactive    0
123 NA  inactive    0
123 NA  inactive    0
123 NA  active  60
NA  NA  active  0

我想获得第一个也是最后一个 obs。对于每个dialled 号码(也重复一个,因为每个电话都是不同的)。我正在寻找的答案是

dialled     Ringing     state   duration
123 NA  NA  0
123 NA  NA  60
222 NA  inactive    0
222 NA  inactive    37
123 NA  NA  0
123 NA  NA  60   

我用的是下面的

library(plyr)
ddply(DF, .(Dialled_nbr), function(x) x[c(1,nrow(x)), ]) which gave me

dialled     Ringing     state   duration
123 NA  NA  0
123 NA  NA  60
222 NA  inactive    0
222 NA  inactive    37

但答案不正确。请帮忙

新数据是

已拨振铃状态持续时间 123 无 无 0 123 无 无 0 123 无 无 60 123 无 无 0 123 无 无 0 123 北美 北美 70 222 NA 无效 0 222 NA 无效 0 222 NA 未激活 37 123 NA 无效 0 123 NA 无效 0 123 NA 活跃 60 答案是 已拨振铃状态持续时间 123 无 无 0 123 无 无 60 123 无 无 0 123 北美 北美 70 222 NA 无效 0 222 NA 未激活 37 123 NA 无效 0 123 NA 活跃 60

【问题讨论】:

  • 等等,什么??在您发布问题并得到两个很好的答案后一个小时,您完全改变了一切
  • @akrun 是的,应该是
  • @akrun 我无法使用 data.table_1.9.5
  • @akrun 这需要哪个版本的 r 和 R studio?
  • @akrun 谢谢.....我如何为每个组提取第一行....即
    拨号振铃状态持续时间#5 123 NA  0

标签: r


【解决方案1】:

这是data.table_1.9.5 的选项。使用setDT从“data.frame”创建“data.table”,删除“dialled”列(!is.na(dialled))中的NA值,在“Dialled_nbr”上使用rleid生成分组变量,得到分组变量级别(.I(c(1L, .N)])的第一行和最后一行的行索引,最后根据行索引对“dt1”进行子集化。

library(data.table)
dt1 <- setDT(df)[!is.na(dialled)]
dt1[dt1[,.I[c(1L, .N)],rleid(dialled)]$V1]
#    dialled Ringing    state duration
#1:     123      NA       NA        0
#2:     123      NA       NA       60
#3:     222      NA inactive        0
#4:     222      NA inactive       37
#5:     123      NA inactive        0
#6:     123      NA   active       60

或使用base R

df1 <- df[!is.na(df$dialled),]
grp<-  inverse.rle(within.list(rle(df1$dialled), 
                    values <- seq_along(values)))

df1[!duplicated(grp)|!duplicated(grp,fromLast=TRUE),]
#    dialled Ringing    state duration
#5      123      NA     <NA>        0
#8      123      NA     <NA>       60
#19     222      NA inactive        0
#21     222      NA inactive       37
#25     123      NA inactive        0
#27     123      NA   active       60

更新

基于新的数据集,

grp <- cumsum(c(TRUE,df$duration[-nrow(df)]!=0))
df[!duplicated(grp)|!duplicated(grp,fromLast=TRUE),]
#   dialled Ringing    state duration
#1      123      NA     <NA>        0
#3      123      NA     <NA>       60
#4      123      NA     <NA>        0
#6      123      NA     <NA>       70
#7      222      NA inactive        0
#9      222      NA inactive       37
#10     123      NA inactive        0
#12     123      NA   active       60

数据

 df <- structure(list(dialled = c(NA, NA, NA, NA, 123L, 123L, 123L, 
 123L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 222L, 222L, 222L, 
 NA, NA, NA, 123L, 123L, 123L, NA), Ringing = c(NA, NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, 145L, 145L, 145L, NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), state = c(NA, NA, NA, 
 NA, NA, NA, NA, NA, "active", "active", "inactive", "inactive", 
 "inactive", "inactive", "inactive", "active", "active", "inactive", 
 "inactive", "inactive", "inactive", "active", "active", "inactive", 
 "inactive", "inactive", "active", "active"), duration = c(0L, 
 0L, 0L, 0L, 0L, 0L, 0L, 60L, 0L, 0L, 0L, 0L, 0L, 0L, 56L, 0L, 
 0L, 0L, 0L, 0L, 37L, 0L, 0L, 0L, 0L, 0L, 60L, 0L)), .Names = 
 c("dialled", "Ringing", "state", "duration"), class = "data.frame", 
 row.names = c(NA, -28L))

新数据

 df <- structure(list(dialled = c(123L, 123L, 123L, 123L, 123L, 123L, 
 222L, 222L, 222L, 123L, 123L, 123L), Ringing = c(NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, NA), state = c(NA, NA, NA, NA, 
 NA, NA, "inactive", "inactive", "inactive", "inactive", "inactive", 
 "active"), duration = c(0L, 0L, 60L, 0L, 0L, 70L, 0L, 0L, 37L, 
 0L, 0L, 60L)), .Names = c("dialled", "Ringing", "state", "duration"
 ), class = "data.frame", row.names = c(NA, -12L))

【讨论】:

    【解决方案2】:

    这里有两个选项。首先,我们需要设置一些将在两个选项中使用的东西。

    ## remove rows where 'dialled' is NA 
    ndf <- DF[!is.na(DF$dialled),]
    ## run-length encoding on the 'dialled' column in 'ndf'
    le <- rle(ndf$dialled)$lengths
    

    选项 1:创建一个包含行号的整数向量以用于子集。

    ndf[cumsum(mapply(c, 1L, le-1L)), ]
    #    dialled Ringing    state duration
    # 5      123      NA     <NA>        0
    # 8      123      NA     <NA>       60
    # 19     222      NA inactive        0
    # 21     222      NA inactive       37
    # 25     123      NA inactive        0
    # 27     123      NA   active       60
    

    如果您不想循环,那么您可以将mapply 调用替换为vec,定义为

    vec <- replace(integer(2*length(le))+1L, c(FALSE, TRUE), le-1L)
    

    选项 2: 添加帮助器 id 列。然后使用dplyr 函数根据新的 id 列获取第一行和最后一行。

    library(dplyr)    
    ## updated data with new column
    DF2 <- cbind(id = rep.int(seq_along(le), le), ndf)    
    ## group by id and filter on the first and last rows
    slice(group_by(DF2, id), c(1, n()))
    #   id dialled Ringing    state duration
    # 1  1     123      NA       NA        0
    # 2  1     123      NA       NA       60
    # 3  2     222      NA inactive        0
    # 4  2     222      NA inactive       37
    # 5  3     123      NA inactive        0
    # 6  3     123      NA   active       60
    

    如果需要,您可以删除帮助列,但以后它也可能会派上用场。

    【讨论】:

    • 这是另一个使用dplyrndf %&gt;% group_by(id=cumsum(Dialled_nbr!=lag(Dialled_nbr, default=TRUE))) %&gt;% slice(c(1L, n()))的选项
    猜你喜欢
    • 1970-01-01
    • 2016-09-10
    • 2021-06-27
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多