我如何在 R 中的每个分组中找出第一个和最后一个观察结果答案

【问题标题】：How I can find out 1st and last observation with in group in R for every by group我如何在 R 中的每个分组中找出第一个和最后一个观察结果
【发布时间】：2015-01-18 03:19:20
【问题描述】：

你好我的数据集如下

dialled     Ringing     state   duration
NA  NA  NA  0
NA  NA  NA  0
NA  NA  NA  0
NA  NA  NA  0
123 NA  NA  0
123 NA  NA  0
123 NA  NA  0
123 NA  NA  60
NA  NA  active  0
NA  NA  active  0
NA  NA  inactive    0
NA  NA  inactive    0
NA  145 inactive    0
NA  145 inactive    0
NA  145 inactive    56
NA  NA  active  0
NA  NA  active  0
NA  NA  inactive    0
222 NA  inactive    0
222 NA  inactive    0
222 NA  inactive    37
NA  NA  active  0
NA  NA  active  0
NA  NA  inactive    0
123 NA  inactive    0
123 NA  inactive    0
123 NA  active  60
NA  NA  active  0

我想获得第一个也是最后一个 obs。对于每个dialled 号码（也重复一个，因为每个电话都是不同的）。我正在寻找的答案是

dialled     Ringing     state   duration
123 NA  NA  0
123 NA  NA  60
222 NA  inactive    0
222 NA  inactive    37
123 NA  NA  0
123 NA  NA  60

我用的是下面的

library(plyr)
ddply(DF, .(Dialled_nbr), function(x) x[c(1,nrow(x)), ]) which gave me

dialled     Ringing     state   duration
123 NA  NA  0
123 NA  NA  60
222 NA  inactive    0
222 NA  inactive    37

但答案不正确。请帮忙

新数据是

已拨振铃状态持续时间 123 无无 0 123 无无 0 123 无无 60 123 无无 0 123 无无 0 123 北美北美 70 222 NA 无效 0 222 NA 无效 0 222 NA 未激活 37 123 NA 无效 0 123 NA 无效 0 123 NA 活跃 60 答案是已拨振铃状态持续时间 123 无无 0 123 无无 60 123 无无 0 123 北美北美 70 222 NA 无效 0 222 NA 未激活 37 123 NA 无效 0 123 NA 活跃 60

【问题讨论】：

等等，什么？？在您发布问题并得到两个很好的答案后一个小时，您完全改变了一切
@akrun 是的，应该是
@akrun 我无法使用 data.table_1.9.5
@akrun 这需要哪个版本的 r 和 R studio？
@akrun 谢谢.....我如何为每个组提取第一行....即
```
拨号振铃状态持续时间#5 123 NA  0
```

标签： r

【解决方案1】：

这是data.table_1.9.5 的选项。使用setDT从“data.frame”创建“data.table”，删除“dialled”列（!is.na(dialled)）中的NA值，在“Dialled_nbr”上使用rleid生成分组变量，得到分组变量级别（.I(c(1L, .N)]）的第一行和最后一行的行索引，最后根据行索引对“dt1”进行子集化。

library(data.table)
dt1 <- setDT(df)[!is.na(dialled)]
dt1[dt1[,.I[c(1L, .N)],rleid(dialled)]$V1]
#    dialled Ringing    state duration
#1:     123      NA       NA        0
#2:     123      NA       NA       60
#3:     222      NA inactive        0
#4:     222      NA inactive       37
#5:     123      NA inactive        0
#6:     123      NA   active       60

或使用base R

df1 <- df[!is.na(df$dialled),]
grp<-  inverse.rle(within.list(rle(df1$dialled), 
                    values <- seq_along(values)))

df1[!duplicated(grp)|!duplicated(grp,fromLast=TRUE),]
#    dialled Ringing    state duration
#5      123      NA     <NA>        0
#8      123      NA     <NA>       60
#19     222      NA inactive        0
#21     222      NA inactive       37
#25     123      NA inactive        0
#27     123      NA   active       60

更新

基于新的数据集，

grp <- cumsum(c(TRUE,df$duration[-nrow(df)]!=0))
df[!duplicated(grp)|!duplicated(grp,fromLast=TRUE),]
#   dialled Ringing    state duration
#1      123      NA     <NA>        0
#3      123      NA     <NA>       60
#4      123      NA     <NA>        0
#6      123      NA     <NA>       70
#7      222      NA inactive        0
#9      222      NA inactive       37
#10     123      NA inactive        0
#12     123      NA   active       60

数据

 df <- structure(list(dialled = c(NA, NA, NA, NA, 123L, 123L, 123L, 
 123L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 222L, 222L, 222L, 
 NA, NA, NA, 123L, 123L, 123L, NA), Ringing = c(NA, NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, 145L, 145L, 145L, NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), state = c(NA, NA, NA, 
 NA, NA, NA, NA, NA, "active", "active", "inactive", "inactive", 
 "inactive", "inactive", "inactive", "active", "active", "inactive", 
 "inactive", "inactive", "inactive", "active", "active", "inactive", 
 "inactive", "inactive", "active", "active"), duration = c(0L, 
 0L, 0L, 0L, 0L, 0L, 0L, 60L, 0L, 0L, 0L, 0L, 0L, 0L, 56L, 0L, 
 0L, 0L, 0L, 0L, 37L, 0L, 0L, 0L, 0L, 0L, 60L, 0L)), .Names = 
 c("dialled", "Ringing", "state", "duration"), class = "data.frame", 
 row.names = c(NA, -28L))

新数据

 df <- structure(list(dialled = c(123L, 123L, 123L, 123L, 123L, 123L, 
 222L, 222L, 222L, 123L, 123L, 123L), Ringing = c(NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, NA), state = c(NA, NA, NA, NA, 
 NA, NA, "inactive", "inactive", "inactive", "inactive", "inactive", 
 "active"), duration = c(0L, 0L, 60L, 0L, 0L, 70L, 0L, 0L, 37L, 
 0L, 0L, 60L)), .Names = c("dialled", "Ringing", "state", "duration"
 ), class = "data.frame", row.names = c(NA, -12L))

【讨论】：

【解决方案2】：

这里有两个选项。首先，我们需要设置一些将在两个选项中使用的东西。

## remove rows where 'dialled' is NA 
ndf <- DF[!is.na(DF$dialled),]
## run-length encoding on the 'dialled' column in 'ndf'
le <- rle(ndf$dialled)$lengths

选项 1：创建一个包含行号的整数向量以用于子集。

ndf[cumsum(mapply(c, 1L, le-1L)), ]
#    dialled Ringing    state duration
# 5      123      NA     <NA>        0
# 8      123      NA     <NA>       60
# 19     222      NA inactive        0
# 21     222      NA inactive       37
# 25     123      NA inactive        0
# 27     123      NA   active       60

如果您不想循环，那么您可以将mapply 调用替换为vec，定义为

vec <- replace(integer(2*length(le))+1L, c(FALSE, TRUE), le-1L)

选项 2： 添加帮助器 id 列。然后使用dplyr 函数根据新的 id 列获取第一行和最后一行。

library(dplyr)    
## updated data with new column
DF2 <- cbind(id = rep.int(seq_along(le), le), ndf)    
## group by id and filter on the first and last rows
slice(group_by(DF2, id), c(1, n()))
#   id dialled Ringing    state duration
# 1  1     123      NA       NA        0
# 2  1     123      NA       NA       60
# 3  2     222      NA inactive        0
# 4  2     222      NA inactive       37
# 5  3     123      NA inactive        0
# 6  3     123      NA   active       60

如果需要，您可以删除帮助列，但以后它也可能会派上用场。

【讨论】：

这是另一个使用dplyrndf %>% group_by(id=cumsum(Dialled_nbr!=lag(Dialled_nbr, default=TRUE))) %>% slice(c(1L, n()))的选项