根据观察值提取数据答案

【问题标题】：Extracting data based on observation values根据观察值提取数据
【发布时间】：2014-09-26 12:06:38
【问题描述】：

我有一个看起来像这样的数据框

deltnr 我们stone_ny stone_mobility 1535 63 no_stone NA 1994-09-21 男性 60 1536 61 no_stone NA 1983-09-06 男性 60 1536 62 no_stone NA 1988-08-18 男性 60 1536 63 石头手机 1994-03-04 男 70 154 61 no_stone NA 1983-06-22 男性 40 154 62 no_stone NA 1988-06-08 男性 45 1543 61 no_stone NA 1983-08-17 女性 30 1543 62 no_stone NA 1988-08-17 女性 35 1336 61 no_stone NA 1983-08-22 男性 60 1336 62 石头手机 1988-11-04 男 65

我想在stone_ny 变量中提取所有带有deltnr 和“stone”的独特观察结果。我的问题在于每个 deltnr 都有多个观察结果。我试过 unique() 和 subset() 没有运气。

【问题讨论】：

您的数据框的dput 可能会使这些列更易辨认。
How to subset my data with eliminating repeated observations的可能重复
@Daniel 假设您在 stone_ny 中有多个 stone 用于 deltnr 组，并且其他列不同。在这种情况下，您选择哪个观察/行？ unique(dat[dat$stone_ny=="stone",]) 将为每个 deltnr 组提供多个 stone。
我想选择第一个也是唯一一个“我们”。正是这就是我出错的地方。你知道出路吗？
@Daniel 这是否意味着如果你有两个us 和65 和70 和stone 和1336，你会选择第一个还是两个观察？我猜你的意思是us 相同时的第一次观察？

标签： r unique subset

【解决方案1】：

你可以使用dplyr

    library(dplyr)
    dat %>% 
        group_by(deltnr) %>%
        filter(stone_ny=="stone") %>% #assuming that there are no trailing or leading spaces
        do(head(.,1))

给出输出

     # deltnr us stone_ny stone_mobility       date  sex val
    #1   1336 62    stone         mobile 1988-11-04 male  65
    #2   1536 63    stone         mobile 1994-03-04 male  70

使用data.table

     library(data.table)
      unique(setDT(dat)[stone_ny == "stone"], by="deltnr") #updated after @Arun's comments
      #   deltnr us stone_ny stone_mobility       date  sex val
      #1:   1536 63    stone         mobile 1994-03-04 male  70
      #2:   1336 62    stone         mobile 1988-11-04 male  65

或者你可以使用base R

    subset(subset(dat, stone_ny=="stone"), 
             ave(seq_along(us), deltnr, FUN=seq_along)==1)
     #    deltnr us stone_ny stone_mobility       date  sex val
    #4    1536 63    stone         mobile 1994-03-04 male  70
    #10   1336 62    stone         mobile 1988-11-04 male  65

数据

我在您的数据中添加了一些列名

 dat <- structure(list(deltnr = c(1535L, 1536L, 1536L, 1536L, 154L, 154L, 
 1543L, 1543L, 1336L, 1336L, 1336L), us = c(63L, 61L, 62L, 63L, 
 61L, 62L, 61L, 62L, 61L, 62L, 63L), stone_ny = c("no_stone", 
 "no_stone", "no_stone", "stone", "no_stone", "no_stone", "no_stone", 
 "no_stone", "no_stone", "stone", "stone"), stone_mobility = c(NA, 
 NA, NA, "mobile", NA, NA, NA, NA, NA, "mobile", "mobile"), date = c("1994-09-21", 
 "1983-09-06", "1988-08-18", "1994-03-04", "1983-06-22", "1988-06-08", 
 "1983-08-17", "1988-08-17", "1983-08-22", "1988-11-04", "1988-11-05"
 ), sex = c("male", "male", "male", "male", "male", "male", "female", 
 "female", "male", "male", "male"), val = c(60L, 60L, 60L, 70L, 
 40L, 45L, 30L, 35L, 60L, 65L, 66L)), .Names = c("deltnr", "us", 
 "stone_ny", "stone_mobility", "date", "sex", "val"), class = "data.frame", row.names = c(NA, 
 -11L))

【讨论】：

我认为data.table 的解决方案是：unique(setDT(dat)[stone_ny == "stone"], by="deltnr")。不知道为什么要在 j 中设置子集，然后在 i... 中再次使用它...以及为什么 .SD..
@Arun 谢谢。我想我搞砸了解决方案:-)

【解决方案2】：

您首先要删除 stone 列中包含“no_stone”的所有行，方法是使用类似

good_rows <- grepl("\\bstone\\b", df$stone_ny)

然后使用unique() 只获取唯一条目

unique(df[good_rows, ])
#    deltnr us stone_ny stone_mobility
# 4    1536 63    stone         mobile
# 10   1336 62    stone         mobile

请注意，我删除了最后三列，因为它们在您的原始帖子中没有名字。

【讨论】：

【解决方案3】：

试试：

> ddf2 = ddf[ddf$stone_ny=='stone',]
> ddf2[!duplicated(ddf2$deltnr),]
   deltnr us stone_ny stone_mobility      
4    1536 63    stone         mobile 
10   1336 62    stone         mobile

【讨论】：