【问题标题】:How to find number of unique ids corresponding to each date in a data drame如何查找与数据框中每个日期对应的唯一 ID 的数量
【发布时间】:2016-01-19 23:24:14
【问题描述】:

我有一个如下所示的数据框:

      date         time              id            datetime    
1 2015-01-02 14:27:22.130 999000000007628 2015-01-02 14:27:22 
2 2015-01-02 14:41:27.720 989001002807730 2015-01-02 14:41:27 
3 2015-01-02 14:41:27.940 989001002807730 2015-01-02 14:41:27 
4 2015-01-02 14:41:28.140 989001002807730 2015-01-02 14:41:28 
5 2015-01-02 14:41:28.170 989001002807730 2015-01-02 14:41:28 
6 2015-01-02 14:41:28.350 989001002807730 2015-01-02 14:41:28 

我需要找到该数据框中每个“日期”的唯一“id”数量。

我试过这个:

sums<-data.frame(date=unique(data$date), numIDs=0)

for(i in unique(data$date)){
  sums[sums$date==i,]$numIDs<-length(unique(data[data$date==i,]$id))
}

我收到以下错误:

 Error in `$<-.data.frame`(`*tmp*`, "numIDs", value = 0L) : 
   replacement has 1 row, data has 0
 In addition: Warning message:
 In `==.default`(data$date, i) :
   longer object length is not a multiple of shorter object length

有什么想法吗??谢谢!

希望这会有所帮助!

data <- structure(list(date = structure(list(sec = c(0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
    hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), mday = c(2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), mon = c(0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L), year = c(115L, 115L, 115L, 115L, 
    115L, 115L, 115L, 115L, 115L, 115L), wday = c(5L, 5L, 5L, 
    5L, 5L, 5L, 5L, 5L, 5L, 5L), yday = c(1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L), zone = c("PST", "PST", "PST", "PST", "PST", 
    "PST", "PST", "PST", "PST", "PST"), gmtoff = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst", 
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), time = c("14:27:22.130", 
"14:41:27.720", "14:41:27.940", "14:41:28.140", "14:41:28.170", 
"14:41:28.350", "14:41:28.390", "14:41:28.520", "14:41:28.630", 
"14:41:28.740"), id = c("999000000007628", "989001002807730", 
"989001002807730", "989001002807730", "989001002807730", "989001002807730", 
"989001002807730", "989001002807730", "989001002807730", "989001002807730"
), datetime = structure(list(sec = c(22.13, 27.72, 27.94, 28.14, 
28.17, 28.35, 28.39, 28.52, 28.63, 28.74), min = c(27L, 41L, 
41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L), hour = c(14L, 14L, 14L, 
14L, 14L, 14L, 14L, 14L, 14L, 14L), mday = c(2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L), mon = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L), year = c(115L, 115L, 115L, 115L, 115L, 115L, 115L, 
115L, 115L, 115L), wday = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
     5L), yday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), isdst = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), zone = c("PST", "PST", "PST", 
    "PST", "PST", "PST", "PST", "PST", "PST", "PST"), gmtoff =     c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec", 
    "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst", 
    "zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), site = c("Chivato", 
    "Chivato", "Chivato", "Chivato", "Chivato", "Chivato", "Chivato", 
    "Chivato", "Chivato", "Chivato")), .Names = c("date", "time", 
    "id", "datetime", "site"), row.names = c(NA, 10L), class = "data.frame")

【问题讨论】:

    标签: r for-loop dataframe unique


    【解决方案1】:

    您可以使用data.table 中的uniqueN 函数:

    library(data.table)
    setDT(df)[, uniqueN(id), by = date]
    

    或(根据@Richard Scriven 的评论):

    aggregate(id ~ date, df, function(x) length(unique(x)))
    

    【讨论】:

    • 我试过:setDT(data)[, uniqueN(id), by = date] 得到错误:Error in byval[[jj]] : subscript out of bounds
    • @Alexis 它应该可以工作(并且可以处理您提供的示例数据)。您能否在问题中包含可以重现该错误的数据和代码?
    • @Alexis 请张贴dput(head(data))inside你的问题。
    • @Alexis 您刚刚在 cmets 中发布了 head(data)。请发帖dput(head(data, n = 10))
    • @Alexis date 列的格式导致了问题。你可以试试:setDT(data)[, uniqueN(id), by = as.Date(date)]
    【解决方案2】:

    或者我们可以使用来自library(dplyr)n_distinct

    library(dplyr) 
    df %>%
       group_by(date) %>%
       summarise(id=n_distinct(id))
    

    【讨论】:

    • @Pascal 如果 OP 的“数据”是 POSIXlt 类,这是 dplyr 不允许的,最好将其在 %&gt;% 之外转换为“日期”类,即 @ 987654327@ 它应该可以工作。
    • 很高兴知道。谢谢。
    【解决方案3】:

    这个答案是对这个帖子的回应:group by and then count unique observations 在我写这个草稿时被标记为重复。这不是对此处重复基础的问题的回应:How to find number of unique ids corresponding to each date in a data drame 询问查找唯一 ID。我不确定第二个帖子是否真的回答了 OP 的问题,即,

    “我想创建一个表,每个表都有唯一的id group1group2 的组合。”

    这里的关键词是“组合”。解释是每个id 都有一个特定的group1 值和一个特定的group2 值,因此感兴趣的数据集是特定的值集c(id, group1, group2)

    这是 OP 提供的 data.frame:

    df1 <- data.frame(id=sample(letters, 10000, replace = T),
    group1=sample(1:2, 10000, replace = T),
    group2=sample(100:101, 10000, replace = T))
    

    使用受这篇文章启发的data.table -- https://stackoverflow.com/a/13017723/5220858

    >library(data.table)
    >DT <- data.table(df1)
    >DT[, .N, by = .(group1, group2)]
    
       group1 group2    N
    1:      1    100 2493
    2:      1    101 2455
    3:      2    100 2559
    4:      2    101 2493
    

    N 是具有特定 group1 值和特定 group2 值的 id 的计数。扩展以包含 id 还会返回一个包含 104 个唯一 idgroup1group2 组合的表。

    >DT[, .N, by = .(id, group1, group2)]
    
         id group1 group2   N
      1:  t      1    100 107
      2:  g      1    101  85
      3:  l      1    101  98
      4:  a      1    100  83
      5:  j      1    101  98
     ---                     
    100:  p      1    101  96
    101:  r      2    101  91
    102:  y      1    101 104
    103:  g      1    100  83
    104:  r      2    100  77
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-12-17
      • 2022-11-25
      • 2022-01-01
      • 1970-01-01
      • 2020-03-05
      • 1970-01-01
      • 2020-11-19
      相关资源
      最近更新 更多