【问题标题】:Count number of values in R [duplicate]计算R中的值数[重复]
【发布时间】:2017-01-25 21:54:55
【问题描述】:

我有以下数据集:

    ClaimType ClaimDay ClaimCost   dates    month      day
1         1        1     10811 1970-01-01     1 1970-01-01
2         1        1     18078 1970-01-01     1 1970-01-01
3         1        2     44579 1970-01-01     1 1970-01-02
4         1        3     23710 1970-01-01     1 1970-01-03
5         1        4     29580 1970-01-01     1 1970-01-04
6         1        4     36208 1970-01-01     1 1970-01-04

我想创建一个包含“索赔日”和“日”列的新数据集。索赔日应按价值计算。例如,既然我们有两个一,一个二,一个三,然后是两个四,我希望​​新的数据集是这样的:

ClaimDay   day
2         1970-01-01
1         1970-01-02
1         1970-01-03
2         1970-01-04

如您所见,Claimday 和 day 是相关的。

我试过了

mydata <- aggregate(ClaimDay~Day,FUN=sum,data=mydata)$ClaimDay

但问题是,聚合时它会计算摘要。

谁能帮我解决我的问题?

【问题讨论】:

  • 你可以使用table

标签: r dataframe aggregate


【解决方案1】:

您可以尝试以下任何一种方法:

base R

aggregate(ClaimDay~day,FUN=length,data=mydata)

tapply

as.data.frame(tapply(mydata$ClaimDay, mydata$day, length), responseName='ClaimDay')

by

by(mydata$ClaimDay, mydata$day, length, simplify = TRUE)

dplyr

library(dplyr)
mydata %>% count(day)

data.table

library(data.table)
data.table(mydata)[,(ClaimDay=length(ClaimDay)),by=day]

plyr

library(plyr)
ddply(mydata,~day,summarise,ClaimDay=length(day))

sqldf

library(sqldf)
sqldf('select count(ClaimDay) as ClaimDay, day from mydata group by day')

#  ClaimDay        day
#1        2 1970-01-01
#2        1 1970-01-02
#3        1 1970-01-03
#4        2 1970-01-04

以及基准测试结果:

library('microbenchmark')
microbenchmark(agg=aggregate(ClaimDay~day,FUN=length,data=mydata), 
               dplyr=mydata %>% dplyr:::count(day), 
               data.table=data.table(mydata)[,(ClaimDay=length(ClaimDay)),by=day], 
               plyr=ddply(mydata,~day,summarise,ClaimDay=length(day)),
               tapply=as.data.frame(tapply(mydata$ClaimDay, mydata$day, length), responseName='ClaimDay'),
               sqldf=sqldf('select count(ClaimDay) as ClaimDay, day from mydata group by day'),
               by=by(mydata$ClaimDay, mydata$day, length, simplify = TRUE),
               times=500)

Unit: microseconds
       expr      min        lq       mean    median        uq       max neval    cld
        agg 1280.399 1408.2675  1655.8207 1458.9445  1845.331  7732.426   500   c   
      dplyr 1019.102 1177.3345  1350.3923 1220.0995  1356.736  3835.208   500  b    
 data.table 1690.092 1883.8190  2208.6055 1957.1630  2234.283  5493.653   500    d  
       plyr 2334.995 2482.7495  2847.0871 2554.5960  2944.404  6620.096   500     e 
     tapply  226.658  273.0580   342.0902  304.0635   353.244  2748.965   500 a     
      sqldf 8395.718 9057.0870 10458.0976 9440.2650 11389.515 61480.071   500      f
         by  353.243  415.0395   492.2115  449.2520   509.765  4331.287   500 a  

【讨论】:

  • 当基准为微秒时,表示数据太小。
  • 是的,数据很小
【解决方案2】:

如果您不介意 dplyr 解决方案,这适用于您的示例数据

library(dplyr)
df %>% select(ClaimDay, day) %>% 
     group_by(day) %>% 
     mutate(ClaimDay.count = n()) %>% 
     slice(1)

【讨论】:

  • dplyr 解决方案可能只使用专用的count 函数
  • @David Arenburg 是的,我添加了一个解决方案
猜你喜欢
  • 2018-09-01
  • 1970-01-01
  • 1970-01-01
  • 2022-12-09
  • 1970-01-01
  • 2017-11-20
  • 2015-11-02
  • 1970-01-01
  • 2015-04-05
相关资源
最近更新 更多