计算数据框中变量的重复次数并计算出它的出现比例答案

【问题标题】：Counting the repetition of a variable within a data frame and working out it's proportional occurrence计算数据框中变量的重复次数并计算出它的出现比例
【发布时间】：2012-07-01 19:23:30
【问题描述】：

对 R 来说相对较新，所以提前为自己的无能表示歉意。

多年来，我在一个国家的多个地点处理多个（非常大的）观测数据集。我需要计算在第 x 周提交观察的站点总数中在第 x 周注意到特定物种的站点的比例（基本上是存在/不存在数据。）我有一个数据集可以提供每个个体的详细信息物种观察，以及每周的观察总数。因此，我需要一些函数来计算该物种在该周记录的站点数量，然后将其除以在同一周内记录任何物种观察的站点总数。观察记录以一周（1-53）和一年（1995-2011）记录。

species.data 示例（以 csv 格式列出以方便粘贴）：

SITE_ID, SPECIES, WEEKNO, YEAR
1289, Attenb., 1, 1995
1538, Attenb., 1, 1995
1894, Attenb., 2, 1995
1286, Attenb., 4, 1995
1238, Attenb., 7, 1995
1892, Attenb., 7, 1995

以及total.obs.data的例子：

YEAR, WEEKNO, TOTALOBS,
1995, 1, 100
1995, 2, 780
1995, 3, 100
1995, 4, 189
1995, 5, 382
1995, 6, 100
1995, 7, 899
1995, 8, 129

（所以我不认为 1995 年第 1 周的比例是 2/100，并且能够构建 GLM 或 GAM）

【问题讨论】：

你的问题并不难。您可以使用重塑和一些子集的组合很容易地做到这一点。但请提供可重现的样本数据集以供使用。例如第二个数据集中的物种在哪里？
如果它是一个大数据集，data.table 包可能是你的朋友。
正如@TylerRinker 评论的那样，请定义“非常大”数据集的含义。有大的、大的和大型的数据集。

标签： r

【解决方案1】：

目前数据过于简单，无法支持非常复杂的测试。 xtabs 函数创建一个矩阵对象，可以除以该周的总数：

> xtblspec <-  xtabs( ~ SPECIES+ SITE_ID +WEEKNO + YEAR  , data=dat)     
> xtblspec
, , WEEKNO = 1, YEAR = 1995

         SITE_ID
SPECIES   1238 1286 1289 1538 1892 1894
  Attenb.    0    0    1    1    0    0

, , WEEKNO = 2, YEAR = 1995

         SITE_ID
SPECIES   1238 1286 1289 1538 1892 1894
  Attenb.    0    0    0    0    0    1

, , WEEKNO = 4, YEAR = 1995

         SITE_ID
SPECIES   1238 1286 1289 1538 1892 1894
  Attenb.    0    1    0    0    0    0

, , WEEKNO = 7, YEAR = 1995

         SITE_ID
SPECIES   1238 1286 1289 1538 1892 1894
  Attenb.    1    0    0    0    1    0
#-------------

weekobs <- totobs[ match( as.numeric(dimnames(xtblspec[ 1, ,  ,])$WEEKNO ) ,totobs$WEEKNO) ,
                  "TOTALOBS"]
#[1] 100 780 189 899

要正确设置特定观察的矩阵，以便矩阵分区正常工作，您需要将 WEEKNO 作为第一个维度：

xtblspec <-  xtabs( ~ WEEKNO +SPECIES+ SITE_ID  + YEAR  , data=dat)
> xtblspec/weekobs
, , SITE_ID = 1238, YEAR = 1995

      SPECIES
WEEKNO     Attenb.
     1 0.000000000
     2 0.000000000
     4 0.000000000
     7 0.001112347

, , SITE_ID = 1286, YEAR = 1995

      SPECIES
WEEKNO     Attenb.
     1 0.000000000
     2 0.000000000
     4 0.005291005
     7 0.000000000

, , SITE_ID = 1289, YEAR = 1995

      SPECIES
WEEKNO     Attenb.
     1 0.010000000
     2 0.000000000
     4 0.000000000
     7 0.000000000

, , SITE_ID = 1538, YEAR = 1995

      SPECIES
WEEKNO     Attenb.
     1 0.010000000
     2 0.000000000
     4 0.000000000
     7 0.000000000

, , SITE_ID = 1892, YEAR = 1995

      SPECIES
WEEKNO     Attenb.
     1 0.000000000
     2 0.000000000
     4 0.000000000
     7 0.001112347

, , SITE_ID = 1894, YEAR = 1995

      SPECIES
WEEKNO     Attenb.
     1 0.000000000
     2 0.001282051
     4 0.000000000
     7 0.000000000

【讨论】：

【解决方案2】：

让我试一试，同时注意上面 cmets 中已经说明的问题的所有限制

#Create the data frame with the total observations
tot.obs<-data.frame(year=rep(1995,10), weekno=1:10, obs=floor(runif(n=10,80,100)))
#Create the variable week-year
tot.obs$week.year<-paste(tot.obs$week,tot.obs$year,sep="-")

#Create the data frame species observations
species.data<-data.frame(site=factor(floor(runif(n=5,2000,3000))), week=c(1,1,2,4,7), year=rep(1995,5),observ=rep(1,5))
species.data$week.year<-paste(species.data$week,species.data$year,sep="-")
species.data$total.obs<-NA

#Match the total observations form the tot.obs data frame to the species data frame. You can probably do it much faster but here is a "quick and dirty way"

for (i in 1:dim(species.data)[1]){
  species.data$total.obs[i]<-tot.obs$obs[tot.obs$week.year==species.data$week.year[i]]  
}

#Calculates the percentage of the total observation that each center contributes
species.data$per.obs<-species.data$observ/ species.data$total.obs 

#For the presentation of the data, reshape is your friend
library(reshape)
species.data.melt<-melt(species.data,id.vars=c("site","week.year"), measure.vars="per.obs")

cast(species.data.melt,site~week.year, fun.aggregate=sum)


site     1-1995     2-1995     4-1995     7-1995
1 2436 0.00000000 0.00000000 0.01010101 0.00000000
2 2501 0.00000000 0.01123596 0.00000000 0.00000000
3 2590 0.00000000 0.00000000 0.00000000 0.01123596
4 2608 0.01030928 0.00000000 0.00000000 0.00000000
5 2942 0.01030928 0.00000000 0.00000000 0.00000000

否则，如果您对每个中心的观察不感兴趣，事情就容易多了：

species.data.melt2<-melt(species.data,id.vars=c("week.year"), measure.vars="observ")
species.obs.total<-data.frame(cast(species.data.melt2,week.year~value, fun.aggregate=sum))
colnames(species.obs.total)[2]<-"aggregated.total"
species.obs.total$total<-NA

for (i in 1:dim(species.obs.total)[1]){
  species.obs.total$total[i]<-tot.obs$obs[tot.obs$week.year==species.obs.total$week.year[i]]  
}

species.obs.total$perc<-species.obs.total$aggregated.total/ species.obs.total$total
species.obs.total


  week.year aggregated.total total       perc
1    1-1995                2    97 0.02061856
2    2-1995                1    89 0.01123596
3    4-1995                1    99 0.01010101
4    7-1995                1    89 0.01123596

【讨论】：