R：计算一年中每周的平均值和标准差答案

【问题标题】：R: calculating mean and sd for each week of the yearR：计算一年中每周的平均值和标准差
【发布时间】：2018-06-18 18:36:03
【问题描述】：

我有 4 个数据框，每个数据框对应一个年份。每个数据框包含五个位置的每日降雨量。

生成样本数据

    location <- c("A","B","C","D","E")
    mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits=2)
    dat.1981 <-as.data.frame(cbind(location,mat)) # rainfall for 1981
    dat.1981$year <- 1981

    mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits = 2)
    dat.1982 <- as.data.frame(cbind(location,mat)) # rainfall for 1982
    dat.1982$year <- 1982

    mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits = 2)
    dat.1983 <-as.data.frame(cbind(location,mat)) # rainfall for 1983
    dat.1983$year <- 1983

    mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits = 2)
    dat.1984 <-as.data.frame(cbind(location,mat)) # rainfall for 1984
    dat.1984$year <- 1984

    dat <- as.data.frame(rbind(dat.1981,dat.1982,dat.1983,dat.1984))

对于每一年，我想对某一天是否是极端潮湿的一天进行分类

我的计算方法如下：

1) 对于每个位置，生成 1981 年至 1984 年期间每周的平均降雨量和标准差。例如，在位置 A，第一周的平均降雨量为：

（A 区 1981 年第一周降雨 + A 区 1982 年第一周降雨 + A 区 1983 年第一周降雨 + A 区 1984 年第一周降雨）/4

在R中可以写成

    mean.week1.loc1 <- mean(rowSums(dat[dat$location=="A",2:8])) # 2:8 selects the first 7 days in each year
    sd.week1.loc1 <- sd(rowSums(dat[dat$location=="A",2:8])) 

    wet.cr <- mean.week1 + sd.week1 # this is my threshold for defining a wet day

如果位置 A 1981 年至 1984 年第 1 周的每日降雨量大于wet.cr，那一天是潮湿的一天，因此得到的值为 1

例如，要检查 1981 年至 1984 年位置 A 的第 1 周降雨是否是雨天，我可以执行以下操作：

   lapply(dat[, 2:8], function(x) ifelse(x > wet.cr, 1, 0))

我想在每周和每个地点重复此操作。

但是，我无法将这些单独的功能拼接在一起，而且我的最终结果应该是与dat 相同的数据框，但不是降雨值，而是用 1 或 0 来定义是否是潮湿的一天。

编辑

下面的解决方案给了我以下信息：

mean(c(rainfall 1981 day 1 week 1, ...., rainfall 1981 day 7 week 1, rainfall 1982 day 1 week 1,....,rainfall 1982 day 7 week 1,....,rainfall 1984 day 1 week 1,....,rainfall 1984 day 7 week 1))

我想要什么：

mean(c(mean(total rainfall week 1 1981), mean(total rainfall week 1 1982), mean(total rainfall week 1 1983), mean(total rainfall week 1 1984)))

我希望现在清楚了。

【问题讨论】：

这似乎比它需要的更难。如果您有所有日期，请使用ISOweek 获取年份和星期，然后在tidyr 中汇总年份。如果您举一个包含完整日期而不是按周细分的数据示例，我可以向您展示....
我怀疑这种数据是否如图所示存储，因为并非每年都有 365 天。如果您有日期，您可以将它们转换为长，添加一个表示一年中的一周的列，按该列分组并获得平均值和标准差，您可以将其与值进行比较......数据是否如我预期的那样存在真的如图所示？
值得注意的是，1984 年有 366 天。您的数据框中缺少哪一项？
我删除了闰年多余的天数，这样所有年份都有相同的 365 天
@Tino 这就是我拥有数据的方式。我可以尝试按照您建议的方式对其进行操作。

标签： r dplyr apply

【解决方案1】：

tidyverse 解决方案

    library(magrittr)
    library(tidyverse)

    dat_m <- gather(dat, day, rainfall, -location, -year)
    str(dat_m)

    dat_m %<>%
      mutate(day = gsub("V", "", day)) %>%
      mutate(day = as.numeric(day)) %>% 
      mutate(week = as.integer(ceiling(day/7))) %>% 
      group_by(location, week) %>% 
      mutate(wet.cr = mean(rainfall, na.rm = TRUE) + sd(rainfall, na.rm = TRUE) ) %>% 
      mutate(indication = ifelse(rainfall > wet.cr, 1, 0)) %>% 
      ungroup()
    dat_m 

    # A tibble: 7,300 x 7
       location  year   day rainfall  week wet.cr indication
       <fctr>   <dbl> <dbl>    <dbl> <int>  <dbl>      <dbl>
     1 A         1981  1.00    0.880     1  0.845       1.00
     2 B         1981  1.00    0.850     1  0.829       1.00
     3 C         1981  1.00    1.00      1  0.877       1.00
     4 D         1981  1.00    0.100     1  0.755       0   
     5 E         1981  1.00    0.190     1  0.750       0   
     6 A         1982  1.00    0.380     1  0.845       0   
     7 B         1982  1.00    0.760     1  0.829       0   
     8 C         1982  1.00    0.940     1  0.877       1.00
     9 D         1982  1.00    0.900     1  0.755       1.00
    10 E         1982  1.00    0.600     1  0.750       0   
    # ... with 7,290 more rows

编辑：对于降雨，我认为使用sum（总计）比使用mean更好

所以我们首先计算每年的每周总降雨量。然后我们计算每周总降雨量的长期平均值和标准差。

    dat_m %<>%
      mutate(day = as.numeric(gsub("V", "", day)),
             week = as.integer(ceiling(day/7))) %>%
      group_by(location, week, year) %>% 
      mutate(total_weekly_rainfall = sum(rainfall, na.rm = TRUE)) %>% 
      ungroup() %>% 
      group_by(location, week) %>% 
      mutate(mean_weekly_rainfall = sum(rainfall, na.rm = TRUE)/length(unique(year)),
             stddev_weekly_rainfall = sd(rainfall, na.rm = TRUE),
             wet.cr =  mean_weekly_rainfall + stddev_weekly_rainfall,
             indication = ifelse(total_weekly_rainfall > wet.cr, 1, 0)) %>% 
      arrange(location, year, day) %>% 
      ungroup() %>% 
      distinct(location, year, week, .keep_all = TRUE)
    dat_m 

    # A tibble: 1,060 x 10
       location  year   day rainfall  week total_wee~ mean_wee~ stddev_w~ wet.~ indic~
       <fctr>   <dbl> <dbl>    <dbl> <int>      <dbl>     <dbl>     <dbl> <dbl>  <dbl>
     1 A         1981  1.00   0.880      1     0.880      0.630     0.277 0.907      0
     2 A         1981  8.00   0.190      2     0.190      0.330     0.431 0.761      0
     3 A         1981 15.0    0.630      3     0.630      0.548     0.331 0.878      0
     4 A         1981 22.0    0.0300     4     0.0300     0.290     0.259 0.549      0
     5 A         1981 29.0    0.360      5     0.360      0.308     0.196 0.504      0
     6 A         1981 36.0    0.540      6     0.540      0.500     0.225 0.725      0
     7 A         1981 43.0    0.0300     7     0.0300     0.375     0.289 0.664      0
     8 A         1981 50.0    0.170      8     0.170      0.332     0.375 0.708      0
     9 A         1981 57.0    0.260      9     0.260      0.652     0.272 0.924      0
    10 A         1981 64.0    0.590     10     0.590      0.512     0.202 0.715      0
    # ... with 1,050 more rows

【讨论】：

对不起，我认为这个函数有错误。它正在做其他事情而不是我打算做的事情
您想比较一周的总降雨量而不是单日降雨量？
我需要这个。 mean(c(mean(total rainfall week 1 1981), mean(total rainfall week 1 1982), mean(total rainfall week 1 1983), mean(total rainfall week 1 1984)))

【解决方案2】：

使用 data.table ：

library(data.table)
dat <- setDT(dat)
newdat <- melt(dat, measure.vars = patterns("^V"),variable.name = "day",value.name = "rain")
newdat[,day := as.character(day)]
newdat[,day := as.numeric(unlist(lapply(newdat$day,function(x){strsplit(x,"V")[[1]][2]})))]
newdat[,Week := day %/% 7]
newdat[,threshold := mean(rain) + sd(rain),  by = .(location,Week)]
newdat[,wet := ifelse(rain > threshold,1,0)]
print(newdat,topn = 100)


      location year day rain Week threshold wet
   1:        A 1981   1 0.73    0 0.7630065   0
   2:        B 1981   1 0.69    0 0.8599243   0
   3:        C 1981   1 0.45    0 0.8145956   0
   4:        D 1981   1 0.51    0 0.8935058   0
   5:        E 1981   1 0.77    0 0.6992752   1
   6:        A 1982   1 0.47    0 0.7630065   0
   7:        B 1982   1 0.70    0 0.8599243   0
   8:        C 1982   1 0.48    0 0.8145956   0
   9:        D 1982   1 0.92    0 0.8935058   1

一步一步的解释：首先你需要改变你的数据格式来简化计算。长格式更合适，因为每一列 V## 实际上是一个变量，即数字天。这是使用融化完成的

melt(dat, measure.vars = patterns("^V"),variable.name = "day",value.name = "rain")

     location year  day rain
   1:        A 1981   V1 0.73
   2:        B 1981   V1 0.69
   3:        C 1981   V1 0.45
   4:        D 1981   V1 0.51
   5:        E 1981   V1 0.77
  ---                        
7296:        A 1984 V365 0.31
7297:        B 1984 V365 0.99
7298:        C 1984 V365 0.25
7299:        D 1984 V365 0.24
7300:        E 1984 V365 0.87

然后你将你的一天转换为一个实数，以便能够计算星期

newdat[,day := as.character(day)]
newdat[,day := as.numeric(unlist(lapply(newdat$day,function(x){strsplit(x,"V")[[1]][2]})))]
> newdat[,.(day,year)]
      day year
   1:   1 1981
   2:   1 1981
   3:   1 1981
   4:   1 1981
   5:   1 1981

然后和你一样计算周数

newdat[,Week := day %/% 7]

阈值计算的统计数据是通过按周和地点分组来完成的（因此每个地点在一年中的统计数据）

newdat[,threshold := mean(rain) + sd(rain), by = .(location,Week)]

并将您的潮湿日定义为降雨量高于阈值的日子

newdat[,wet := ifelse(rain > threshold,1,0)]

但我同意初始数据的格式肯定比您提供的格式更好的评论。

【讨论】：

这真的很有帮助，而且看起来不错。你说的对。我的实际数据对于每个位置都是分开的。对于每个位置，我都有一个长格式的 csv 文件，其中包含年份、doy 和降雨量值。
一个简单的问题：一年中的第 7 天在第 1 周，尽管它应该在第 0 周。
如果我没记错你可以newdat[,Week := (day-1) %/% 7]
我这样做了：newdat[,Week := day %/% 8]
它将设置为 8 天的星期。

【解决方案3】：

对于 data.table 和 tidyverse 解决方案，您可能会将其视为缩放练习（许多学科中的 z 分数），因为均值 + n 标准差是众所周知的基准。

对于 data.table 解决方案，您会：

newdat[,zrain := scale(rain),  by = .(location,Week)]
newdat[,zwet := ifelse(zrain > 1.0,1,0)]

您从基础依赖 scale 并与 1.0 进行比较

对于变成的tidyverse：

mutate(zrain = scale(rainfall)) %>% 
mutate(indication = ifelse(zrain > 1.0, 1, 0)) %>%

这样，将来如果您的“湿”标准发生变化，您只需更改代码中的一个数字

【讨论】：