【问题标题】:Conditional statistics on R data frameR数据框的条件统计
【发布时间】:2020-06-17 08:58:08
【问题描述】:

我有大约 50 个数据框用于分析空气污染。这是一个例子:

> Amsterdam_CO2
   Chemicals Begin.Date   End.Date Less.Than    Value Uncertainty.Value Measuring.Unit
1    CO2 2019-01-31 2019-01-31         <      1.0714000                NA          Mol/KG
2    CO2 2019-02-28 2019-02-28         <      0.4609000                NA          Mol/KG
3    CO2 2019-03-28 2019-03-28         <      0.7020623                NA          Mol/KG
4    CO2 2019-04-25 2019-04-25         <      0.5563282                NA          Mol/KG
5    CO2 2019-05-22 2019-05-22         <      1.6000000                NA          Mol/KG
6    CO2 2019-06-20 2019-06-20         <      0.6000000                NA          Mol/KG
7    CO2 2019-07-09 2019-07-09         <      1.2000000                NA          Mol/KG
8    CO2 2019-08-12 2019-08-12         <      0.8000000                NA          Mol/KG
9    CO2 2019-09-11 2019-09-11         <      1.3000000                NA          Mol/KG
10   CO2 2019-10-10 2019-10-10         <      1.0000000                NA          Mol/KG
11   CO2 2019-11-04 2019-11-04                0.7000000                NA          Mol/KG
12   CO2 2019-12-05 2019-12-05                0.9000000                NA          Mol/KG

我想创建 2 个新数据框,分别代表 2 组的平均值、最大值、最小值和标准差:

-Less.Than 中包含“Amsterdam_CO2_BelowDL

-在 Less.Than 中包含“Amsterdam_CO2_AboveDL。

#Filter and statistics for rows without "<" in Less.Than
Amsterdam_CO2_AboveDL <- Amsterdam_CO2 %>% 
              dplyr::filter(Less.Than != "<") %>% 
              (summarise(mean_Mesure = mean(Value), max_Mesure = max(Value), min_Mesure = min(Value), sd_Mesure = sd(Value), nbr_Mesure = n()))

> Amsterdam_CO2_AboveDL
    mean_Mesure max_Mesure min_Mesure     sd_Mesure nbr_Mesure
1       0.8         0.9        0.7           0.05      2

#Filter and statistics for rows with "<" in Less.Than         
Amsterdam_CO2_BelowDL <- Amsterdam_CO2 %>%
              dplyr::filter(Less.Than == "<") %>% 
              summarise(mean_DL = mean(Value), max_DL = max(Value), min_DL = min(Value), sd_DL = sd(Value), nbr_DL = n())

> Amsterdam_CO2_BelowDL
    mean_DL max_DL min_DL     sd_DL nbr_DL
1 0.9075575    1.6 0.4609 0.3396243     10

#export in an Excel file
wb = createWorkbook()
sheet1 = createSheet(wb, "Amsterdam_CO2")
cs3 <- CellStyle(wb) + Font(wb, isBold=TRUE) + Border()  # header

addDataFrame(Amsterdam_CO2, sheet=sheet1, startColumn=1, row.names=F)
addDataFrame(Amsterdam_CO2_AboveDL, sheet=sheet1, startRow=(3+nrow(Amsterdam_CO2)), row.names=F, showNA = F, characterNA = "", colnamesStyle=cs3)
addDataFrame(Amsterdam_CO2_BelowDL, sheet=sheet1, startRow=(5+nrow(Amsterdam_CO2)), row.names=F, showNA = F, characterNA = "", colnamesStyle=cs3)
            saveWorkbook(wb, "Amsterdam.xlsx")

但是,对于大多数初始数据帧,所有值都低于选择限制,这意味着所有行都有“

Error in mean(Value) : object 'Value' not found

因此,我想补充一点(if... else?),说明如果数据帧 AboveDL 或 Beyond DL 为空(0x7 变量),那么 R 仍必须返回一个数据帧:

平均值 = -, 最大值 = -, 最小值 = -, sd = -, nbr = 0

目标是获得相当自动化的东西,无论初始数据帧中是否存在“

#Filter and statistics for rows without "<" in Less.Than
Amsterdam_CO2_AboveDL <- Amsterdam_CO2 %>% 
              dplyr::filter(Less.Than != "<") %>% 
 ???? if (nrow(Amsterdam_CO2_AboveDL) > 0) 
{  (summarise(mean_Mesure = mean(Value), max_Mesure = max(Value), min_Mesure = min(Value), sd_Mesure = sd(Value), nbr_Mesure = n())) }

??? else {
mean = "-", max = "-", min = "-", sd = "-", nbr = "0" }


#Filter and statistics for rows with "<" in Less.Than         
Amsterdam_CO2_BelowDL <- Amsterdam_CO2 %>%
              dplyr::filter(Less.Than == "<") %>% 
 ???? if (nrow(Amsterdam_CO2_BelowDL) > 0) ???

              summarise(mean_DL = mean(Value), max_DL = max(Value), min_DL = min(Value), sd_DL = sd(Value), nbr_DL = n())

【问题讨论】:

  • 你在正确的轨道上。在if 语句中使用nrow() 而不是length()
  • 我试过了,但没用:Amsterdam_CO2_AboveDL &lt;- Amsterdam_CO2 %&gt;% dplyr::filter(Less.Than == "") %&gt;% if (nrow(Amsterdam_CO2_AboveDL) &gt; 0) { summarise(mean_Mesure = mean(Value), max_Mesure = max(Value), min_Mesure = min(Value), sd_Mesure = sd(Value), nbr_Mesure = n()) } else { mean_Mesure = "-"; max_Mesure = "-"; min_Mesure = "-"; sd_Mesure = "-"; nbr_Mesure = "0"}
  • `if (nrow(...) >0) {summary % summarise(...) } else {summary
  • 感谢您的帮助。确实,我需要更新我的 R 语法基础知识

标签: r dplyr


【解决方案1】:
blank_df <- data.frame(mean = "-", max = "-", min = "-", sd = "-", nbr = "0")

Amsterdam_CO2_AboveDL <- dplyr::filter(Amsterdam_CO2, Less.Than != "<") %>% 
  dplyr::summarise(mean_Mesure = mean(Value),
                   max_Mesure = max(Value),
                   min_Mesure = min(Value),
                   sd_Mesure = sd(Value),
                   nbr_Mesure = n())

if (nrow(Amsterdam_CO2_AboveDL) == 0)
  Amsterdam_CO2_AboveDL <- blank_df

Amsterdam_CO2_BelowDL <- dplyr::filter(Amsterdam_CO2, Less.Than == "<") %>% 
  dplyr::summarise(mean_Mesure = mean(Value),
                   max_Mesure = max(Value),
                   min_Mesure = min(Value),
                   sd_Mesure = sd(Value),
                   nbr_Mesure = n())

if (nrow(Amsterdam_CO2_BelowDL) == 0)
  Amsterdam_CO2_BelowDL <- blank_df

wb = createWorkbook()
sheet1 = createSheet(wb, "Amsterdam_CO2")
cs3 <- CellStyle(wb) + Font(wb, isBold = TRUE) + Border()

addDataFrame(Amsterdam_CO2, sheet = sheet1, startColumn = 1, row.names = FALSE)
addDataFrame(Amsterdam_CO2_AboveDL,
             sheet = sheet1,
             startRow = (3+nrow(Amsterdam_CO2)),
             row.names = FALSE,
             showNA = FALSE,
             characterNA = "",
             colnamesStyle = cs3)
addDataFrame(Amsterdam_CO2_BelowDL,
             sheet = sheet1,
             startRow = (5 + nrow(Amsterdam_CO2)),
             row.names = FALSE,
             showNA = FALSE,
             characterNA = "",
             colnamesStyle = cs3)
saveWorkbook(wb, "Amsterdam.xlsx")

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2019-02-14
    • 2020-03-30
    • 1970-01-01
    • 2021-12-31
    • 1970-01-01
    • 2017-06-08
    • 2018-08-11
    • 2023-02-21
    相关资源
    最近更新 更多