【问题标题】:R - iteratively apply a function of a list of variablesR - 迭代地应用变量列表的函数
【发布时间】:2015-10-26 08:19:04
【问题描述】:

我的目标是创建一个函数,当循环遍历数据框的多个变量时,将返回一个新数据框,其中包含每个变量每个级别的百分比和 95% 置信区间。

例如,如果我将此函数应用于 mtcars 数据框中的“cyl”和“am”,我希望将其作为最终结果:

  variable level                ci.95
1      cyl     4 34.38 (19.50, 53.11)
2      cyl     6 21.88 (10.35, 40.45)
3      cyl     8 43.75 (27.10, 61.94)
4       am     0  59.38 (40.94, 75.5)
5       am     1 40.62 (24.50, 59.06) 

所以,到目前为止,我的函数似乎适用于单个变量;但是,我有两个问题希望社区可以帮助我:

  1. 常规 R-ifying 我的代码。我仍然是 R 新手。我已经阅读了足够多的帖子,知道 R 爱好者通常不鼓励使用 for 循环,但我仍然很难使用 apply 函数(在大多数情况下,这似乎是 for 循环的替代方法)。

  2. 将此函数应用于变量列表 - 生成单个数据框,其中包含每个变量的每个级别的函数返回值。

到目前为止,我的代码如下所示:

t1.props <- function(x, data = NULL) {

  # Grab dataframe and/or variable name
  if(!missing(data)){
    var <- data[,deparse(substitute(x))]
  } else {
    var <- x
  }

  # Grab variable name for use in ouput
  var.name <- substitute(x)

  # Omit observations with missing data
  var.clean <- na.omit(var)

  # Number of nonmissing observations
  n <- length(var.clean)

  # Grab levels of variable
  levels <- sort(unique(var.clean))

  # Create an empty data frame to store values
  out <- data.frame(variable = NA,
                    level = NA,
                    ci.95 = NA)

  # Estimate prop, se, and ci for each level of the variable
  for(i in seq_along(levels)) {
    prop <- paste0("prop", i)
    se <- paste0("se", i)
    log.prop <- paste0("log.trans", i)
    log.se <- paste0("log.se", i)
    log.l <- paste0("log.l", i)
    log.u <- paste0("log.u", i)
    lcl <- paste0("lcl", i)
    ucl <- paste0("ucl", i)

    # Find the proportion for each level of the variable
    assign(prop, sum(var.clean == levels[i]) / n)

    # Find the standard error for each level of the variable
    assign(se, sd(var.clean == levels[i]) /
             sqrt(length(var.clean == levels[i])))

    # Perform a logit transformation of the original percentage estimate
    assign(log.prop, log(get(prop)) - log(1 - get(prop)))

    # Transform the standard error of the percentage to a standard error of its
    # logit transformation
    assign(log.se, get(se) / (get(prop) * (1 - get(prop))))

    # Calculate the lower and upper confidence bounds of the logit
    # transformation
    assign(log.l,
           get(log.prop) -
           qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
    assign(log.u,
           get(log.prop) +
           qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))

    # Finally, perform inverse logit transformations to get the confidence bounds
    assign(lcl, exp(get(log.l)) / (1 + exp(get(log.l))))
    assign(ucl, exp(get(log.u)) / (1 + exp(get(log.u))))

    # Create a combined 95% CI variable for easy copy/paste into Word tables
    ci.95 <- paste0(round(get(prop) * 100, 2), " ",
                "(", sprintf("%.2f", round(get(lcl) * 100, 2)), ",", " ",
                round(get(ucl) * 100, 2), ")")

    # Populate the "out" data frame with values
    out <- rbind(out, c(as.character(var.name), levels[i], ci.95))
  }

  # Remove first (empty) row from out
  # But only in the first iteration
  if (is.na(out[1,1])) {
    out <- out[-1, ]
    rownames(out) <- 1:nrow(out)
  }
  out
}

data(mtcars)
t1.props(cyl, mtcars)

感谢您提供的任何帮助或建议。

【问题讨论】:

    标签: r for-loop lapply


    【解决方案1】:

    您使用的所有函数的好处是它们已经矢量化(sdqt 除外,但您可以使用 Vectorize 轻松矢量化它们以获得特定参数)。这意味着您可以将向量传递给它们,而无需编写单个循环。我省略了处理准备输入和修饰输出的函数部分。

    t1.props <- function(var, data=mtcars) {
        N <- nrow(data)
        levels <- names(table(data[,var]))
        count <- unclass(table(data[,var]))        # counts
        prop <- count / N                          # proportions
        se <- sqrt(prop * (1-prop)/(N-1))          # standard errors of props.
        lprop <- log(prop) - log(1-prop)           # logged prop
        lse <- se / (prop*(1-prop))                # logged se
        stat <- Vectorize(qt, "df")(0.975, N-1)    # tstats
        llower <- lprop - stat*lse                 # log lower 
        lupper <- lprop + stat*lse                 # log upper
        lower <- exp(llower) / (1 + exp(llower))   # lower ci
        upper <- exp(lupper) / (1 + exp(lupper))   # upper ci
    
        data.frame(variable=var,
                   level=levels,
                   perc=100*prop,
                   lower=100*lower,
                   upper=100*upper)
    }
    

    因此,当您将函数应用于多个变量时,唯一的显式应用/循环出现,如下所示

    ## Apply your function to two variables
    do.call(rbind, lapply(c("cyl", "am"), t1.props))
    #   variable level   perc    lower    upper
    # 4      cyl     4 34.375 19.49961 53.11130
    # 6      cyl     6 21.875 10.34883 40.44691
    # 8      cyl     8 43.750 27.09672 61.94211
    # 0       am     0 59.375 40.94225 75.49765
    # 1       am     1 40.625 24.50235 59.05775
    

    就代码中的循环而言,它在效率方面并不是特别重要,但是您可以看到代码简洁时可以更容易阅读 - 并且应用函数提供了很多简单的一个 -线解决方案。

    我认为更改代码最重要的是使用assignget。相反,您可以将变量存储在列表或其他数据结构中,并在需要时使用setNamesnames&lt;-names(...) &lt;- 来命名组件。

    【讨论】:

    • level 列似乎与输出不匹配。也许levels &lt;- sort(unique...
    • 检查 OP 的期望输出。并注意行名和级别列之间的区别。
    • 因为气缸 6 的输出在第 2 行。气缸 4 的输出在第 1 行。这不仅仅是装饰性的。级别列说明一件事,而置信区间说明另一行。
    • 感谢您的反馈并发现估计中的差异。
    • np 很好地简化了整个过程。
    【解决方案2】:

    您也可以保持该功能基本不变并在其上使用lapply

    vars <- c("cyl", "am")
    lapply(vars, t1.props, data=mtcars)
    [[1]]
      variable level                ci.95
    1      cyl     4 34.38 (19.50, 53.11)
    2      cyl     6 21.88 (10.35, 40.45)
    3      cyl     8 43.75 (27.10, 61.94)
    
    [[2]]
      variable level                ci.95
    1       am     0  59.38 (40.94, 75.5)
    2       am     1 40.62 (24.50, 59.06)
    

    并将它们全部合并到一个数据框中:

    lst <- lapply(vars, t1.props, data=mtcars)
    do.call(rbind,lst)
    

    数据

    您必须将varvar.name 分配简化为:

    t1.props <- function(x, data = NULL) {
    
      # Grab dataframe and/or variable name
      if(!missing(data)){
        var <- data[,x]
      } else {
        var <- x
      }
    
      # Grab variable name for use in ouput
      var.name <- x
    
      # Omit observations with missing data
      var.clean <- na.omit(var)
    
      # Number of nonmissing observations
      n <- length(var.clean)
    
      # Grab levels of variable
      levels <- sort(unique(var.clean))
    
      # Create an empty data frame to store values
      out <- data.frame(variable = NA,
                        level = NA,
                        ci.95 = NA)
    
      # Estimate prop, se, and ci for each level of the variable
      for(i in seq_along(levels)) {
        prop <- paste0("prop", i)
        se <- paste0("se", i)
        log.prop <- paste0("log.trans", i)
        log.se <- paste0("log.se", i)
        log.l <- paste0("log.l", i)
        log.u <- paste0("log.u", i)
        lcl <- paste0("lcl", i)
        ucl <- paste0("ucl", i)
    
        # Find the proportion for each level of the variable
        assign(prop, sum(var.clean == levels[i]) / n)
    
        # Find the standard error for each level of the variable
        assign(se, sd(var.clean == levels[i]) /
                 sqrt(length(var.clean == levels[i])))
    
        # Perform a logit transformation of the original percentage estimate
        assign(log.prop, log(get(prop)) - log(1 - get(prop)))
    
        # Transform the standard error of the percentage to a standard error of its
        # logit transformation
        assign(log.se, get(se) / (get(prop) * (1 - get(prop))))
    
        # Calculate the lower and upper confidence bounds of the logit
        # transformation
        assign(log.l,
               get(log.prop) -
                 qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
        assign(log.u,
               get(log.prop) +
                 qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
    
        # Finally, perform inverse logit transformations to get the confidence bounds
        assign(lcl, exp(get(log.l)) / (1 + exp(get(log.l))))
        assign(ucl, exp(get(log.u)) / (1 + exp(get(log.u))))
    
        # Create a combined 95% CI variable for easy copy/paste into Word tables
        ci.95 <- paste0(round(get(prop) * 100, 2), " ",
                        "(", sprintf("%.2f", round(get(lcl) * 100, 2)), ",", " ",
                        round(get(ucl) * 100, 2), ")")
    
        # Populate the "out" data frame with values
        out <- rbind(out, c(as.character(var.name), levels[i], ci.95))
      }
    
      # Remove first (empty) row from out
      # But only in the first iteration
      if (is.na(out[1,1])) {
        out <- out[-1, ]
        rownames(out) <- 1:nrow(out)
      }
      out
    }
    

    【讨论】:

    • 这似乎正是我所需要的。谢谢!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2017-08-09
    • 2020-05-08
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多