【问题标题】：Speeding up a function used to create convenient groups for mapping and charting加速用于创建方便的组以进行映射和图表的功能
【发布时间】：2016-01-15 13:31:52
【问题描述】：

背景

以下有用的讨论，以及我从 SO 同事那里得到的帮助：

我结合了一个便利功能。这需要一个数字向量并生成与组有关的分解向量。

功能

函数体如下所示。

nice.cuts <- function(variable, cuts = 10, thousands.separator = FALSE) {

  # Load required packages (useful when used independently of context)
  Vectorize(require)(package = c("gsubfn", "Hmisc", "scales"),
                     character.only = TRUE)

  # Destring this variable
  destring <- function(x) {
    ## convert factor to strings
    if (is.character(x)) {
      as.numeric(x)
    } else if (is.factor(x)) {
      as.numeric(levels(x))[x]
    } else if (is.numeric(x)) {
      x
    } else {
      stop("could not convert to numeric")
    }
  }

  # Apply function
  variable <- destring(variable)

  # Check whether to disable scientific notation
  if (mean(variable) > 100000) {
    options(scipen = 999)
  } else {
    options(scipen = 0)
  }

  # Create pretty breaks
  cut_breaks <- pretty_breaks(n = cuts)(variable)

  # Round it two decimal places
  variable <- round(variable, digits = 2)

  # Develop cuts according to the provided object
  cuts_variable <- cut2(x = variable, cuts = cut_breaks)

  # Check if variable is total or with decimals
  if (all(cut_breaks %% 1 == 0)) {
    # Variable is integer
    clean_cuts <- gsubfn('\\[\\s*(\\d+),\\s*(\\d+)[^0-9]+',
                         ~paste0(x, '-',as.numeric(y)-1),
                         as.character(cuts_variable))
  } else {
    # Variable is not integer
    # Create clean cuts
    clean_cuts <- gsubfn('\\[\\s*([0-9]+\\.*[0-9]*),\\s*(\\d+\\.\\d+).*',
                         ~paste0(x, '-', as.numeric(y)- 0.01),
                         as.character(cuts_variable))
  }

  # Clean Inf
  clean_cuts <- gsub("Inf", max(variable), clean_cuts)

  # Clean punctuation
  clean_cuts <- sub("\\[(.*), (.*)\\]", "\\1 - \\2", clean_cuts)

  # Replace strings with spaces
  clean_cuts <- gsub("-"," - ",clean_cuts, fixed = TRUE)

  # Trim white spaces
  clean_cuts <- trimws(clean_cuts)

  # Order factor before returning
  clean_cuts <- factor(clean_cuts, levels = unique(clean_cuts[order(variable)]))

  if (thousands.separator == TRUE) {
    levels(clean_cuts) <- sapply(strsplit(levels(clean_cuts), " - "),
                                 function(x) paste(prettyNum(x,
                                                             big.mark = ",",
                                                             preserve.width = "none"),
                                                   collapse = " - "))
  }

  # Return
  return(clean_cuts)
}

结果

该函数在生成用于映射的因子时非常有用。例如以下值：

set.seed(1)
dta <- data.frame(values=floor(runif(100, 10000,90000)))

该函数将产生漂亮的中断

> dta$cuts <- nice.cuts(dta$values, thousands.separator = TRUE)
> t(t(table(dta$cuts))) #' t() for presentation

                  [,1]
  10,000 - 19,999    9
  20,000 - 29,999   11
  30,000 - 39,999   12
  40,000 - 49,999   20
  50,000 - 59,999    6
  60,000 - 69,999   15
  70,000 - 79,999   17
  80,000 - 89,999   10

这可以用来生成惊人的图例：

这在为choropleth maps 生成数据时非常有用，我一直都在使用它。

问题

挑战与性能不佳有关。该功能似乎很慢。

非常小的数据集

100 个观察的小型数据集的性能并不惊人：

> require(microbenchmark)
> dta <- data.frame(values=floor(runif(100, 10000,90000)))
> microbenchmark(nice.cuts(dta$values, thousands.separator = TRUE))
Unit: milliseconds
                                              expr      min       lq     mean   median       uq      max neval
 nice.cuts(dta$values, thousands.separator = TRUE) 32.67988 58.25709 99.26317 95.25195 136.7998 222.2178   100

小数据集

对于稍微大一点的数据集变得非常慢：

> dta <- data.frame(values=floor(runif(1000, 10000,90000)))
> microbenchmark(nice.cuts(dta$values, thousands.separator = TRUE),
+                times = 10)
Unit: milliseconds
                                              expr      min       lq     mean   median       uq      max neval
 nice.cuts(dta$values, thousands.separator = TRUE) 428.6821 901.2123 1154.097 1068.845 1679.052 1708.836    10

因此我的问题相当简单，我想保留nice.cuts 函数的当前功能，但我想让它运行得更快。

建议

我认为gsubfn 元素需要很长时间，但我还没有想出如何提高效率。
我还认为，采用变量的唯一值可能会加快速度。在我的真实数据中，我经常使用重复某些值的向量

【问题讨论】：

这似乎更适合Code Review 姐妹网站。如果你想在这里拥有它，你应该把它归结为代码问题，即更小的问题。
@Roland 感谢您的建议，我很乐意在 CR 删除并重新发布。也许我会等几分钟，看看 SO 社区的成员是否有兴趣。我被performance 标签的可用性误导了，因为这是我的问题所在。
@NicE 非常感谢您的建议。事实上，我对构建有序因子向量很感兴趣。实际上，我经常在形状文件中的ggfortify 之后的data.frames 中已经可用的值上使用该函数，所以我想要一个因子列，用于所有具有括号值而不是核心值的观察值。基本思想是有一组整洁的颜色应用于地图/图表图例。

标签： r performance function vector hmisc

【解决方案1】：

您对完整输入向量上的标签进行所有清理：您首先在cut2 之后生成一个字符向量，然后在此向量上执行大量正则表达式。但是，您只是在修改标签。

因此，在生成cut_breaks 之后，我将首先生成正确格式的标签：cut_labels。我在下面的cut.labels 的新版本中完成了这项工作。与原始版本进行基准测试显示了巨大的改进：

> require(microbenchmark)
> dta <- data.frame(values=floor(runif(1000, 10000,90000)))
> microbenchmark(nice.cuts(dta$values, thousands.separator = TRUE),
+   nice.cuts2(dta$values, thousands.separator = TRUE))
Unit: milliseconds
                                               expr      min        lq     mean    median        uq        max neval cld
  nice.cuts(dta$values, thousands.separator = TRUE) 720.1378 815.51782 902.9218 923.97881 968.39036 1208.00434   100   b
 nice.cuts2(dta$values, thousands.separator = TRUE)  11.4147  15.18232  16.6196  16.46937  17.05305   29.91089   100  a 
>

新版本的 nice.cuts

我取了cuts_variable 的标签，并将原始函数的所有步骤应用于这些标签。然后我用这些新标签覆盖cuts_variable 的标签。

nice.cuts2 <- function(variable, cuts = 10, thousands.separator = FALSE) {

  # Load required packages (useful when used independently of context)
  Vectorize(require)(package = c("gsubfn", "Hmisc", "scales"),
                     character.only = TRUE)

  # Destring this variable
  destring <- function(x) {
    ## convert factor to strings
    if (is.character(x)) {
      as.numeric(x)
    } else if (is.factor(x)) {
      as.numeric(levels(x))[x]
    } else if (is.numeric(x)) {
      x
    } else {
      stop("could not convert to numeric")
    }
  }

  # Apply function
  variable <- destring(variable)

  # Check whether to disable scientific notation
  if (mean(variable) > 100000) {
    options(scipen = 999)
  } else {
    options(scipen = 0)
  }

  # Create pretty breaks
  cut_breaks <- pretty_breaks(n = cuts)(variable)

  # Round it two decimal places
  variable <- round(variable, digits = 2)

  # Develop cuts according to the provided object
  cuts_variable <- cut2(x = variable, cuts = cut_breaks)

  cuts_labels <- levels(cuts_variable)

  # Check if variable is total or with decimals
  if (all(cut_breaks %% 1 == 0)) {
    # Variable is integer
    cuts_labels <- gsubfn('\\[\\s*(\\d+),\\s*(\\d+)[^0-9]+',
                         ~paste0(x, '-',as.numeric(y)-1),
                         as.character(cuts_labels))
  } else {
    # Variable is not integer
    # Create clean cuts
    cuts_labels <- gsubfn('\\[\\s*([0-9]+\\.*[0-9]*),\\s*(\\d+\\.\\d+).*',
                         ~paste0(x, '-', as.numeric(y)- 0.01),
                         as.character(cuts_labels))
  }

  # Clean Inf
  cuts_labels <- gsub("Inf", max(variable), cuts_labels)

  # Clean punctuation
  cuts_labels <- sub("\\[(.*), (.*)\\]", "\\1 - \\2", cuts_labels)

  # Replace strings with spaces
  cuts_labels <- gsub("-"," - ",cuts_labels, fixed = TRUE)

  # Trim white spaces
  cuts_labels <- trimws(cuts_labels)


  if (thousands.separator == TRUE) {
    cuts_labels <- sapply(strsplit(cuts_labels, " - "),
                                 function(x) paste(prettyNum(x,
                                                             big.mark = ",",
                                                             preserve.width = "none"),
                                                   collapse = " - "))
  }

  levels(cuts_variable) <- cuts_labels
  cuts_variable
}

【讨论】：

非常感谢您的关注和宝贵的贡献。您愿意发布基准测试结果以显示速度的提升吗？
@Konrad 添加了一个新版本的nice.cuts 并将其与原始版本进行了对比。
非常感谢 cmets，这对性能来说是一个非常显着的提升！