SuperImpose Histogram 适合一个情节 ggplot答案

【问题标题】：SuperImpose Histogram fits in one plot ggplotSuperImpose Histogram 适合一个情节 ggplot
【发布时间】：2012-11-19 14:37:19
【问题描述】：

我有 ~ 5 个非常大的向量（~ 108 MM 条目），所以我在 R 中对它们进行的任何绘图/东西都需要很长时间。

我正在尝试可视化它们的分布（直方图），并且想知道在不花费太长时间的情况下将它们的直方图分布叠加在 R 中的最佳方法是什么。我正在考虑首先将分布拟合到直方图，然后将所有分布线绘制在一个图中。

您对如何做到这一点有一些建议吗？

假设我的向量是：

x1, x2, x3, x4, x5.

我正在尝试使用此代码：Overlaying histograms with ggplot2 in R

我用于 3 个向量的代码示例（R 无法绘制）：

n = length(x1)
dat <- data.frame(xx = c(x1, x2, x3),yy = rep(letters[1:3],each = n))
ggplot(dat,aes(x=xx)) + 
    geom_histogram(data=subset(dat,yy == 'a'),fill = "red", alpha = 0.2) +
    geom_histogram(data=subset(dat,yy == 'b'),fill = "blue", alpha = 0.2) +
    geom_histogram(data=subset(dat,yy == 'c'),fill = "green", alpha = 0.2)

但是绘制情节需要很长时间，最终它把我踢出了 R。关于如何有效地将 ggplot2 用于大型向量的任何想法？在我看来，我必须创建一个包含 5*108MM 条目的数据框，然后进行绘图，在我的情况下效率非常低。

谢谢！

【问题讨论】：

标签： r plot histogram ggplot2

【解决方案1】：

这是 Rcpp 的一个小 sn-p，它可以非常有效地对数据进行分类 - 在我的计算机上，大约需要一秒钟来对 100,000,000 个观察值进行分类：

library(Rcpp)
cppFunction('
  std::vector<int> bin3(NumericVector x, double width, double origin = 0) {
    int bin, nmissing = 0;
    std::vector<int> out;

    NumericVector::iterator x_it = x.begin(), x_end;
    for(; x_it != x.end(); ++x_it) {
      double val = *x_it;
      if (ISNAN(val)) {
        ++nmissing;
      } else {
        bin = (val - origin) / width;
        if (bin < 0) continue;

        // Make sure there\'s enough space
        if (bin >= out.size()) {
          out.resize(bin + 1);
        }
        ++out[bin];
      }
    }

    // Put missing values in the last position
    out.push_back(nmissing);
    return out;
  }
')

x8 <- runif(1e8)
system.time(bin3(x8, 1/100))
#   user  system elapsed 
#  1.373   0.000   1.373

也就是说，hist 在这里也很快：

system.time(hist(x8, breaks = 100, plot = F))
#   user  system elapsed 
#  7.281   1.362   8.669

使用bin3 制作直方图或频率多边形很简单：

# First we create some sample data, and bin each column

library(reshape2)
library(ggplot2)

df <- as.data.frame(replicate(5, runif(1e6)))
bins <- vapply(df, bin3, 1/100, FUN.VALUE = integer(100 + 1))

# Next we match up the bins with the breaks
binsdf <- data.frame(
  breaks = c(seq(0, 1, length = 100), NA),
  bins)

# Then melt and plot
binsm <- subset(melt(binsdf, id = "breaks"), !is.na(breaks))
qplot(breaks, value, data = binsm, geom = "line", colour = variable)

仅供参考，我手头有bin3 的原因是我正在研究如何在 ggplot2 中将此速度设为默认值:)

【讨论】：

谢谢，看起来又快又好 ;-) 爱 Rcpp 不熟悉它。
@hadley 一个小错误，需要用双引号来更正代码
猜想我们需要在不使用单引号的原因列表中添加注释嵌入缩写。
这很好，但它没有解决重叠图的问题，我重新调整了 Q 看看是否有更多的想法，谢谢！
@Dnaiel 使用 bin3 自己计算 bin，然后绘图。