将 ggplot2 与用户定义的 stat_function() 集成答案

【问题标题】：Integrating ggplot2 with user-defined stat_function()将 ggplot2 与用户定义的 stat_function() 集成
【发布时间】：2014-08-29 12:57:54
【问题描述】：

我正在尝试使用 ggplot2 包和 用户定义函数为其stat_function()。我尝试了两种方法。 分布标识在这两种情况下都是正常的：

number of iterations= 11 
summary of normalmixEM object:
         comp 1  comp 2
lambda 0.348900 0.65110
mu     2.019878 4.27454
sigma  0.237472 0.43542
loglik at estimate:  -276.3643

A) 但是，在第一种方法中，输出包含以下错误：

Error in eval(expr, envir, enclos) : object 'comp.number' not found

此方法的可重现示例如下（忠实是内置R数据集）：

library(ggplot2)
library(mixtools)

DISTRIB_COLORS <- c("green", "red")
NUM_COMPONENTS <- 2

set.seed(12345)

mix.info <- normalmixEM(faithful$eruptions, k = NUM_COMPONENTS,
                        maxit = 100, epsilon = 0.01)
summary(mix.info)

plot.components <- function(mix, comp.number) {
  g <- stat_function(fun = function(mix, comp.number) 
  {mix$lambda[comp.number] *
     dnorm(x, mean = mix$mu[comp.number],
           sd = mix$sigma[comp.number])}, 
  geom = "line", aes(colour = DISTRIB_COLORS[comp.number]))
  return (g)
}

g <- ggplot(faithful, aes(x = waiting)) +
  geom_histogram(binwidth = 0.5)

distComps <- lapply(seq(NUM_COMPONENTS),
                    function(i) plot.components(mix.info, i))
print(g + distComps)

B)第二种方法不会产生任何错误。但是，唯一可见的图是混合分布之一。 没有生成或显示其分量分布图（在我看来，水平直线 y=0 也是可见的，但我不是 100% 确定）：

以下是此方法的可重现示例：

library(ggplot2)
library(mixtools)

DISTRIB_COLORS <- c("green", "red")
NUM_COMPONENTS <- 2

set.seed(12345)

mix.info <- normalmixEM(faithful$eruptions, k = NUM_COMPONENTS,
                        maxit = 100, epsilon = 0.01)
summary(mix.info)

plot.components <- function(x, mix, comp.number, ...) {
  mix$lambda[comp.number] *
    dnorm(x, mean = mix$mu[comp.number],
          sd = mix$sigma[comp.number], ...)
}

g <- ggplot(faithful, aes(x = waiting)) +
  geom_histogram(binwidth = 0.5)

distComps <- lapply(seq(NUM_COMPONENTS), function(i)
  stat_function(fun = plot.components,
                args = list(mix = mix.info, comp.number = i)))
print(g + distComps)

问题：每种方法存在哪些问题，哪一种（更）正确？

更新： 发布几分钟后，我意识到我忘记在第二种方法中包含stat_function() 的画线部分，因此对应的行如下：

distComps <- lapply(seq(NUM_COMPONENTS), function(i)
  stat_function(fun = plot.components,
                args = list(mix = mix.info, comp.number = i)),
  geom = "line", aes(colour = DISTRIB_COLORS[i]))

但是，此更新产生了一个错误，我不太明白其来源：

Error in FUN(1:2[[1L]], ...) : 
  unused arguments (geom = "line", list(colour = DISTRIB_COLORS[i]))

【问题讨论】：

你这里真是一团糟。您的 normalmixEM 函数正在 $eruptions 上调用，因此它查看该变量的分布，但您的绘图基于 x=waiting，这是一些完全不同的变量。查看汇总输出均值和方差，它们与您的 X 轴值相去甚远。您可能会看到以 2.019 和 4.275 为中心的分布尾部。解决所有这些问题，然后我们将处理各种范围问题以及 fun 应该只是 x 的函数这一事实......
@Spacedman：谢谢！已经开始研究这个了。
@Spacedman：我修复了错误的变量问题（两种方法都更改为$waiting）并看到组件识别的改进。但是错误消息保持不变。仍在尝试找出缩放/范围问题。
通过允许额外参数 (...) 修复了方法 2 中的错误。在阅读了有关 StackOverflow (stackoverflow.com/a/25091231/2872891) 和 Hadley 链接的 cmets 的信息后，我了解到所有计算都应在 stat_function() 和其他 ggplot2 函数的外部进行，因为环境范围。这部分符合我的方法 2，因此我专注于通过形成具有计算结果的补充数据框并将其传递给 geom_line() 来修复它。

标签： r plot ggplot2 distribution data-visualization

【解决方案1】：

我终于弄清楚了如何做我想做的事并重新设计了我的解决方案。对于这个问题，我已经改编了@Spacedman 和@jlhoward 的部分答案（在发布我的问题时我还没有看到）：Any suggestions for how I can plot mixEM type data using ggplot2。但是，我的解决方案有点不同。一方面，我使用了@Spacedman 使用stat_function() 的方法——我尝试在我的原始版本中使用相同的想法——我更喜欢它而不是替代方案，这似乎有点太复杂（虽然更灵活） .另一方面，与@jlhoward 的方法类似，我简化了参数传递。我还介绍了一些视觉改进，例如自动选择不同的颜色以便更轻松地识别组件分布。对于我的 EDA，我已将此代码重构为 R 模块。然而，还有一个问题，我仍在试图弄清楚：为什么组件分布图位于低于预期密度图，如下图。对此问题的任何建议将不胜感激！

更新：最后，我发现了缩放的问题，并相应地更新了代码和图形 - y 值需要是 乘以binwidth 的值（在本例中为0.5）以计算每个 bin 的观察次数。

这里是完全重新设计的可重现解决方案：

library(ggplot2)
library(RColorBrewer)
library(mixtools)

NUM_COMPONENTS <- 2

set.seed(12345) # for reproducibility

data <- faithful$waiting # use R built-in data

# extract 'k' components from mixed distribution 'data'
mix.info <- normalmixEM(data, k = NUM_COMPONENTS,
                        maxit = 100, epsilon = 0.01)
summary(mix.info)

numComponents <- length(mix.info$sigma)
message("Extracted number of component distributions: ",
        numComponents)

calc.components <- function(x, mix, comp.number) {
  mix$lambda[comp.number] *
    dnorm(x, mean = mix$mu[comp.number], sd = mix$sigma[comp.number])
}

g <- ggplot(data.frame(x = data)) +
  geom_histogram(aes(x = data, y = 0.5 * ..density..),
                 fill = "white", color = "black", binwidth = 0.5)

# we could select needed number of colors randomly:
#DISTRIB_COLORS <- sample(colors(), numComponents)

# or, better, use a palette with more color differentiation:
DISTRIB_COLORS <- brewer.pal(numComponents, "Set1")

distComps <- lapply(seq(numComponents), function(i)
  stat_function(fun = calc.components,
                arg = list(mix = mix.info, comp.number = i),
                geom = "line", # use alpha=.5 for "polygon"
                size = 2,
                color = DISTRIB_COLORS[i]))
print(g + distComps)

【讨论】：