如何计算出用于负二项式 GLM 的数据的 theta 值？答案

【问题标题】：How to work out theta value of my data for use in negative binomial GLM?如何计算出用于负二项式 GLM 的数据的 theta 值？
【发布时间】：2021-11-03 17:11:35
【问题描述】：

我正在尝试对计数数据集进行 GLM，但发现我的数据过于分散，因此不适合使用泊松 GLM。我知道我必须改用负二项式 GLM，这需要一个 theta 值。但是，当我尝试运行我的模型的摘要时，我会在下面收到一系列错误，并且找不到 theta 值。对此的任何帮助将不胜感激。我将总结我的数据集和用于生成模型摘要的代码以及下面的错误。

数据集摘要：

用于 GLM 的数据是总数（计数数据）和治疗（代表不同治疗的字母，例如 C、M、F）

用于产生 theta 的代码：

    summary(m1 <- glm.nb(Total ~ Treatment, data = twohour))

此代码的输出，底部有错误：

我们将不胜感激任何有关产生 theta 值的帮助。提前致谢。

根据要求，摘要和模型输出为文本：

总结：

> summary(twohour)


  Treatment        |     Length         |        ID          |   Block1          Block2       |   Fertility      |    Notes         |       Total      
 Length:252   |       Length:252     |     Min.   : 1.00   | Min.   :  0.0   Min.   :  0.00   Min.   :0.0000   Length:252         Min.   :  0.0  
 Class :character   Class :character   1st Qu.:10.00   1st Qu.:125.8   1st Qu.: 39.50   1st Qu.:1.0000   Class :character   1st Qu.:172.2  
 Mode  :character   Mode  :character   Median :19.50   Median :154.0   Median :104.50   Median :1.0000   Mode  :character   Median :263.0  
                                       Mean   :19.89   Mean   :143.5   Mean   : 94.66   Mean   :0.9683                      Mean   :238.1  
                                       3rd Qu.:30.00   3rd Qu.:179.2   3rd Qu.:146.00   3rd Qu.:1.0000                      3rd Qu.:309.5  
                                       Max.   :40.00   Max.   :227.0   Max.   :228.00   Max.   :1.0000                      Max.   :434.0

模型输出：

> Call: glm.nb(formula = Total ~ Treatment, data = twohour, init.theta =
> 2055605.705, 
>     link = log)
> 
> Deviance Residuals: 
>     Min       1Q   Median       3Q      Max  
> -23.001   -4.624    1.650    4.567   12.571  
> 
> Coefficients:
               Estimate Std. Error z value Pr(>|z|)     (Intercept)   5.577987   0.009846 566.534  < 2e-16 *** TreatmentC   -0.102625   0.014394  -7.130 1.01e-12 *** TreatmentF   -0.154580   0.014396 -10.737  < 2e-16 *** TreatmentF30 -0.298972   0.019920 -15.008  < 2e-16 *** TreatmentM   -0.158733   0.014613 -10.862  < 2e-16 ***
 TreatmentM30 -0.044795   0.013992  -3.201  0.00137 **  TreatmentMxF
 -0.105191   0.014211  -7.402 1.34e-13 ***
 --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
 (Dispersion parameter for Negative Binomial(2055606) family taken to
 be 1)
 
    Null deviance: 15127  on 251  degrees of freedom Residual deviance: 14799  on 245  degrees of freedom AIC: 16542

Number of Fisher Scoring iterations: 1

Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width,
3L,  :    invalid 'nsmall' argument In addition: Warning messages: 

1: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =control$trace  :   iteration limit reached 

2: In sqrt(1/i) : NaNs produced 

3: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace >  :   iteration limit reached 

4: In sqrt(1/i) : NaNs produced

【问题讨论】：

您能否将您的summary() 输出和模型输出发布为文本而不是屏幕截图？使用屏幕阅读器的人无法搜索和访问屏幕截图...
我已经这样做了，如果这有帮助，请告诉我。
嗯。有没有办法可以剪切和粘贴而保留格式（制表符或空格）？
我已经完成了，希望对您有所帮助。

标签： r glm

【解决方案1】：

tl;dr 我怀疑这是由异常值驱动的，尤其是 (??) 一些与数据集的其余部分不一致的零值。如果零值不是错误/奇怪的情况，您可能会考虑使用 zero-inflated 模型 ...???

我们可能需要您的数据才能确定发生了什么。到目前为止，我可以收集到以下信息：

您的结果的某些方面看起来像欠分散（theta 估计值大得离谱，“达到迭代限制”警告...
...但我同意您的数据似乎过度分散（残差与残差 df 的比率很大；范围从 0 到 434，平均值为 238）
...偏差残差的极端范围（-23 到 +12）表明存在异常值（偏差残差基本上在对数尺度上...）

我可以通过构建一个主要是泊松但有一些极端异常值的数据集来获得大部分方法：

n <- 252     ## total number of obs
ng <- 7      ## number of groups/treatments
mu <- exp(6)    ## mean response
   ## NOTE: this doesn't match your data, I did it accidentally,
   ##  but it does reproduce the errors.
set.seed(101)
dd <- data.frame(
    ## mostly Poisson, but with 2 values at the min and max values
    y = c(rpois(n-4, lambda=mu), rep(c(0,434), each=2)),
    f = factor(rep(1:ng, length.out = n))
)
summary(dd)
library(MASS)
m2 <- glm.nb(y~f, data = dd)

（零值看起来是最大的问题。我可以用 2 个（但不是 1 个）零异常值来重现问题，其余数据泊松的平均值很大......）

Warning messages:
1: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace >  :
  iteration limit reached
2: In sqrt(1/i) : NaNs produced
3: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace >  :
  iteration limit reached
4: In sqrt(1/i) : NaNs produced

结果：

Call:
glm.nb(formula = y ~ f, data = dd, init.theta = 12474197.56, 
    link = log)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-28.1363   -0.4887    0.0444    0.7153    3.4771  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  6.0005206  0.0082958 723.319  < 2e-16 ***
f2           0.0006192  0.0117302   0.053  0.95790    
f3           0.0015129  0.0117276   0.129  0.89736    
f4          -0.0328793  0.0118297  -2.779  0.00545 ** 
f5          -0.0195274  0.0117898  -1.656  0.09766 .  
f6           0.0068583  0.0117120   0.586  0.55816    
f7          -0.0087784  0.0117579  -0.747  0.45531    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(12474198) family taken to be 1)

    Null deviance: 1837.2  on 251  degrees of freedom
Residual deviance: 1820.0  on 245  degrees of freedom
AIC: 3795.6

Number of Fisher Scoring iterations: 1

prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, ) 中的错误：无效的“nsmall”参数

一点点挖掘表明，这个特定的错误是因为无法计算 theta 估计的标准误差（它是NaN）...

查看诊断 (plot(m2)) 可以清楚地显示异常值：

以下工作正常（或多或少：它给出了荒谬的theta 估计，因为一旦考虑到零通胀，数据不会过度分散）。

library(pscl)
zeroinfl(y~f, dist="negbin",data = dd)

【讨论】：

非常感谢您的帮助 - 0 不是异常值，而是生物学上的 0，个人没有根据他们接受的治疗产生后代。我将查看 0 个充气模型。再次感谢，
如果这解决了您的问题（见补充），我们鼓励您点击复选标记接受它...