Intuition from some examples

给定一个随机变量XX满足某种分布, 我们可以通过sample它得到其mean or variance. 假设sample了NN个点, 那sample mean
X=1Nn=1NXn\overline{X}=\frac{1}{N}\sum_{n=1}^N X_n
随着NN的增加,X\overline{X}应该越来越趋近于真实的mean E[X]\mathbb{E}[X] 最终相等. 但是simulation results indicate otherwise.
例1: Bernoulli distribution XBernoulli(p=0.6)X\sim Bernoulli(p=0.6), 随着N的增大,sample mean的curve如下图所示
Sample mean of a random variable
可以看到, 最终的mean确实好像是收敛到了p=0.6p=0.6. 但是如果我们放大来看的话会发现, 这条曲线实际上在抖动,就是说他并不是converge到一个点的.

例2: Gaussian distribution XN(0,1)X\sim N(0,1), sample mean的curve
Sample mean of a random variable

可以看到, 10610^6之后sample mean仍在抖动. 虽然抖动的很小, 但至少不是想象中的converge to a single point.

以上simulation表示, sample mean 收敛不到 population mean (i.e., 真实的mean) . It should be close to the population mean, but may not exactly equal the population mean.

Main Results

另一种表述方法是: 即使 NN足够大, 每次sample N次得到的 X\overline{X}仍然不是一个固定值, 而是一个distribution.

Theorem 1 (mean of sample mean). If E[X]=μ\mathbb{E}[X]=\mu, then E[X]=μ\mathbb{E}[\overline{X}]=\mu.

Theorem 1很好理解, 即 XX 的 sample mean 的 mean 即为 XX 的 mean. 这也很好verify:

E[X]=E[1Nn=1NXn]=1Nn=1NE[Xn]=E[X]=μ,\mathbb{E}[\overline{X}]= \mathbb{E}\bigg[\frac{1}{N}\sum_{n=1}^N X_n\bigg] =\frac{1}{N}\sum_{n=1}^N \mathbb{E}[X_n]=\mathbb{E}[X]=\mu,

因为每次的sample都是i.i.d.的.

Theorem 2 (variance of sample mean). If var[X]=σ2\text{var} [X]=\sigma^2, then var[X]=σ2N\text{var}[\overline{X}]=\frac{\sigma^2}{N}.

var[X]=var[1Nn=1NXn]=1N2n=1Nvar[Xn]=σ2N,\text{var}[\overline{X}]= \text{var}\bigg[\frac{1}{N}\sum_{n=1}^N X_n\bigg] =\frac{1}{N^2}\sum_{n=1}^N \text{var}[X_n]=\frac{\sigma^2}{N},

从Theorem 2中也可以看出, 多sample是有好处的, NN越大sample mean 的variance越小也就越趋近于population mean.

Conclusion

Overall, the sample mean is not a robust statistic, meaning that they are sensitive to outliers. We can only give a lower bound and an upper bound of the population mean, and say how confident we are (in %) that the population mean is between the lower bound and upper bound of the confidence interval.

Sample mean of a random variable

Confidence interval is [XE,X+E]\big[ \overline{X}-E, \overline{X}+E\big], where EE is called the margin of error, and is given by
E=zα/2σNE=z_{\alpha/2}\frac{\sigma}{\sqrt{N}}

zz: critical value, can be computed from standard normal distribution if given α/2\alpha/2.
α\alpha: significance level.
CL=1αCL=1-\alpha: confidence level.

As shown in the figure,

  1. Given a CL=95%CL=95\%;
  2. Calculate α=0.05\alpha = 0.05 and α=0.025\alpha = 0.025;
  3. Check norm distribution table and find zα/2=z0.025=1.96z_{\alpha/2}=z_{0.025}=1.96
  4. Compute E=zα/2σNE=z_{\alpha/2}\frac{\sigma}{\sqrt{N}}, and the confidence interval.

相关文章: