Zipf 分布：如何测量 Zipf 分布答案

【问题标题】：Zipf Distribution: How do I measure Zipf DistributionZipf 分布：如何测量 Zipf 分布
【发布时间】：2017-04-28 02:54:27
【问题描述】：

如何测量或找到 Zipf 分布？例如，我有一个英语单词语料库。如何找到 Zipf 分布？我需要找到 Zipf 分布，然后绘制它的图表。但我被困在第一步，即找到 Zipf 分布。

编辑：从每个单词的频率计数来看，很明显它遵循 Zipf 定律。但我的目标是绘制一个 zipf 分布图。我不知道如何计算分布图的数据

【问题讨论】：

这是你的家庭作业吗？请告诉我们您尝试了什么？
不，这不是我的家庭作业。这更像是一个爱好项目。我正在分析一个名为 Indus Script 的古代脚本。更多细节在这里：（journals.plos.org/plosone/article?id=10.1371/…）。脚本由符号而不是单词组成。我首先将符号语料库翻译成一个数字序列，并对其进行了数字分析。对于 zipf 分布，我已经计算了每个符号的频率，我不知道如何从那里开始。
如果您的目标只是绘制它，只需绘制计数直方图：matplotlib.org/1.2.1/examples/api/histogram_demo.html
不行，我需要绘制一个 zipf 分布图来表明语料库中的数据符合 Zipf 定律。从频率计数中，我可以清楚地看到它确实遵守 Zipf 定律，但是我应该能够将它拟合到 Zipf 分布图上。

标签： python numpy scipy statistics zipf

【解决方案1】：

我不会假装理解统计数据。然而，根据scipy site 的阅读，这是python 中的一个幼稚尝试。

构建数据

首先我们获取数据。例如，我们从国家医学图书馆 MeSH（医学主题词）ASCII 文件 d2016.bin (28 MB) 下载数据。
接下来，我们打开文件，转换成字符串。

open_file = open('d2016.bin', 'r')
file_to_string = open_file.read()

接下来我们在文件中定位单个单词并分离出单词。

words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)

最后我们准备了一个字典，其中唯一的单词作为键，字数作为值。

for word in words:
    count = frequency.get(word,0)
    frequency[word] = count + 1

构建 zipf 分发数据
出于速度目的，我们将数据限制在 1000 字以内。

n = 1000
frequency = {key:value for key,value in frequency.items()[0:n]}

之后我们得到值的频率，转换为numpy数组并使用numpy.random.zipf函数从zipf分布中抽取样本。

分布参数a =2.作为样本，因为它需要大于1。出于可见性目的，我们将数据限制为 50 个样本点。

s = frequency.values()
s = np.array(s)

count, bins, ignored = plt.hist(s[s<50], 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)

最后绘制数据。

综合考虑

import re
from operator import itemgetter
import matplotlib.pyplot as plt
from scipy import special
import numpy as np

#Get our corpus of medical words
frequency = {}
open_file = open('d2016.bin', 'r')
file_to_string = open_file.read()
words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)

#build dict of words based on frequency
for word in words:
    count = frequency.get(word,0)
    frequency[word] = count + 1

#limit words to 1000
n = 1000
frequency = {key:value for key,value in frequency.items()[0:n]}

#convert value of frequency to numpy array
s = frequency.values()
s = np.array(s)

#Calculate zipf and plot the data
a = 2. #  distribution parameter
count, bins, ignored = plt.hist(s[s<50], 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()

剧情

【讨论】：

感谢您的详细解释。我在：count, bins, ignored = plt.hist(s[s<50], 50, normed=True) TypeError: unorderable types: dict_values() < int() 遇到错误我使用的是 Python 3.5.2，但是我在 Python 2.x 上遇到了同样的错误。会尝试修复它，看看效果如何。
太棒了，这成功了。非常感谢您的帮助。非常感谢您的帮助。
这个答案缺少很多变量。 y = x**(-a) / special.zetac(a) 行无法编译，因为未定义 special。此外，尽管提到它，您从不使用 numpy.random.zipf。如果您可以使用包含的所有变量和步骤更新此答案，那就太好了。
您好，使用 python 2.7 的解决方案仍然适用于我。对于 python 3.x，它可能需要一些更改。 special.zetac(a) 的静态值为 0.6449340668482264