Python中文本语料库中点互信息的高效计算答案

【问题标题】：Efficient calculation of point mutual information in the text corpus in PythonPython中文本语料库中点互信息的高效计算
【发布时间】：2019-01-22 09:51:30
【问题描述】：

我有一个语料库，我在其中计算 unigrams 和 skipgrams 的频率，通过将它们除以所有频率的总和来归一化这些值，然后将它们输入 pandas 数据帧。现在，我想计算每个skipgram的点互信息，即skipgram的归一化频率除以skipgram中两个unigram的乘归一化频率的对数。

我的数据框如下所示：

unigram_df.head()
              word  count      prob
0          nordisk      1  0.000007
1           lments      1  0.000007
2             four     91  0.000593
3          travaux      1  0.000007
4  cancerestimated      1  0.000007

skipgram_df.head()
                      words  count      prob
0                 (o, odds)      1  0.000002
1  (reported, pretreatment)      1  0.000002
2       (diagnosis, simply)      1  0.000002
3           (compared, sbx)      1  0.000002
4             (imaging, or)      1  0.000002

现在，我计算每个skipgram的PMI值，通过迭代skipgram_df的每一行，提取skipgram的prob值，提取两个unigrams的prob值，然后计算对数，并将结果附加到列表。

代码看起来像这样，并且运行良好：

for row in skipgram_df.itertuples():
    skipgram_prob = float(row[3])
    x_unigram_prob = float(unigram_df.loc[unigram_df['word'] == str(row[1][0])]['prob'])
    y_unigram_prob = float(unigram_df.loc[unigram_df['word'] == str(row[1][1])]['prob'])
    pmi = math.log10(skipgram_prob/(x_unigram_prob*y_unigram_prob))
    pmi_list.append(pmi)

问题在于遍历整个数据帧需要很长时间（300,000 个跳过图大约需要 30 分钟）。我将不得不处理比这还要大 10-20 倍的语料库，所以我正在寻找一种更有效的方法来做到这一点。谁能建议另一种更快的解决方案？谢谢。

【问题讨论】：

skipgram_df['words' 是字符串还是元组？
@wwii 他们是元组

标签： python nlp

【解决方案1】：

我也在尝试解决类似的问题。我不知道如何提高代码的性能，但你可以并行化它，因为每个计算都是相互独立的。 Pandas df.iterrow() parallelization

【讨论】：