用于计算 NLP 任务的单词共现矩阵的工具答案

【问题标题】：Tool for calculating co-occurrence matrix of words for NLP task用于计算 NLP 任务的单词共现矩阵的工具
【发布时间】：2023-03-10 18:30:01
【问题描述】：

我有一个 15GB 的单词文本。我需要计算固定大小的窗口中单词的共现计数，然后对其进行处理。例如，这是我的文字；

“foo 说 hoo，bar 说什么？”

要在窗口大小 = 4 的情况下从此文本中构造具有共现频率的二元组，输出应如下所示；

word1-word2-count

富，说，1

foo,hoo,1

foo,bar,1

说，呼，2

说，酒吧，2

说，说，1

呼，酒吧，1

呼，什么，1

酒吧，什么，1

说，什么，1

我已经知道有一些工具可以执行此操作，例如 NLTK，但它不是多线程的，因此不适用于 15gb 大小的文本。有没有什么工具可以在给定的窗口大小和快速的方式给我单词的共现矩阵？

【问题讨论】：

标签： python nlp text-processing

【解决方案1】：

我自己也曾寻找过这样的工具，但从未找到。我通常只是简单地编写一个脚本来完成它。以下示例包含一些可能对您有用的限制：

import concurrent.futures
from collections import Counter

tokens = []

for _ in range(10):
    tokens.extend(['lazy', 'old', 'fart', 'lying', 'on', 'the', 'bed'])


def cooccurrances(idx, tokens, window_size):

    # beware this will backfire if you feed it large files (token lists)
    window = tokens[idx:idx+window_size]    
    first_token = window.pop(0)

    for second_token in window:
        yield first_token, second_token

def harvest_cooccurrances(tokens, window_size=3, n_workers=5):
    l = len(tokens)
    harvest = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=n_workers) as executor:
        future_cooccurrances = {
            executor.submit(cooccurrances, idx, tokens, window_size): idx
            for idx
            in range(l)
        }
        for future in concurrent.futures.as_completed(future_cooccurrances):
            try:
                harvest.extend(future.result())
            except Exception as exc:
                # you may want to add some logging here
                continue


    return harvest

def count(harvest):
    return [
        (first_word, second_word, count) 
        for (first_word, second_word), count 
        in Counter(harvest).items()
    ]


harvest = harvest_cooccurrances(tokens, 3, 5)
counts = count(harvest)

print(counts)

如果你只是运行代码，你应该会看到这个：

[('lazy', 'old', 10),
 ('lazy', 'fart', 10),
 ('fart', 'lying', 10),
 ('fart', 'on', 10),
 ('lying', 'on', 10),
 ('lying', 'the', 10),
 ('on', 'the', 10),
 ('on', 'bed', 10),
 ('old', 'fart', 10),
 ('old', 'lying', 10),
 ('the', 'bed', 10),
 ('the', 'lazy', 9),
 ('bed', 'lazy', 9),
 ('bed', 'old', 9)]

限制：

由于切片，此脚本无法很好地处理大型令牌列表
window 列表的分割在这里有效，但如果您打算对窗口列表切片执行任何操作，则应该注意它
您可能需要实现一些特定的东西来替换 Counter 对象以防阻塞（同样是大列表限制）

猜想：

您也许可以使用spaCy Matcher（请参阅here）编写类似的内容，但是，我不确定这是否可行，因为您需要的通配符仍然有点不稳定（根据我的经验）。

【讨论】：