我自己也曾寻找过这样的工具,但从未找到。我通常只是简单地编写一个脚本来完成它。以下示例包含一些可能对您有用的限制:
import concurrent.futures
from collections import Counter
tokens = []
for _ in range(10):
tokens.extend(['lazy', 'old', 'fart', 'lying', 'on', 'the', 'bed'])
def cooccurrances(idx, tokens, window_size):
# beware this will backfire if you feed it large files (token lists)
window = tokens[idx:idx+window_size]
first_token = window.pop(0)
for second_token in window:
yield first_token, second_token
def harvest_cooccurrances(tokens, window_size=3, n_workers=5):
l = len(tokens)
harvest = []
with concurrent.futures.ThreadPoolExecutor(max_workers=n_workers) as executor:
future_cooccurrances = {
executor.submit(cooccurrances, idx, tokens, window_size): idx
for idx
in range(l)
}
for future in concurrent.futures.as_completed(future_cooccurrances):
try:
harvest.extend(future.result())
except Exception as exc:
# you may want to add some logging here
continue
return harvest
def count(harvest):
return [
(first_word, second_word, count)
for (first_word, second_word), count
in Counter(harvest).items()
]
harvest = harvest_cooccurrances(tokens, 3, 5)
counts = count(harvest)
print(counts)
如果你只是运行代码,你应该会看到这个:
[('lazy', 'old', 10),
('lazy', 'fart', 10),
('fart', 'lying', 10),
('fart', 'on', 10),
('lying', 'on', 10),
('lying', 'the', 10),
('on', 'the', 10),
('on', 'bed', 10),
('old', 'fart', 10),
('old', 'lying', 10),
('the', 'bed', 10),
('the', 'lazy', 9),
('bed', 'lazy', 9),
('bed', 'old', 9)]
限制:
- 由于切片,此脚本无法很好地处理大型令牌列表
-
window 列表的分割在这里有效,但如果您打算对窗口列表切片执行任何操作,则应该注意它
- 您可能需要实现一些特定的东西来替换
Counter 对象以防阻塞(同样是大列表限制)
猜想:
您也许可以使用spaCy Matcher(请参阅here)编写类似的内容,但是,我不确定这是否可行,因为您需要的通配符仍然有点不稳定(根据我的经验)。