在python中计算窗口大小的缩放共现矩阵答案

【问题标题】：Scaled Co-occurrence matrix with window size calculation in python在python中计算窗口大小的缩放共现矩阵
【发布时间】：2020-07-17 08:15:27
【问题描述】：

假设我有一个 CSV 格式的数据集，其中包含成行的句子/段落。假设，它看起来像这样：

df = ['A B X B', 'X B B']

现在，我可以生成如下所示的共现矩阵

这里，(A,B,X) 是单词。它说B出现在X出现的地方= 4次我使用的代码

def co_occurrence(sentences, window_size):
    d = defaultdict(int)
    vocab = set()
    for text in sentences:
        # preprocessing (use tokenizer instead)
        text = text.lower().split()
        # iterate over sentences
        for i in range(len(text)):
            token = text[i]
            vocab.add(token)  # add to vocab
            next_token = text[i+1 : i+1+window_size]
            for t in next_token:
                key = tuple( sorted([t, token]) )
                d[key] += 1

    # formulate the dictionary into dataframe
    vocab = sorted(vocab) # sort vocab
    df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                      index=vocab,
                      columns=vocab)
    for key, value in d.items():
        df.at[key[0], key[1]] = value
        df.at[key[1], key[0]] = value
    return df

这段代码的美妙之处在于它允许我选择窗口大小。这意味着如果一个特定的单词没有出现在总句子大小的固定范围内，那么它就会被忽略。但我想扩展它。

所以这意味着如果一个词离目标词“to”很远，那么它将被赋予较小的值。不幸的是，我找不到合适的解决方案。是否可以使用诸如 scikit-learn 之类的软件包？或者除了原始编码还有其他方法吗？

【问题讨论】：

您已经非常接近要求工具、库等了 ;-)
同现矩阵是否来自上面声明的df？
@thebjorn 哦！玩游戏很长时间;-)
@thebjorn 不。我正在努力实现的示例。
你已经看到了吗：stackoverflow.com/a/49667439/75103？

标签： python matrix nlp stanford-nlp find-occurrences

【解决方案1】：

Here 的实现可以根据输入句子中单词标记之间的距离选择性地缩放累积的共现值：

In [11]: sentences = ['from swerve of shore to bend of bay , brings'.split()]                                    

In [12]: index, matrix = co_occurence_matrix(sentences, window=3, scale=True)                                    

In [13]: cell = index['bend'], index['of']                                                                       

In [14]: matrix[cell]                                                                                            
Out[14]: 1.3333333333333333

In [15]: index, matrix = co_occurence_matrix(sentences, window=3, scale=False)                                   

In [16]: matrix[cell]                                                                                            
Out[16]: 2.0

In [17]: {w: matrix[index['to']][i] for w, i in index.items()}                                                   
Out[17]: 
{',': 0.0,
 'bend': 1.0,
 'of': 1.0,
 'bay': 0.3333333333333333,
 'brings': 0.0,
 'to': 0.0,
 'from': 0.0,
 'shore': 1.0,
 'swerve': 0.3333333333333333}

【讨论】：

像魅力一样工作。谢谢。
@AtanuCSE，distances 方法可能会被进一步优化——可能有一个更干净的迭代解决方案来检查令牌，而不是遍历所有成对组合——但很高兴这就是你的所作所为寻找。您还可以使用其他距离指标，例如 math.sqrt(distance)。