【问题标题】:numpy - how to combine multiple indices (replace multiple one-by-one matrix access with one access)numpy - 如何组合多个索引(将多个一对一矩阵访问替换为一次访问)
【发布时间】:2021-01-01 06:47:36
【问题描述】:

更新

该实现没有考虑同一个词的多次出现,以及自身词的出现。

比如stride=2,该位置的单词是W,X的同现需要+2,W的自同需要+1。

X|Y|W|X|W

问题

要更新m * m 矩阵(co_occurance_matrix),当前使用循环逐行访问。整个代码在底部。

如何删除循环并一次更新多行?我相信应该有一种方法可以将每个索引组合成一个矩阵,用一个矢量化更新替换循环。

请建议可能的方法。

当前实现

for position in range(0, n):       
    co_ccurrence_matrix[
        sequence[position],                                                # position  to the word
        sequence[max(0, position-stride) : min((position+stride),n-1) +1]  # positions to co-occurrence words
    ] += 1
  1. 循环遍历单词索引数组sequence(单词索引是每个单词的整数代码)。
  2. 对于循环中position 处的每个单词,检查stride 距离内两边同时出现的单词。
    这是一个 N-gram context 窗口,如图中的紫色框所示。 N = context_size = stride*2 + 1
  3. 按照图中的蓝线,增加co_occurrence_matrix 中每个共现词的计数。

尝试

看来Integer array indexing 可能是一种同时访问多行的方法。

x = np.array([[ 0,  1,  2],
              [ 3,  4,  5],
              [ 6,  7,  8],
              [ 9, 10, 11]])
rows = np.array([[0, 0],
                 [3, 3]], dtype=np.intp)
columns = np.array([[0, 2],
                    [0, 2]], dtype=np.intp)
x[rows, columns]
---
array([[ 0,  2],
       [ 9, 11]])

通过组合循环中的每个索引来创建多维索引,但它不适用于错误。请告知原因和错误,或者如果尝试没有意义。

    indices = np.array([
        [
            sequence[0],                                         # position  to the word
            sequence[max(0, 0-stride) : min((0+stride),n-1) +1]  # positions to co-occurrence words
        ]]
    )
    assert n > 1
    for position in range(1, n):
        co_occurrence_indices = np.array([
            [
                sequence[position],                                                # position  to the word
                sequence[max(0, position-stride) : min((position+stride),n-1) +1]  # positions to co-occurrence words
            ]]
        )
        indices = np.append(
            indices,
            co_occurrence_indices,
            axis=0
        )

    print("Updating the co_occurrence_matrix: indices \n{} \nindices.dtype {}".format(
        indices,
        indices.dtype
    ))
    co_ccurrence_matrix[  
        indices              <---- Error
    ] += 1
 

输出

Updating the co_occurrence_matrix: indices 
[[0 array([0, 1])]
 [1 array([0, 1, 2])]
 [2 array([1, 2, 3])]
 [3 array([2, 3, 0])]
 [0 array([3, 0, 1])]
 [1 array([0, 1, 4])]
 [4 array([1, 4, 5])]
 [5 array([4, 5, 6])]
 [6 array([5, 6, 7])]
 [7 array([6, 7])]] 
indices.dtype object

<ipython-input-88-d9b081bf2f1a>:48: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  indices = np.array([
<ipython-input-88-d9b081bf2f1a>:56: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  co_occurrence_indices = np.array([

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-88-d9b081bf2f1a> in <module>
     84 sequence, word_to_id, id_to_word = preprocess(corpus)
     85 vocabrary_size = max(word_to_id.values()) + 1
---> 86 create_cooccurrence_matrix(sequence, vocabrary_size , 3)

<ipython-input-88-d9b081bf2f1a> in create_cooccurrence_matrix(sequence, vocabrary_size, context_size)
     70         indices.dtype
     71     ))
---> 72     co_ccurrence_matrix[
     73         indices
     74     ] += 1

IndexError: arrays used as indices must be of integer (or boolean) type

当前代码

import numpy as np
 
def preprocess(text):
    """
    Args:
        text: A string including sentences to process. corpus
    Returns:
        sequence:
            A numpy array of word indices to every word in the original text as they appear in the text.
            The objective of corpus is to preserve the original text but as numerical indices.
        word_to_id: A dictionary to map a word to a word index
        id_to_word: A dictionary to map a word index to a word
    """
    text = text.lower()
    text = text.replace('.', ' .')
    words = text.split(' ')
 
    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word
 
    sequence= np.array([word_to_id[w] for w in words])
 
    return sequence, word_to_id, id_to_word
 
 
def create_cooccurrence_matrix(sequence, vocabrary_size, context_size=3):
    """
    Args:
        sequence: word index sequence of the original corpus text
        vocabrary_size: number of words in the vocabrary (same with co-occurrence vector size)
        context_size: context (N-gram size N) within which to check co-occurrences.         
    """
    n = sequence_size = len(sequence)
    co_ccurrence_matrix = np.zeros((vocabrary_size, vocabrary_size), dtype=np.int32)
 
    stride = int((context_size - 1)/2 )
    assert(n > stride), "sequence_size {} is less than/equal to stride {}".format(
        n, stride
    )
 
    for position in range(0, n):       
        co_ccurrence_matrix[
            sequence[position],                                                # position  to the word
            sequence[max(0, position-stride) : min((position+stride),n-1) +1]  # positions to co-occurrence words
        ] += 1
 
    np.fill_diagonal(co_ccurrence_matrix, 0)
    return co_ccurrence_matrix
 
 
corpus= "To be, or not to be, that is the question"
 
sequence, word_to_id, id_to_word = preprocess(corpus)
vocabrary_size = max(word_to_id.values()) + 1
create_cooccurrence_matrix(sequence, vocabrary_size , 3)
---
[[0 2 0 1 0 0 0 0]
 [2 0 1 0 1 0 0 0]
 [0 1 0 1 0 0 0 0]
 [1 0 1 0 0 0 0 0]
 [0 1 0 0 0 1 0 0]
 [0 0 0 0 1 0 1 0]
 [0 0 0 0 0 1 0 1]
 [0 0 0 0 0 0 1 0]]

分析

使用来自enter link description here的ptb.train.txt。

Timer unit: 1e-06 s

Total time: 23.0015 s
File: <ipython-input-8-27f5e530d4ff>
Function: create_cooccurrence_matrix at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def create_cooccurrence_matrix(sequence, vocabrary_size, context_size=3):
     2                                               """
     3                                               Args: 
     4                                                   sequence: word index sequence of the original corpus text
     5                                                   vocabrary_size: number of words in the vocabrary (same with co-occurrence vector size)
     6                                                   context_size: context (N-gram size N) within to check co-occurrences.
     7                                               Returns:
     8                                                   co_occurrence matrix
     9                                               """
    10         1          4.0      4.0      0.0      n = sequence_size = len(sequence)
    11         1         98.0     98.0      0.0      co_occurrence_matrix = np.zeros((vocabrary_size, vocabrary_size), dtype=np.int32)
    12                                           
    13         1          5.0      5.0      0.0      stride = int((context_size - 1)/2 )
    14         1          1.0      1.0      0.0      assert(n > stride), "sequence_size {} is less than/equal to stride {}".format(
    15                                                   n, stride
    16                                               )
    17                                           
    18                                               """
    19                                               # Handle position=slice(0 : (stride-1) +1),       co-occurrences=slice(max(0, position-stride): min((position+stride),n-1) +1)
    20                                               # Handle position=slice((n-1-stride) : (n-1) +1), co-occurrences=slice(max(0, position-stride): min((position+stride),n-1) +1)
    21                                               indices = [*range(0, (stride-1) +1), *range((n-1)-stride +1, (n-1) +1)]
    22                                               #print(indices)
    23                                               
    24                                               for position in indices:
    25                                                   debug(sequence, position, stride, False)
    26                                                   co_occurrence_matrix[
    27                                                       sequence[position],                                             # position to the word
    28                                                       sequence[max(0, position-stride) : min((position+stride),n-1) +1]  # indices to co-occurance words 
    29                                                   ] += 1
    30                                           
    31                                               
    32                                               # Handle position=slice(stride, ((sequence_size-1) - stride) +1)
    33                                               for position in range(stride, (sequence_size-1) - stride + 1):        
    34                                                   co_occurrence_matrix[
    35                                                       sequence[position],                                 # position to the word
    36                                                       sequence[(position-stride) : (position + stride + 1)]  # indices to co-occurance words 
    37                                                   ] += 1
    38                                               """        
    39                                               
    40    929590    1175326.0      1.3      5.1      for position in range(0, n):        
    41   2788767   15304643.0      5.5     66.5          co_occurrence_matrix[
    42   1859178    2176964.0      1.2      9.5              sequence[position],                                                # position  to the word
    43    929589    3280181.0      3.5     14.3              sequence[max(0, position-stride) : min((position+stride),n-1) +1]  # positions to co-occurance words 
    44    929589    1062613.0      1.1      4.6          ] += 1
    45                                           
    46         1       1698.0   1698.0      0.0      np.fill_diagonal(co_occurrence_matrix, 0)
    47                                               
    48         1          2.0      2.0      0.0      return co_occurrence_matrix

【问题讨论】:

  • 您可以使用stride_tricks 滚动窗口以完全矢量化的方式解决此问题,然后使用np.eyenp.sum 获取窗口的多热向量,最后使用np.tensordot 转换多热向量到共现矩阵。在下面查看我的详细答案。
  • 您好,该解决方案对您有用吗?

标签: python arrays numpy matrix-indexing


【解决方案1】:

编辑:您可以非常轻松地使用内置的 sklearn 函数来做到这一点,但是查看您的问题历史,我相信您正在寻找一个纯 NumPy 矢量化实现。


IIUC,您想根据单词周围的上下文窗口创建一个共现矩阵。因此,如果词汇表中有 12 个单词,100 个句子,并且上下文大小为 2,那么您希望查看每个句子中大小为 5 (2 left, 1 center, 2 right) 的滚动窗口,并迭代(或矢量化)添加上下文词得到一个 (12, 12) 矩阵,它告诉你一个词在另一个词的上下文窗口中出现了多少次

矢量化实现

您可以以完全矢量化的方式执行此操作(上一节中的说明)-

#Definitions
sentences, vocab, length, context_size = 100, 12, 15, 2

#Create dummy corpus (label encoded)
window = context_size*2+1
corpus = np.random.randint(0, vocab, (sentences, length))  #(100, 15)

#Create rolling window view of the sequences
shape = corpus.shape[0], corpus.shape[1]-window+1, window  #(100, 11, 5) 
stride = corpus.strides[0], corpus.strides[1], corpus.strides[1]  #(120, 8, 8)
rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride)  #(100, 11, 5)

#Creating co-occurence matrix based on context window
center_idx = context_size
#position = rolling_window[:,:,context_size]  #(100, 11)
context = np.delete(rolling_window, center_idx, -1)  #(100, 11, 4)
context_multihot = np.sum(np.eye(vocab)[context], axis=-2)  #(100, 11, 12)
cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1]))  #(12, 12)
np.fill_diagonal(cooccurence,0)  #(12, 12)
print(cooccurence)
[[  0.  94. 100. 114.  91.  92.  90. 128. 100. 114.  91.  84.]
 [ 94.   0.  78.  96.  90.  65.  76.  68.  76. 108.  58.  68.]
 [100.  78.   0. 125. 107.  93.  83.  84.  73.  84.  97. 110.]
 [114.  96. 125.   0.  84.  97.  76. 110.  80.  94. 117.  97.]
 [ 91.  90. 107.  84.   0.  84.  87. 103.  60. 127. 123.  97.]
 [ 92.  65.  93.  97.  84.   0.  67.  87.  72.  87.  74.  92.]
 [ 90.  76.  83.  76.  87.  67.   0.  83.  73. 118.  81. 108.]
 [128.  68.  84. 110. 103.  87.  83.   0.  72. 100. 115.  69.]
 [100.  76.  73.  80.  60.  72.  73.  72.   0.  83.  81. 100.]
 [114. 108.  84.  94. 127.  87. 118. 100.  83.   0. 109. 110.]
 [ 91.  58.  97. 117. 123.  74.  81. 115.  81. 109.   0. 104.]
 [ 84.  68. 110.  97.  97.  92. 108.  69. 100. 110. 104.   0.]]

对给出的示例进行测试

让我们在一个句子语料库to be or not to be that is the question上测试一下

sentence = 'to be or not to be that is the question'
corpus = np.array([[0, 1, 2, 3, 0, 1, 4, 5, 6, 7]])

#Definitions
vocab, context_size = 8, 2
window = context_size*2+1

#Create rolling window view of the sequences
shape = corpus.shape[0], corpus.shape[1]-window+1, window
stride = corpus.strides[0], corpus.strides[1], corpus.strides[1]
rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride)

#Creating co-occurence matrix based on context window
center_idx = context_size
#position = rolling_window[:,:,context_size]  
context = np.delete(rolling_window, center_idx, -1)  
context_multihot = np.sum(np.eye(vocab)[context], axis=-2)  
cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1]))
np.fill_diagonal(cooccurence,0)
print(cooccurence)
[[0. 5. 1. 3. 1. 2. 1. 0.]
 [5. 0. 3. 2. 2. 1. 2. 1.]
 [1. 3. 0. 1. 1. 0. 0. 0.]
 [3. 2. 1. 0. 2. 1. 0. 0.]
 [1. 2. 1. 2. 0. 1. 1. 1.]
 [2. 1. 0. 1. 1. 0. 1. 0.]
 [1. 2. 0. 0. 1. 1. 0. 1.]
 [0. 1. 0. 0. 1. 0. 1. 0.]]

详细说明

让我们从创建一些标签编码的虚拟数据开始。这里有100 句子,词汇为12 大小。每个句子的长度是15,我正在采取的窗口是5 (2+1+2) -

sentences, vocab, length, context_size = 100, 12, 15, 2
window = context_size*2+1
corpus = np.random.randint(0, vocab, (sentences, length))
corpus[0:2]
#top 2 sentences
array([[ 9,  8,  9,  4,  2, 10,  9,  0,  7,  1, 11,  0,  7,  3,  1],
       [ 7,  9,  4,  0,  1,  9, 10,  7,  4,  2,  2,  3,  5,  8,  8]])

接下来,我们要创建窗口大小的滚动窗口视图,以便我们可以进入下一个阶段。这个新视图的形状将等于(sentences, number of windows, window size),因此使用stride_tricks,我们可以很容易地创建这个矩阵的滚动窗口视图。

#Create shape and stride definitions
shape = corpus.shape[0], corpus.shape[1]-window+1, window
stride = corpus.strides[0], corpus.strides[1], corpus.strides[1]
print(shape, stride)

#create view
rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride)  #(100, 11, 5)
print('\nView for first sequence ->')
print(rolling_window[0])
(100, 11, 5) (120, 8, 8)

View for first sequence ->
[[ 9  8  9  4  2]
 [ 8  9  4  2 10]
 [ 9  4  2 10  9]
 [ 4  2 10  9  0]
 [ 2 10  9  0  7]
 [10  9  0  7  1]
 [ 9  0  7  1 11]
 [ 0  7  1 11  0]
 [ 7  1 11  0  7]
 [ 1 11  0  7  3]
 [11  0  7  3  1]]

接下来让我们先只看一个句子,然后将其放入共现矩阵。之后我们可以将其缩放到更高维度的矩阵。

对于单个句子,我们可以执行以下步骤 -

  1. 获取位置词(中心词)
  2. 删除中心词列获取上下文词
  3. 使用np.eye(vocab) 创建一个单热矩阵并过滤上下文标签
  4. 在最后一个轴上求和以获得每个窗口的多热矩阵
  5. 对每个窗口的多热点上下文向量进行点积以获取(word, word) 共现矩阵。
  6. 用 0 填充对角线以忽略与其自身相同的单词的出现
position = rolling_window[0][:,2]
context = np.delete(rolling_window[0], 2, 1)
context_multihot = np.sum(np.eye(vocab)[context], axis=1)
cooccurence = context_multihot.T@context_multihot
np.fill_diagonal(cooccurence,0)
print(cooccurence)
[[0. 3. 2. 1. 1. 0. 0. 5. 0. 2. 1. 4.]
 [3. 0. 0. 2. 0. 0. 0. 4. 0. 2. 1. 3.]
 [2. 0. 0. 0. 2. 0. 0. 1. 2. 3. 2. 0.]
 [1. 2. 0. 0. 0. 0. 0. 1. 0. 0. 0. 2.]
 [1. 0. 2. 0. 0. 0. 0. 0. 1. 4. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [5. 4. 1. 1. 0. 0. 0. 0. 0. 1. 2. 2.]
 [0. 0. 2. 0. 1. 0. 0. 0. 0. 2. 1. 0.]
 [2. 2. 3. 0. 4. 0. 0. 1. 2. 0. 4. 1.]
 [1. 1. 2. 0. 1. 0. 0. 2. 1. 4. 0. 0.]
 [4. 3. 0. 2. 0. 0. 0. 2. 0. 1. 0. 0.]]

我们现在已经能够用 1 个句子完成整个事情。现在我们只需要在没有 for 循环的情况下扩展到 100 个句子。为此,只需更改几件事即可。

  1. 为位置和上下文词创建动态索引(以前是硬编码的)
  2. 处理轴,因为现在我们处理的是 3D 张量而不是 2D
  3. context_multihot 转置到点积之前的最后 2 个轴上
  4. np.dot改成np.tensordot,这样我们就可以减少指定的轴了。在这种情况下,我们必须执行(100, 12, 11) @ (100, 11, 12) -&gt; (12, 12)。所以相应地选择轴。
#Creating co-occurence matrix based on context window
center_idx = context_size
#position = rolling_window[:,:,context_size]  #(100, 11)
context = np.delete(rolling_window, center_idx, -1)  #(100, 11, 4)
context_multihot = np.sum(np.eye(vocab)[context], axis=-2)  #(100, 11, 12)
cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1]))  #(12, 12)
np.fill_diagonal(cooccurence,0)  #(12, 12)
print(cooccurence)
[[  0.  94. 100. 114.  91.  92.  90. 128. 100. 114.  91.  84.]
 [ 94.   0.  78.  96.  90.  65.  76.  68.  76. 108.  58.  68.]
 [100.  78.   0. 125. 107.  93.  83.  84.  73.  84.  97. 110.]
 [114.  96. 125.   0.  84.  97.  76. 110.  80.  94. 117.  97.]
 [ 91.  90. 107.  84.   0.  84.  87. 103.  60. 127. 123.  97.]
 [ 92.  65.  93.  97.  84.   0.  67.  87.  72.  87.  74.  92.]
 [ 90.  76.  83.  76.  87.  67.   0.  83.  73. 118.  81. 108.]
 [128.  68.  84. 110. 103.  87.  83.   0.  72. 100. 115.  69.]
 [100.  76.  73.  80.  60.  72.  73.  72.   0.  83.  81. 100.]
 [114. 108.  84.  94. 127.  87. 118. 100.  83.   0. 109. 110.]
 [ 91.  58.  97. 117. 123.  74.  81. 115.  81. 109.   0. 104.]
 [ 84.  68. 110.  97.  97.  92. 108.  69. 100. 110. 104.   0.]]

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-03-18
    • 2021-07-22
    • 1970-01-01
    • 2013-12-08
    • 2016-06-12
    相关资源
    最近更新 更多