【发布时间】:2021-01-01 06:47:36
【问题描述】:
更新
该实现没有考虑同一个词的多次出现,以及自身词的出现。
比如stride=2,该位置的单词是W,X的同现需要+2,W的自同需要+1。
X|Y|W|X|W
问题
要更新m * m 矩阵(co_occurance_matrix),当前使用循环逐行访问。整个代码在底部。
如何删除循环并一次更新多行?我相信应该有一种方法可以将每个索引组合成一个矩阵,用一个矢量化更新替换循环。
请建议可能的方法。
当前实现
for position in range(0, n):
co_ccurrence_matrix[
sequence[position], # position to the word
sequence[max(0, position-stride) : min((position+stride),n-1) +1] # positions to co-occurrence words
] += 1
- 循环遍历单词索引数组
sequence(单词索引是每个单词的整数代码)。 - 对于循环中
position处的每个单词,检查stride距离内两边同时出现的单词。
这是一个 N-gramcontext窗口,如图中的紫色框所示。N = context_size = stride*2 + 1。 - 按照图中的蓝线,增加
co_occurrence_matrix中每个共现词的计数。
尝试
看来Integer array indexing 可能是一种同时访问多行的方法。
x = np.array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
rows = np.array([[0, 0],
[3, 3]], dtype=np.intp)
columns = np.array([[0, 2],
[0, 2]], dtype=np.intp)
x[rows, columns]
---
array([[ 0, 2],
[ 9, 11]])
通过组合循环中的每个索引来创建多维索引,但它不适用于错误。请告知原因和错误,或者如果尝试没有意义。
indices = np.array([
[
sequence[0], # position to the word
sequence[max(0, 0-stride) : min((0+stride),n-1) +1] # positions to co-occurrence words
]]
)
assert n > 1
for position in range(1, n):
co_occurrence_indices = np.array([
[
sequence[position], # position to the word
sequence[max(0, position-stride) : min((position+stride),n-1) +1] # positions to co-occurrence words
]]
)
indices = np.append(
indices,
co_occurrence_indices,
axis=0
)
print("Updating the co_occurrence_matrix: indices \n{} \nindices.dtype {}".format(
indices,
indices.dtype
))
co_ccurrence_matrix[
indices <---- Error
] += 1
输出
Updating the co_occurrence_matrix: indices
[[0 array([0, 1])]
[1 array([0, 1, 2])]
[2 array([1, 2, 3])]
[3 array([2, 3, 0])]
[0 array([3, 0, 1])]
[1 array([0, 1, 4])]
[4 array([1, 4, 5])]
[5 array([4, 5, 6])]
[6 array([5, 6, 7])]
[7 array([6, 7])]]
indices.dtype object
<ipython-input-88-d9b081bf2f1a>:48: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
indices = np.array([
<ipython-input-88-d9b081bf2f1a>:56: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
co_occurrence_indices = np.array([
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-88-d9b081bf2f1a> in <module>
84 sequence, word_to_id, id_to_word = preprocess(corpus)
85 vocabrary_size = max(word_to_id.values()) + 1
---> 86 create_cooccurrence_matrix(sequence, vocabrary_size , 3)
<ipython-input-88-d9b081bf2f1a> in create_cooccurrence_matrix(sequence, vocabrary_size, context_size)
70 indices.dtype
71 ))
---> 72 co_ccurrence_matrix[
73 indices
74 ] += 1
IndexError: arrays used as indices must be of integer (or boolean) type
当前代码
import numpy as np
def preprocess(text):
"""
Args:
text: A string including sentences to process. corpus
Returns:
sequence:
A numpy array of word indices to every word in the original text as they appear in the text.
The objective of corpus is to preserve the original text but as numerical indices.
word_to_id: A dictionary to map a word to a word index
id_to_word: A dictionary to map a word index to a word
"""
text = text.lower()
text = text.replace('.', ' .')
words = text.split(' ')
word_to_id = {}
id_to_word = {}
for word in words:
if word not in word_to_id:
new_id = len(word_to_id)
word_to_id[word] = new_id
id_to_word[new_id] = word
sequence= np.array([word_to_id[w] for w in words])
return sequence, word_to_id, id_to_word
def create_cooccurrence_matrix(sequence, vocabrary_size, context_size=3):
"""
Args:
sequence: word index sequence of the original corpus text
vocabrary_size: number of words in the vocabrary (same with co-occurrence vector size)
context_size: context (N-gram size N) within which to check co-occurrences.
"""
n = sequence_size = len(sequence)
co_ccurrence_matrix = np.zeros((vocabrary_size, vocabrary_size), dtype=np.int32)
stride = int((context_size - 1)/2 )
assert(n > stride), "sequence_size {} is less than/equal to stride {}".format(
n, stride
)
for position in range(0, n):
co_ccurrence_matrix[
sequence[position], # position to the word
sequence[max(0, position-stride) : min((position+stride),n-1) +1] # positions to co-occurrence words
] += 1
np.fill_diagonal(co_ccurrence_matrix, 0)
return co_ccurrence_matrix
corpus= "To be, or not to be, that is the question"
sequence, word_to_id, id_to_word = preprocess(corpus)
vocabrary_size = max(word_to_id.values()) + 1
create_cooccurrence_matrix(sequence, vocabrary_size , 3)
---
[[0 2 0 1 0 0 0 0]
[2 0 1 0 1 0 0 0]
[0 1 0 1 0 0 0 0]
[1 0 1 0 0 0 0 0]
[0 1 0 0 0 1 0 0]
[0 0 0 0 1 0 1 0]
[0 0 0 0 0 1 0 1]
[0 0 0 0 0 0 1 0]]
分析
使用来自enter link description here的ptb.train.txt。
Timer unit: 1e-06 s
Total time: 23.0015 s
File: <ipython-input-8-27f5e530d4ff>
Function: create_cooccurrence_matrix at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def create_cooccurrence_matrix(sequence, vocabrary_size, context_size=3):
2 """
3 Args:
4 sequence: word index sequence of the original corpus text
5 vocabrary_size: number of words in the vocabrary (same with co-occurrence vector size)
6 context_size: context (N-gram size N) within to check co-occurrences.
7 Returns:
8 co_occurrence matrix
9 """
10 1 4.0 4.0 0.0 n = sequence_size = len(sequence)
11 1 98.0 98.0 0.0 co_occurrence_matrix = np.zeros((vocabrary_size, vocabrary_size), dtype=np.int32)
12
13 1 5.0 5.0 0.0 stride = int((context_size - 1)/2 )
14 1 1.0 1.0 0.0 assert(n > stride), "sequence_size {} is less than/equal to stride {}".format(
15 n, stride
16 )
17
18 """
19 # Handle position=slice(0 : (stride-1) +1), co-occurrences=slice(max(0, position-stride): min((position+stride),n-1) +1)
20 # Handle position=slice((n-1-stride) : (n-1) +1), co-occurrences=slice(max(0, position-stride): min((position+stride),n-1) +1)
21 indices = [*range(0, (stride-1) +1), *range((n-1)-stride +1, (n-1) +1)]
22 #print(indices)
23
24 for position in indices:
25 debug(sequence, position, stride, False)
26 co_occurrence_matrix[
27 sequence[position], # position to the word
28 sequence[max(0, position-stride) : min((position+stride),n-1) +1] # indices to co-occurance words
29 ] += 1
30
31
32 # Handle position=slice(stride, ((sequence_size-1) - stride) +1)
33 for position in range(stride, (sequence_size-1) - stride + 1):
34 co_occurrence_matrix[
35 sequence[position], # position to the word
36 sequence[(position-stride) : (position + stride + 1)] # indices to co-occurance words
37 ] += 1
38 """
39
40 929590 1175326.0 1.3 5.1 for position in range(0, n):
41 2788767 15304643.0 5.5 66.5 co_occurrence_matrix[
42 1859178 2176964.0 1.2 9.5 sequence[position], # position to the word
43 929589 3280181.0 3.5 14.3 sequence[max(0, position-stride) : min((position+stride),n-1) +1] # positions to co-occurance words
44 929589 1062613.0 1.1 4.6 ] += 1
45
46 1 1698.0 1698.0 0.0 np.fill_diagonal(co_occurrence_matrix, 0)
47
48 1 2.0 2.0 0.0 return co_occurrence_matrix
【问题讨论】:
-
您可以使用
stride_tricks滚动窗口以完全矢量化的方式解决此问题,然后使用np.eye和np.sum获取窗口的多热向量,最后使用np.tensordot转换多热向量到共现矩阵。在下面查看我的详细答案。 -
您好,该解决方案对您有用吗?
标签: python arrays numpy matrix-indexing