创建 |N| x |M|哈希表中的矩阵答案

【问题标题】：Creating a |N| x |M| matrix from a hash-table创建 |N| x |M|哈希表中的矩阵
【发布时间】：2017-03-05 17:03:49
【问题描述】：

假设我有一个字符串对（键）及其各自概率（值）的字典/哈希表：

import numpy as np
import random
import uuid

# Creating the N vocabulary and M vocabulary
max_word_len = 20
n_vocab_size = random.randint(8000,10000)
m_vocab_size = random.randint(8000,10000)

def random_word(): 
    return str(uuid.uuid4().get_hex().upper()[0:random.randint(1,max_word_len)])

# Generate some random words.
n_vocab = [random_word() for i in range(n_vocab_size)]
m_vocab = [random_word() for i in range(m_vocab_size)]


# Let's hallucinate probabilities for each word pair.
hashes =  {(n, m): random.random() for n in n_vocab for m in m_vocab}

hashes 哈希表如下所示：

{('585F', 'B4867'): 0.7582038699473549,
 ('69', 'D98B23C5809A'): 0.7341569569849136,
 ('4D30CB2BF4134', '82ED5FA3A00E4728AC'): 0.9106077161619021,
 ('DD8F8AFA5CF', 'CB'): 0.4609114677237601,
...
}

假设这是我将从 CSV 文件中读取的输入哈希表，第一列和第二列是哈希表的单词对（键），第三列是概率

如果我要将概率放入某种 numpy 矩阵中，我将不得不从哈希表中执行此操作：

 n_words, m_words = zip(*hashes.keys())
 probs = np.array([[hashes[(n, m)] for n in n_vocab] for m in m_vocab])

还有其他方法可以让prob 进入|N| 吗？ * |M|哈希表中的矩阵而不通过 m_vocab 和 n_vocab 进行嵌套循环？

（注意：我在这里创建随机词和随机概率，但想象一下我已经从文件中读取了哈希表，并将其读入到该哈希表结构中）

假设两种情况，其中：

哈希表来自csv 文件（@bunji 的回答解决了这个问题）
哈希表来自腌制字典。或者哈希表是在到达需要将其转换为矩阵的部分之前以其他方式计算的。

重要的是最终矩阵需要是可查询的，以下是不可取的：

$ echo -e 'abc\txyz\t0.9\nefg\txyz\t0.3\nlmn\topq\t\0.23\nabc\tjkl\t0.5\n' > test.txt

$ cat test.txt
abc xyz 0.9
efg xyz 0.3
lmn opq .23
abc jkl 0.5


$ python
Python 2.7.10 (default, Jul 30 2016, 18:31:42) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pt = pd.read_csv('test.txt', index_col=[0,1], header=None, delimiter='\t').unstack().as_matrix()
>>> pt
array([[ 0.5,  nan,  0.9],
       [ nan,  nan,  0.3],
       [ nan,  nan,  nan]])
>>> pd.read_csv('test.txt', index_col=[0,1], header=None, delimiter='\t').unstack()
       2         
1    jkl opq  xyz
0                
abc  0.5 NaN  0.9
efg  NaN NaN  0.3
lmn  NaN NaN  NaN

>>> df = pd.read_csv('test.txt', index_col=[0,1], header=None, delimiter='\t').unstack()

>>> df
       2         
1    jkl opq  xyz
0                
abc  0.5 NaN  0.9
efg  NaN NaN  0.3
lmn  NaN NaN  NaN

>>> df['abc', 'jkl']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
    return self._getitem_multilevel(key)
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
    loc = self.columns.get_loc(key)
  File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1617, in get_loc
    return self._engine.get_loc(key)
  File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
  File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13161)
  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13115)
KeyError: ('abc', 'jkl')
>>> df['abc']['jkl']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
    return self._getitem_multilevel(key)
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
    loc = self.columns.get_loc(key)
  File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1597, in get_loc
    loc = self._get_level_indexer(key, level=0)
  File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1859, in _get_level_indexer
    loc = level_index.get_loc(key)
  File "/Library/Python/2.7/site-packages/pandas/indexes/base.py", line 2106, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
  File "pandas/index.pyx", line 163, in pandas.index.IndexEngine.get_loc (pandas/index.c:4090)
KeyError: 'abc'

>>> df[0][2]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
    return self._getitem_multilevel(key)
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
    loc = self.columns.get_loc(key)
  File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1597, in get_loc
    loc = self._get_level_indexer(key, level=0)
  File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1859, in _get_level_indexer
    loc = level_index.get_loc(key)
  File "/Library/Python/2.7/site-packages/pandas/indexes/base.py", line 2106, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
  File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
  File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8141)
  File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8085)
KeyError: 0

>>> df[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
    return self._getitem_multilevel(key)
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
    loc = self.columns.get_loc(key)
  File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1597, in get_loc
    loc = self._get_level_indexer(key, level=0)
  File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1859, in _get_level_indexer
    loc = level_index.get_loc(key)
  File "/Library/Python/2.7/site-packages/pandas/indexes/base.py", line 2106, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
  File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
  File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8141)
  File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8085)
KeyError: 0

生成的矩阵/数据框应该是可查询的，即能够执行以下操作：

probs[('585F', 'B4867')] = 0.7582038699473549

【问题讨论】：

你能用熊猫做这个吗？为字典创建数据框？将两个键用作两列，将散列用作另一列。之后，您也许可以创建一个复合索引。只是猜测。
然后pandas.DataFrame.tonumpy()?有这样的功能吗？让我试试。
对于 python 3.5 左右，您的 uuid4().get_hex().upper() 可能需要更改为 uuid4().hex.upper()。
为什么要将它们放在这样的表中？
其实我还有一个|M| x 1 对应于M的向量，需要经过矩阵乘法形成|N| x |M| * |M| x 1 = |N| × 1。此外，该矩阵是执行其他统计计算所必需的，在nlp 中，该矩阵称为共现矩阵。

标签： python csv numpy matrix hash

【解决方案1】：

我不确定是否有办法完全避免循环，但我想可以通过使用 itertools 对其进行优化：

import itertools
nested_loop_iter = itertools.product(n_vocab,m_vocab)
#note that because it iterates over n_vocab first we will need to transpose it at the end
probs = np.fromiter(map(hashes.get, nested_loop_iter),dtype=float)
probs.resize((len(n_vocab),len(m_vocab)))
probs = probs.T

【讨论】：

【解决方案2】：

[dr-xorile 答案的简短扩展]

大多数解决方案对我来说都不错。如果您需要速度或方便，则取决于一点。

我同意你基本上有一个 coo 稀疏格式的矩阵。你可能想看看https://docs.scipy.org/doc/scipy-0.18.1/reference/sparse.html

唯一的问题是矩阵需要整数索引。因此，只要您的散列小足以快速表示为应该可以工作的 np.int64。并且稀疏格式应该允许 $O(1)$ 访问所有元素。

（抱歉简洁！）

粗略的轮廓

这可能很快，但有点老套。

获取稀疏表示的数据。我认为你应该选择coo_matrix 来保存你的二维哈希图。

一个。使用 numpy.fromtxt 加载 CSV 并使用例如数据类型 ['>u8', '>u8', np.float32] 将散列视为无符号 8 字节整数的字符串表示形式。如果这不起作用，您可能会加载字符串并使用 numpy 进行转换。最后，您有三个大小为 N * M 的表，就像您的哈希表一样，并将它们与您选择的 scipy 稀疏矩阵表示一起使用。

b.如果你的对象已经在内存中，你可以直接使用稀疏构造函数

要访问你需要再次解析你的字符串

prob = matrix[np.fromstring(key1, dtype='>u8'), np.fromstring(key2, dtype='>u8')]

【讨论】：

【解决方案3】：

我尝试减少样本量以快速比较不同的代码。我编写了 dataframe 方法，它可能仍然在 pandas 函数中使用 for 循环，并与 Tadhg McDonald-Jensen 提供的原始代码和 itertools 代码进行了比较。最快的代码是 itertools。

In [3]: %timeit itertool(hashes,n_vocab,m_vocab)
1000 loops, best of 3: 1.12 ms per loop

In [4]: %timeit baseline(hashes,n_vocab,m_vocab)
100 loops, best of 3: 3.23 ms per loop

In [5]: %timeit dataframeMethod(hashes,n_vocab,m_vocab)
100 loops, best of 3: 5.49 ms per loop

这是我用来比较的代码。

import numpy as np
import random
import uuid
import pandas as pd
import itertools

# Creating the N vocabulary and M vocabulary
max_word_len = 20
n_vocab_size = random.randint(80,100)
m_vocab_size = random.randint(80,100)

def random_word(): 
    return str(uuid.uuid4().get_hex().upper()[0:random.randint(1,max_word_len)])

# Generate some random words.
n_vocab = [random_word() for i in range(n_vocab_size)]
m_vocab = [random_word() for i in range(m_vocab_size)]


# Let's hallucinate probabilities for each word pair.
hashes =  {(n, m): random.random() for n in n_vocab for m in m_vocab}

def baseline(hashes,n_vocab,m_vocab):
    n_words, m_words = zip(*hashes.keys())
    probs = np.array([[hashes[(n, m)] for n in n_vocab] for m in m_vocab])
    return probs

def itertool(hashes,n_vocab,m_vocab):
    nested_loop_iter = itertools.product(n_vocab,m_vocab)
    #note that because it iterates over n_vocab first we will need to transpose it at the end
    probs = np.fromiter(map(hashes.get, nested_loop_iter),dtype=float)
    probs.resize((len(n_vocab),len(m_vocab)))
    return probs.T  

def dataframeMethod(hashes,n_vocab,m_vocab):
    # build dataframe from hashes
    id1 = pd.MultiIndex.from_tuples(hashes.keys())
    df=pd.DataFrame(hashes.values(),index=id1)
    # make dataframe with one index and one column
    df2=df.unstack(level=0)
    df2.columns = df2.columns.levels[1]
    return df2.loc[m_vocab,n_vocab].values

【讨论】：

【解决方案4】：

遍历整个 n_vocab x m_vocab 空间以获得稀疏矩阵似乎有点低效！您可以遍历原始哈希表。当然，最好先了解几件事：

你知道 n_vocab 和 m_vocab 的大小吗？或者你会在构建它时弄清楚它吗？
您知道您的哈希表中是否有任何重复，如果有，您将如何处理？看起来 hash 是一个字典，在这种情况下，显然键是唯一的。在实践中，这可能意味着您每次都在重写，所以最后一个值才是正确的。

无论如何，以下是两个选项的比较：

from collections import defaultdict
import numpy as np

hashes = defaultdict(float,{('585F', 'B4867'): 0.7582038699473549,
 ('69', 'D98B23C5809A'): 0.7341569569849136,
 ('4D30CB2BF4134', '82ED5FA3A00E4728AC'): 0.9106077161619021,
 ('DD8F8AFA5CF', 'CB'): 0.4609114677237601})

#Double loop approach
n_vocab, m_vocab = zip(*hashes.keys())
probs1 = np.array([[hashes[(n, m)] for n in n_vocab] for m in m_vocab])

#Loop through the hash approach
n_hash = dict()  #Create a hash table to find the correct row number
for i,n in enumerate(n_vocab):
    n_hash[n] = i
m_hash = dict()  #Create a hash table to find the correct col number
for i,m in enumerate(m_vocab):
    m_hash[m] = i
probs2 = np.zeros((len(n_vocab),len(m_vocab)))
for (n,m) in hashes: #Loop through the hashes and put the values into the probs table
    probs2[n_hash[n],m_hash[m]] = hashes[(n,m)]

probs1 和 probs2 的输出当然是一样的：

>>> probs1
array([[ 0.73415696,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.46091147,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.75820387,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.91060772]])
>>> probs2
array([[ 0.73415696,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.46091147,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.75820387,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.91060772]])

当然，您的 probs1 代码非常简洁。然而，循环的大小有很大的不同，它可能会对运行时间产生很大的影响

【讨论】：

【解决方案5】：

如果您的最终目标是从 .csv 文件中读取数据，则直接使用 pandas 读取文件可能会更容易。

import pandas as pd

df = pd.read_csv('coocurence_data.csv', index_col=[0,1], header=None).unstack()
probs = df.as_matrix()

这会从 csv 中读取您的数据，将前两列转换为 multi-index，这与您的两组单词相对应。然后，它取消堆叠多索引，以便您将一组单词作为列标签，将另一组单词作为索引标签。这给了你的 |N|*|M|然后可以使用 .as_matrix() 函数将矩阵转换为 numpy 数组。

这并不能真正解决您关于将 {(n,m):prob} 字典更改为 numpy 数组的问题，但考虑到您的意图，这将允许您完全避免创建该字典的需要。

另外，如果你还是要在 csv 中阅读，那么首先使用 pandas 阅读它会比使用内置 csv 模块更快：请参阅这些基准测试 here

编辑

为了根据行和列标签查询 DataFrame 中的特定值，df.loc：

df.loc['xyz', 'abc']

'xyz' 是行标签中的单词，'abc' 是列标签。另请查看 df.ix 和 df.iloc 以了解查询 DataFrame 中特定单元格的其他方法。

【讨论】：

确实，如果我将文件读入熊猫数据框，那么只会填充对角线向量，但如果哈希表被腌制为dict，它可能不那么方便。这仍然是阅读 csv 的好方法 =)
啊，这是技巧，标题没有在最终的概率矩阵中定义，使其不可查询 =(
@alvas 您可以使用 .loc 函数根据标签查询 DataFrame 中的单元格。我将在上面编辑我的答案来证明这一点。
.values 优于 .as_matrix()
当我尝试使用我的数据集时有点奇怪，当我使用 df.loc 时，它一直说我的话不在索引中。