将字符串从一个 numpy 数组匹配到另一个答案

【问题标题】：Match strings from one numpy array to another将字符串从一个 numpy 数组匹配到另一个
【发布时间】：2018-02-21 18:37:05
【问题描述】：

您好，我正在与python 3 合作，我已经面临这个问题一段时间了，我似乎无法弄清楚这一点。

我有 2 个包含 strings 的 numpy 数组

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])

如果您注意到，array_one 实际上是一个数组，其中包含句子 alice in a wonder land 的 1-gram, 2-gram, 3-gram, 4-gram, 5-gram。

我故意将wonderland 当作两个词wonder 和land。

现在我有另一个numpy array，其中包含一些位置和名称。

array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

现在我要做的是获取array_one 中存在于array_two 中的所有元素。

如果我使用两个数组的np.intersect1d 取出一个交集，我不会得到任何匹配项，因为wonderland 在array_one 中是两个单独的词，而在array_two 中是一个词。

有没有办法做到这一点？我已经尝试过堆栈 (this) 的解决方案，但它们似乎不适用于 python 3

array_one 最多有 60-100 个项目，而array_two 最多有大约 100 万个项目，但平均有 250,000 - 500,000 个项目。

编辑

我使用了一种非常幼稚的方法，因为到目前为止我无法找到解决方案，我从 arrays 替换了 white space，然后使用生成的 boolean 数组（[True, False, True ]) 以`过滤原始数组。下面是代码：

import numpy.core.defchararray as np_f
import numpy as np


array_two_wr = np_f.replace(array_two, ' ', '')
array_one_wr = np_f.replace(array_one, ' ', '')
intersections = array_two[np.in1d(array_two_wr, array_one_wr)]

但考虑到array_two 中的元素数量，我不确定这是要走的路

【问题讨论】：

你可以尝试使用 levenshtein 距离吗？ en.wikipedia.org/wiki/Levenshtein_distance
@EspoirMurhabazi 我想到了levenshtein distance 和Cosine string matching 但问题是如何在不使用两个for 循环的情况下实现它们，这是第一个问题，第二个问题是，我需要一些可以处理的东西空白，因为 1 的 levenshtein 距离将与 block A 和 block B 匹配，而 cosine 将在 0.90 匹配它们。
也许你可以使用this SO question中讨论的局部敏感哈希

标签： python numpy

【解决方案1】：

很抱歉发布两个答案，但在添加了上面的局部敏感散列技术之后，我意识到您可以通过使用布隆过滤器来利用数据中的类分离（查询向量和潜在匹配向量）。

布隆过滤器是一个漂亮的对象，它允许您传入一些对象，然后查询给定对象是否已添加到布隆过滤器中。这是awesome visual demo of a bloom filter。

在您的情况下，我们可以将array_two 的每个成员添加到布隆过滤器中，然后查询array_one 的每个成员以查看它是否在布隆过滤器中。使用pip install bloom-filter：

from bloom_filter import BloomFilter # pip instal bloom-filter
import numpy as np
import re

def clean(s):
  '''Clean a string'''
  return re.sub(r'\s+', '', s)

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

# initialize bloom filter with particular size
bloom = BloomFilter(max_elements=10000, error_rate=0.1)
# add each member of array_two to bloom filter
[bloom.add(clean(i)) for i in array_two]
# find the members in array_one in array_two
matches = [i for i in array_one if clean(i) in bloom]
print(matches)

结果：['wonder land']

根据您的要求，这可能是一个非常有效（且高度可扩展）的解决方案。

【讨论】：

你设置了max_elements=10000有没有意义呢？我可以设置为1 million吗？
是的，您应该将max_elements 参数设置为您打算传入的元素数量（您可能需要体面的硬件才能在主内存中完成所有操作）。布隆过滤器比 LSH 更简单，但智能得多——查看数据将帮助您决定什么是最好的......
太棒了。我有 1 个最后一个查询，BloomFilter 如何处理重复项？如果我的 cmets 有点天真，我很抱歉，因为我刚刚被介绍了这些方式，如果字符串长度较短，为什么它不匹配任何东西，例如它不匹配 a c c road 到 acc road 并且当我放在acc road 中仍然不匹配，我运行了几次，它开始匹配所有提到的情况。有什么我想念的吗？
@iam.Carrot 查看我上面链接的可视化演示。您的acc road 示例在我的机器上按预期工作（即找到匹配项）。重复条目应该对布隆过滤器没有影响。但是，布隆过滤器不会处理子字符串匹配——这就是像 LSH 这样更抽象的方法的用途。如果您只需要找到原始集的交集，布隆过滤器会很快，但如果您需要一些更模糊的 LSH 技术，则更加灵活。我希望这会有所帮助！
不，它确实匹配，但我必须运行代码几次才能匹配。我以为我错过了什么。也许是些微不足道的事情。无论如何，你帮了大忙。非常感谢。

【解决方案2】：

这里绝对可以使用 Minhashing。下面是 minhashing 背后的一个非常普遍的想法：对于列表中的每个对象，多次散列该对象，并更新一个跟踪为每个列表成员计算的散列的对象。然后检查生成的散列集，并为每个散列找到计算该散列的所有对象（我们刚刚存储了此数据）。如果仔细选择散列函数，计算相同散列的对象将非常相似。

有关 minhashing 的更详细说明，请参阅Mining Massive Datasets 的第 3 章。

这是一个使用您的数据和数据草图 (pip install datasketch) 的 Python 3 实现示例，它计算哈希值：

import numpy as np
from datasketch import MinHash, MinHashLSH
from nltk import ngrams

def build_minhash(s):
  '''Given a string `s` build and return a minhash for that string'''
  new_minhash = MinHash(num_perm=256)
  # hash each 3-character gram in `s`
  for chargram in ngrams(s, 3):
    new_minhash.update(''.join(chargram).encode('utf8'))
  return new_minhash

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

# create a structure that lets us query for similar minhashes
lsh = MinHashLSH(threshold=0.3, num_perm=256)

# loop over the index and value of each member in array two
for idx, i in enumerate(array_two):
  # add the minhash to the lsh index
  lsh.insert(idx, build_minhash(i))

# find the items in array_one with 1+ matches in arr_two
for i in array_one:
  result = lsh.query(build_minhash(i))
  if result:
    matches = ', '.join([array_two[j] for j in result])
    print(' *', i, '--', matches)

结果（array_one 成员在左侧，array_two 匹配在右侧）：

 * wonder -- wonderland
 * a wonder -- wonderland
 * wonder land -- wonderland
 * a wonder land -- wonderland
 * in a wonder land -- wonderland
 * alice in a wonder land -- wonderland

此处调整精度/召回率的最简单方法是将threshold 参数更改为MinHashLSH。您也可以尝试修改散列技术本身。在这里，我在为每个 ngram 构建 minhash 时使用了 3 个字符的哈希，耶鲁大学数字人文实验室发现这种技术在捕获文本相似性方面非常强大：https://github.com/YaleDHLab/intertext

【讨论】：

我在使用这段代码时遇到了一些问题（是的，问题陈述是这样，这似乎是更可行的解决方案）。问题是，array_one 可以根据输入进行更改，但 array_two 在所有情况下都保持不变。我在创建组合 numpy.array 时遇到了麻烦，因为我想我会将 array_two 处理为 lsh 对象并存储为泡菜
@iam.Carrot 我刚刚更新了上面的内容，所以array_two 是静态的，array_one 可以是动态的。当array_one 更新时，只需为新元素构建minhash，然后查询现在只包含array_two 元素的LSH 索引。这有意义吗？
啊！我几乎有相同的代码（微小的差异）。我会尽快测试一下。
它工作得很好。谢谢你的sn-p，我把array_one而不是array_two错误地放在join([array_two[j] for j in result])中。我有两个最后的问题，1. MinHash 能持有多少？单说能容纳100万甚至1000万吗？ 2. 如果我必须申请 Cosine 或 Levenshtien 甚至 JaroWinkler，LSH 是否有针对 em 的规定？还是我做后期处理？
完美。非常感谢您花时间帮助我。我真的很感激。