这就是你想要的:
import numpy as np
arr1 = np.array([['a1', 'x'], ['a2', 'x'], ['a3', 'y'], ['a4', 'y'], ['a5', 'z']])
d = {'x': 2, 'z': 1, 'y': 1, 'w': 2}
# get the actual counts of values in arr1
counts = dict(zip(*np.unique(arr1[:, 1], return_counts=True)))
# determine what values to keep, as their count matches the desired count
keep = [x for x in d if x in counts and d[x] == counts[x]]
# filter down the array
result = arr1[list(map(lambda x: x[1] in keep, arr1))]
很可能在 numpy 中有一种更优化的方法来执行此操作,但我不知道您申请的集合有多大,或者您需要多久执行一次,以说明是否值得寻找它。
编辑:请注意,您需要扩大规模以确定什么是好的解决方案。您的原始解决方案非常适合玩具示例,它优于两个答案。但是,如果您扩大到可能更现实的工作负载,@NewbieAF 提供的 numpy 解决方案可以轻松击败其他解决方案:
from random import randint
from timeit import timeit
import numpy as np
def original(arr1, d):
return [[x1, x2] for x1, x2 in arr1 if np.count_nonzero(arr1 == x2) == d[x2]]
def f1(arr1, d):
# get the actual counts of values in arr1
counts = dict(zip(*np.unique(arr1[:, 1], return_counts=True)))
# determine what values to keep, as their count matches the desired count
keep = [x for x in d if x in counts and d[x] == counts[x]]
# filter down the array
return arr1[list(map(lambda x: x[1] in keep, arr1))]
def f2(arr1, d):
# create arrays from d
keys, vals = np.array(list(d.keys())), np.array(list(d.values()))
# count the unique elements in arr1[:,1]
unqs, cts = np.unique(arr1[:,1], return_counts=True)
# only keep track of elements that appear in arr1
mask = np.isin(keys,unqs)
keys, vals = keys[mask], vals[mask]
# sort the unique values and corresponding counts according to keys
idx1 = np.argsort(np.argsort(keys))
idx2 = np.argsort(unqs)
unqs, cts = unqs[idx2][idx1], cts[idx2][idx1]
# filter values by whether the counts match
correct = unqs[vals==cts]
return arr1[np.isin(arr1[:,1],correct)]
def main():
arr1 = np.array([['a1', 'x'], ['a2', 'x'], ['a3', 'y'], ['a4', 'y'], ['a5', 'z']])
d = {'x': 2, 'z': 1, 'y': 1, 'w': 2}
print(timeit(lambda: original(arr1, d), number=10000))
print(timeit(lambda: f1(arr1, d), number=10000))
print(timeit(lambda: f2(arr1, d), number=10000))
counts = [randint(1, 3) for _ in range(10000)]
arr1 = np.array([['x', f'{n}'] for n in range(10000) for _ in range(counts[n])])
d = {f'{n}': randint(1, 3) for n in range(10000)}
print(timeit(lambda: original(arr1, d), number=10))
print(timeit(lambda: f1(arr1, d), number=10))
print(timeit(lambda: f2(arr1, d), number=10))
main()
结果:
0.14045359999999998
0.2402685
0.5027185999999999
46.7569239
5.893172499999999
0.08729539999999503
numpy 解决方案在玩具示例上速度较慢,但在大型输入上速度要快几个数量级。您的解决方案看起来不错,但在扩大规模时会输给非 numpy 解决方案,避免额外的调用。
考虑问题的规模。如果问题很小,您应该选择自己的解决方案,以提高可读性。如果问题是中等规模的,您可能会选择我的以提高性能。如果问题很大(无论是在大小上还是在使用频率上),您应该选择 all numpy 解决方案,以牺牲可读性来换取速度。