这可以表述为所有记录之间的成对汉明距离计算,分离出低于某个阈值的后续对。幸运的是,numpy/scipy/sklearn 已经完成了繁重的工作。我包含了两个产生相同输出的函数——一个是完全矢量化的(但消耗 O(N^2) 内存),另一个是消耗 O(N) 内存但仅沿单个维度矢量化的函数。在您的规模下,您几乎可以肯定不想要完全矢量化的版本 - 它可能会导致 OOM 错误。在这两种情况下,基本算法如下:
- 将每个特征值编码为整数值(感谢 sklearn!)
- 对于所有行对,计算汉明距离(不同值的总和)
- 如果在
threshold 或汉明距离以下找到两行,则丢弃后者,直到没有行低于该阈值
代码:
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
from scipy.spatial.distance import pdist, squareform
import numpy as np
def dedupe_fully_vectorized(df, threshold=1):
"""
fully vectorized memory hog version - best not to use for n > 10k
"""
# convert field data to integers
enc = OrdinalEncoder()
X = enc.fit_transform(df.to_numpy())
# calc the (unnormalized) hamming distance for all row pairs
d = pdist(X, metric="hamming") * df.shape[1]
s = squareform(d)
# s contains all pairs (j,k) and (k,j); exclude all pairs j < k as "duplicates"
s[np.triu_indices_from(s)] = -1
dupe_pair_matrix = (0 <= s) * (s <= threshold)
df_dupes = df[np.any(dupe_pair_matrix, axis=1)]
df_deduped = df.drop(df_dupes.index).sort_index()
return (df_deduped, df_dupes)
def dedupe_partially_vectorized(df, threshold=1):
"""
- Iterate through each row starting from the last; examine all previous rows for duplicates.
- If found, it is appended to a list of duplicate indices.
"""
# convert field data to integers
enc = OrdinalEncoder()
X = enc.fit_transform(df.to_numpy())
"""
- loop through each row, starting from last
- for each `row`, calculate hamming distance to all previous rows
- if any such distance is `threshold` or less, mark `idx` as duplicate
- loop ends at 2nd row (1st is by definition not a duplicate)
"""
dupe_idx = []
for j in range(len(X) - 1):
idx = len(X) - j - 1
row = X[idx]
prev_rows = X[0:idx]
dists = np.sum(row != prev_rows, axis=1)
if min(dists) <= threshold:
dupe_idx.append(idx)
dupe_idx = sorted(dupe_idx)
df_dupes = df.iloc[dupe_idx]
df_deduped = df.drop(dupe_idx)
return (df_deduped, df_dupes)
现在让我们测试一下。首先进行健全性检查:
df = pd.DataFrame(
[
["john", "doe", "m", 23],
["john", "dupe", "m", 23],
["jane", "doe", "f", 29],
["jane", "dole", "f", 28],
["jon", "dupe", "m", 23],
["tom", "donald", "m", 12],
["john", "dupe", "m", 65],
],
columns=["first", "last", "s", "age"],
)
(df_deduped_fv, df_dupes_fv) = dedupe_fully_vectorized(df)
(df_deduped, df_dupes) = dedupe_partially_vectorized(df)
df_deduped_fv == df_deduped # True
# df_deduped
# first last s age
# 0 john doe m 23
# 2 jane doe f 29
# 3 jane dole f 28
# 5 tom donald m 12
# df_dupes
# first last s age
# 1 john dupe m 23
# 4 jon dupe m 23
# 6 john dupe m 65
我已经在最多约 40k 行的数据帧上对此进行了测试(如下所示),它似乎有效(两种方法给出了相同的结果),但可能需要几秒钟。我还没有尝试过你的规模,但它可能会很慢:
arr = np.array("abcdefgh")
df = pd.DataFrame(np.random.choice(arr, (40000, 15))
# (df_deduped, df_dupes) = dedupe_partially_vectorized(df)
如果您可以避免进行所有成对比较,例如按名称分组,那将显着提高性能。
有趣的一面/方法问题
您可能会注意到您可以得到有趣的“汉明链”(我不知道这是不是一个术语),其中非常不同的记录通过一键编辑差异记录链连接:
df_bad_news = pd.DataFrame(
[
["john", "doe", "m", 88],
["jon", "doe", "m", 88],
["jan", "doe", "m", 88],
["jane", "doe", "m", 88],
["jane", "doe", "m", 12],
],
columns=["first", "last", "s", "age"],
)
(df_deduped, df_dupes) = dedupe(df)
# df_deduped
# first last s age
# 0 john doe m 88
# df_dupes
# first last s age
# 1 jon doe m 88
# 2 jan doe m 88
# 3 jane doe m 88
# 4 jane doe m 12
如果有一个可以分组的字段,性能将大大提高(在 cmets 中提到 name 应该是相同的)。这里的成对计算是内存中的 n^2 。可以根据需要用一些时间效率换取内存效率。