您可以按首字母对项目进行分组,然后只搜索子列表,任何字符串都不能以子字符串开头,除非它至少具有相同的首字母:
from collections import defaultdict
def find(l):
d = defaultdict(list)
# group by first letter
for ele in l:
d[ele[0]].append(ele)
for val in d.values():
for v in val:
# check each substring in the sublist
if not any(v.startswith(s) and v != s for s in val):
yield v
print(list(find(l)))
['sdfdg', 'xc', 'ab']
这会正确过滤单词,从下面的代码中可以看出,reduce 函数没有,'tool' 不应该出现在输出中:
In [56]: l = ["tool",'ab',"too", 'xc', 'abb',"toot", 'abed',"abel", 'sdfdg', 'abfdsdg', 'xccc',"xcew","xrew"]
In [57]: reduce(r,l)
Out[57]: ['tool', 'ab', 'too', 'xc', 'sdfdg', 'xrew']
In [58]: list(find(l))
Out[58]: ['sdfdg', 'too', 'xc', 'xrew', 'ab']
它也很有效:
In [59]: l = ["".join(sample(ascii_lowercase, randint(2,25))) for _ in range(5000)]
In [60]: timeit reduce(r,l)
1 loops, best of 3: 2.12 s per loop
In [61]: timeit list(find(l))
1 loops, best of 3: 203 ms per loop
In [66]: %%timeit
..... result = []
....: for element in lst:
....: is_prefixed = False
....: for possible_prefix in lst:
....: if element is not possible_prefix and element.startswith(possible_prefix):
....: is_prefixed = True
....: break
....: if not is_prefixed:
....: result.append(element)
....:
1 loops, best of 3: 4.39 s per loop
In [92]: timeit list(my_filter(l))
1 loops, best of 3: 2.94 s per loop
如果你知道最小字符串长度总是 > 1,你可以进一步优化,同样如果最小长度字符串是 2,那么一个词必须至少有前两个字母:
def find(l):
d = defaultdict(list)
# find shortest length string to use as key length
mn = len(min(l, key=len))
for ele in l:
d[ele[:mn]].append(ele)
for val in d.values():
for v in val:
if not any(v.startswith(s) and v != s for s in val):
yield v
In [84]: timeit list(find(l))
100 loops, best of 3: 14.6 ms per loop
最后,如果您有骗子,您可能希望从列表中过滤掉重复的单词,但您需要保留它们以进行比较:
from collections import defaultdict,Counter
def find(l):
d = defaultdict(list)
mn = len(min(l, key=len))
cn = Counter(l)
for ele in l:
d[ele[:mn]].append(ele)
for val in d.values():
for v in val:
if not any(v.startswith(s) and v != s for s in val):
# make sure v is not a dupe
if cn[v] == 1:
yield v
因此,如果速度很重要,那么使用上述代码的一些变体的实现将比您的幼稚方法快得多。内存中存储的数据也更多,因此您也应该考虑到这一点。
为了节省内存,我们可以为每个 val/sublist 创建一个计数器,这样我们一次只存储一个计数器字典:
def find(l):
d = defaultdict(list)
mn = len(min(l, key=len))
for ele in l:
d[ele[:mn]].append(ele)
for val in d.values():
# we only need check each grouping of words for dupes
cn = Counter(val)
for v in val:
if not any(v.startswith(s) and v != s for s in val):
if cn[v] == 1:
yield v
每个循环创建一个字典会增加 5 毫秒,所以对于 5k 个单词来说仍然是
如果数据已排序,reduce 方法应该可以工作:
reduce(r,sorted(l)) # -> ['ab', 'sdfdg', 'too', 'xc', 'xrew']
为了明确行为之间的区别:
In [202]: l = ["tool",'ab',"too", 'xc', 'abb',"toot", 'abed',
"abel", 'sdfdg', 'abfdsdg', 'xccc',"xcew","xrew","ab"]
In [203]: list(filter_list(l))
Out[203]: ['ab', 'too', 'xc', 'sdfdg', 'xrew', 'ab']
In [204]: list(find(l))
Out[204]: ['sdfdg', 'too', 'xc', 'xrew', 'ab', 'ab']
In [205]: reduce(r,sorted(l))
Out[205]: ['ab', 'sdfdg', 'too', 'xc', 'xrew']
In [206]: list(find_dupe(l))
Out[206]: ['too', 'xrew', 'xc', 'sdfdg']
In [207]: list(my_filter(l))
Out[207]: ['sdfdg', 'xrew', 'too', 'xc']
In [208]: "ab".startswith("ab")
Out[208]: True
所以ab 重复了两次,因此使用集合或字典而不跟踪ab 出现的次数意味着我们认为没有其他元素满足条件ab"ab".startswith(other ) == True,即我们可以看到是不正确的。
您还可以使用 itertools.groupby 根据最小索引大小进行分组:
def find_dupe(l):
l.sort()
mn = len(min(l, key=len))
for k, val in groupby(l, key=lambda x: x[:mn]):
val = list(val)
for v in val:
cn = Counter(val)
if not any(v.startswith(s) and v != s for s in val) and cn[v] == 1:
yield v
根据您的 cmets,如果您认为 "dd".startswith("dd") 不应该是重复元素的 True,我们可以调整我的第一个代码:
l = ['abbb', 'xc', 'abb', 'abed', 'sdfdg', 'xc','abfdsdg', 'xccc', 'd','dd','sdfdg', 'xc','abfdsdg', 'xccc', 'd','dd']
def find_with_dupe(l):
d = defaultdict(list)
# group by first letter
srt = sorted(set(l))
ind = len(srt[0])
for ele in srt:
d[ele[:ind]].append(ele)
for val in d.values():
for v in val:
# check each substring in the sublist
if not any(v.startswith(s) and v != s for s in val):
yield v
print(list(find_with_dupe(l)))
['abfdsdg', 'abed', 'abb', 'd', 'sdfdg', 'xc']
在随机文本样本上运行的时间只是您自己的代码运行时间的一小部分:
In [15]: l = open("/home/padraic/Downloads/sample.txt").read().split()
In [16]: timeit list(find(l))
100 loops, best of 3: 19 ms per loop
In [17]: %%timeit
....: l = open("/home/padraic/Downloads/sample.txt").read().split()
....: for i in range(0, len(l) - 1):
....: for j in range(i + 1, len(l)):
....: if l[j].startswith(l[i]):
....: l[j] = l[i]
....: else:
....: if l[i].startswith(l[j]):
....: l[i] = l[j]
....:
1 loops, best of 3: 4.92 s per loop
两者都返回相同的输出:
In [41]: l = open("/home/padraic/Downloads/sample.txt").read().split()
In [42]:
for i in range(0, len(l) - 1):
for j in range(i + 1, len(l)):
if l[j].startswith(l[i]):
l[j] = l[i]
else:
if l[i].startswith(l[j]):
l[i] = l[j]
....:
In [43]:
In [43]: l2 = open("/home/padraic/Downloads/sample.txt").read().split()
In [44]: sorted(set(l)) == sorted(find(l2))
Out[44]: True