限制
如果您拒绝使用字典,您的算法将需要大量计算。除此之外,不可能将只出现一次的关键字(例如:“karl”)与糟糕的序列(例如:“e2bo”)区分开来。我的解决方案将尽最大努力,并且仅当您的 URL 列表包含多次关键字时才有效。
基本思想
我假设一个单词是至少 3 个字符的频繁出现的字符序列。这可以防止字母“o”成为最流行的单词。
基本思路如下。
- 计算所有 n 个字母序列并选择出现多次的那个。
- 剪切属于较大序列一部分的所有序列。
- 按受欢迎程度对它们进行排序,您就有了一个接近解决问题的解决方案。 (留给读者作为练习)
在代码中
import operator
sentences = ["davidbobmike1joe" , "mikejoe2bobkarl", "joemikebob", "bobjoe", "bobbyisawesome", "david", "bobbyjoe"];
dict = {}
def countWords(n):
"""Count all possible character sequences/words of length n occuring in all given sentences"""
for sentence in sentences:
countWordsSentence(sentence, n);
def countWordsSentence(sentence, n):
"""Count all possible character sequence/words of length n occuring in a sentence"""
for i in range(0,len(sentence)-n+1):
word = sentence[i:i+n]
if word not in dict:
dict[word] = 1;
else:
dict[word] = dict[word] +1;
def cropDictionary():
"""Removes all words that occur only once."""
for key in dict.keys():
if(dict[key]==1):
dict.pop(key);
def removePartials(word):
"""Removes all the partial occurences of a given word from the dictionary."""
for i in range(3,len(word)):
for j in range(0,len(word)-i+1):
for key in dict.keys():
if key==word[j:j+i] and dict[key]==dict[word]:
dict.pop(key);
def removeAllPartials():
"""Removes all partial words in the dictionary"""
for word in dict.keys():
removePartials(word);
for i in range(3,max(map(lambda x: len(x), sentences))):
countWords(i);
cropDictionary();
removeAllPartials();
print dict;
输出
>>> print dict;
{'mike': 3, 'bobby': 2, 'david': 2, 'joe': 5, 'bob': 6}
对读者的一些挑战