贪心算法
测试所有可能的句子组合的第一个想法太慢了。如果你有n 句子,那么就有2**n(2 的 n 次方)可能的句子组合。例如,当 n=1000 时,有2**1000 ≈ 10**300 可能的组合。那是一个 1 后跟 300 个零:超过了宇宙中粒子的数量,也超过了可能的国际象棋游戏的数量!
这里是一个贪心算法的建议。没有特别优化,运行时间为O(k * n**2),其中n为句子数,k为最长句子长度。
思路如下:
- 将分数分配给每个句子
number of useful characters - number of superfluous characters。例如,如果一个句子包含 20 个'a',而目标只需要 15 个'a',我们将计算 15 个有用的'a' 和 5 个多余的'a',所以字符 'a' 对分数的贡献为 10那句话。
- 将得分最高的句子添加到结果中;
- 更新目标以删除结果中已经存在的字符;
- 更新每个句子的分数以反映更新后的目标。
- 循环直到没有句子得分为正。
我懒得在 C++ 中实现它,所以这里是在 python 中,使用一个最大堆和一个计数器。在代码之后我写了一个快速解释来帮助你把它翻译成 C++。
from collections import Counter
import heapq
sentences = ['More RVs were seen in the storage lot than at the campground.', 'She did her best to help him.', 'There have been days when I wished to be separated from my body, but today wasn’t one of those days.', 'The swirled lollipop had issues with the pop rock candy.', 'The two walked down the slot canyon oblivious to the sound of thunder in the distance.', 'Acres of almond trees lined the interstate highway which complimented the crazy driving nuts.', 'He is no James Bond; his name is Roger Moore.', 'The tumbleweed refused to tumble but was more than willing to prance.', 'She was disgusted he couldn’t tell the difference between lemonade and limeade.', 'He didn’t want to go to the dentist, yet he went anyway.']
target = Counter('abcdefghijklmnopqrstuvwxyz' * 10)
Counter({'a': 10, 'b': 10, 'c': 10, 'd': 10, 'e': 10, 'f': 10, 'g': 10, 'h': 10, 'i': 10, 'j': 10, 'k': 10, 'l': 10, 'm': 10, 'n': 10, 'o': 10, 'p': 10, 'q': 10, 'r': 10, 's': 10, 't': 10, 'u': 10, 'v': 10, 'w': 10, 'x': 10, 'y': 10, 'z': 10})
print(target)
counts = [Counter(''.join(filter(str.isalpha, s)).lower()) for s in sentences] # remove punctuation, spaces, uncapitalize, then count frequencies
def get_score(sentence_count, target):
return sum((sentence_count & target).values()) - sum((sentence_count - target).values())
candidates = []
for sentence, count in zip(sentences, counts):
score = get_score(count, target)
candidates.append((-score, sentence, count))
heapq.heapify(candidates) # order candidates by score
# python's heapq only handles min-heap
# but we need a max-heap
# so I added a minus sign in front of every score
selection = []
while candidates and candidates[0][0] < 0: # while there is a candidate with positive score
score, sentence, count = heapq.heappop(candidates) # greedily selecting best candidate
selection.append(sentence)
target = target - count # update target by removing characters already accounted for
candidates = [(-get_score(c,target), s, c) for _,s,c in candidates] # update scores of remaining candidates
heapq.heapify(candidates) # reorder candidates according to new scores
# HERE ARE THE SELECTED SENTENCES:
print(selection)
# ['Acres of almond trees lined the interstate highway which complimented the crazy driving nuts.', 'There have been days when I wished to be separated from my body, but today wasn’t one of those days.']
# HERE ARE THE TOTAL FREQUENCIES FOR THE SELECTED SENTENCES:
final_frequencies = Counter(filter(str.isalpha, ''.join(selection).lower()))
print(final_frequencies)
# Counter({'e': 22, 't': 15, 'a': 12, 'h': 11, 's': 10, 'o': 10, 'n': 10, 'd': 10, 'i': 9, 'r': 8, 'y': 7, 'm': 5, 'w': 5, 'c': 4, 'b': 4, 'f': 3, 'l': 3, 'g': 2, 'p': 2, 'v': 2, 'u': 2, 'z': 1})
# CHARACTERS IN EXCESS:
target = Counter('abcdefghijklmnopqrstuvwxyz' * 10)
print(final_frequencies - target)
# Counter({'e': 12, 't': 5, 'a': 2, 'h': 1})
# CHARACTERS IN DEFICIT:
print(target - final_frequencies)
# Counter({'j': 10, 'k': 10, 'q': 10, 'x': 10, 'z': 9, 'g': 8, 'p': 8, 'u': 8, 'v': 8, 'f': 7, 'l': 7, 'b': 6, 'c': 6, 'm': 5, 'w': 5, 'y': 3, 'r': 2, 'i': 1})
解释:
- Python 的
Counter( ) 将句子转换为映射character -> frequency;
- 对于两个计数器
a 和b,a & b 是多集交集,a - b 是多集差异;
- 对于计数器
a,sum(a.values()) 是总计数(所有频率的总和);
-
heapq.heapify 将列表转换为最小堆,这是一种允许轻松访问具有最低分数的元素的数据结构。我们实际上想要的是最高分的句子,而不是最低分,所以我用负数替换了所有分数。
贪心算法的非最优性
我应该提一下,这个贪心算法是一种近似算法。在每次迭代中,它选择得分最高的句子;但不能保证最优解确实包含那句话。
很容易建立一个贪心算法找不到最优解的例子:
target = Counter('abcdefghijklmnopqrstuvwxyz')
print(target)
# Counter({'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1, 'f': 1, 'g': 1, 'h': 1, 'i': 1, 'j': 1, 'k': 1, 'l': 1, 'm': 1, 'n': 1, 'o': 1, 'p': 1, 'q': 1, 'r': 1, 's': 1, 't': 1, 'u': 1, 'v': 1, 'w': 1, 'x': 1, 'y': 1, 'z': 1})
sentences = [
'The quick brown fox jumps over the lazy dog.',
'abcdefghijklm',
'nopqrstuvwxyz'
]
有了这个目标,分数如下:
[
(17, 'The quick brown fox jumps over the lazy dog.'),
(13, 'abcdefghijklm'),
(13, 'nopqrstuvwxyz')
]
这两个“半字母表”各有 13 分,因为它们包含 13 个字母表。句子“The quick brown fox...”的得分为 17 = 26 - 9,因为它包含 26 个字母表,加上 9 个多余的字母(例如,有 3 个多余的 'o' 和 2 个多余的' e')。
显然,最佳解决方案是用字母表的两半完美地覆盖目标。但是我们的贪心算法会先选择“quick brown fox”这句话,因为它的得分更高。