【问题标题】:Is there a better algorithm for exercise 9.3 in 《Think Python: How to Think Like a Computer Scientist》《Think Python: How to Think Like a Computer Scientist》中的练习9.3有没有更好的算法?
【发布时间】:2023-03-26 13:15:01
【问题描述】:

本书中的exercise 9.3要求读者找出排除this file中最少单词的5个禁止字母的组合。

以下是我第一部分的解决方案,我认为他们没有问题

# if the word contain any letter in letters, return True,
# otherwise return False
def contain(word, letters):
    for letter in letters:
        if letter in word:
            return True
    return False

# return the number of words contain any letter in letters
def ncont(words, letters):
    count = 0
    for word in words:
        if contain(word, letters):
            count += 1
return count

但是对于上面的问题,我只能想到一个蛮力算法,也就是尝试各种可能的组合,正好有26个! / 5! = 65780种组合,下面是实现:

def get_lset(nlt, alphabet, cur_set):
    global min_n, min_set
    # when get enough letters 
    if nlt <= 0:
        cur_n = ncont(words, ''.join(cur_set))
        if min_n == -1 or cur_n < min_n:
            min_n = cur_n
            min_set = cur_set.copy()
        print(''.join(cur_set), cur_n, ' *->', min_n, ''.join(min_set))
    # otherwise find the result letters in a recursive way
    else:
        cur_set.append(None)
        for i in range(len(alphabet)):
            cur_set[-1] = alphabet[i]
            get_lset(nlt-1, alphabet[i+1:], cur_set)
        cur_set.pop()

然后像这样调用上面的函数:

if __name__ == '__main__':
    min_n = -1
    min_set = []
    with open('words.txt', 'r') as fin:
        words = [line.strip() for line in fin]
    get_lset(5, list(string.ascii_lowercase), [])
    print(min_set, min_n)

但是这个解决方案很慢,我想知道这个问题有更好的算法吗?任何建议都会很好!

【问题讨论】:

    标签: python algorithm


    【解决方案1】:

    首先,让我们更简洁地重写它

    def contain(word, letters):
        return any(letter in word for letter in letters)
    
    def ncont(words, letters):
        return sum(contain(word, letters) for word in words):
    

    目前您的算法具有平均复杂度

    O(len(letters) * len(a_word) * len(words))
      ---+----------------------   -+--------
         contain(word, letters)     ncont(words, letters)
    

    我们可以通过使用sets 来减少这种情况:

    def contain(word, letters):
        return not set(letters).isdisjoint(set(word))
    

    简化为:

    O(min(len(letters), len(a_word)) * len(words))
      ---+--------------------------   -+--------
         contain(word, letters)        ncont(words, letters)
    

    根据https://wiki.python.org/moin/TimeComplexity


    至于第二部分,算法用itertools会更容易理解:

    import itertools
    
    def minimum_letter_set(words, n):
        attempts = itertools.combinations(string.ascii_lowercase, n)
        return min(attempts, key=lambda attempt: ncont(words, attempt))
    

    但是,我们可以做得更好:

    def minimum_letter_set(words, n):
        # build a lookup table for each letter to the set of words it features in
        by_letter = {
            letter: {
                word
                for word in words
                if letter in word
            }
            for letter in string.ascii_lowercase
        }
    
        # allowing us to define a function that finds words that match multiple letters
        def matching_words(letters):
            return set.union(*(by_letter[l] for l in letters))
    
        # find all 5 letter combinations
        attempts = itertools.combinations(string.ascii_lowercase, n)
    
        # and return the one that matches the fewest words
        return min(attempts, key=lambda a: len(matching_words(a))))
    

    我不相信这会降低算法复杂度,但它确实节省了过滤单词列表的重复工作。

    【讨论】:

    • 非常感谢,这是一个很好的例子,展示了python简洁的力量,函数变得比原来的实现更快。你对第二部分有什么建议吗,找到可以得到最少单词数的 5 个字母的集合?我觉得蛮力算法还不够好
    • 太好了,函数式编程风格似乎是使程序更具可读性的好方法,非常感谢。
    • 不相信我会称之为函数式编程风格
    • 我不能准确地说出你最终解决方案的时间复杂度,但是对于这个问题是一个新的观点,这个很重要,并且程序的可读性大大提高了,非常感谢!
    【解决方案2】:

    这是我的想法:

    首先计算将字母映射到字母 l 的排除词集的 exclude[l]。

    计算这 26 个集合中最小的五个的并集。这会给你一个公平的“临时最小结果”。

    然后,不要使用 itertools.combinations 来探索 5 个字母的所有组合,而是编写自己的算法来做到这一点。计算其中“排除”集的并集。在这个算法中,如果对于前 i 个字母(i

    【讨论】:

    • 很高兴你喜欢它。如果你实现它,请提供一些反馈。
    • 好的,我会尽快实现的!
    • 你好,我已经实现了你的算法,它比原来的算法快得惊人,再次感谢。Here是它的链接,有什么建议吗?
    • 比我想象的要快。它运作良好,因为字母使用之间存在很大差异。实际上第一个猜测是正确的。这不是 6 个字母的情况,而是 5、7、8、9、10 的第一个猜测就足够了。这意味着对于低使用率的字母,它们之间的相关性并不高。
    • 谢谢,我在这次谈话中学到了很多东西。它在设计算法时给了我一些指导。一是关注数据,利用数据的属性。对于这个问题,自然语言中字母用法的巨大差异大大减少了尝试次数;其次,尽量减少中途尝试的次数,这是一个更通用的指导方针。
    【解决方案3】:

    我的解决方案:

        def smallest_set(filename):
            avoid_dict = dict.fromkeys(ascii_letters.lower(), 0)
            with open(filename) as file_handler:
                for line in file_handler:
                    for key in avoid_dict:
                        if key not in line:
                            avoid_dict[key] += 1
            avoid_stats_sorted = sorted(avoid_dict, key=avoid_dict.get,
    reverse=True)
            return ''.join([item for item in avoid_stats_sorted[:5]])
    

    【讨论】:

    • 欢迎使用 Stack Overflow,虽然这可能会回答问题,但最好给出解释,而不仅仅是为您的答案编写代码。
    【解决方案4】:

    在我看来,我有一个更快的解决方案。这是带有 cmets 的代码...

    import itertools
    import string
    import timeit
    
    if __name__ == '__main__':
        # Start timestamp
        start_ts = timeit.default_timer()
    
        #
        # Small function to calculate the factorial of a number
        # Used in debugging
        #
        # Math: the number of unique combinations of x elements from y elements is calculated as
        #       y! / (y - x)! / x!
        #
        # Or, in 'school' notation
        #
        #          y!
        #    _____________
        #    (y - x)! . x!
        #
        fac = lambda num: 1 if num <= 1 else num * fac(num - 1)
    
        #
        # Open the file and read the content in memory as a list of strings
        #
        with open("words.txt", "r") as file:
            words = file.readlines()
    
        #
        # Create a dictionary containing the 26 letters of the English alphabet
        # For each of the letters, set the number the letter appears to 0
        #
        # I prefer to initialize this here instead of dynamically adding them to the dictionary later,
        # as normally this text file will contain all letters and having to check if the element exists will take longer
        #
        appearances = {}
    
        for letter in string.ascii_lowercase:
            appearances[letter] = 0
    
        #
        # For each of the words, each of the unique letters, count them into appearances
        # If a letter appears twice or even more, it does not matter.  We count the words that contain the letter
        # at least once.  For our letter set, it does not matter whether the letter appears once or more
        #
        for word in words:
            for letter in list(set(word.strip().lower())):
                appearances[letter] += 1
    
        # Debug: you will see Q has the least appearances, E has the most
        print(appearances)
    
        #
        # Let's sort this.  It's key to this algorythm
        #
        # In short:
        #
        # Suppose we only have 5 letters, A to E
        # Suppose we have counted our appearances and this is how many times they show up
        #   A : 10
        #   B : 5
        #   C : 3
        #   D : 7
        #   E : 12
        #
        # Sorted:
        #   C : 3, B: 5, D : 7, A : 10, E : 12
        #
        # Suppose we need combinations of only 2 letters
        # Take C + B
        # In worst case, you have in total 8 words that contain either C or B.  This is the case where no words have both.
        # In best case, you have 5.  This is the case where 3 words contain B and C, 2 words contain only B
        #
        # Given the above, it makes no sense to check any combination with A or E
        # You know they have either 10 or 12 words.  They can't beat B+C in number of appearances
        # So don't include them in the combinations.  This will significantly lower the number of combinations
        #
        # Given the above, you must include D, as you don't know how many words have either B or C (between 5 and 8)
        #
        # On the words.txt, this approach resulted in only 252 combinations to check.  So "with brute" force, you only
        # needed 252 iterations over the possible combinations of 5 characters.  You can verify with the debug code
        #
        #
    
        #
        # appearances_sorted is a list, we can't calculate on it
        #
        appearances_sorted = sorted(appearances, key=lambda x: appearances[x])
    
        print(appearances_sorted)
        print(appearances_sorted[:5])
    
        #
        # Calculate the least amount possible.  This is the sum of the 5 lowest appearances
        # As we are looping over the first 5, we already put them in our list of combinations to check
        #
        sum_least = 0
        appearances_least = {}
    
        for k in appearances_sorted[:5]:
            v = appearances[k]
            sum_least += v
            appearances_least[k] = v
    
        print(sum_least)
        print(appearances_least)
    
        #
        # For the rest of the sorted appearances, we add them, unless the appearance of the character by itself
        # is already higher than the sum we calculated
        #
        for k in appearances_sorted[5:]:
            if appearances[k] > sum_least:
                break
    
            appearances_least[k] = appearances[k]
    
        print(appearances_least)
    
        #
        # Debug code to check the math against the len of the calculations Python will provide
        #
        # f1 = fac(len(appearances_least))
        # f2 = fac(len(appearances_least) - 5)
        # f3 = fac(5)
        # print(f1 / f2 / f3)
        #
    
        #
        # Create all the possible combinations using itertools
        # One advantage is also that we can do this on a sorted list, the combinations with the smallest possible
        # appearances appear first.  But as said, as we don't know the words that have multiple letters combined, we
        # cannot be sure we only need to check the first
        #
        combinations = list(itertools.combinations(appearances_least, 5))
    
        # This will print 252 on the words.txt file
        print(len(combinations))
    
        #
        # How many words in total do we have
        # This total will be used as a starting point to see how a combination is done
        # The worst combination possible will never be in more words than the file contains
        #
        total_words = len(words)
        min_found = total_words
        print(total_words)
    
        #
        # Just to avoid that PyCharm complains that best_combo might not be set later
        #
        best_combo = combinations[0]
    
        #
        # Loop over all the combos we have, as we cannot be sure on the words that have multiple letters
        # When we calculated the appearances, we were calculating only per letter
        #
        for combo in combinations:
            count_matches = 0
    
            #
            # Loop over the words, then over the letters in the combo
            # If one of the letters is found, add the counter and stop the loop as it does not matter if other characters
            # of the combo also appear.  One is enough to count it.
            #
            #
            for word in words:
                for letter in combo:
                    if letter in word:
                        count_matches += 1
                        break
    
                #
                # If we already found more words than the minimum we have detected already, we can stop the loop.  This
                # combo will not be better, it will only get worse.
                #
                if count_matches > min_found:
                    break
    
            #
            # If we found a better one, store it
            #
            if count_matches < min_found:
                best_combo = combo
                min_found = count_matches
    
        # End timestamp
        end_ts = timeit.default_timer()
    
        #
        # Print the results
        #
        print(best_combo)
        print(min_found)
        print(end_ts - start_ts)
        #
        # I have:
        #
        # ('q', 'j', 'x', 'z', 'w')
        # 17382
        # 4.387889001052827
        #
        # Enjoy !
    

    【讨论】:

      猜你喜欢
      • 2016-01-16
      • 2017-02-10
      • 2023-03-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多