在Python中使用堆的前K个常用词[重复]答案

【问题标题】：Top K Frequent Words using heaps in Python [duplicate]在Python中使用堆的前K个常用词[重复]
【发布时间】：2021-02-22 23:05:00
【问题描述】：

我正在尝试在 O(N log K) 时间内解决 Top K Frequent Words Leetcode problem 问题，但结果不理想。我的 Python3 代码和控制台输出如下：

from collections import Counter
import heapq

class Solution:
    def topKFrequent(self, words: List[str], k: int) -> List[str]:
        
        counts = Counter(words)
        print('Word counts:', counts)
        
        result = []
        for word in counts:
            print('Word being added:', word)
            if len(result) < k:
                heapq.heappush(result, (-counts[word], word))
                print(result)
            else:
                heapq.heappushpop(result, (-counts[word], word))
        result = [r[1] for r in result]
        
        return result

----------- Console output -----------

Word counts: Counter({'the': 3, 'is': 3, 'sunny': 2, 'day': 1})
Word being added: the
[(-3, 'the')]
Word being added: day
[(-3, 'the'), (-1, 'day')]
Word being added: is
[(-3, 'is'), (-1, 'day'), (-3, 'the')]
Word being added: sunny
[(-3, 'is'), (-2, 'sunny'), (-3, 'the'), (-1, 'day')]

当我使用K = 4 运行测试用例["the", "day", "is", "sunny", "the", "the", "sunny", "is", "is"] 时，我发现一旦添加了is，单词the 就会移动到列表的末尾（在day 之后），即使它们都有计数为 3。这是有道理的，因为父级只需要 (-2, 'sunny') 和(-3, 'the') 都> (-3, 'is')，因此堆不变量实际上是保持不变的，即使(-3, 'the') (-2, 'sunny') 并且是(-3, 'is') 的右孩子。预期结果是["is","the","sunny","day"]，而我的代码输出是["is","sunny","the","day"]。

我是否应该在 O(N log K) 时间内使用堆来解决这个问题，如果是，我该如何修改我的代码以达到预期的结果？

【问题讨论】：

标签： python heap

【解决方案1】：

您在使用heapq 和Counter 时走在正确的轨道上，您只需要稍微修改与k 相关的使用它们的方式：（您需要在添加任何内容之前迭代所有计数到result):

from collections import Counter
import heapq

class Solution:
    def topKFrequent(self, words: List[str], k: int) -> List[str]:
        counts = collections.Counter(words)
        max_heap = []
        for key, val in counts.items():
            heapq.heappush(max_heap, (-val, key))
        
        result = []
        while k > 0:
            result.append(heapq.heappop(max_heap)[1])
            k -= 1
        
        return result

之前没有读过O(N log k)的要求，这里对上述解决方案进行修改以实现：

from collections import Counter, deque
import heapq

class WordWithFrequency(object):
    def __init__(self, word, frequency):
        self.word = word
        self.frequency = frequency

    def __lt__(self, other):
        if self.frequency == other.frequency:
            return lt(other.word, self.word)
        else:
            return lt(self.frequency, other.frequency)

class Solution:
    def topKFrequent(self, words: List[str], k: int) -> List[str]:    
        counts = collections.Counter(words)
        
        max_heap = []
        for key, val in counts.items():
            heapq.heappush(max_heap, WordWithFrequency(key, val))
            if len(max_heap) > k:
                heapq.heappop(max_heap)
        
        result = deque([]) # can also use a list and just reverse at the end
        while k > 0:
            result.appendleft(heapq.heappop(max_heap).word)
            k -= 1
        
        return list(result)

【讨论】：

感谢您的回答。假设 K words 的所有元素都是唯一的，如果将它们全部推送到堆上，这不是 O(N log N) 而不是 O(N log K)？
max_heap 在任何时候都只有 K + 1 个元素。它将针对 K + 1 执行 heappop

【解决方案2】：

您无需为堆而烦恼。 Counter() 已经有了返回最常见元素的方法。

>>> c = Counter(["the", "day", "is", "sunny", "the", "the", "sunny", "is", "is"])
>>> c.most_common(4)
[('the', 3), ('is', 3), ('sunny', 2), ('day', 1)]

【讨论】：

+1，但是对于上下文，OP 使用的平台，Leetcode 是用于面试准备的，通常面试官不允许你库函数来完成问题的核心方面。例如。如果一个问题与二分搜索无关，我敢打赌大多数面试官都会允许你使用bisect.bisect_left，但most_common 会在这个问题上延伸它，imo，除非它是 Python 专家角色。
@ShashSinha 我没有意识到上下文。另一方面，在我做采访的时候，我问了一个特定的问题，这个问题总是得到很长很复杂的答案。一位候选人刚刚回答说：“我会把它们放在一堆。”这就是我需要听到的全部内容！
它是否具有所需的时间复杂度？
@superbrain。 Counter 使用堆来查找最大的 n 个。所以我怀疑它使用了标准的 N log n 算法，其中 N 是您拥有的项目数，n 是“我想要最大的 n”。

【解决方案3】：

使用Counter() 和sort()，这也会传递O(N Log N)：

class Solution:
    def topKFrequent(self, words, k):
        words_countmap = collections.Counter(words)
        items = list(words_countmap.items())
        items.sort(key=lambda item: (-item[1], item[0]))
        return [item[0] for item in items[0:k]]

这是一个使用 PriorityQueue 的 Java 解决方案：

class Solution {
    public static final List<String> topKFrequent(
        final String[] words,
        final int k
    ) {
        LinkedList<String> frequentWords = new LinkedList<>();
        HashMap<String, Integer> wordsMap = new HashMap<>();

        for (int i = 0; i < words.length; i++) {
            wordsMap.put(words[i], wordsMap.getOrDefault(words[i], 0) + 1);
        }

        PriorityQueue<Map.Entry<String, Integer>> wordCounterQueue = new PriorityQueue<>(
            (a, b) -> a.getValue() == b.getValue() ? b.getKey().compareTo(a.getKey()) : a.getValue() -
            b.getValue()
        );

        for (Map.Entry<String, Integer> key : wordsMap.entrySet()) {
            wordCounterQueue.offer(key);

            if (wordCounterQueue.size() > k) {
                wordCounterQueue.poll();
            }
        }

        while (!wordCounterQueue.isEmpty()) {
            frequentWords.add(0, wordCounterQueue.poll().getKey());
        }

        return frequentWords;
    }
}

与 C++ 类似：

// Most of headers are already included;
// Can be removed;
#include <iostream>
#include <cstdint>
#include <vector>
#include <string>
#include <unordered_map>
#include <queue>
#include <utility>

// The following block might slightly improve the execution time;
// Can be removed;
static const auto __optimize__ = []() {
    std::ios::sync_with_stdio(false);
    std::cin.tie(nullptr);
    std::cout.tie(nullptr);
    return 0;
}();




struct Solution {
    using ValueType = std::uint_fast16_t;
    using Pair = std::pair<std::string, ValueType>;
    static const std::vector<std::string> topKFrequent(
        const std::vector<std::string>& words,
        const ValueType k
    ) {
        std::unordered_map<std::string, ValueType> word_counts;

        for (const auto& word : words) {
            ++word_counts[word];
        }

        std::priority_queue<Pair, std::vector<Pair>, Comparator> word_freqs;

        for (const auto& word_count : word_counts) {
            word_freqs.push(word_count);


            if (std::size(word_freqs) > k) {
                word_freqs.pop();
            }
        }

        std::vector<std::string> k_frequent;

        while (!word_freqs.empty()) {
            k_frequent.emplace_back(word_freqs.top().first);
            word_freqs.pop();
        }

        std::reverse(std::begin(k_frequent), std::end(k_frequent));

        return k_frequent;
    }

  private:
    struct Comparator {
        Comparator() {}
        ~Comparator() {}
        const bool operator()(
            const Pair& a,
            const Pair& b
        ) {
            return a.second > b.second || (a.second == b.second && a.first < b.first);
        }
    };
};

int main() {
    std::vector<std::string> words = {"i", "love", "leetcode", "i", "love", "coding"};
    std::vector<std::string> top_k_frequents = Solution().topKFrequent(words, 2);

    for (const auto& word : top_k_frequents) {
        std::cout << word << "\n";
    }
}

【讨论】：