【问题标题】:Top K Frequent Words using heaps in Python [duplicate]在Python中使用堆的前K个常用词[重复]
【发布时间】:2021-02-22 23:05:00
【问题描述】:

我正在尝试在 O(N log K) 时间内解决 Top K Frequent Words Leetcode problem 问题,但结果不理想。我的 Python3 代码和控制台输出如下:

from collections import Counter
import heapq

class Solution:
    def topKFrequent(self, words: List[str], k: int) -> List[str]:
        
        counts = Counter(words)
        print('Word counts:', counts)
        
        result = []
        for word in counts:
            print('Word being added:', word)
            if len(result) < k:
                heapq.heappush(result, (-counts[word], word))
                print(result)
            else:
                heapq.heappushpop(result, (-counts[word], word))
        result = [r[1] for r in result]
        
        return result

----------- Console output -----------

Word counts: Counter({'the': 3, 'is': 3, 'sunny': 2, 'day': 1})
Word being added: the
[(-3, 'the')]
Word being added: day
[(-3, 'the'), (-1, 'day')]
Word being added: is
[(-3, 'is'), (-1, 'day'), (-3, 'the')]
Word being added: sunny
[(-3, 'is'), (-2, 'sunny'), (-3, 'the'), (-1, 'day')]

当我使用K = 4 运行测试用例["the", "day", "is", "sunny", "the", "the", "sunny", "is", "is"] 时,我发现一旦添加了is,单词the 就会移动到列表的末尾(在day 之后),即使它们都有计数为 3。这是有道理的,因为父级只需要 (-2, 'sunny') 和(-3, 'the') 都> (-3, 'is'),因此堆不变量实际上是保持不变的,即使(-3, 'the') (-2, 'sunny') 并且是(-3, 'is') 的右孩子。预期结果是["is","the","sunny","day"],而我的代码输出是["is","sunny","the","day"]

我是否应该在 O(N log K) 时间内使用堆来解决这个问题,如果是,我该如何修改我的代码以达到预期的结果?

【问题讨论】:

    标签: python heap


    【解决方案1】:

    您在使用heapqCounter 时走在正确的轨道上,您只需要稍微修改与k 相关的使用它们的方式:(您需要在添加任何内容之前迭代所有计数到result):

    from collections import Counter
    import heapq
    
    class Solution:
        def topKFrequent(self, words: List[str], k: int) -> List[str]:
            counts = collections.Counter(words)
            max_heap = []
            for key, val in counts.items():
                heapq.heappush(max_heap, (-val, key))
            
            result = []
            while k > 0:
                result.append(heapq.heappop(max_heap)[1])
                k -= 1
            
            return result
    

    之前没有读过O(N log k)的要求,这里对上述解决方案进行修改以实现:

    from collections import Counter, deque
    import heapq
    
    class WordWithFrequency(object):
        def __init__(self, word, frequency):
            self.word = word
            self.frequency = frequency
    
        def __lt__(self, other):
            if self.frequency == other.frequency:
                return lt(other.word, self.word)
            else:
                return lt(self.frequency, other.frequency)
    
    class Solution:
        def topKFrequent(self, words: List[str], k: int) -> List[str]:    
            counts = collections.Counter(words)
            
            max_heap = []
            for key, val in counts.items():
                heapq.heappush(max_heap, WordWithFrequency(key, val))
                if len(max_heap) > k:
                    heapq.heappop(max_heap)
            
            result = deque([]) # can also use a list and just reverse at the end
            while k > 0:
                result.appendleft(heapq.heappop(max_heap).word)
                k -= 1
            
            return list(result)
    

    【讨论】:

    • 感谢您的回答。假设 K words 的所有元素都是唯一的,如果将它们全部推送到堆上,这不是 O(N log N) 而不是 O(N log K)?
    • max_heap 在任何时候都只有 K + 1 个元素。它将针对 K + 1 执行 heappop
    【解决方案2】:

    您无需为堆而烦恼。 Counter() 已经有了返回最常见元素的方法。

    >>> c = Counter(["the", "day", "is", "sunny", "the", "the", "sunny", "is", "is"])
    >>> c.most_common(4)
    [('the', 3), ('is', 3), ('sunny', 2), ('day', 1)]
    

    【讨论】:

    • +1,但是对于上下文,OP 使用的平台,Leetcode 是用于面试准备的,通常面试官不允许你库函数来完成问题的核心方面。例如。如果一个问题与二分搜索无关,我敢打赌大多数面试官都会允许你使用bisect.bisect_left,但most_common 会在这个问题上延伸它,imo,除非它是 Python 专家角色。
    • @ShashSinha 我没有意识到上下文。另一方面,在我做采访的时候,我问了一个特定的问题,这个问题总是得到很长很复杂的答案。一位候选人刚刚回答说:“我会把它们放在一堆。”这就是我需要听到的全部内容!
    • 它是否具有所需的时间复杂度?
    • @superbrain。 Counter 使用堆来查找最大的 n 个。所以我怀疑它使用了标准的 N log n 算法,其中 N 是您拥有的项目数,n 是“我想要最大的 n”。
    【解决方案3】:

    使用Counter()sort(),这也会传递O(N Log N)

    class Solution:
        def topKFrequent(self, words, k):
            words_countmap = collections.Counter(words)
            items = list(words_countmap.items())
            items.sort(key=lambda item: (-item[1], item[0]))
            return [item[0] for item in items[0:k]]
    

    这是一个使用 PriorityQueue 的 Java 解决方案:

    class Solution {
        public static final List<String> topKFrequent(
            final String[] words,
            final int k
        ) {
            LinkedList<String> frequentWords = new LinkedList<>();
            HashMap<String, Integer> wordsMap = new HashMap<>();
    
            for (int i = 0; i < words.length; i++) {
                wordsMap.put(words[i], wordsMap.getOrDefault(words[i], 0) + 1);
            }
    
            PriorityQueue<Map.Entry<String, Integer>> wordCounterQueue = new PriorityQueue<>(
                (a, b) -> a.getValue() == b.getValue() ? b.getKey().compareTo(a.getKey()) : a.getValue() -
                b.getValue()
            );
    
            for (Map.Entry<String, Integer> key : wordsMap.entrySet()) {
                wordCounterQueue.offer(key);
    
                if (wordCounterQueue.size() > k) {
                    wordCounterQueue.poll();
                }
            }
    
            while (!wordCounterQueue.isEmpty()) {
                frequentWords.add(0, wordCounterQueue.poll().getKey());
            }
    
            return frequentWords;
        }
    }
    

    与 C++ 类似:

    // Most of headers are already included;
    // Can be removed;
    #include <iostream>
    #include <cstdint>
    #include <vector>
    #include <string>
    #include <unordered_map>
    #include <queue>
    #include <utility>
    
    // The following block might slightly improve the execution time;
    // Can be removed;
    static const auto __optimize__ = []() {
        std::ios::sync_with_stdio(false);
        std::cin.tie(nullptr);
        std::cout.tie(nullptr);
        return 0;
    }();
    
    
    
    
    struct Solution {
        using ValueType = std::uint_fast16_t;
        using Pair = std::pair<std::string, ValueType>;
        static const std::vector<std::string> topKFrequent(
            const std::vector<std::string>& words,
            const ValueType k
        ) {
            std::unordered_map<std::string, ValueType> word_counts;
    
            for (const auto& word : words) {
                ++word_counts[word];
            }
    
            std::priority_queue<Pair, std::vector<Pair>, Comparator> word_freqs;
    
            for (const auto& word_count : word_counts) {
                word_freqs.push(word_count);
    
    
                if (std::size(word_freqs) > k) {
                    word_freqs.pop();
                }
            }
    
            std::vector<std::string> k_frequent;
    
            while (!word_freqs.empty()) {
                k_frequent.emplace_back(word_freqs.top().first);
                word_freqs.pop();
            }
    
            std::reverse(std::begin(k_frequent), std::end(k_frequent));
    
            return k_frequent;
        }
    
      private:
        struct Comparator {
            Comparator() {}
            ~Comparator() {}
            const bool operator()(
                const Pair& a,
                const Pair& b
            ) {
                return a.second > b.second || (a.second == b.second && a.first < b.first);
            }
        };
    };
    
    int main() {
        std::vector<std::string> words = {"i", "love", "leetcode", "i", "love", "coding"};
        std::vector<std::string> top_k_frequents = Solution().topKFrequent(words, 2);
    
        for (const auto& word : top_k_frequents) {
            std::cout << word << "\n";
        }
    }
    
    
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-06-30
      • 2017-04-30
      • 1970-01-01
      • 2018-09-10
      • 2014-03-10
      • 2019-09-10
      相关资源
      最近更新 更多