【问题标题】:optimize calculating Document Frequency优化计算文档频率
【发布时间】:2021-02-20 15:36:13
【问题描述】:

这需要太长时间:

# Document-frequency
phrases_final["doc_freq"] = len(phrases_final) * [0]


# for each phrase, compute the number of clusters that phrase occurs in

for phrase in phrases_final["extracted_phrases"]:
    for i in cluster_name:
        all_tweets = ""
        for tweet in df["tweets_to_consider"][df.cl_num == i]:
            all_tweets = all_tweets + tweet + ". "
        if phrase in all_tweets:
            phrases_final["doc_freq"][
                (phrases_final.extracted_phrases == phrase) & (phrases_final.cluster_num == i)
            ] = (
                phrases_final["doc_freq"][
                    (phrases_final.extracted_phrases == phrase) & (phrases_final.cluster_num == i)
                ]
                + 1
            )

【问题讨论】:

    标签: python nlp


    【解决方案1】:
    • 您可能应该为每个集群预先计算all_tweets,而不是为每个短语重新计算。
      • 或者,您可能根本不想构造all_tweets,因为if phrase in (long_string_here) 会很慢;考虑一个集合的字典,也许?
    • 不是直接将结果计算到数据帧中(至少我假设它是来自索引的数据帧,但老实说,您将 phrases_final 初始化为整数列表,因此无论如何索引可能完全是假的),考虑一个collections.Counter()(cluster_num, phrase) 元组索引(或collections.defaultdict(collections.Counter)cluster_num 索引,然后是phrase)。
    • 如果速度仍然太慢,请使用 multiprocessing.Pool() 在短语或集群上并行处理。

    【讨论】:

      猜你喜欢
      • 2016-01-23
      • 2023-03-08
      • 2014-05-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多