如何优化 NLP 项目的文本运行时特征提取答案

【问题标题】：How to optimise feature extraction from text runtime for NLP project如何优化 NLP 项目的文本运行时特征提取
【发布时间】：2021-11-03 05:32:04
【问题描述】：

我正在 Google Colab (Python) 中使用涉及约 100,000 个实例的文本数据集进行 NLP 项目。现在，对于每个实例，我都在对大约 5-10 个特征进行特征提取，每次尝试运行代码大约需要 5-10 分钟。因为我正在尝试不同类型的特征，所以我多次运行特征提取过程，一段时间后总运行时间加起来。

我怀疑这可能是因为我的代码效率不高，目前依赖于列表理解、映射和迭代。由于数据的大小以及它如何存储文本的多个副本，该代码也占用了大量内存。

所以我想知道是否有更好的方法来执行特征提取以加快处理速度（并节省空间）。我听说 numpy 有矢量化操作，但不知道该怎么做。

这是我的代码的骨架版本。

import nltk
import numpy as np
import pandas as pd

df = pd.DataFrame([["The quick brown fox jumps over the lazy dog.",
                    "Energy is sustainable if it meets the needs of the present without compromising the ability of future generations to meet their needs."],
                   ["The scientific literature on limiting global warming describes pathways in which the world rapidly phases out coal-fired power plants, produces more electricity from clean sources such as wind and solar, shifts towards using electricity instead of fuels in sectors such as transport and heating buildings, and takes measures to conserve energy.",
                    "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s"]], columns=['text1', 'text2'])


def process(text):
    tokens = nltk.word_tokenize(text)

    # Other techniques like stemming and lemmatization

    return tokens

def get_features(text1, text2):
    features = []

    feature1 = len(text1) + len(text2)
    features.append(feature1)
    feature2 = len([word1 for word1 in text1 if word1 in text2])
    features.append(feature2)

    # Continued for about 5-10 features. Some features involve multiple steps like doing named entity recognition and creating features from there

    return features

df.loc[:, 'text1_tokens'] = df.loc[:, 'text1'].apply(process)
df.loc[:, 'text2_tokens'] = df.loc[:, 'text2'].apply(process)

features = df.apply(lambda x: get_features(x['text1_tokens'], x['text2_tokens']), axis='columns')

df.loc[:, 'feature1'] = list(map(lambda x: x[0], features))
df.loc[:, 'feature2'] = list(map(lambda x: x[1], features))

【问题讨论】：

标签： python dataframe numpy

【解决方案1】：

feature2 = len([word1 for word1 in text1 if word1 in text2])

该行的运行时复杂度为words_in_text1 * words_in_text2。根据这些文本的大小，您可能会通过仅获取 text2 中的 set 单词来获得很大的加速。

您还在同一行中创建了一个列表，该列表只是被浪费了。如果文本中的单词顺序始终无关紧要，则可能使用collections.Counter 或类似对象会进一步提高速度。

例如：

from collections import Counter


text1_counts = Counter(text1)
text2_counts = Counter(text2)
feature2 = sum(count for word, count in text2_counts.items()
               if word in text2_counts)

如果您有更多具有类似问题的特征，解决这些问题应该会加快您的特征提取速度。

【讨论】：

谢谢。我可以检查一下你的意思是unique = set(text2) 和feature2 = len([word1 for word1 in text1 if word1 in unique]) 吗？或者有没有办法进一步简化列表理解
是的，在unique 中查找单词将比在text2 中快得多。看看我的编辑，添加Counter 对象，这样text1 上的迭代也更快（当然只有在您计划多次使用计数时）。