【发布时间】:2021-11-03 05:32:04
【问题描述】:
我正在 Google Colab (Python) 中使用涉及约 100,000 个实例的文本数据集进行 NLP 项目。现在,对于每个实例,我都在对大约 5-10 个特征进行特征提取,每次尝试运行代码大约需要 5-10 分钟。因为我正在尝试不同类型的特征,所以我多次运行特征提取过程,一段时间后总运行时间加起来。
我怀疑这可能是因为我的代码效率不高,目前依赖于列表理解、映射和迭代。由于数据的大小以及它如何存储文本的多个副本,该代码也占用了大量内存。
所以我想知道是否有更好的方法来执行特征提取以加快处理速度(并节省空间)。我听说 numpy 有矢量化操作,但不知道该怎么做。
这是我的代码的骨架版本。
import nltk
import numpy as np
import pandas as pd
df = pd.DataFrame([["The quick brown fox jumps over the lazy dog.",
"Energy is sustainable if it meets the needs of the present without compromising the ability of future generations to meet their needs."],
["The scientific literature on limiting global warming describes pathways in which the world rapidly phases out coal-fired power plants, produces more electricity from clean sources such as wind and solar, shifts towards using electricity instead of fuels in sectors such as transport and heating buildings, and takes measures to conserve energy.",
"Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s"]], columns=['text1', 'text2'])
def process(text):
tokens = nltk.word_tokenize(text)
# Other techniques like stemming and lemmatization
return tokens
def get_features(text1, text2):
features = []
feature1 = len(text1) + len(text2)
features.append(feature1)
feature2 = len([word1 for word1 in text1 if word1 in text2])
features.append(feature2)
# Continued for about 5-10 features. Some features involve multiple steps like doing named entity recognition and creating features from there
return features
df.loc[:, 'text1_tokens'] = df.loc[:, 'text1'].apply(process)
df.loc[:, 'text2_tokens'] = df.loc[:, 'text2'].apply(process)
features = df.apply(lambda x: get_features(x['text1_tokens'], x['text2_tokens']), axis='columns')
df.loc[:, 'feature1'] = list(map(lambda x: x[0], features))
df.loc[:, 'feature2'] = list(map(lambda x: x[1], features))
【问题讨论】: