优化 pandas 中的字符串操作答案

【问题标题】：Optimizing string manipulations in pandas优化 pandas 中的字符串操作
【发布时间】：2021-03-06 12:34:03
【问题描述】：

我有一个 10M 记录的数据集，其中第一步是清理数据并使数据集中的单词长度小于 400（如果有）。可以在不使用 numba /dask 或其他多处理库的情况下以原始形式更快地完成这项工作吗？

from cleantext import clean
def func_vect(val):
    temp=clean(val,no_line_breaks=True,no_urls=True,no_emails=True,lower=True).split()

    if len(temp)<=400:
        return " ".join(u for u in temp if len(u)<=15)

    else:
        return " ".join(u for u in temp[:175]+temp[-175:] if len(u)<=15)

ufunc_vec=np.vectorize(func_vect,otypes=[str])

【问题讨论】：

为什么不使用 np.select(condition, choice) 作为 if else 条件。它应该加快速度
能举个例子吗？我尝试了 np.select 但在 else 条件下我将不得不将文本拆分两次，对吗？
很难在没有数据的情况下为您提供代码。但我们的想法是丢失循环并使用 numpy 对其进行优化。
请勿使用np.vectorize 以加快代码速度。在pandas 中，字符串值作为 Python 字符串存储在对象 dtype Series 中。 pandas 确实有将字符串方法应用于系列的方法。 numpy 不会对字符串做任何快速或花哨的事情。

标签： python python-3.x pandas numpy

【解决方案1】：

这可能有效：

df['truncated_string'] = df['string'].str[:400]

【讨论】：

我想先清理文本，然后获取前 175 个单词和最后 175 个单词，然后加入它们并将其作为输入提供给 ML 模型。
柱子不能清洗是有原因的吗？ df['clean_string'] = df['dirty_string'].appy(cleaning_func) 之类的东西？
为了减少计算时间，我制作了一个函数。 Apply 也适用于上述情况，但需要更多时间。