用于 np.where 的 Pandas str.extract：正则表达式捕获组外的空格抛出 AttributeError答案

【问题标题】：Pandas str.extract for np.where: Whitespace outside regex capturing group throws AttributeError用于 np.where 的 Pandas str.extract：正则表达式捕获组外的空格抛出 AttributeError
【发布时间】：2016-05-22 16:53:30
【问题描述】：

从这两个字符串中，我想捕获第一行中的 5X 部分，而不是第二行中的 X50 部分：

    "name"
1   LONG YOX 5X AAA
2   LONG YOX50 AAA

对于pandas.DataFrame.loc 操作，我使用numpy.where 提取上述部分，使用long_keyword 作为定位器，使用str.extract 作为正则表达式：

long_keyword = df.loc[df["name"].str.contains("LONG", case=False), "name"]

df.loc[df["name"].str.contains(long_keyword, case=False), "result_column"] = np.where(long_keyword.str.extract(r"\s(\d+X|X\d+)", flags=re.IGNORECASE).str.strip("Xx").str.isdigit(), "+" + long_keyword.str.extract(r"\s(\d+X|X\d+)", flags=re.IGNORECASE).str.strip("Xx") + "00", "+100")

当我使用正则表达式\s(\d+X|X\d+)时，我得到：

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

但是当我使用相同的正则表达式没有捕获组之外的前导空格\s - 即(\d+X|X\d+) - 我没有收到错误。然而，这意味着我不想要的字符串部分将包含在捕获中。

问：如何解决此错误？问题是空格\s 还是我在捕获组() 之外有正则表达式标识符？

【问题讨论】：

请发布minimal reproducible example，我们可以运行它来复制问题。避免包含与问题无关的问题。 df.loc 和 np.where 位看起来与您的问题无关。您的 snbtax 和 np.where 的使用不正确。 sourceString.str.extract 适合我。
@Goyo 我已经修改了这个问题并添加了更准确的细节。不确定这是否会改变任何东西。
无论正则表达式是什么，您的代码都会为我引发TypeError: 'Series' objects are mutable, thus they cannot be hashed。无论如何，您不希望我帮助您调试一行 280 个字符长的代码，其中包含 16 个操作/属性访问/方法调用，是吗？为什么你要继续发布不可能产生你所描述的问题的代码？
@Goyo 我对你没有任何期望，你是回答我问题的人。
抱歉，这太讽刺了。关键是无法复制您使用发布的代码描述的问题，因此我无法提供帮助。我什至无法确定您的问题到底是什么。

标签： python regex numpy pandas

【解决方案1】：

假设您有这样的文件

10,"ABC YOX 5X AAA"
20,"ABC YOX50 AAA"

所以，数据框是这样的

           string
10  ABC YOX 5X AAA
20   ABC YOX50 AAA

你想要这个吗？

df['size']=df['string'].apply(lambda x: len(x.split()))
df['interest']=df[df['size']==4]['string'].str.split(" ").str.get(2)

输出

           string  size interest
10  ABC YOX 5X AAA     4       5X
20   ABC YOX50 AAA     3      NaN

这是你想要的吗？

【讨论】：

抱歉，我查看了您的代码，但不确定它应该做什么。我想要的是获取字符串"5X" 的一部分并将其转换为"+500"。我已编辑问题以涵盖更多原始代码。
"YOX50" 不应转换为 "+5000"，因为数字前没有 \s。