使用正则表达式从熊猫数据框中的列中提取数据答案

【问题标题】：extracting the data from a column in pandas dataframe using regular expression使用正则表达式从熊猫数据框中的列中提取数据
【发布时间】：2019-11-27 07:26:23
【问题描述】：

我有一个如下定义的数据框 df

import pandas as pd
df = pd.DataFrame(
    {
        "ID": [1, 2, 3, 4, 5],
        "name": [
            "Hello Kitty how=1234 when=2345",
            "how=3456 Hello Puppy when=7685",
            "how=646 It is an Helloexample when=9089",
            "for how=6574 stackoverflow when=5764",
            "Hello  when=3632 World how=7654",
        ],
    }
)





df
Out[100]: 
   ID                                     name
0   1           Hello Kitty how=1234 when=2345
1   2           how=3456 Hello Puppy when=7685
2   3  how=646 It is an Helloexample when=9089
3   4     for how=6574 stackoverflow when=5764
4   5           Hello  when=3632 World how=7654

我想将 how 和 when 之后写入的值提取到两个单独的列中，方法和时间。如何使用正则表达式做同样的事情？

例如：在第一条记录中，我应该在how 列中获得1234，在when 列中获得2345。在最后一条记录中，我应该在how 列中得到7654，在when 列中得到3632

【问题讨论】：

标签： python regex pandas

【解决方案1】：

使用str.extract

例如：

df = pd.DataFrame(
    {
        "ID": [1, 2, 3, 4, 5],
        "name": [
            "Hello Kitty how=1234 when=2345",
            "how=3456 Hello Puppy when=7685",
            "how=646 It is an Helloexample when=9089",
            "for how=6574 stackoverflow when=5764",
            "Hello  when=3632 World how=7654",
        ],
    }
)
df['when'] = df['name'].str.extract(r"when=(\w+)")  #If only int use `(\d+)`
df['how'] = df['name'].str.extract(r"how=(\w+)")    #If only int use `(\d+)`
print(df)

输出：

   ID                                     name  when   how
0   1           Hello Kitty how=1234 when=2345  2345  1234
1   2           how=3456 Hello Puppy when=7685  7685  3456
2   3  how=646 It is an Helloexample when=9089  9089   646
3   4     for how=6574 stackoverflow when=5764  5764  6574
4   5          Hello  when=3632 World how=7654  3632  7654

【讨论】：

如果数字包含小数点，如 32.46 或 .786，如何提取？
使用\d*\.?\d*?
你的意思是这样的吗：df['when'] = df['name'].str.extract(r"when=(\d)|(\d*\.\d* )")
当我使用df['when'] = df['name'].str.extract(r"when=(\d*\.\d*)")时，它只是提取带小数的数字，如何提取带小数或不带小数的数字？
你错过了?

【解决方案2】：

使用 df.name.str.extract(...)。此方法中的第一个参数是 pattern。包括两个命名的捕获组，用于捕获每个片段。

类似：

df.name.str.extract(r'(?P<how>(?<=how=)[\d.]+)|(?P<when>(?<=when=)[\d.]+)')

由于包含反斜杠，模式应作为原始字符串传递。

【讨论】：

如果数字包含小数点，如 32.46 或 .786，如何提取？