【发布时间】:2021-07-23 20:57:26
【问题描述】:
在 Pandas 中基于正则表达式拆分字段和创建数据框时需要一些帮助。
| A | B | C |
|---|---|---|
| 1129 | 19-APR-2021 | Zip Code Details: City: Huntsville_Alabama , Zip: 35808 , 801thru816 City: Anchorage_Alaska , Zip: 99506 , 501thru524 |
| 1139 | 20-APR-2021 | Zip Code Details: City: Miami_Florida , Zip: 33128 , 124thru190 City: Atlanta_Georgia , Zip: 30301 , 301thru381 |
在其中一个 C 列中,需要提取多个 City & Zip Code 详细信息并在 以下格式:
| No | Date | City | Zip |
|---|---|---|---|
| 1129 | 19-APR-2021 | Huntsville_Alabama | 35808 |
| 1129 | 19-APR-2021 | Anchorage_Alaska | 99506 |
| 1139 | 20-APR-2021 | Miami_Florida | 33128 |
| 1139 | 20-APR-2021 | Atlanta_Georgia | 30301 |
我的 re.findall 表达式如下,工作正常:
city_regex_extract = r" [a-z|A-Z|0-9|_]*\_[a-z|A-Z|0-9|_]*" (https://regex101.com/r/VM8oFF/1)
zip_regex_extract = r"[0-9]{5}" (https://regex101.com/r/oBYJZX/1)
以下是目前的代码,但无法添加 Zip 字段。
import pandas as pd
import json, re, sys, time
df = pd.DataFrame({
'No': ['1129', '1139'],
'Date': ['19-APR-2021','20-APR-2021'],
'C': ['Zip Code Details: City: Huntsville_Alabama , Zip: 35808 , 801thru816 City: Anchorage_Alaska , Zip: 99506 , 501thru524','Zip Code Details: City: Miami_Florida , Zip: 33128 , 124thru190 City: Atlanta_Georgia , Zip: 30301 , 301thru381']
})
city_regex_extract = r" [a-z|A-Z|0-9|_]*\_[a-z|A-Z|0-9|_]*"
zip_regex_extract = r"[0-9]{17}"
df['City'] = [re.findall(city_regex_extract, str(x)) for x in df['C']]
df['Zip'] = [re.findall(zip_regex_extract, str(x)) for x in df['C']]
df = (df
.set_index(['No','Date'])['City']
.apply(pd.Series)
.stack()
.reset_index()
.drop('level_2', axis=1)
.rename(columns={0:'City'}))
print(df)
感谢任何帮助。
【问题讨论】:
标签: python regex pandas dataframe