str.extract 在 pandas DataFrame 中从后面开始答案

【问题标题】：str.extract starting from the back in pandas DataFramestr.extract 在 pandas DataFrame 中从后面开始
【发布时间】：2018-02-12 08:06:28
【问题描述】：

我有一个包含数千行和两列的 DataFrame，如下所示：

                                          string       state
0      the best new york cheesecake rochester ny          ny
1      the best dallas bbq houston tx random str          tx
2   la jolla fish shop of san diego san diego ca          ca
3                                   nothing here          dc

对于每个州，我都有一个所有城市名称（小写）的正则表达式，其结构类似于(city1|city2|city3|...)，其中城市的顺序是任意的（但如果需要可以更改）。例如，纽约州的正则表达式包含 'new york' 和 'rochester'（同样，德克萨斯州的 'dallas' 和 'houston'，加利福尼亚州的 'san diego' 和 'la jolla'）。

我想找出字符串中最后出现的城市是什么（对于观察 1、2、3、4，我想要 'rochester'、'houston'、'san diego' 和 NaN（或无论如何），分别）。

我从str.extract 开始，并试图考虑诸如反转字符串之类的事情，但已经陷入僵局。

非常感谢您的帮助！

【问题讨论】：

标签： python regex string pandas series

【解决方案1】：

您可以使用str.findall，但如果没有匹配则为空list，所以需要申请。最后通过[-1]选择字符串的最后一项：

cities = r"new york|dallas|rochester|houston|san diego"

print (df['string'].str.findall(cities)
                   .apply(lambda x: x if len(x) >= 1 else ['no match val'])
                   .str[-1])
0       rochester
1         houston
2       san diego
3    no match val
Name: string, dtype: object

（已修正 >= 1 至 > 1。）

另一种解决方案有点小技巧 - 在每个字符串的开头添加不匹配的字符串 radd 并将此字符串添加到城市：

a = 'no match val'
cities = r"new york|dallas|rochester|houston|san diego" + '|' + a

print (df['string'].radd(a).str.findall(cities).str[-1])
0       rochester
1         houston
2       san diego
3    no match val
Name: string, dtype: object

【讨论】：

【解决方案2】：

cities = r"new york|dallas|..."

def last_match(s):
    found = re.findall(cities, s)
    return found[-1] if found else ""

df['string'].apply(last_match)
#0    rochester
#1      houston
#2    san diego
#3

【讨论】：