Python中基于Regex的CASE语句答案

【问题标题】：CASE statement in Python based on RegexPython中基于Regex的CASE语句
【发布时间】：2021-03-11 07:52:51
【问题描述】：

所以我有一个这样的数据框：

FileName  
01011RT0TU7  
11041NT4TU8  
51391RST0U2  
01011645RT0TU9  
11311455TX0TU8  
51041545ST3TU9

我想要的是 DataFrame 中的另一列，如下所示：

FileName      |RdwyId  
01011RT0TU7   |01011000  
11041NT4TU8   |11041000  
51391RST0U2   |51391000  
01011645RT0TU9|01011645   
11311455TX0TU8|11311455    
51041545ST3TU9|51041545

本质上，如果前 5 个字符是数字，则与“000”连接，如果前 8 个字符是数字，则只需将它们移动到 RdwyId 列

我是菜鸟，所以我一直在玩这个：
测试一：

rdwyre1=re.compile(r'\d\d\d\d\d')  
rdwyre2=re.compile(r'\d\d\d\d\d\d\d\d')  
rdwy1=rdwyre1.findall(str(thous["FileName"]))  
rdwy2=rdwyre2.findall(str(thous["FileName"]))  
thous["RdwyId"]=re.sub(r'\d\d\d\d\d', str(thous["FileName"].loc[:4])+"000",thous["FileName"])

测试 2：

thous["RdwyId"]=np.select(  
    [  
        re.search(r'\d\d\d\d\d',thous["FileName"])!="None",  
        rdwyre2.findall(str(thous["FileName"]))!="None"  

    ],  
    [  
        rdwyre1.findall(str(thous["FileName"]))+"000",  
        rdwyre2.findall(str(thous["FileName"])),  
    ],  
    default="Unknown"  
)

测试 3：

thous=thous.assign(RdwyID=lambda x: str(rdwyre1.search(x).group())+"000" if bool(rdwyre1.search(x))==True else str(rdwyre2.search(x).group()))

以上方法均无效。谁能帮我弄清楚我哪里出错了？以及如何解决？

【问题讨论】：

除了5个或8个字符还有其他类型吗？
@ombk 不。只有这两个条件
我会发布一个非常幼稚的方法，如果它没有帮助让我知道删除它

标签： python regex pandas numpy dataframe

【解决方案1】：

您可以使用numpy select，它为多个条件复制CASE WHEN，以及Pandas 的str.isnumeric 方法：

cond1 = df.FileName.str[:8].str.isnumeric() # first condition
choice1 = df.FileName.str[:8] # result if first condition is met
cond2 = df.FileName.str[:5].str.isnumeric() # second condition
choice2 = df.FileName.str[:5] + "000" # result if second condition is met

condlist = [cond1, cond2]
choicelist = [choice1, choice2]

df.loc[:, "RdwyId"] = np.select(condlist, choicelist)

df

    FileName         RdwyId
0   01011RT0TU7     01011000
1   11041NT4TU8     11041000
2   51391RST0U2     51391000
3   01011645RT0TU9  01011645
4   11311455TX0TU8  11311455
5   51041545ST3TU9  51041545

【讨论】：

我收到以下错误AttributeError: 'Series' object has no attribute 'isnumeric'
您是否在它前面加上了str？请注意我使用 str 编写代码的方式：df.FileName.str[:8].str.isnumeric()

【解决方案2】：

def filt(list1):
    for i in list1:
        if i[:8].isdigit():
            print(i[:8])
        else:
            print(i[:5]+"000")
# output

01011000
11041000
51391000
01011645
11311455
51041545

我的意思是，如果您的案例非常具体，您可以对其进行调整并将其应用于您的数据框。

到一个数据框。

def filt(i):
    if i[:8].isdigit():
        return i[:8]
    else:
        return i[:5]+"000"
d = pd.DataFrame({"names": list_1})
d["filtered"] = d.names.apply(lambda x: filt(x)) #.apply(filt) also works im used to lambdas

#output

    names           filtered
0   01011RT0TU7     01011000
1   11041NT4TU8     11041000
2   51391RST0U2     51391000
3   01011645RT0TU9  01011645
4   11311455TX0TU8  11311455
5   51041545ST3TU9  51041545

【讨论】：

【解决方案3】：

使用正则表达式：

c1 = re.compile(r'\d{5}')  
c2 = re.compile(r'\d{8}')
rdwyId = []
for f in thous['FileName']:
    m = re.match(c2, f)
    if m:
        rdwyId.append(m[0])
        continue
    m = re.match(c1, f)
    if m:
        rdwyId.append(m[0] + "000")        
thous['RdwyId'] = rdwyId

编辑：用 re.match 替换 re.search，因为它更有效，因为我们只在字符串的开头查找匹配项。

【讨论】：

【解决方案4】：

让我们试试findall 和ljust

df['new'] = df.FileName.str.findall(r"(\d+)[A-z]").str[0].str.ljust(8,'0')
Out[226]: 
0    01011000
1    11041000
2    51391000
3    01011645
4    11311455
5    51041545
Name: FileName, dtype: object

【讨论】：