字符串拆分异常答案

【问题标题】：String split with exceptions字符串拆分异常
【发布时间】：2023-03-29 07:50:01
【问题描述】：

我使用逗号作为分隔符将字符串拆分为行。

for col in [col for col in df.loc[:,df.columns.str.contains(">")]]: #only on colnames containing ">"
    df[col] = df[col].str.split(", ")
    df = df.explode(col).reset_index(drop=True)

但是，逗号“自然”出现的三个子字符串不应导致拆分：

与性偏好、性生活和/或性取向相关的数据
合同、工资和福利
采购、分包和供应商管理

我在想，因为只有这三个实例，如果有办法使用类似这样的东西做出一些例外："preferences,", "sex life," 、“合同”、和“采购”。或者更优雅的解决方法？

这里是一个例子 df:

df = pd.DataFrame({"col > 1": ["Personals, Financials, Data related to sexual preferences, sex life, and/or sexual orientation", "Personals, Financials", "Vendors, Procurement, subcontracting and vendor management"]})

这是它应该输出的内容：

+-------------------------------------------------------------------------+
|                                 col > 1                                 |
+-------------------------------------------------------------------------+
| Personals                                                               |
| Financials                                                              |
| Data related to sexual preferences, sex life, and/or sexual orientation |
| Personals                                                               |
| Financials                                                              |
| Vendors                                                                 |
| Procurement, subcontracting and vendor management                       |
+-------------------------------------------------------------------------+

【问题讨论】：

我有一个类似的问题，但我希望利用 " 来表示应该忽略里面的逗号。下面的答案似乎没有注意到 "。

标签： python pandas split

【解决方案1】：

您可以在df.str.split() 中使用带有多个否定后向语句的正则表达式模式，以基本上说“在, 上拆分行，除非, 前面有...” .

要在 Python 中实现这一点，最好使用多个否定的lookbehind 断言 - Python 正则表达式强制执行固定宽度的lookarounds，因此它不像单个否定lookbehind 那样简单，子句由| 分隔。

使用您示例中的短语在, 上进行拆分，除非前面有您可以使用的任何列出的短语：

r"(?<!preferences)(?<!sex life)(?<!Contract)(?<!Procurement),"

完整代码示例：

import pandas as pd

df = pd.DataFrame({"col > 1": ["Personals, Financials, Data related to sexual preferences, sex life, and/or sexual orientation", "Personals, Financials", "Vendors, Procurement, subcontracting and vendor management"]})

df["col > 1"] = df["col > 1"].str.split(r"(?<!preferences)(?<!sex life)(?<!Contract)(?<!Procurement),")

df = df.explode("col > 1").reset_index(drop=True)

这将为您提供df 以及您问题中概述的所需["col > 1"] 值一个新索引0...n。

即

                                             col > 1
0                                          Personals
1                                         Financials
2   Data related to sexual preferences, sex life,...
3                                          Personals
4                                         Financials
5                                            Vendors
6   Procurement, subcontracting and vendor manage...

【讨论】：

【解决方案2】：

您可以暂时将这些例外的逗号替换为其他内容（让我们使用 ;。
创建以逗号分隔的列表
展开数据框
用逗号替换分号

df = pd.DataFrame({"col > 1": ["Personals, Financials, Data related to sexual preferences, sex life, and/or sexual orientation", "Personals, Financials", "Vendors, Procurement, subcontracting and vendor management"]})
r1 = ['Data related to sexual preferences, sex life, and/or sexual orientation',
      'Contract, salary and benefits',
      'Procurement, subcontracting and vendor management']
r2 = ['Data related to sexual preferences; sex life; and/or sexual orientation',
      'Contract; salary and benefits',
      'Procurement; subcontracting and vendor management']
df = df.replace(r1,r2, regex=True)
df['col > 1'] = df['col > 1'].str.split(',')
df = df.explode('col > 1').replace(r2,r1,regex=True)
df
Out[1]: 
                                             col > 1
0                                          Personals
0                                         Financials
0   Data related to sexual preferences, sex life,...
1                                          Personals
1                                         Financials
2                                            Vendors
2   Procurement, subcontracting and vendor manage...

【讨论】：