【问题标题】:String split with exceptions字符串拆分异常
【发布时间】:2023-03-29 07:50:01
【问题描述】:

我使用逗号作为分隔符将字符串拆分为行。

for col in [col for col in df.loc[:,df.columns.str.contains(">")]]: #only on colnames containing ">"
    df[col] = df[col].str.split(", ")
    df = df.explode(col).reset_index(drop=True)

但是,逗号“自然”出现的三个子字符串不应导致拆分:

  1. 与性偏好、性生活和/或性取向相关的数据
  2. 合同、工资和福利
  3. 采购、分包和供应商管理

我在想,因为只有这三个实例,如果有办法使用类似这样的东西做出一些例外:"preferences,", "sex life,"“合同”、“采购”。或者更优雅的解决方法?

这里是一个例子 df:

df = pd.DataFrame({"col > 1": ["Personals, Financials, Data related to sexual preferences, sex life, and/or sexual orientation", "Personals, Financials", "Vendors, Procurement, subcontracting and vendor management"]})

这是它应该输出的内容:

+-------------------------------------------------------------------------+
|                                 col > 1                                 |
+-------------------------------------------------------------------------+
| Personals                                                               |
| Financials                                                              |
| Data related to sexual preferences, sex life, and/or sexual orientation |
| Personals                                                               |
| Financials                                                              |
| Vendors                                                                 |
| Procurement, subcontracting and vendor management                       |
+-------------------------------------------------------------------------+

【问题讨论】:

  • 我有一个类似的问题,但我希望利用 " 来表示应该忽略里面的逗号。下面的答案似乎没有注意到 "。

标签: python pandas split


【解决方案1】:

您可以在df.str.split() 中使用带有多个否定后向语句的正则表达式模式,以基本上说“在, 上拆分行,除非, 前面有...” .

要在 Python 中实现这一点,最好使用多个否定的lookbehind 断言 - Python 正则表达式强制执行固定宽度的lookarounds,因此它不像单个否定lookbehind 那样简单,子句由| 分隔。

使用您示例中的短语在, 上进行拆分,除非前面有您可以使用的任何列出的短语:

r"(?<!preferences)(?<!sex life)(?<!Contract)(?<!Procurement),"

完整代码示例:

import pandas as pd

df = pd.DataFrame({"col > 1": ["Personals, Financials, Data related to sexual preferences, sex life, and/or sexual orientation", "Personals, Financials", "Vendors, Procurement, subcontracting and vendor management"]})

df["col > 1"] = df["col > 1"].str.split(r"(?<!preferences)(?<!sex life)(?<!Contract)(?<!Procurement),")

df = df.explode("col > 1").reset_index(drop=True)

这将为您提供df 以及您问题中概述的所需["col &gt; 1"] 值一个新索引0...n

                                             col > 1
0                                          Personals
1                                         Financials
2   Data related to sexual preferences, sex life,...
3                                          Personals
4                                         Financials
5                                            Vendors
6   Procurement, subcontracting and vendor manage...

【讨论】:

    【解决方案2】:
    1. 您可以暂时将这些例外的逗号替换为其他内容(让我们使用 ;
    2. 创建以逗号分隔的列表
    3. 展开数据框
    4. 用逗号替换分号

    df = pd.DataFrame({"col > 1": ["Personals, Financials, Data related to sexual preferences, sex life, and/or sexual orientation", "Personals, Financials", "Vendors, Procurement, subcontracting and vendor management"]})
    r1 = ['Data related to sexual preferences, sex life, and/or sexual orientation',
          'Contract, salary and benefits',
          'Procurement, subcontracting and vendor management']
    r2 = ['Data related to sexual preferences; sex life; and/or sexual orientation',
          'Contract; salary and benefits',
          'Procurement; subcontracting and vendor management']
    df = df.replace(r1,r2, regex=True)
    df['col > 1'] = df['col > 1'].str.split(',')
    df = df.explode('col > 1').replace(r2,r1,regex=True)
    df
    Out[1]: 
                                                 col > 1
    0                                          Personals
    0                                         Financials
    0   Data related to sexual preferences, sex life,...
    1                                          Personals
    1                                         Financials
    2                                            Vendors
    2   Procurement, subcontracting and vendor manage...
    

    【讨论】:

      猜你喜欢
      • 2016-02-04
      • 1970-01-01
      • 1970-01-01
      • 2015-07-13
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-09-26
      • 1970-01-01
      相关资源
      最近更新 更多