根据单词是否大写拆分熊猫数据框列答案

【问题标题】：Splitting a pandas Data frame column depending if word is capital or not根据单词是否大写拆分熊猫数据框列
【发布时间】：2022-01-17 13:53:51
【问题描述】：

我有以下格式的数据框列：

header
THIS IS an example
ALSO this
ONE LAST
J. one more

我想把它分成两列：

header1	header2
THIS IS	an example
ALSO	this
ONE LAST	null
null	J. one more

我尝试过这样提取信息：

df1['header'].str.split('[A-Z]', expand=True)

但我的正则表达式达不到标准。非常感谢任何帮助！

【问题讨论】：

标签： regex pandas

【解决方案1】：

希腊字母通知

要仅匹配希腊大写字母，请将以下模式中的 [A-Z] 替换为 [\u0391-\u03A1\u03A3-\u03A9]。要匹配 ASCII 和希腊大写字母，请使用 [a-zA-Z\u0391-\u03A1\u03A3-\u03A9]。

即

rx = r'^\s*(?P<header1>(?:[\u0391-\u03A1\u03A3-\u03A9]+\b(?!\.)(?:\s+[\u0391-\u03A1\u03A3-\u03A9]+)*\b)?)(?:\s+(?P<header2>.*))?'
new_df = df['header'].str.extract(rx, expand=True)

见the regex demo。

你可以使用

df[['header1', 'header2']] = df['header'].str.extract(r'^\s*((?:[A-Z]+\b(?!\.)(?:\s+[A-Z]+)*)?)\s*(.*)', expand=True)

输出：

>>> df
               header   header1      header2
0  THIS IS an example   THIS IS   an example
1           ALSO this      ALSO         this
2            ONE LAST  ONE LAST             
3         J. one more            J. one more

请参阅regex demo。

详情：

^ - 字符串开头
\s* - 零个或多个空格
((?:[A-Z]+\b(?!\.)(?:\s+[A-Z]+)*)?) - 第 1 组 (header1)：一个或多个大写 ASCII 字母的可选序列（不跟 . 字符），然后是零个或多个一个或多个空格和一个或多个大写 ASCII 字母的序列
\s* - 零个或多个空格
(.*) - 第 2 组 (header2)：除换行符之外的任何零个或多个字符，尽可能多。

您可以使用命名捕获组提取到新的数据帧：

>>> new_df = df['header'].str.extract(r'^\s*(?P<header1>(?:[A-Z]+\b(?!\.)(?:\s+[A-Z]+)*)?)\s*(?P<header2>.*)', expand=True)
>>> new_df
    header1      header2
0   THIS IS   an example
1      ALSO         this
2  ONE LAST             
3            J. one more

【讨论】：

当我运行您的示例时，它运行良好，但在我的数据框中却没有，尽管我涵盖了所有情况。它返回两个空列。可能是因为编码？我的数据集是 utf8，你提到了 ascii。还是我应该在其他地方寻找问题的根源？再次感谢您的帮助。
@PeterGould [A-Z] 仅匹配 ASCII 大写字母。例如，它与Ф 不匹配。使用 ASCII 或 UTF8 对文件进行编码并不重要，这与此无关。如果您得到空列，则表示您没有与我们共享您拥有的数据。我们只能测试您分享的内容。
@PeterGould Here，你显示希腊大写字母。你需要支持希腊语吗？
@PeterGould 我在答案的顶部添加了一条注释。
@PeterGould 使用r'^\s*(?P<header1>(?:[\u0391-\u03A1\u03A3-\u03A9]+\b(?!\.)(?:\s+[\u0391-\u03A1\u03A3-\u03A9]+)*\b)?)(?:\s+(?P<header2>.*))?'

【解决方案2】：

您还可以使用 2 个命名捕获组，并加入列。

^(?P<header1>[A-Z]+(?:[^\S\n]+[A-Z]+)*)?(?:(?:^|[^\S\n]+)(?P<header2>.+))?$

（如果必须以小写字符开头，则使用[a-z].* 而不是.+）

^ 字符串开始
(?P<header1>[A-Z]+捕获组header1，匹配1+ chars A-Z
(?:[^\S\n]+[A-Z]+)*)? 可选匹配空格和 1+ 字符 A-Z
(?:非捕获组
- (?:^|[^\S\n]+) 断言字符串的开头或匹配 1+ 个空格
- (?P<header2>.+) 命名组 header2 匹配 1+ 个字符
)? 关闭群组并将其设为可选
$ 字符串结束

查看regex demo 和Python demo。

例子

import pandas as pd

strings = [
    "THIS IS an example",
    "ALSO this",
    "ONE LAST",
    "J. one more"
]
df1 = pd.DataFrame(strings, columns=["header"])
df1 = df1.join(
    df1['header'].str.extract(
        '^(?P<header1>[A-Z]+(?:[^\S\n]+[A-Z]+)*)?(?:(?:^|[^\S\n]+)(?P<header2>.+))?$',
        expand=True
    )
    .fillna('')
)

print(df1)

输出

               header   header1      header2
0  THIS IS an example   THIS IS   an example
1           ALSO this      ALSO         this
2            ONE LAST  ONE LAST             
3         J. one more            J. one more

【讨论】：

我在这里收到以下错误ValueError: columns overlap but no suffix specified: Index(['area_lvl_2', 'area_lvl_1'], dtype='object')
@PeterGould 我无法重现，我得到了这个输出ideone.com/Jqptim你使用相同的列名吗？
是的，我更改了列名，现在它正在运行。但不填充列 header1 和 header2。这就是我得到的 |header|header1|header2| |--------|------------|----------| |ΓΙΑΝΝΙΤΣΩΝ |null |null| |ΕΛΕΥΘΕΡΙΟ Άγιος Γεώργιος|空 |空| |ΕΛΕΥΘΕΡΙΟ |null |null| |ΕΛΕΥΘΕΡΙΟ |null |null|
@PeterGould 你已经更新了这个问题，为此我已经更新了我的答案。我添加了一个示例来说明对于给定的数据它是有效的。我无法从您的 cmets 重现结果。你可以做的是用不适合你的代码更新问题，这样你就可以在代码工作的地方得到答案。

【解决方案3】：

使用str.extract我们可以试试：

df["header1"] = df["header"].str.extract(r'^([A-Z]+(?: [A-Z]+)?)')
df["header2"] = df["header"].str.extract(r'\b([a-z]+(?: [a-z]+)?)')

【讨论】：