python pandas将数据框列拆分为两个新列并删除原始列答案

【问题标题】：python pandas split datafram column into two new column and drop the original columnpython pandas将数据框列拆分为两个新列并删除原始列
【发布时间】：2021-02-19 12:56:16
【问题描述】：

我有以下数据框：

import pandas as pd
import numpy as np

df = pd.DataFrame({'Name': ['Steve Smith', 'Joe Nadal',
                            'Roger Federer'],
                  'birthdat/company': ['1995-01-26Sharp, Reed and Crane',
                                      '1955-08-14Price and Sons',
                                      '2000-06-28Pruitt, Bush and Mcguir']})

df[['data_time','full_company_name']] = df['birthdat/company'].str.split('[0-9]{4}-[0-9]{2}-[0-9]{2}', expand=True)
df

使用我的代码，我得到以下信息：

____|____Name______|__birthdat/company_______________|_birthdate_|____company___________
0   |Steve Smith   |1995-01-26Sharp, Reed and Crane  |           |Sharp, Reed and Crane
1   |Joe Nadal     |1955-08-14Price and Sons         |           |Price and Sons
2   |Roger Federer |2000-06-28Pruitt, Bush and Mcguir|           |Pruitt, Bush and Mcguir

我想要的是 - 得到这个正则表达式（'[0-9]{4}-[0-9]{2}-[0-9]{2}'），其余的应该去列“ full_company_name”和：

____|____Name______|_birthdate_|____company_name_______
0   |Steve Smith   |1995-01-26 |Sharp, Reed and Crane
1   |Joe Nadal     |1955-08-14 |Price and Sons
2   |Roger Federer |2000-06-28 |Pruitt, Bush and Mcguir

更新的问题：我如何处理生日或公司名称的缺失值，例如：birthdate/company = "NaApple" orbirthdate/company = "2003-01-15Na" 缺失值不仅限于 Na

【问题讨论】：

标签： python python-3.x regex pandas dataframe

【解决方案1】：

你可以使用

df[['data_time','full_company_name']] = df['birthdat/company'].str.extract(r'^([0-9]{4}-[0-9]{2}-[0-9]{2})(.*)', expand=False)
>>> df
            Name  Age  ...   data_time        full_company_name
0    Steve Smith   32  ...  1995-01-26    Sharp, Reed and Crane
1      Joe Nadal   34  ...  1955-08-14           Price and Sons
2  Roger Federer   36  ...  2000-06-28  Pruitt, Bush and Mcguir

[3 rows x 5 columns]

这里使用Series.str.extract是因为你需要得到两个部分而不丢失日期。

正则表达式是

^ - 字符串开头
([0-9]{4}-[0-9]{2}-[0-9]{2}) - 您的日期模式被捕获到第 1 组
(.*) - 字符串的其余部分被捕获到第 2 组中。

请参阅regex demo。

【讨论】：

【解决方案2】：

split 通过分隔符分割字符串，同时忽略它们。我想你想要extract 有两个捕获组：

df[['data_time','full_company_name']] = \
   df['birthdat/company'].str.extract('^([0-9]{4}-[0-9]{2}-[0-9]{2})(.*)')

输出：

    Name           birthdat/company                   data_time    full_company_name
--  -------------  ---------------------------------  -----------  -----------------------
 0  Steve Smith    1995-01-26Sharp, Reed and Crane    1995-01-26   Sharp, Reed and Crane
 1  Joe Nadal      1955-08-14Price and Sons           1955-08-14   Price and Sons
 2  Roger Federer  2000-06-28Pruitt, Bush and Mcguir  2000-06-28   Pruitt, Bush and Mcguir

【讨论】：

不是其他答案的重复吗？
@RyszardCzech 不要挑剔，但根据时间戳，我的答案与其他答案相同。为什么我的会自动重复？
嗯，时间戳不一样，我看到了细微的差别。
@RyszardCzech 好的，现在我看到一分钟。也就是说，时间差不到一分钟。那正是复制/粘贴答案的时候。当我发布我的答案时，我当然没有看到其他答案，尽管它可能在系统上。更不用说在发布答案后没有显示编辑/重新编辑，我的回答明确解释了 split 行为。我不认为因此而否决我的答案是公平的。