使用 pandas 将数字 ID 列拆分为两个答案

【问题标题】：Split a numeric ID column into two using pandas使用 pandas 将数字 ID 列拆分为两个
【发布时间】：2018-05-02 09:34:22
【问题描述】：

              DateTime  Junction  Vehicles           ID
0  2015-11-01 00:00:00         1        15  20151101001
1  2015-11-01 01:00:00         1        13  20151101011
2  2015-11-01 02:00:00         1        10  20151101021
3  2015-11-01 03:00:00         1         7  20151101031
4  2015-11-01 04:00:00         1         9  20151101041
5  2015-11-01 05:00:00         1         6  20151101051
6  2015-11-01 06:00:00         1         9  20151101061
7  2015-11-01 07:00:00         1         8  20151101071
8  2015-11-01 08:00:00         1        11  20151101081
9  2015-11-01 09:00:00         1        12  20151101091

我想将ID 列拆分为两个单独的列，以便前 4 位数字在一个中，其余数字在第二个中。

我试过的代码：

new_ID = data.apply(lambda x: x.rsplit(4))

但它不起作用。我怎样才能用熊猫做到这一点？

【问题讨论】：

另外，这是一个非常简单的问题。您最好显示大约 5-10 行数据，以及预期的输出以及您尝试过的可能不起作用的代码。
另外，还有一个 pandas 标签，使用它。其他标签与您的问题无关。
不，不是图片。作为文本.
@cᴏʟᴅsᴘᴇᴇᴅ 我在这里比较新，我无法添加表格。我用过 .str.rsplit() 但徒劳无功。
只需 print(df.head(10))，将其粘贴到此处，突出显示，然后按 Ctrl+K。

标签： python pandas data-cleaning

【解决方案1】：

选项 1
最简单最直接的——使用str访问器。

v = df.ID.astype(str)
df['Year'], df['ID'] = v.str[:4], v.str[4:]

df

              DateTime  Junction  Vehicles       ID  Year
0 2015-11-01  00:00:00         1        15  1101001  2015
1 2015-11-01  01:00:00         1        13  1101011  2015
2 2015-11-01  02:00:00         1        10  1101021  2015
3 2015-11-01  03:00:00         1         7  1101031  2015
4 2015-11-01  04:00:00         1         9  1101041  2015
5 2015-11-01  05:00:00         1         6  1101051  2015
6 2015-11-01  06:00:00         1         9  1101061  2015
7 2015-11-01  07:00:00         1         8  1101071  2015
8 2015-11-01  08:00:00         1        11  1101081  2015
9 2015-11-01  09:00:00         1        12  1101091  2015

选项 2
str.extract

v = df.ID.astype(str).str.extract('(?P<Year>\d{4})(?P<ID>.*)', expand=True)
df = pd.concat([df.drop('ID', 1), v], 1)

df

              DateTime  Junction  Vehicles  Year       ID
0 2015-11-01  00:00:00         1        15  2015  1101001
1 2015-11-01  01:00:00         1        13  2015  1101011
2 2015-11-01  02:00:00         1        10  2015  1101021
3 2015-11-01  03:00:00         1         7  2015  1101031
4 2015-11-01  04:00:00         1         9  2015  1101041
5 2015-11-01  05:00:00         1         6  2015  1101051
6 2015-11-01  06:00:00         1         9  2015  1101061
7 2015-11-01  07:00:00         1         8  2015  1101071
8 2015-11-01  08:00:00         1        11  2015  1101081
9 2015-11-01  09:00:00         1        12  2015  1101091

【讨论】：

@chdeepak 查看选项一并根据需要进行修改。
我认为第一个解决方案是最好的，+1

【解决方案2】：

这里是一个数值解（假设ID列的长度是常数）：

In [10]: df['Year'], df['ID'] = df['ID'] // 10**7, df['ID'] % 10**7

In [11]: df
Out[11]:
              DateTime  Junction  Vehicles       ID  Year
0 2015-11-01  00:00:00         1        15  1101001  2015
1 2015-11-01  01:00:00         1        13  1101011  2015
2 2015-11-01  02:00:00         1        10  1101021  2015
3 2015-11-01  03:00:00         1         7  1101031  2015
4 2015-11-01  04:00:00         1         9  1101041  2015
5 2015-11-01  05:00:00         1         6  1101051  2015
6 2015-11-01  06:00:00         1         9  1101061  2015
7 2015-11-01  07:00:00         1         8  1101071  2015
8 2015-11-01  08:00:00         1        11  1101081  2015
9 2015-11-01  09:00:00         1        12  1101091  2015

【讨论】：

很好，也保留了数字属性。
@cᴏʟᴅsᴘᴇᴇᴅ，谢谢！你的解决方案更通用！

【解决方案3】：

df[id_col].map(lambda x: int(str(x)[:5])) # as an integer
df[id_col].map(lambda x: str(x)[:5]) # as a string

【讨论】：