【发布时间】:2021-11-29 23:56:08
【问题描述】:
我有一个多索引数据框
df
2021-06-28 2021-07-05
total_orders total_sales last_year_sales last_year_total_orders total_orders total_sales last_year_sales last_year_total_orders
group category
A Amazon 195.000 1,268.850 5,194.950 195.000 ... ... ... ...
Netflix 37.000 15,323.800 11,461.250 20.00
Apple 61.000 18,274.385 19,102.220 30.00
Facebook 106.000 19,722.650 22,796.420 50.00
Tesla 13.000 1,688.675 992.600 25.00
Uber 4.000 1,906.000 3,671.150 15.00
Google 61.000 55,547.255 30,853.115 10.00
total 477.000 113,731.615 94,071.705 56.00
B Amazon 50.000 3,219.650 6,022.300 400.00
Netflix 17.000 5,918.500 4,749.000 30.00
Apple 50.000 15,852.060 7,373.600 27.00
Facebook 75.000 17,743.700 33,514.000 15.00
Tesla 14.000 1,708.750 902.010 66.00
Uber 3.000 937.010 349.300 94.00
Google 43.000 37,795.150 35,943.450 65.00
total 252.000 83,174.820 88,853.660 61.00
我正在尝试创建一个year_to_date 列,其中sum 将是total_sales 的sum,从2021-06-28 开始直到今天。正如你在上面看到的,我每周有 4 列,从 2020-06-29 到本周 2021-10-11 有几个星期。因此,从2021-06-28 开始的每个星期,我都会有一个列来简单地保存从2021-06-28 开始的所有前几周的sales 的累积总和。
我希望结果看起来像这样(我删除了一些列以获得更好的视觉效果):
2021-06-28 2021-07-05 2021-07-12
total_orders total_sales total_orders total_sales year_to_date_sales total_orders total_sales year_to_date_sales
group category
A Amazon 195.000 1,268.850 ... 1000 1,268.850 + 1000 ... 5000 1,268.850 + 1000 + 5000
Netflix 37.000 15,323.800
Apple 61.000 18,274.385 ... 2000 15,323.800 + 2000 ... 6000 1,268.850 + 1000 + 6000
Facebook 106.000 19,722.650
Tesla 13.000 1,688.675
Uber 4.000 1,906.000
Google 61.000 55,547.255
total 477.000 113,731.615
B Amazon 50.000 3,219.650
Netflix 17.000 5,918.500
Apple 50.000 15,852.060
Facebook 75.000 17,743.700
Tesla 14.000 1,708.750
Uber 3.000 937.010
Google 43.000 37,795.150
total 252.000 83,174.820
我试过了:
# Adding year to date sales
s = df.loc[:, (['2021-06-28','2021-10-11'], 'total_sales')]
s = np.sum(s, axis = 1)
s = s.rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1)
df = df.combine_first(s)
# I tried ['2021-06-28':'2021-10-11'] to select all columns from - to but it does not let me
我假设我会按行使用np.sum(s, axis = 1) 到sum,所以我会得到每个group 和category 的结果。但目前,我无法选择所有感兴趣的范围,我的方法会为每周创建相同的值,而不是从 2021-06-28 的每个下周创建具有累积总和的值。
我怎样才能达到这样的结果?
更新
使用后
df1 = (df.loc[:, (['2021-06-28','2021-10-11'], 'total_sales')]
.rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1))
df = df.join(df1.cumsum(axis=1)).sort_index(axis=1)
由于某种原因,我没有添加新列:
df.loc[:,'2021-07-05'].columns
MultiIndex([('2021-07-05', 'total_orders'),
('2021-07-05', 'total_sales'),
('2021-07-05', 'last_year_sales'),
('2021-07-05', 'last_year_total_orders')]
names=['created_at', None])
当我检查 df1 只为 2021-06-28 保留一列时,我预计每周会有多列。最终df 中的新列仅存在于2021-06-28,但不存在于未来的其他日期。
df data
df = pd.DataFrame.from_dict({('group', ''): {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'A',
6: 'A',
7: 'A',
8: 'A',
9: 'B',
10: 'B',
11: 'B',
12: 'B',
13: 'B',
14: 'B',
15: 'B',
16: 'B',
17: 'B',
18: 'all',
19: 'all'},
('category', ''): {0: 'Amazon',
1: 'Apple',
2: 'Facebook',
3: 'Google',
4: 'Netflix',
5: 'Tesla',
6: 'Total',
7: 'Uber',
8: 'total',
9: 'Amazon',
10: 'Apple',
11: 'Facebook',
12: 'Google',
13: 'Netflix',
14: 'Tesla',
15: 'Total',
16: 'Uber',
17: 'total',
18: 'Total',
19: 'total'},
(pd.Timestamp('2020-06-29 00:00:00'), 'last_sales'): {0: 195.0,
1: 61.0,
2: 106.0,
3: 61.0,
4: 37.0,
5: 13.0,
6: 954.0,
7: 4.0,
8: 477.0,
9: 50.0,
10: 50.0,
11: 75.0,
12: 43.0,
13: 17.0,
14: 14.0,
15: 504.0,
16: 3.0,
17: 252.0,
18: 2916.0,
19: 2916.0},
(pd.Timestamp('2020-06-29 00:00:00'), 'total_sales'): {0: 1268.85,
1: 18274.385000000002,
2: 19722.65,
3: 55547.255,
4: 15323.800000000001,
5: 1688.6749999999997,
6: 227463.23,
7: 1906.0,
8: 113731.615,
9: 3219.6499999999996,
10: 15852.060000000001,
11: 17743.7,
12: 37795.15,
13: 5918.5,
14: 1708.75,
15: 166349.64,
16: 937.01,
17: 83174.82,
18: 787625.7400000001,
19: 787625.7400000001},
(pd.Timestamp('2020-06-29 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2020-07-06 00:00:00'), 'last_sales'): {0: 26.0,
1: 39.0,
2: 79.0,
3: 49.0,
4: 10.0,
5: 10.0,
6: 436.0,
7: 5.0,
8: 218.0,
9: 89.0,
10: 34.0,
11: 133.0,
12: 66.0,
13: 21.0,
14: 20.0,
15: 732.0,
16: 3.0,
17: 366.0,
18: 2336.0,
19: 2336.0},
(pd.Timestamp('2020-07-06 00:00:00'), 'total_sales'): {0: 3978.15,
1: 12138.96,
2: 19084.175,
3: 40033.46000000001,
4: 4280.15,
5: 1495.1,
6: 165548.29,
7: 1764.15,
8: 82774.145,
9: 8314.92,
10: 12776.649999999996,
11: 28048.075,
12: 55104.21000000002,
13: 6962.844999999999,
14: 3053.2000000000003,
15: 231049.11000000002,
16: 1264.655,
17: 115524.55500000001,
18: 793194.8000000002,
19: 793194.8000000002},
(pd.Timestamp('2020-07-06 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2021-06-28 00:00:00'), 'last_sales'): {0: 96.0,
1: 56.0,
2: 106.0,
3: 44.0,
4: 34.0,
5: 13.0,
6: 716.0,
7: 9.0,
8: 358.0,
9: 101.0,
10: 22.0,
11: 120.0,
12: 40.0,
13: 13.0,
14: 8.0,
15: 610.0,
16: 1.0,
17: 305.0,
18: 2652.0,
19: 2652.0},
(pd.Timestamp('2021-06-28 00:00:00'), 'total_sales'): {0: 5194.95,
1: 19102.219999999994,
2: 22796.420000000002,
3: 30853.115,
4: 11461.25,
5: 992.6,
6: 188143.41,
7: 3671.15,
8: 94071.705,
9: 6022.299999999998,
10: 7373.6,
11: 33514.0,
12: 35943.45,
13: 4749.000000000001,
14: 902.01,
15: 177707.32,
16: 349.3,
17: 88853.66,
18: 731701.46,
19: 731701.46},
(pd.Timestamp('2021-06-28 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2021-07-07 00:00:00'), 'last_sales'): {0: 45.0,
1: 47.0,
2: 87.0,
3: 45.0,
4: 13.0,
5: 8.0,
6: 494.0,
7: 2.0,
8: 247.0,
9: 81.0,
10: 36.0,
11: 143.0,
12: 56.0,
13: 9.0,
14: 9.0,
15: 670.0,
16: 1.0,
17: 335.0,
18: 2328.0,
19: 2328.0},
(pd.Timestamp('2021-07-07 00:00:00'), 'total_sales'): {0: 7556.414999999998,
1: 14985.05,
2: 16790.899999999998,
3: 36202.729999999996,
4: 4024.97,
5: 1034.45,
6: 163960.32999999996,
7: 1385.65,
8: 81980.16499999998,
9: 5600.544999999999,
10: 11209.92,
11: 32832.61,
12: 42137.44500000001,
13: 3885.1499999999996,
14: 1191.5,
15: 194912.34000000003,
16: 599.0,
17: 97456.17000000001,
18: 717745.3400000001,
19: 717745.3400000001},
(pd.Timestamp('2021-07-07 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0}}).set_index(['group','category'])
【问题讨论】:
-
请以可重复的形式提供您的数据
df.to_dict() -
对不起,忘记了,其实是故意的!
标签: python pandas multi-index