【发布时间】:2018-05-01 17:15:19
【问题描述】:
我正在使用一个包含四列的大型 Excel 文件,但我只需要两列:日期和 HPCP。该程序的目标是将日期转换为日期对象,删除重复的日期,然后对重复项的 HPCP 求和。我觉得这段代码应该可以工作,但是输出非常错误。该代码成功地将日期转换为日期对象,删除重复项,但未正确求和。任何帮助将不胜感激。
excel文件的链接: https://drive.google.com/open?id=1P5-k9Zyz8iFwx6Y-9yhnRozGGSvqpXLz
Excel 文件中的一些行示例:
STATION STATION_NAME DATE HPCP
COOP:305801 NY CITY CENTRAL PARK NY US 20000101 01:00 0
COOP:305801 NY CITY CENTRAL PARK NY US 20000104 15:00 0
COOP:305801 NY CITY CENTRAL PARK NY US 20000104 16:00 0.01
COOP:305801 NY CITY CENTRAL PARK NY US 20000104 17:00 0.03
COOP:305801 NY CITY CENTRAL PARK NY US 20000104 18:00 0.04
COOP:305801 NY CITY CENTRAL PARK NY US 20000104 19:00 0.12
COOP:305801 NY CITY CENTRAL PARK NY US 20000104 20:00 0.17
COOP:305801 NY CITY CENTRAL PARK NY US 20000104 21:00 0.13
COOP:305801 NY CITY CENTRAL PARK NY US 20000104 22:00 0.04
COOP:305801 NY CITY CENTRAL PARK NY US 20000104 23:00 0.09
COOP:305801 NY CITY CENTRAL PARK NY US 20000105 00:00 0.07
COOP:305801 NY CITY CENTRAL PARK NY US 20000105 01:00 0
COOP:305801 NY CITY CENTRAL PARK NY US 20000109 21:00 0.01
COOP:305801 NY CITY CENTRAL PARK NY US 20000109 22:00 0
COOP:305801 NY CITY CENTRAL PARK NY US 20000110 00:00 0.01
COOP:305801 NY CITY CENTRAL PARK NY US 20000110 13:00 0.15
COOP:305801 NY CITY CENTRAL PARK NY US 20000110 14:00 0.29
COOP:305801 NY CITY CENTRAL PARK NY US 20000110 15:00 0.24
COOP:305801 NY CITY CENTRAL PARK NY US 20000110 16:00 0.15
COOP:305801 NY CITY CENTRAL PARK NY US 20000110 17:00 0.01
COOP:305801 NY CITY CENTRAL PARK NY US 20000113 08:00 0
COOP:305801 NY CITY CENTRAL PARK NY US 20000113 09:00 0.01
COOP:305801 NY CITY CENTRAL PARK NY US 20000113 10:00 0.02
COOP:305801 NY CITY CENTRAL PARK NY US 20000113 15:00 0.01
COOP:305801 NY CITY CENTRAL PARK NY US 20000113 16:00 0.01
COOP:305801 NY CITY CENTRAL PARK NY US 20000113 17:00 0
COOP:305801 NY CITY CENTRAL PARK NY US 20000120 07:00 0
COOP:305801 NY CITY CENTRAL PARK NY US 20000120 08:00 0
COOP:305801 NY CITY CENTRAL PARK NY US 20000120 09:00 0
代码:
import sys
import pandas as pd
import datetime
data = pd.read_csv(sys.argv[1])
data = data[['DATE','HPCP']]
data['DATE'] = pd.to_datetime(data['DATE'])
for index, row in data.iterrows():
print index
data.loc[index,'DATE'] = data.loc[index,'DATE'].date()
data = data.groupby(['DATE'],as_index=False).sum()
print data
输出:
DATE HPCP
0 2000-01-01 11999.88
1 2000-01-03 0.00
2 2000-01-04 1002.97
3 2000-01-05 1.25
4 2000-01-09 1000.01
5 2000-01-10 4.72
6 2000-01-11 0.00
7 2000-01-13 0.17
8 2000-01-16 0.00
9 2000-01-20 1000.11
10 2000-01-21 0.12
... ...
2871 2013-12-17 0.66
2872 2013-12-21 0.01
2873 2013-12-22 0.04
2874 2013-12-23 2.06
2875 2013-12-24 0.00
2876 2013-12-26 0.00
2877 2013-12-29 4.90
2878 2013-12-30 0.00
2879 2013-12-31 0.00
2880 2014-01-01 3999.96
【问题讨论】:
-
你为什么觉得不对?
-
HPCP 的所有值都非常小(在 excel 文件中都小于 1)。使用 sum 函数后输出中给出的值是错误的。
-
如果没有您的数据,我既不能确认也不能否认(我当然需要超过 1 行)。你只发布了你的代码和输出,这两个对我来说都是正确的。你现在让我做什么?
-
选择一个您认为错误的日期(例如,
2014-01-01),仅将这些行过滤到列表中,然后打印出 HPCP 值列表及其最终总和。这将帮助您解决代码逻辑或期望中的问题。 -
我刚刚上传了更多的数据行。如果您查看我发布的数据,您会发现输出不是应该的。
标签: python pandas date dataframe sum