【问题标题】:Pandas Resample not working - also drop-duplicates not workingPandas Resample 不起作用 - 删除重复也不起作用
【发布时间】:2021-07-14 12:55:12
【问题描述】:

我有一个大约 500k 行的 .csv,其时间戳如下所示:2021-02-01 00:00:29.159 UTC

我想每 300 毫秒重新采样一次数据。

我将“时间戳”列转换为日期时间:

df.timestamp = pd.to_datetime(df.timestamp)

现在它们看起来像这样:2021-02-01 00:00:29.159000+00:00

现在我重新采样:

df = df.set_index(['timestamp']).resample("300ms").backfill()

并得到错误:

ValueError: cannot reindex a non-unique index with a method or limit

我认为这意味着有重复的时间戳?

所以我drop_duplicates:

print(df.drop_duplicates(subset=['timestamp'], keep='first').duplicated().any())

然后得到:

False

哪个好?我再次运行重采样,并得到相同的错误。因此,我对 drop 重复项进行了快速检查:

duplicatedRows = df[df.duplicated((['timestamp']))]
print(duplicatedRows, sep=' ')

它会打印出 22 个重复的行。当我检查结果时,根本没有一个是重复的?

所以我的问题是:我做得对吗?以及实现我将这样的数据重新采样到 300 毫秒(每 300 毫秒 1 行)的目标的更好方法是什么。

我是一名中级程序员,但对 python 不熟悉,所以很可能是一些简单的问题

干杯

【问题讨论】:

    标签: python pandas duplicates resampling


    【解决方案1】:

    df.timestamp = pd.to_datetime(df.timestamp) #无法将值解析为时间。我得到了 NaT。我转换为iso时间

    df=pd.DataFrame({'timestamp':['2021-02-01T00:00:29.159 UTC','2021-02-01T00:00:35.159 UTC']})
    df['timestamp']=df['timestamp'].apply(lambda row: row.replace(' UTC','Z').replace(' ','T'))
    df['timestamp']=df['timestamp'].apply(lambda timestamp: datetime.strptime(timestamp, '%Y-%m-%dT%H:%M:%S.%f%z'))
    df=df.set_index('timestamp')
    df = df.resample('300ms')
    print(*df)
    

    输出:

    名称:时间戳,数据类型:datetime64[ns, UTC] (时间戳('2021-02-01 00:00:29.100000+0000', tz='UTC', freq='300L'), 空数据帧 列: [] 索引:[2021-02-01 00:00:29.159000+00:00]) (时间戳('2021-02-01 00:00:29.400000+0000', tz='UTC', freq='300L'),空数据框 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:29.700000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:30+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:30.300000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:30.600000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:30.900000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:31.200000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:31.500000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:31.800000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:32.100000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:32.400000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:32.700000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:33+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:33.300000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:33.600000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:33.900000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:34.200000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:34.500000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:34.800000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[]) (Timestamp('2021-02-01 00:00:35.100000+0000', tz='UTC', freq='300L'), Empty DataFrame 列: [] 索引:[2021-02-01 00:00:35.159000+00:00])

    【解决方案2】:

    忘记在 drop_duplicates 中添加“inplace=True”,这就是没有删除重复项的原因

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2016-11-23
      • 2020-03-18
      • 1970-01-01
      • 2017-05-04
      相关资源
      最近更新 更多