【问题标题】:Forward fill pandas dataframe without duplicating values in rows前向填充熊猫数据框而不重复行中的值
【发布时间】:2019-03-14 14:29:30
【问题描述】:

我有以下数据框,所有空白区域都是 np.nan。

         coupler_id   25       26         28        29
timestamp               
2015-12-05 03:02:29                     12017.0     12008.0
2015-12-05 03:04:47                     12017.0     12008.0
2015-12-05 03:09:14                     12017.0     12008.0
2015-12-05 03:12:12                     12017.0     12008.0
2015-12-05 03:23:06                                 12008.0
2015-12-05 03:24:45                                 12017.0
2015-12-05 06:31:20                     12017.0 
2015-12-05 09:36:29                     12011.0 
2015-12-05 23:59:35                                 12017.0
2015-12-06 23:59:38                                 12017.0

我想转发填充缺失值(限制 1)不重复行中的值。所以上面的数据框应该是这样的:

         coupler_id   25       26         28        29
timestamp               
2015-12-05 03:02:29                     12017.0     12008.0
2015-12-05 03:04:47                     12017.0     12008.0
2015-12-05 03:09:14                     12017.0     12008.0
2015-12-05 03:12:12                     12017.0     12008.0
2015-12-05 03:23:06                     12017.0     12008.0
2015-12-05 03:24:45                                 12017.0
2015-12-05 06:31:20                     12017.0 
2015-12-05 09:36:29                     12011.0 
2015-12-05 23:59:35                     12011.0     12017.0
2015-12-06 23:59:38                                 12017.0

编辑:

如果第 25 列和第 26 列中有数据,并且第 28 列索引 2015-12-05 03:24:45 上没有前面的 nan,该怎么办。

         coupler_id   25       26         28        29
timestamp               
2015-12-05 03:02:29                     12017.0     12008.0
2015-12-05 03:04:47                     12017.0     12008.0
2015-12-05 03:09:14                     12017.0     12008.0
2015-12-05 03:12:12                     12017.0     12008.0
2015-12-05 03:23:06   12007.0 12018.0               12008.0
2015-12-05 03:24:45   12033.0 12050.0   12025.0     12017.0
2015-12-05 06:31:20           12033.0   12017.0 
2015-12-05 09:36:29   12008.0           12011.0 
2015-12-05 23:59:35                                 12017.0
2015-12-06 23:59:38                                 12017.0

【问题讨论】:

  • 只需对 fillna 函数使用 limit 参数。 pandas.pydata.org/pandas-docs/stable/generated/…
  • 这满足了限制 1 的要求,但会在索引 2015-12-05 06:31:20 的第 28 列和第 29 列中生成重复的 12017.0。
  • 所以如果填充会生成 col28 = col29,则无法填充任何行?
  • @Yuca 是的,但此外,没有前向填充应该会产生 col28 = any(col25, col26, col29) 的情况。

标签: python pandas dataframe


【解决方案1】:

更新答案

这是一个检查所有列的更一般的情况:

def remove_duplicates(data, ix, names):
    # if only 1 entry, no comparison needed
    if data.notnull().sum() == 1: 
        return data
    # mark all duplicates
    dupes = data.dropna().duplicated(keep=False) 
    if dupes.any():
        for name in names:
            # if previous value was NaN AND current is duplicate, replace with NaN
            if np.isnan(df.loc[ix, name]) & dupes[name]:
                data[name] = np.nan
    return data

filled = df.ffill(limit=1)
filled.apply(lambda row: remove_duplicates(row, row.name, row.index), axis=1)

                          25       26       28       29
2015-12-05 03:02:29      NaN      NaN  12017.0  12008.0
2015-12-05 03:04:47      NaN      NaN  12017.0  12008.0
2015-12-05 03:09:14      NaN      NaN  12017.0  12008.0
2015-12-05 03:12:12      NaN      NaN  12017.0  12008.0
2015-12-05 03:23:06  12007.0  12018.0  12017.0  12008.0
2015-12-05 03:24:45  12033.0  12050.0  12025.0  12017.0
2015-12-05 06:31:20      NaN  12033.0  12017.0      NaN
2015-12-05 09:36:29  12008.0  12033.0  12011.0      NaN
2015-12-05 23:59:35  12008.0      NaN  12011.0  12017.0
2015-12-06 23:59:38      NaN      NaN      NaN  12017.0

上一个答案
您可以使用ffill(limit=1),然后检查是否有重复的,如果前面的列之一是NaN

import numpy as np

def remove_duplicates(data, ix, names):
    if data[0] - data[1] != 0:
        return data
    if np.isnan(filled.loc[ix-1, names[0]]):
        return [data[0], np.nan]
    elif np.isnan(filled.loc[ix-1, names[1]]):
        return [np.nan, data[1]]
    return data

filled = df[["28","29"]].ffill(limit=1)

df[["28","29"]] = filled.apply(
    lambda row: remove_duplicates(row, row.name, row.index), axis=1
)

df
            coupler_id  25  26       28       29
0  2015-12-05 03:02:29 NaN NaN  12017.0  12008.0
1  2015-12-05 03:04:47 NaN NaN  12017.0  12008.0
2  2015-12-05 03:09:14 NaN NaN  12017.0  12008.0
3  2015-12-05 03:12:12 NaN NaN  12017.0  12008.0
4  2015-12-05 03:23:06 NaN NaN  12017.0  12008.0
5  2015-12-05 03:24:45 NaN NaN      NaN  12017.0
6  2015-12-05 06:31:20 NaN NaN  12017.0      NaN
7  2015-12-05 09:36:29 NaN NaN  12011.0      NaN
8  2015-12-05 23:59:35 NaN NaN  12011.0  12017.0
9  2015-12-06 23:59:38 NaN NaN      NaN  12017.0

【讨论】:

  • 快到了。如果第 28 列索引 2015-12-05 03:24:45 (或您的示例中的索引 5)上没有前面的 nan 怎么办(请参阅原始帖子上的编辑)。此外,一旦我在数据框中更进一步并且第 25 和 26 列中有值,代码将无法工作。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-06-05
  • 1970-01-01
  • 2020-03-30
  • 1970-01-01
  • 2020-06-06
  • 2018-12-19
相关资源
最近更新 更多