【问题标题】:Display the first 5 largest values of a DataFrame for each month显示每个月 DataFrame 的前 5 个最大值
【发布时间】:2020-09-12 04:39:33
【问题描述】:

我正在尝试处理具有很多列 (505) 的数据框,并且我只想选择每个月的前 5 个值。 您将在下面找到我的 DataFrame 图像的链接。

link photo

示例如下:

  Dates         1        2       3           4       5     6
2002-07-31  -31.710916  NaN  -5.208684  -29.773404  NaN -7.308558   
2002-08-31  -44.941351  NaN   3.665286  -23.987135  NaN 3.134669    
2002-09-30  -36.725548  NaN   4.114474  -19.536571  NaN -0.986986   
2002-10-31  -25.377286  NaN  -0.486158  -5.887594   NaN -0.787117   
2002-11-30  19.766328   NaN  -5.298877  -10.672174  NaN -21.057946  
2002-12-31  1.996514    NaN  -7.570497  -9.257122   NaN -19.630112  
2003-01-31  -0.366083   NaN -14.124492  -5.434475   NaN -8.053424   
2003-02-28  -17.869297  NaN -20.075997  1.009837    NaN -11.616974  

我该怎么做?我已经尝试过使用 df.max(axis=1) 但我想在最大值之后添加 4 个其他值。 感谢您的帮助

【问题讨论】:

  • 请张贴可以复制的数据框样本,而不是图片
  • 抱歉,您现在可以找到我的数据框示例
  • 你的问题还不是很清楚。似乎一个月=一排。正确的?对于每一行,您想从 505 列中提取五个最大值?那正确吗?您在问题中提供的数据表的预期输出是什么?

标签: python pandas dataframe time-series


【解决方案1】:

我假设您希望每行最多 5 列,因为这是我解释您的问题的方式。下面在示例输入中选择最多 2 行,因为它只有 4 个非 nan 列。

import io
import re
import pandas as pd


# First read in the data you supplied. 
data=io.StringIO(re.sub(" +","\t",
"""Dates         1        2       3           4       5     6
2002-07-31  -31.710916  NaN  -5.208684  -29.773404  NaN -7.308558
2002-08-31  -44.941351  NaN   3.665286  -23.987135  NaN 3.134669
2002-09-30  -36.725548  NaN   4.114474  -19.536571  NaN -0.986986
2002-10-31  -25.377286  NaN  -0.486158  -5.887594   NaN -0.787117
2002-11-30  19.766328   NaN  -5.298877  -10.672174  NaN -21.057946
2002-12-31  1.996514    NaN  -7.570497  -9.257122   NaN -19.630112
2003-01-31  -0.366083   NaN -14.124492  -5.434475   NaN -8.053424
2003-02-28  -17.869297  NaN -20.075997  1.009837    NaN -11.616974"""))
df = pd.read_csv(data,sep="\t")

# Then we preprocess the data, so it is in a long format instead of a wide
df = df.melt(id_vars='Dates',var_name='Column_name',value_name='Value')

# Finally extract the top 2 values for each date, but first set the index so the output knows what column the input came from
print(df.set_index('Column_name').groupby('Dates')['Value'].apply(lambda grp: grp.nlargest(2)))

输出是

Dates       Column_name
2002-07-31  3              -5.208684
            6              -7.308558
2002-08-31  3               3.665286
            6               3.134669
2002-09-30  3               4.114474
            6              -0.986986
2002-10-31  3              -0.486158
            6              -0.787117
2002-11-30  1              19.766328
            3              -5.298877
2002-12-31  1               1.996514
            3              -7.570497
2003-01-31  1              -0.366083
            4              -5.434475
2003-02-28  4               1.009837
            6             -11.616974
Name: Value, dtype: float64

很难给出更合适的答案,除非你更明确地知道你想要什么输出。

【讨论】:

  • 最后一个问题,我不知道熔化函数,在我的例子中,“日期”是一个索引,所以如果我使用 id_vars='Dates',我会得到一个 KeyError:“日期”。是否可以使用您的方法添加索引?再次感谢
  • 使用reset_indexDates 变成df 的列。 melt 也有很多你可以玩的选项。总的来说,文档中的例子都很好,所以看看它们吧!
【解决方案2】:

通过阅读方法的DocString 也许你正在寻找nlargest 方法。

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html

【讨论】:

  • 感谢您的帮助,问题是 nlargest 方法查找每列的值,我正在尝试每月查找每行(而不是列)的值
  • 所以基本上我想要的是每行 5 个值
【解决方案3】:

你可以试试这个:

df['Dates'] = pd.to_datetime(df['Dates'])
df = df.groupby(pd.Grouper(key='Dates', freq='1M'))
df2 = df.apply(lambda x: x.sort_values(['1', '2', '3', '4', '5', '6'], ascending=False))
df3 = df2.reset_index(drop=True)
print(df3.groupby(pd.Grouper(key='Dates', freq='1M')).head(5))

输出:

        Dates          1   2          3          4   5          6
0  2002-07-31 -31.710916 NaN  -5.208684 -29.773404 NaN  -7.308558
1  2002-08-31 -44.941351 NaN   3.665286 -23.987135 NaN   3.134669
2  2002-09-30 -36.725548 NaN   4.114474 -19.536571 NaN  -0.986986
3  2002-10-31 -25.377286 NaN  -0.486158  -5.887594 NaN  -0.787117
4  2002-11-30  19.766328 NaN  -5.298877 -10.672174 NaN -21.057946
5  2002-12-31   1.996514 NaN  -7.570497  -9.257122 NaN -19.630112
6  2003-01-31  -0.366083 NaN -14.124492  -5.434475 NaN  -8.053424
7  2003-02-28 -17.869297 NaN -20.075997   1.009837 NaN -11.616974
8  2003-02-28 -18.869297 NaN -20.075997   1.009837 NaN -11.616974
9  2003-02-28 -19.869297 NaN -20.075997   1.009837 NaN -11.616974
10 2003-02-28 -20.869297 NaN -20.075997   1.009837 NaN -11.616974
11 2003-02-28 -21.869297 NaN -20.075997   1.009837 NaN -11.616974

【讨论】:

    猜你喜欢
    • 2021-02-23
    • 1970-01-01
    • 1970-01-01
    • 2019-06-23
    • 1970-01-01
    • 2020-10-30
    • 2023-04-10
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多