根据条件标记行答案

【问题标题】：Flag rows based on criteria根据条件标记行
【发布时间】：2015-09-10 07:54:41
【问题描述】：

我有一个数据框，我一次循环遍历一天，并根据特定标准确定当天的哪些项目符合条件。然后我需要标记那些符合条件的项目。数据框：

        date           abc    xyz    rth
index
apple   2015-01-27     23     5712   713  
        2015-01-28     234    1357   9541
        2015-01-29     489    185    278
        2015-01-30     154    951    754
pear    2015-01-27     4786   7531   4751
        2015-01-28     476    367    45
        2015-01-29     15     37     783
        2015-01-30     489    185    421
grape   2015-01-27     2513   57     513
        2015-01-28     237    587    733
        2015-01-29     7869   472    759
        2015-01-30     489    185    278

例如，我需要为每个日期标记满足以下条件的每个项目：

abc > 50
xyz > 700
rth = 一旦我有一个基于上述标准的候选名单，从这个候选名单中选择具有 rth 最大值的单个项目

上述条件的输出将是：

        date           abc    xyz    rth    meets_criteria
index
apple   2015-01-27     23     5712   713  
        2015-01-28     234    1357   9541   True
        2015-01-29     489    185    278
        2015-01-30     154    951    754    True
pear    2015-01-27     4786   7531   4751   True
        2015-01-28     476    367    45
        2015-01-29     15     37     783
        2015-01-30     489    185    421
grape   2015-01-27     2513   57     513
        2015-01-28     237    587    733
        2015-01-29     7869   472    759
        2015-01-30     489    185    278

如您所见，每天 27 日、28 日、30 日有一件商品符合条件。 29日没有项目符合条件。

到目前为止，为了能够每天进行评估，我已经完成了以下工作：

unique_dates = df['date'].unique()

for i in range(0, len(unique_dates)):
    today_df = df.loc[df['date'] == unique_dates[i]]

    today_df = today_df.loc[today_df['abc'] > 50]
    today_df = today_df.loc[today_df['xyz'] > 700]

    today_df = today_df.sort('rth')
    today_df = today_df.tail(1)

这给了我每天的合格项目（如果有的话）。我的问题是我不知道如何从 today_df 中获取符合条件的项目并将其标记在原始数据框中的正确行上。

【问题讨论】：

df.loc[(df['abc']> 50) & (df['xyz']> 700), 'rth'].max(level='date') 做你想做的事吗？
谢谢，但是我收到错误：ValueError：级别名称日期不是索引的名称。此外，这些标准只是一个示例。可能存在我需要最大 X 行来表示“rth”的情况。不仅仅是单一的最大值。即：在我的示例中，最后一行可能是 today_df = today_df.tail(2) 例如。

标签： python pandas dataframe

【解决方案1】：

sorted_df = df.sort_index(by = 'rth' , ascending=False)

sorted_df.groupby('date' , as_index = False).apply(meets_criteria)

def meets_criteria(df):
    # Check for each value in column whether it satisfies your condition or not , simply concatenate results into one data frame we will use this later
    criteria_df = pd.concat([df['abc'] > 50 , df['xyz'] > 700] , axis = 1)
    # we want all conditions to be met for each row this can be achieved by the next line of code
    meets_criteria = np.all(criteria_df , axis = 1)
    # slice only the data that matches your criteria 
    df_meets_criteria = df[meets_criteria]
    # this handles the case where there is matched criteria
    if len(df_meets_criteria) > 0:
        vals = np.zeros(len(df))
        vals[0] = 1
        df['meets_criteria'] = vals  
    # this handles the case where there is no matched criteria 
    else:
        df['meets_criteria'] = np.zeros(len(df))
    return df.reset_index()

【讨论】：

你能发布你的代码的解释，只发布代码答案会适得其反
谢谢纳德。这适用于前两个标准。但我不明白你是如何确定“rth”的最终标准的。即：选择最大'rth'值。或者选择最大 X 个 'rth' 值。您的代码似乎只过滤 'abc' 和 'xyz' 的条件？
所以你只想选择一个值，即rth的最大值？在这种情况下应该是9541 吗？
不，每天都是单独评估的。首先评估“abc”和“xyz”。然后在这些结果中（针对特定的一天），我需要具有最大“rth”值的行。然而，这只是一个例子。在某些情况下，我需要当天的最大 X 个“rth”值。例如，前 2 名或前 3 名等。这就是为什么在我的原始问题示例中，我使用了 today_df.tail(1)。这样我不仅可以选择最大值。希望这在原始问题中有意义吗？
抱歉回复晚了，我想我明白了你的意思，如果这对你来说是正确的答案，请告诉我，以便详细了解它的工作原理