【问题标题】:Apply filtering on dataframe based on the input from other sheet根据来自其他工作表的输入对数据框应用过滤
【发布时间】:2020-05-30 06:40:48
【问题描述】:

我正在尝试通过应用过滤器(来自另一个电子表格的输入)来过滤主数据集(Pandas Dataframe)。

主要数据集:

+---------+--------+-----+-----------+-------------+-------+-------------+-------------+----------+
| Cust Id | gender | Age | Indicator | X Indicator | State | foreign_ind | Eu Resident | address1 |
+---------+--------+-----+-----------+-------------+-------+-------------+-------------+----------+
|  987685 | M      |  65 | Y         | N           | TX    | N           | N           | XYZ,USA  |
|  987686 | F      |  54 | Y         | N           | NJ    | N           | N           | XYZ,USA  |
|  987687 | M      |  75 | Y         | Y           | NJ    | N           | N           | XYZ,USA  |
|  987688 | M      |  45 | N         | Y           | NY    | N           | N           | XYZ,USA  |
|  987689 | F      |  45 | Y         | Y           | NJ    | N           | N           | XYZ,USA  |
+---------+--------+-----+-----------+-------------+-------+-------------+-------------+----------+

以下是配置列表,我们以电子表格格式从最终用户那里获取输入,并将此条件应用于主数据集。

来自另一个电子表格的条件输入:

+-------------+-----------+--------+------------------------------+---------+-----------+--------+
|   column1   | operator1 | value1 | Logical Condition(And or OR) | column2 | operator2 | value2 |
+-------------+-----------+--------+------------------------------+---------+-----------+--------+
| gender      | ==        | F      |                              |         |           |        |
| gender      | ==        | M      |                              |         |           |        |
| Age         | >=        | 75     | ||                           | Age     | >=        |     45 |
| Indicator   | ==        | Y      |                              |         |           |        |
| X Idnicator | ==        | Y      |                              |         |           |        |
| State       | ==        | NJ     |                              |         |           |        |
+-------------+-----------+--------+------------------------------+---------+-----------+--------+

应用过滤器后的预期输出数据帧。

+---------+--------+-----+-----------+-------------+-------+-------------+-------------+----------+
| Cust Id | gender | Age | Indicator | X Indicator | State | foreign_ind | Eu Resident | address1 |
+---------+--------+-----+-----------+-------------+-------+-------------+-------------+----------+
|  987687 | M      |  75 | Y         | Y           | NJ    | N           | N           | XYZ,USA  |
|  987689 | F      |  45 | N         | Y           | DL    | N           | N           | XYZ,USA  |
+---------+--------+-----+-----------+-------------+-------+-------------+-------------+----------+

【问题讨论】:

  • 输入的电子表格总是一样的吗?

标签: python pandas dataframe filtering


【解决方案1】:

我想出了一个解决方案,但您必须稍微更改输入电子表格。它应该如下表所示:

         Label Operator Condition Value
0       gender                 ==     F
1       gender        |        ==     M
2          Age        &        >=    75
3          Age        &        >=    45
4    Indicator        &        ==     Y
5  X Indicator        &        ==     Y
6        State        &        ==    NJ

该表列出了过滤条件。应用于相同条件的条件必须写在连续的行中,代码才能工作。在您的示例中,性别和年龄。

你的桌子:

        gender  Age Indicator X Indicator State foreign_ind Eu Resident address1
Cust Id
987685       M   65         Y           N    TX           N           N  XYZ,USA
987686       F   54         Y           N    NJ           N           N  XYZ,USA
987687       M   75         Y           Y    NJ           N           N  XYZ,USA
987688       M   45         N           Y    NY           N           N  XYS,USA
987689       F   45         Y           Y    NJ           N           N  XYS,USA

输出:

        gender  Age Indicator X Indicator State foreign_ind Eu Resident address1
Cust Id
987687       M   75         Y           Y    NJ           N           N  XYZ,USA
987689       F   45         Y           Y    NJ           N           N  XYS,USA

代码:

i = 0
filter_string = 'df'
while i < conditions.shape[0]:
    if i == conditions.shape[0] - 1:
        if conditions.iloc[i,0] == conditions.iloc[i-1,0]:
            label = conditions.iloc[i,0]
            operator = conditions.iloc[i,1]
            condition = conditions.iloc[i,2]
            value = conditions.iloc[i,3]
            if type(value) == 'int':
                string = str(operator + '(df[\'' + label + '\']' + condition + str(value) + '))]')
                filter_string += string
                i += 1
            else:
                string = str(operator + '(df[\'' + label + '\']' + condition + '\''+ str(value) + '\'))]')
                filter_string += string
                i += 1
        else:
            label = conditions.iloc[i,0]
            operator = conditions.iloc[i,1]
            condition = conditions.iloc[i,2]
            value = conditions.iloc[i,3]
            if type(value) == 'int':
                string = str(operator + '(df[\'' + label + '\']' + condition + str(value) + ')]')
                filter_string += string
                i += 1
            else:
                string = str(operator + '(df[\'' + label + '\']' + condition + '\''+ str(value) + '\')]')
                filter_string += string
                i += 1
    elif i == 0:
        if conditions.iloc[i,0] == conditions.iloc[i+1,0]:
            value = conditions.iloc[i,3]
            label = conditions.iloc[i,0]
            operator = conditions.iloc[i,1]
            condition = conditions.iloc[i,2]
            value = conditions.iloc[i,3]
            if type(value) == 'int':
                string = operator + '[((df[\'' + label + '\']' + condition + str(value) + ')'
                filter_string += string
                i += 1
            else:
                string = operator + '[((df[\'' + label + '\']' + condition + '\''+ str(value) + '\')'
                filter_string += string
                i += 1
        else:
            value = conditions.iloc[i,3]
            label = conditions.iloc[i,0]
            operator = conditions.iloc[i,1]
            condition = conditions.iloc[i,2]
            value = conditions.iloc[i,3]
            if type(value) == 'int':
                string = operator + '[(df[\'' + label + '\']' + condition + str(value) + ')'
                filter_string += string
                i += 1
            else:
                string = operator + '[(df[\'' + label + '\']' + condition + '\''+ str(value) + '\')'
                filter_string += string
                i += 1
    else:
        if conditions.iloc[i,0] == conditions.iloc[i+1,0]:
            value = conditions.iloc[i,3]
            label = conditions.iloc[i,0]
            operator = conditions.iloc[i,1]
            condition = conditions.iloc[i,2]
            value = conditions.iloc[i,3]
            if type(value) == 'int':
                string = operator + '((df[\'' + label + '\']' + condition + str(value) + ')'
                filter_string += string
                i += 1
            else:
                string = operator + '((df[\'' + label + '\']' + condition + '\''+ str(value) + '\')'
                filter_string += string
                i += 1
        elif conditions.iloc[i,0] == conditions.iloc[i-1,0]:
            value = conditions.iloc[i,3]
            label = conditions.iloc[i,0]
            operator = conditions.iloc[i,1]
            condition = conditions.iloc[i,2]
            value = conditions.iloc[i,3]
            if type(value) == 'int':
                string = operator + '(df[\'' + label + '\']' + condition + str(value) + '))'
                filter_string += string
                i += 1
            else:
                string = operator + '(df[\'' + label + '\']' + condition + '\''+ str(value) + '\'))'
                filter_string += string
                i += 1
        else:
            value = conditions.iloc[i,3]
            label = conditions.iloc[i,0]
            operator = conditions.iloc[i,1]
            condition = conditions.iloc[i,2]
            value = conditions.iloc[i,3]
            if type(value) == 'int':
                string = operator + '(df[\'' + label + '\']' + condition + str(value) + ')'
                filter_string += string
                i += 1
            else:
                string = operator + '(df[\'' + label + '\']' + condition + '\''+ str(value) + '\')'
                filter_string += string
                i += 1

print(eval(re.sub("'(\d+)'", r'\1', filter_string)))

编辑:注意:这仅适用于将多个条件应用于与您的示例相同的标签的查询。例如,如果您想查询 45 岁以上的男性和 60 岁以上的女性,它就不起作用。如果需要,可以通过包含一些更改来轻松添加它。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2019-12-09
    • 2018-03-14
    • 1970-01-01
    • 1970-01-01
    • 2022-11-23
    • 2021-08-30
    • 1970-01-01
    • 2011-01-21
    相关资源
    最近更新 更多