【问题标题】:Compare multiple records in pandas比较熊猫中的多条记录
【发布时间】:2018-05-05 15:33:16
【问题描述】:

尝试比较“Cntr No”的 df1 和 df2,并且 df2 [“人工成本”、“材料成本”、“估计货币金额”] 的任何一列中的值必须与 df1 的总计相匹配。

例如,df1 OOLU 3868088 与 df2 OOLU 3868088 匹配,并且 df1 “28”的总值与 df2 的“劳动力成本”值“28”匹配。

df:

df1 = pd.DataFrame({'Cntr No': ['OOLU 3868088','OOLU 3868088','OOLU 3868088','TRIU 0625840','TRIU 0625840','TRIU 0625840','TRIU 1234567','OOLU 6232016','OOLU 0981231','OOLU 1212444'], 
               'Total': [12,28,48,119,82.5,11.0,18.0,11.0,13.0,10.0]})

df2 = pd.DataFrame({'Cntr No': ['OOLU 3868088','OOLU 3868088','OOLU 3868088','TRIU 0625840','TRIU 0625840','TRIU 0625840','TRIU 1234567'],  
                  'Labour Cost': [0.0,0.0,28.0,0.0,54.0,0.0,0.0], 
                  'Material Cost':[0.00,12.0,58.91,82.5,54.0,0.0,16.0],
                  'Amount in Estimate Currency':[48.00,12.00,87.81,82.5,119.0,12.0,16.0]})

预期输出:

    Cntr No        Total    Tally_with_df2
0   OOLU 3868088    12.0    Yes
1   OOLU 3868088    28.0    Yes
2   OOLU 3868088    48.0    Yes
3   TRIU 0625840    119.0   Yes
4   TRIU 0625840    82.5    Yes
5   TRIU 0625840    11.0    No
6   TRIU 1234567    18.0    No

使用的代码:这是我尝试过但无法达到我的要求的以下代码

cols = ['Labour Cost', 'Material Cost', 'Amount in Estimate Currency']

 d = {k: set(v.values()) for k, v in \
    df_co.set_index('Cntr No')[cols].to_dict(orient='index').items()}

df['Tally'] = [j in d.get(i, set()) for i, j in zip(df['Cntr No'], df['Total'])]
df['Tally'] = df['Tally'].map({True: 'Yes', False: 'No'})

df1:

Cntr No                       object
Serviced By                   object
Location                      object
WO No                         object
WASH - CHEMICAL              float64
PTI - CHILL                  float64
WASHING CONTAINER AGENT      float64
WASH - CHEMICAL AGENT        float64
WASHING CONTAINER -AGENT     float64
BUNDLING/UNBUNDLING OF FR    float64
PTI - AUTO                   float64
PTI                          float64
Struct Repair - Labour       float64
Struct Repair - Material     float64
Machy Repair - Labour        float64
Total                        float64
Vendor                        object
Sz                            object
Ty                            object
CO                            object
WO Date                       object
WO ID                         object

df2:

 Cntr No                            object
Equipment Size/type Group Code     object
Labour Cost                       float64
Material Cost                     float64
Amount in Estimate Currency       float64
Remarks                            object

【问题讨论】:

    标签: pandas string-comparison


    【解决方案1】:

    IIUC,我们可以为每个唯一的 Cntr 编号从 df2 创建一个groupby 数据。

    ## this is grouped data
    to_remove = df2.select_dtypes(['object']).columns.values.tolist()
    
    df3 = (df2
    .groupby('Cntr No')
    .apply(lambda df: set(np.concatenate(df.loc[:, df.columns.difference(to_remove)].values))))
    
    ## df3 looks like this - using set for faster speed
    print(df3)
    
    Cntr No
    OOLU 3868088    {0.0, 12.0, 48.0, 87.81, 58.91, 28.0}
    TRIU 0625840           {0.0, 12.0, 82.5, 54.0, 119.0}
    TRIU 1234567                              {16.0, 0.0}
    
    
    ## this function ensures all cases are handles
    def get_value(x, data):
        if x['Cntr No'] not in data.index:
            return 'Not Found'
        else:
            if x['Total'] in data[x['Cntr No']]:
                return 'Yes'
            else:
                return 'No'
    
    ## next we do a simple look-up
    df1['Tally_with_df2'] = df1.apply(lambda x: get_value(x, df3), axis=1)
    
    print(df1)
    
            Cntr No  Total Tally_with_df2
    0  OOLU 3868088   12.0            Yes
    1  OOLU 3868088   28.0            Yes
    2  OOLU 3868088   48.0            Yes
    3  TRIU 0625840  119.0            Yes
    4  TRIU 0625840   82.5            Yes
    5  TRIU 0625840   11.0             No
    6  TRIU 1234567   18.0             No
    

    【讨论】:

    • 谢谢,但是查找代码有错误:TypeError: 'str' object cannot be Explained as an integer。 KeyError: ('OOLU 6232016', '发生在索引 1')
    • @leong 我看不到我身边的错误。您能否检查两个数据帧中是否存在值的 dtypes 或“OOLU 6232016”值。
    • df3 示例:TRIU 0783320 {GP, 70.0, 40FL, 48.0, 118.0} 也许我的 df2 有另外 2 个字符串列?
    • @leong 确保将所有字符串列添加到列表df.columns.difference(['Cntr No', 'add_string_column'])] 中,因为最后,我们想要一组所有数值
    • 我在我的帖子中添加了实际的 df1 和 df2 列。我必须全部添加吗?请看我之前的帖子更新
    猜你喜欢
    • 2018-11-18
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-12-23
    • 2019-09-17
    • 1970-01-01
    • 2021-12-03
    • 2014-03-31
    相关资源
    最近更新 更多