【问题标题】:Pandas: Creating an index of unique values based off of two different columnsPandas:基于两个不同的列创建唯一值索引
【发布时间】:2021-01-12 04:13:04
【问题描述】:

我正在为一个正在创建航道的项目优化我的代码。我目前拥有的是由 c_match 的索引值组合在一起的数据框。酷,很棒,乍一看一切都是正确的。

运输通道是一组具有相同折扣和最低收费的州。我的代码以相同的折扣返回状态。大多数具有相同折扣的州也有相同的最低收费。然而,异常值是具有相同折扣和不同最低费用的州。

目标:创建最低收费和折扣百分比相同的航线。

我的想法:创建一个逻辑运算,将具有相同费率和成本的州名称连接起来,并返回它们的费率和成本。仍然需要考虑以相同费率产生不同成本的州。

期望的输出:

Shipping Lane                                 Rate  Cost
20_21_RDWY_Purple_AL_AR_KY_LA_MS_SC_TN_PE   50.80%  120
20_21_RDWY_Purple_AZ                        50.80%  155
20_21_RDWY_Purple_CA                        62.40%  145
20_21_RDWY_Purple_CO_ND_WY_MB_NF_PQ         62.40%  155
20_21_RDWY_Purple_CT_DE_MN_NE               50.00%  145
20_21_RDWY_Purple_DC_IA_KS_MD_MI_OH_OK_WI   49.00%  125
20_21_RDWY_Purple_FL                        48.30%  125

当前代码:

def remove_dups(input, output):
    input.sort()
    n_list = list(input for input, _ in itertools.groupby(input))
    output.append(n_list)


def get_matches_discount(state):
    state_groups = []
    state_rates = []
    state_cost = []
    final_format = []

    match = []
    c_match = []
   
    for i, x in enumerate(df_d[state]):
        #checks within the column for identical values then maps where the identical values are
        match1 = [j for j, y in enumerate(df_d[state].isin([x])) if y is True]
        match.append(match1)
        remove_dups(match, c_match)


    for list in c_match:

        for elements in list:
            r = elements[0]
            state_g = df_d.index[elements]
            state_groups.append(state_g)

            state_r = df_d[state][r]
            state_rates.append(state_r)
            print(state_rates)
            match_cost = df_m[state][r]
            state_cost.append(match_cost)

    for i in state_groups:
        delimiter = "_"

        join_str = delimiter.join(i)

        j_str = "20_21_RDWY_Purple_" + join_str

        final_format.append(j_str)

    master_frame = pd.DataFrame(
        {'Shipping Lane': final_format,
         'Rate': state_rates,
         'Cost': state_cost,
         }
    )
    print(master_frame)
    return master_frame


m_col_names = ['AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA',
               'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH',
               'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY', 'AB', 'BC',
               'MB', 'NB', 'NF', 'NS', 'ON', 'PE', 'PQ', 'SK']
# calls the function in a loop to process one column at a time
# creates the master data frame outside of the function calling for loop
master_dataframe0 = pd.DataFrame()
for state in m_col_names:
    temp_df = get_matches_discount(state)
    # Stores the function call as a variable
    master_dataframe0 = master_dataframe0.append(temp_df)
    # Creates an appended dataframe outside of the function
print(master_dataframe0)
master_dataframe0.to_excel("shipping_lanes_revised00.xlsx")

示例输入:

最低收费表

这是数据框:df_m

State   AL     AR     AZ     CA       CO      CT     DC
AL  120.00  120.00  155.00  145.00  155.00  145.00  125.00
AR  120.00  120.00  155.00  155.00  145.00  155.00  145.00
AZ  155.00  155.00  120.00  120.00  125.00  185.00  185.00
CA  145.00  164.30  120.00  120.00  170.00  185.00  185.00
CO  155.00  145.00  125.00  145.00  120.00  155.00  155.00
CT  145.00  155.00  185.00  185.00  155.00  120.00  120.00
DC  125.00  155.00  185.00  185.00  155.00  120.00  185.00
DE  145.00  155.00  185.00  185.00  155.00  120.00  120.00
FL  125.00  145.00  145.00  185.00  145.00  155.00  145.00
GA  120.00  120.00  155.00  145.00  155.00  145.00  120.00
IA  125.00  125.00  155.00  145.00  125.00  155.00  145.00
ID  145.00  155.00  145.00  145.00  125.00  185.00  185.00
IL  120.00  120.00  155.00  145.00  145.00  125.00  125.00
IN  120.00  120.00  155.00  145.00  145.00  125.00  120.00
KS  125.00  120.00  155.00  155.00  120.00  155.00  145.00
KY  120.00  120.00  155.00  145.00  145.00  125.00  125.00
LA  120.00  120.00  155.00  145.00  155.00  155.00  155.00
MA  155.00  155.00  185.00  185.00  145.00  120.00  120.00
MD  125.00  145.00  185.00  185.00  155.00  120.00  120.00
ME  155.00  155.00  185.00  185.00  145.00  120.00  125.00
MI  125.00  125.00  145.00  145.00  155.00  125.00  120.00
MN  145.00  125.00  155.00  145.00  145.00  155.00  145.00
MO  120.00  120.00  155.00  155.00  125.00  145.00  145.00
MS  120.00  120.00  155.00  155.00  145.00  155.00  145.00
MT  145.00  155.00  155.00  155.00  125.00  185.00  185.00
NC  120.00  125.00  145.00  185.00  155.00  125.00  120.00
ND  155.00  155.00  145.00  145.00  155.00  155.00  155.00
NE  145.00  125.00  155.00  155.00  120.00  155.00  155.00
NH  155.00  155.00  185.00  185.00  145.00  120.00  120.00
NJ  145.00  155.00  185.00  185.00  155.00  120.00  120.00
NM  155.00  125.00  120.00  145.00  120.00  145.00  145.00
NV  145.00  155.00  120.00  120.00  145.00  185.00  185.00
NY  145.00  145.00  185.00  185.00  155.00  120.00  120.00
OH  125.00  125.00  145.00  145.00  155.00  120.00  120.00
OK  125.00  120.00  145.00  155.00  120.00  155.00  155.00
OR  185.00  145.00  155.00  125.00  155.00  185.00  185.00
PA  145.00  145.00  185.00  185.00  155.00  120.00  120.00
RI  155.00  155.00  185.00  185.00  145.00  120.00  120.00
SC  120.00  120.00  145.00  185.00  155.00  125.00  120.00
SD  155.00  145.00  155.00  155.00  120.00  155.00  145.00
TN  120.00  120.00  155.00  145.00  155.00  145.00  125.00
TX  125.00  120.00  145.00  155.00  125.00  145.00  155.00
UT  170.00  164.30  132.50  132.50  127.20  145.00  145.00
VA  120.00  145.00  145.00  185.00  155.00  120.00  120.00
   

折扣表

这是数据帧:df_d

State   AL      AR      AZ      CA      CO     CT     DC
    AL  50.80%  44.10%  54.30%  73.10%  53.90%  50.00%  49.00%
    AR  50.80%  50.80%  53.90%  65.70%  50.00%  53.90%  50.00%
    AZ  56.70%  55.80%  50.80%  54.10%  49.60%  59.50%  64.40%
    CA  62.40%  61.00%  54.30%  61.40%  43.00%  52.30%  54.30%
    CO  54.30%  67.10%  49.00%  65.70%  50.80%  54.30%  54.30%
    CT  50.00%  53.90%  64.40%  72.50%  54.30%  50.80%  50.80%
    DC  49.00%  53.90%  64.40%  64.40%  54.30%  50.80%  64.40%
    DE  50.00%  53.90%  64.40%  64.40%  54.30%  50.80%  50.80%
    FL  48.30%  35.00%  55.50%  55.50%  55.10%  66.40%  62.30%
    GA  67.90%  44.10%  71.00%  64.60%  56.00%  50.00%  44.10%
    IA  49.00%  49.00%  54.30%  61.80%  49.00%  53.90%  50.00%
    ID  61.80%  54.30%  50.00%  75.90%  49.00%  64.40%  64.40%
    IL  44.10%  44.10%  54.30%  64.00%  50.00%  49.00%  49.00%
    IN  44.10%  1.60%   11.70%  26.10%  -0.70%  49.00%  44.10%
    KS  49.00%  63.40%  61.00%  67.70%  72.50%  72.20%  50.00%
    KY  50.80%  44.10%  54.30%  61.50%  50.00%  49.00%  49.00%
    LA  50.80%  44.10%  54.30%  61.80%  53.90%  54.30%  53.90%
    MA  63.50%  53.90%  67.70%  63.90%  53.00%  63.50%  44.10%
    MD  49.00%  50.00%  64.40%  73.80%  54.30%  50.80%  50.80%
    ME  53.90%  54.30%  64.40%  64.40%  61.80%  50.80%  49.00%
    MI  49.00%  49.00%  61.80%  55.10%  53.90%  49.00%  44.10%
    MN  50.00%  49.00%  54.30%  61.80%  50.00%  53.90%  50.00%
    MO  44.10%  50.80%  53.90%  56.10%  49.00%  50.00%  50.00%
    MS  50.80%  50.80%  54.30%  63.90%  50.00%  53.90%  50.00%
    MT  61.80%  54.30%  53.90%  75.80%  49.00%  64.40%  64.40%
    NC  44.10%  59.20%  53.50%  58.60%  57.90%  42.90%  69.60%
    ND  54.30%  53.90%  61.80%  61.80%  54.30%  53.90%  53.90%
    NE  50.00%  49.00%  54.30%  54.30%  44.10%  53.90%  53.90%
    NH  53.90%  54.30%  64.40%  64.40%  61.80%  50.80%  44.10%
    NJ  50.50%  51.50%  70.50%  66.20%  59.70%  67.10%  50.80%
    NM  53.90%  49.00%  44.10%  68.20%  44.10%  61.80%  61.80%
    NV  61.80%  54.30%  52.70%  73.50%  50.00%  64.40%  64.40%
    NY  61.10%  69.00%  65.50%  68.90%  63.00%  68.40%  50.80%
    OH  49.00%  49.00%  68.50%  71.50%  72.30%  60.70%  44.10%
    OK  49.00%  50.80%  50.00%  54.30%  44.10%  54.30%  54.30%
    OR  64.40%  61.80%  53.90%  64.00%  53.90%  64.40%  64.40%
    PA  47.20%  57.00%  33.70%  51.90%  45.50%  50.80%  50.80%
    RI  53.90%  54.30%  64.40%  64.40%  61.80%  50.80%  44.10%
    SC  50.80%  44.10%  61.80%  58.70%  54.30%  49.00%  44.10%
    SD  53.90%  50.00%  54.30%  54.30%  44.10%  54.30%  61.80%
    TN  50.80%  50.80%  52.50%  62.60%  61.30%  53.30%  49.00%
    TX  56.60%  46.00%  51.40%  58.30%  53.20%  63.10%  65.10%
    UT  45.00%  60.60%  73.50%  73.50%  70.30%  44.40%  61.90%
    VA  57.90%  50.00%  61.80%  72.10%  54.30%  44.10%  50.80%

电流输出:

                                                                      Shipping Lane     Rate   Cost
0                                             20_21_RDWY_Purple_AL_AR_KY_LA_MS_SC_TN_PE   50.80%  120.0
1                                                                  20_21_RDWY_Purple_AZ   56.70%  155.0
2                                                                  20_21_RDWY_Purple_CA   62.40%  145.0
3                                                   20_21_RDWY_Purple_CO_ND_WY_MB_NF_PQ   54.30%  155.0
4                                                         20_21_RDWY_Purple_CT_DE_MN_NE   50.00%  145.0
5                                             20_21_RDWY_Purple_DC_IA_KS_MD_MI_OH_OK_WI   49.00%  125.0
6                                                                  20_21_RDWY_Purple_FL   48.30%  125.0
7                                                                  20_21_RDWY_Purple_GA   67.90%  120.0
8                                                      20_21_RDWY_Purple_ID_MT_NV_AB_SK   61.80%  145.0
9                                                      20_21_RDWY_Purple_IL_IN_MO_NC_WV   44.10%  120.0
10                                                                 20_21_RDWY_Purple_MA   63.50%  155.0
11                                            20_21_RDWY_Purple_ME_NH_NM_RI_SD_VT_NB_NS   53.90%  155.0
12                                                                 20_21_RDWY_Purple_NJ   50.50%  145.0
13                                                                 20_21_RDWY_Purple_NY   61.10%  145.0
14                                                           20_21_RDWY_Purple_OR_WA_BC   64.40%  185.0
15                                                                 20_21_RDWY_Purple_PA   47.20%  145.0
16                                                                 20_21_RDWY_Purple_TX   56.60%  125.0
17                                                                 20_21_RDWY_Purple_UT   45.00%  170.0
18                                                                 20_21_RDWY_Purple_VA   57.90%  120.0
19                                                                 20_21_RDWY_Purple_ON   37.30%  145.0
0                                                   20_21_RDWY_Purple_AL_GA_IL_KY_LA_SC   44.10%  120.0
1                                          20_21_RDWY_Purple_AR_MO_MS_OK_TN_NB_NF_NS_PE   50.80%  120.0
2                                                                  20_21_RDWY_Purple_AZ   55.80%  155.0
3                                                                  20_21_RDWY_Purple_CA   61.00%  164.3
4                                                                  20_21_RDWY_Purple_CO   67.10%  145.0
5                                                   20_21_RDWY_Purple_CT_DC_DE_MA_ND_MB   53.90%  155.0
6                                                                  20_21_RDWY_Purple_FL   35.00%  145.0
7                                             20_21_RDWY_Purple_IA_MI_MN_NE_NM_OH_WI_WV   49.00%  125.0
8                                          20_21_RDWY_Purple_ID_ME_MT_NH_NV_RI_VT_PQ_SK   54.30%  155.0
9                                                                  20_21_RDWY_Purple_IN    1.60%  120.0
10                                                                 20_21_RDWY_Purple_KS   63.40%  120.0
11                                                        20_21_RDWY_Purple_MD_SD_VA_WY   50.00%  145.0
12                                                                 20_21_RDWY_Purple_NC   59.20%  125.0
13                                                                 20_21_RDWY_Purple_NJ   51.50%  155.0
14                                                                 20_21_RDWY_Purple_NY   69.00%  145.0
15                                                           20_21_RDWY_Purple_OR_WA_AB   61.80%  145.0
16                                                                 20_21_RDWY_Purple_PA   57.00%  145.0
17                                                                 20_21_RDWY_Purple_TX   46.00%  120.0
18                                                                 20_21_RDWY_Purple_UT   60.60%  164.3
19                                                                 20_21_RDWY_Purple_BC   64.40%  185.0
20                                                                 20_21_RDWY_Purple_ON   32.10%  145.0
0                              20_21_RDWY_Purple_AL_CA_IA_IL_KY_LA_MN_MS_NE_SD_WA_AB_BC   54.30%  155.0
1                                                         20_21_RDWY_Purple_AR_MO_MT_OR   53.90%  155.0
2                                                      20_21_RDWY_Purple_AZ_NB_NF_NS_PE   50.80%  120.0
3                                                                  20_21_RDWY_Purple_CO   49.00%  125.0
4                                    20_21_RDWY_Purple_CT_DC_DE_MD_ME_NH_RI_VT_ON_PQ_SK   64.40%  185.0
5                                                                  20_21_RDWY_Purple_FL   55.50%  145.0
6                                                                  20_21_RDWY_Purple_GA   71.00%  155.0
7                                                            20_21_RDWY_Purple_ID_OK_WY   50.00%  145.0
8                                                                  20_21_RDWY_Purple_IN   11.70%  155.0
9                                                                  20_21_RDWY_Purple_KS   61.00%  155.0
10                                                                 20_21_RDWY_Purple_MA   67.70%  185.0
11                                                  20_21_RDWY_Purple_MI_ND_SC_VA_WV_MB   61.80%  145.0
12                                                                 20_21_RDWY_Purple_NC   53.50%  145.0
13                                                                 20_21_RDWY_Purple_NJ   70.50%  185.0
14                                                                 20_21_RDWY_Purple_NM   44.10%  120.0
15                                                                 20_21_RDWY_Purple_NV   52.70%  120.0
16                                                                 20_21_RDWY_Purple_NY   65.50%  185.0
17                                                                 20_21_RDWY_Purple_OH   68.50%  145.0
18                                                                 20_21_RDWY_Purple_PA   33.70%  185.0
19                                                                 20_21_RDWY_Purple_TN   52.50%  155.0
20                                                                 20_21_RDWY_Purple_TX   51.40%  145.0


  

【问题讨论】:

    标签: python pandas dataframe logic data-science


    【解决方案1】:

    您有多个状态行,但它们也在列上。看起来您只是在显示 AL 列的示例输出?您可以合并State 上的两个数据框,然后合并.groupby RateCost。然后,返回具有相同费率和成本的状态的连接字符串(带有.apply(lambda x: '_'.join(x)))(因为您按它们分组,它们将具有相同的费率和成本):

    master_dataframe0 = (pd.merge(df_d[['State', 'AL']], df_m[['State', 'AL']], how='inner', on='State')
                        .rename({'AL_x' : 'Rate', 'AL_y' : 'Cost'}, axis=1)
                        .groupby(['Rate', 'Cost'])['State'].apply(lambda x: '_'.join(x)).reset_index()
                        .sort_values('State'))
    master_dataframe0 = master_dataframe0[['State', 'Rate', 'Cost']].assign(State='20_21_RDWY_Purple_' + master_dataframe0['State'])
    master_dataframe0
    Out[1]: 
                                         State    Rate   Cost
    7   20_21_RDWY_Purple_AL_AR_KY_LA_MS_SC_TN  50.80%  120.0
    11                    20_21_RDWY_Purple_AZ  56.70%  155.0
    15                    20_21_RDWY_Purple_CA  62.40%  145.0
    9                  20_21_RDWY_Purple_CO_ND  54.30%  155.0
    5            20_21_RDWY_Purple_CT_DE_MN_NE  50.00%  145.0
    4   20_21_RDWY_Purple_DC_IA_KS_MD_MI_OH_OK  49.00%  125.0
    3                     20_21_RDWY_Purple_FL  48.30%  125.0
    18                    20_21_RDWY_Purple_GA  67.90%  120.0
    14              20_21_RDWY_Purple_ID_MT_NV  61.80%  145.0
    0            20_21_RDWY_Purple_IL_IN_MO_NC  44.10%  120.0
    16                    20_21_RDWY_Purple_MA  63.50%  155.0
    8         20_21_RDWY_Purple_ME_NH_NM_RI_SD  53.90%  155.0
    6                     20_21_RDWY_Purple_NJ  50.50%  145.0
    13                    20_21_RDWY_Purple_NY  61.10%  145.0
    17                    20_21_RDWY_Purple_OR  64.40%  185.0
    2                     20_21_RDWY_Purple_PA  47.20%  145.0
    10                    20_21_RDWY_Purple_TX  56.60%  125.0
    1                     20_21_RDWY_Purple_UT  45.00%  170.0
    12                    20_21_RDWY_Purple_VA  57.90%  120.0
    

    【讨论】:

    • 你如何重命名每个州,而不仅仅是 AL
    • @AdamZuckerman 输出会是什么样子? 50 个不同的比率列和 50 个不同的列成本?或者作为行,您也许可以将每个分类为AL - 0_21_RDWY_Purple_AL_AR_KY_LA_MS_SC_TNCA - 20_21_RDWY_Purple_AZCA - 0_21_RDWY_Purple_AL_AR_KY_LA_MS_SC_TNAL - 20_21_RDWY_Purple_AZ 你明白我的意思是在列中添加状态前缀。无论哪种方式,我认为最好创建一个新的 SO 问题,它是具有所需输出的该问题的扩展。您不必在列和行中显示所有 50 个状态,但可能只显示几个,以便人们了解。
    • 现在的输出看起来很棒!我想到了。简单修复:.rename({state + '_x': 'Rate', state + '_y': 'Cost'}, axis=1)
    【解决方案2】:

    使用 Erickson 对 .groupby 和 lambda 函数的帮助,我们得出了正确的解决方案:

    pd.set_option('display.max_rows', None)
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', None)
    pd.set_option('display.max_colwidth', None)
    
    df_d = pd.read_excel(path,
                            sheet_name=0,
                            header=0,
                            index_col=False,
                            keep_default_na=True)
    df_m = pd.read_excel(path2,
                           sheet_name=0,
                           header=0,
                           index_col=False,
                           keep_default_na=True)
    
    m_col_names = ['AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA',
                   'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH',
                   'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY', 'AB', 'BC',
                   'MB', 'NB', 'NF', 'NS', 'ON', 'PE', 'PQ', 'SK']
    
    final_frame = pd.DataFrame()
    for state in m_col_names:
    
        master_dataframe0 = (pd.merge(df_d[['State', state]], df_m[['State', state]], how='inner', on='State')
                             .rename({state + '_x': 'Rate', state + '_y': 'Cost'}, axis=1)
                             .groupby(['Rate', 'Cost'])['State'].apply(lambda x: '_'.join(x)).reset_index()
                             .sort_values('State'))
        master_dataframe0['Origin'] = state
        master_dataframe0 = master_dataframe0[['State', 'Rate', 'Cost', 'Origin']].assign(
            State='20_21_RDWY_Purple_' + master_dataframe0['State'])
    
        final_frame = final_frame.append(master_dataframe0)
    
        print(final_frame)
        final_frame.to_excel("w3llshipmeright.xlsx")
    

    正确的输出:

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-06-28
      • 1970-01-01
      • 2019-06-02
      • 1970-01-01
      • 1970-01-01
      • 2020-10-11
      • 1970-01-01
      相关资源
      最近更新 更多