合并最接近的值 Pandas答案

【问题标题】：Merging on closest value Pandas合并最接近的值 Pandas
【发布时间】：2021-10-06 09:55:03
【问题描述】：

我目前正在开展一个项目，我正在按米指示（公里）匹配并合并两个 excel 文档。这两个数据帧包含相同类型的数据，但略有不同。我对这个问题的解决方案是将每个数据帧除以 100，以便更轻松地匹配列。但是，正如我预测的那样，在某些情况下这不起作用，因此这是一个临时解决方案。

我有两个数据框：

Meter_indication      Fuel1
1180784               275
1181278               280
1181791               300
1182285               280
1182801               300
1183295               280
1183717               250

Meter_indication      Fuel2
1180785               278
1181282               282
1181800               310
1182401               282
1182824               320
1183310               215
1183727               250

如您所见，这些数据帧包含相同类型的信息，但略有不同。我以前使用过合并merged_df = df1filt2.merge(df2filt, on='Meter_indication') 的方法，它只合并完全匹配的值。但是，在这种情况下，这种方法是无关紧要的。

我想要的输出是：

Meter_indication      Fuel1      Fuel2
1180784               275        278
1181278               280        282
1181791               300        310
1182285               280        282
1182801               300        320
1183295               280        215
1183717               250        250

如您所见，数据框已在“Meter_indication”上合并，并通过找到与其自身相比最接近的值进行合并。

我四处寻找有类似问题的其他人，并尝试了许多不同的建议，例如https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.merge_asof.html

Merging pandas dataframes based on nearest value(s)

但都没有成功。

我当前的代码（必要部分）是：

    filepathname1=input1.variable
    filepathname2=input2.variable
    filepathname3=output1.variable
    filepathname4=output2.variable

    #Creating filepaths for Automatic and Manual doc (1 = Automatic, 2 = Manual).
    print("You have chosen to mix", filepathname1, "and", filepathname2)

    #Changes the option of pd (max rows and columns).
    pd.set_option("display.max_rows", None, "display.max_columns", None)

    #READS PROVIDED DOCUMENTS.
    df1 = pd.read_excel(
        filepathname1, sheetname, na_values=["NA"], skiprows=1, usecols="A, B, C, D, E, F")
    df2 = pd.read_excel(
        filepathname2,
        na_values=["NA"],
        skiprows=2,
        usecols="D, AG, AH")

    # Drop NaN rows.
    df2.dropna(inplace=True)
    df1.dropna(inplace=True)

    print(df2)

    #df100 = pd.DataFrame()

    #df100['Bränslenivå (%)'] = df1['Bränslenivå (%)']

    #df100['Bränslenivå (%)'] = (df100['Bränslenivå (%)'] >= 99)


    #print(df100)
    #input()

    #Filters out rows with the keywords listed in 'blacklist'.
    df1.rename(columns={"Bränslenivå (%)": "Bränsle"}, inplace=True)
    df1 = df1[~df1.Bränsle.isin(blacklist)]
    df1.rename(columns={"Bränsle": "Bränslenivå (%)"}, inplace=True)

    df2.rename(columns={"Unnamed 32": "Actual refuel"}, inplace=True)
    df2.rename(columns={"Unnamed 33": "Mätarställning"}, inplace=True)

    #Creates new column for the difference in fuellevel column.
    df1["Difference (%)"] = df1["Bränslenivå (%)"]
    df1["Difference (%)"] = df1.loc[:, "Bränslenivå (%)"].diff()

    # Renames time-column so that they match.
    df2.rename(columns={"Datum": "Tid"}, inplace=True)



    # Drops rows where the difference is equal to 0.
    df1filt = df1[(df1["Difference (%)"] != 0)]

    # Converts time-column to only year, month and date.
    df1filt["Tid"] = pd.to_datetime(df1filt["Tid"]).dt.strftime("%Y%m%d").astype(str)

    df1filt.reset_index(level=0, inplace=True)

    #Renames the index column to "row" in order to later use the "row" column 
    df1filt.rename(columns={"index": "row"}, inplace=True)

    # Creates a new column for the difference in total driven kilometers (used for matching)
    df1filt["Match"] = df1filt["Vägmätare (km)"]
    df1filt["Match"] = df1filt.loc[:, "Vägmätare (km)"].diff()

    #Merges refuels that are previously seperated because of the timeintervals. For example when a refuel takes a lot of time and gets split into two different refuels.
    ROWRANGE = len(df1filt)+1
    thevalue = 0
    for currentrow in range(ROWRANGE-1):
        if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] <= 0:
            thevalue = 0
            thevalue += df1filt.loc[currentrow,'Difference (%)']
            df1filt.loc[currentrow,'Match'] = "SUMMED"
        if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
            thevalue += df1filt.loc[currentrow,'Difference (%)']
        if df1filt.loc[currentrow, 'Difference (%)'] <= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
            df1filt.loc[currentrow-1,'Difference (%)'] = thevalue
            df1filt.loc[currentrow-1,'Match'] = "OFFICIAL"        
            thevalue = 0

    #Removes single "refuels" that are lower than 5
    df1filt = df1filt[(df1filt['Difference (%)'] > 5)]

    #Creates a new dataframe for the summed values 
    df1filt2 = df1filt[(df1filt['Match'] == "OFFICIAL")]

    #Creates a estimated refueled amount column for the automatic
    df1filt2["Apparent refuel"] = df1filt2["Difference (%)"]
    df1filt2["Apparent refuel"] = df1filt2.loc[:, "Difference (%)"]/100 *fuelcapacity

    #Renames total kilometer column so that the two documents can match
    df1filt2.rename(columns={"Vägmätare (km)": "Mätarställning"}, inplace=True)

    #Filters out rows where refuel and kilometer = NaN (Manual)

    df2.rename(columns={"x": "Actual refuel"}, inplace=True)
    df2.rename(columns={"x.1": "Mätarställning"}, inplace=True)
    df2.rename(columns={"Datum.1": "Tid"}, inplace=True)
    for col in df2.columns:
        print(col)
    input()
    df2filt = df2[(df2['Actual refuel'] != NaN) & (df2['Mätarställning'] != NaN)]


    #Drops first row
    df2filt.drop(df2filt.index[0], inplace=True)

    #Adds prefix for the time column so that they match (not used anymore because km is used to match)
    df2filt['Tid'] = '20' + df2filt['Tid'].astype(str) 

    #Rounds numeric columns
    decimals = 0
    
    df2filt['Mätarställning'] = pd.to_numeric(df2filt['Mätarställning'],errors='coerce')
    df2filt['Actual refuel'] = pd.to_numeric(df2filt['Actual refuel'],errors='coerce')
    df2filt['Mätarställning'] = df2filt['Mätarställning'].apply(lambda x: round(x, decimals))
    df2filt['Actual refuel'] = df2filt['Actual refuel'].apply(lambda x: round(x, decimals))
    

    #This is my temporary fix to the problem
    df2filt['Mätarställning'] //= 100
    df1filt2['Mätarställning'] //= 100

    merged_df = df1filt2.merge(df2filt, on='Mätarställning')

    merged_df.to_excel(filepathname3, index = False)

希望这是足够的信息。提前谢谢！

~更新（结果/问题）~

感谢@Rob Raymond 解决了我的问题。然而，通过提供的解决方案，问题得到了解决，但同时又出现了新的问题。下面的解决方案部分解决了“合并”问题。

现在某些地方的输出似乎重复，并且行似乎在同一个 Meter_indication 上合并。我不确定是什么原因造成的，但它可能与“merge_asof”的工作方式有关。我有很多数据，这意味着一些数据会有不同的位数。我对“merge_asof”方法不是很熟悉，但一种猜测是它合并了相似但不完全相同的数字。换句话说，如果是这种情况，“1234”将与“12345”合并，因为它们都包含“1234”。

以下是我使用数据创建的一些图表，用于阐明和说明问题。

当前图形输出：

所需的图形输出：

数据（用于图表）有时看起来像这样：

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

merge_asof() 确实有效。符合您的预期输出
没有对排序做出假设，sort_values() 也是如此
为了透明度，在输出中包含正确的Meter_indication

df1 = pd.read_csv(io.StringIO("""Meter_indication      Fuel1
1180784               275
1181278               280
1181791               300
1182285               280
1182801               300
1183295               280
1183717               250"""), sep="\s+")

df2 = pd.read_csv(io.StringIO("""Meter_indication      Fuel2
1180785               278
1181282               282
1181800               310
1182401               282
1182824               320
1183310               215
1183727               250"""), sep="\s+")

pd.merge_asof(
    df1.sort_values("Meter_indication"),
    df2.sort_values("Meter_indication").assign(mi=lambda d: d["Meter_indication"]),
    on="Meter_indication",
    direction="nearest",
)

Meter_indication	Fuel1	Fuel2	mi
1180784	275	278	1180785
1181278	280	282	1181282
1181791	300	310	1181800
1182285	280	282	1182401
1182801	300	320	1182824
1183295	280	215	1183310
1183717	250	250	1183727

【讨论】：

这似乎解决了问题。但是，在某些行上，我得到了奇怪的数字。 “mi”选项卡中的某些值完全关闭。并且似乎在同一个“mi”值上合并。换句话说，它有时会卡在几行相同的数字上。这很常见吗？
我在输出中添加 mi 的原因是为了获得更好的透明度。查看direction 参数的其他选项。 Meter_indication 是否总是相同的有效位数？如果不是，则可能值得确保它可以让合并更一致地工作
我不确定我是否很关注，你想让我改个“mi”的名字吗？我已经查看了不同的direction-参数，但据我了解，nearest 对我的情况来说是最合理的。但是，关于有效数字，数字的数量并不总是相同的。 “merge_asof”是否仅通过检查数字的顺序来匹配单元格？换句话说，它可以匹配“12345”和“1234”，因为两者都包含“1234”吗？
也许用一些你质疑结果的例子来更新问题。多个候选匹配的概率越高，值越小
我现在进行了编辑！ @Rob Raymond