【问题标题】:Join two dataframes based on closest combination that sums up to a target value根据总和为目标值的最接近组合连接两个数据帧
【发布时间】:2021-12-16 00:19:06
【问题描述】:

我试图根据 df2Sales 中最接近的行组合加入以下两个数据框,这些行的总和为 df1Total Sales 中的目标值,列加入时,两个数据框中的 NameDate 应该相同(如预期输出所示)。

例如:在 df1 行号 0 应该只与 df2 行 0 和 1 匹配,因为列 NameDate 是相同的,其中姓名:约翰,日期:2021-10-01。

df1:

df1 = pd.DataFrame({"Name":{"0":"John","1":"John","2":"Jack","3":"Nancy","4":"Ahmed"},
                    "Date":{"0":"2021-10-01","1":"2021-11-01","2":"2021-10-10","3":"2021-10-12","4":"2021-10-30"},
                    "Total Sales":{"0":15500,"1":5500,"2":17600,"3":20700,"4":12000}})

    Name    Date        Total Sales
0   John    2021-10-01  15500
1   John    2021-11-01  5500
2   Jack    2021-10-10  17600
3   Nancy   2021-10-12  20700
4   Ahmed   2021-10-30  12000

df2:

df2 = pd.DataFrame({"ID":{"0":"JO1","1":"JO2","2":"JO3","3":"JO4","4":"JA1","5":"JA2","6":"NA1",
                          "7":"NA2","8":"NA3","9":"NA4","10":"AH1","11":"AH2","12":"AH3","13":"AH3"},
                    "Name":{"0":"John","1":"John","2":"John","3":"John","4":"Jack","5":"Jack","6":"Nancy","7":"Nancy",
                            "8":"Nancy","9":"Nancy","10":"Ahmed","11":"Ahmed","12":"Ahmed","13":"Ahmed"},
                    "Date":{"0":"2021-10-01","1":"2021-10-01","2":"2021-11-01","3":"2021-11-01","4":"2021-10-10","5":"2021-10-10","6":"2021-10-12","7":"2021-10-12",
                            "8":"2021-10-12","9":"2021-10-12","10":"2021-10-30","11":"2021-10-30","12":"2021-10-30","13":"2021-10-29"},
                    "Sales":{"0":10000,"1":5000,"2":1000,"3":5500,"4":10000,"5":7000,"6":20000,
                             "7":100,"8":500,"9":100,"10":5000,"11":7000,"12":10000,"13":12000}})

    ID  Name    Date        Sales
0   JO1 John    2021-10-01  10000
1   JO2 John    2021-10-01  5000
2   JO3 John    2021-11-01  1000
3   JO4 John    2021-11-01  5500
4   JA1 Jack    2021-10-10  10000
5   JA2 Jack    2021-10-10  7000
6   NA1 Nancy   2021-10-12  20000
7   NA2 Nancy   2021-10-12  100
8   NA3 Nancy   2021-10-12  500
9   NA4 Nancy   2021-10-12  100
10  AH1 Ahmed   2021-10-30  5000
11  AH2 Ahmed   2021-10-30  7000
12  AH3 Ahmed   2021-10-30  10000
13  AH3 Ahmed   2021-10-29  12000

预期输出:

    Name    Date        Total Sales Comb IDs            Comb Total
0   John    2021-10-01  15500       JO1, JO2            15000.0
1   John    2021-11-01  5500        JO4                 5500.0
2   Jack    2021-10-10  17600       JA1, JA2            17000.0
3   Nancy   2021-10-12  20700       NA1, NA2, NA3, NA4  20700.0
4   Ahmed   2021-10-30  12000       AH1, AH2            12000.0

我在下面尝试的方法一次只为一行工作,但我不确定如何在 pandas 数据帧中应用它以获得预期的输出。

下面脚本中的变量numbers代表df2中的Sales列,下面的变量target代表df1中的Total Sales列。

import itertools
import math

numbers = [1000, 5000, 3000]
target = 6000

best_combination = ((None,))
best_result = math.inf
best_sum = 0

for L in range(0, len(numbers)+1):
    for combination in itertools.combinations(numbers, L):
        sum = 0
        for number in combination:
            sum += number
        result = target - sum
        if abs(result) < abs(best_result):
            best_result = result
            best_combination = combination
            best_sum = sum

print("\nbest sum{} = {}".format(best_combination, best_sum))


[Out] best sum(1000, 5000) = 6000

【问题讨论】:

  • 你应该看看knapsack problem
  • 似乎在最一般的情况下,对于每个人和日期,您都有一个优化问题,您必须选择一组数字的子集,以尝试最小化 sum 和 a目标。这类似于背包问题。除非您的问题小到可以暴力破解,否则获得最佳答案并非易事,但即便如此,我也怀疑您是否会得到一个优雅的解决方案。

标签: python-3.x pandas dataframe numpy itertools


【解决方案1】:

获取您编写的找到最佳总和的代码并将其转换为一个函数(我们称之为opt,它具有目标参数和一个数据框(它将是df2 的子集。它需要返回对应于最佳组合的 ID 列表。

编写另一个函数,它接受 3 个参数名称、日期和目标(我们称之为 calc)。此函数将根据名称和日期过滤df2,并将其与目标一起传递给opt 函数并返回该函数的结果。最后,遍历df1 的行,并使用行参数调用calc(或者使用pandas.DataFrame.apply

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2020-12-15
    • 1970-01-01
    • 1970-01-01
    • 2015-11-14
    • 2021-04-11
    • 2016-03-23
    • 2015-10-27
    相关资源
    最近更新 更多