如何将 DataFrame 作为输入传递给 Spark UDF？答案

【问题标题】：How to pass DataFrame as input to Spark UDF?如何将 DataFrame 作为输入传递给 Spark UDF？
【发布时间】：2018-05-10 14:12:58
【问题描述】：

我有一个数据框，我想对每一行应用一个函数。此功能取决于其他数据帧。

简化示例。我有如下三个数据框：

df = sc.parallelize([
    ['a', 'b', 1],
    ['c', 'd', 3]
    ]).toDF(('feat1', 'feat2', 'value'))

df_other_1 = sc.parallelize([
        ['a', 0, 1, 0.0],
        ['a', 1, 3, 0.1],
        ['a', 3, 10, 1.0],
        ['c', 0, 10, 0.2],
        ['c', 10, 25, 0.5]
        ]).toDF(('feat1', 'lower', 'upper', 'score'))

df_other_2 = sc.parallelize([
        ['b', 0, 4, 0.1],
        ['b', 4, 20, 0.5],
        ['b', 20, 30, 1.0],
        ['d', 0, 5, 0.05],
        ['d', 5, 22, 0.9]
        ]).toDF(('feat1', 'lower', 'upper', 'score'))

对于df 的每一行，我想从df_other_1 和df_other_2 中收集feat1 和feat2 的唯一上限值，即对于第一行，唯一值是(1, 3, 10、4、20、30)。然后，我将它们排序为 (30, 20, 10, 4, 3, 1) 并添加到前面，在第一个数字上方添加一个数字。 df 会变成这样：

df = sc.parallelize([
        ['a', 'b', 1, [31, 30, 20, 10, 4, 3, 1]],
        ['c', 'd', 3, [26, 25, 22, 10, 5]]
        ]).toDF(('feat1', 'feat2', 'value', 'lst'))

然后，对于df 的每一行和lst 的每个相应值，我想从df_other_1 和df_other_2 计算score 的总和，其中@987654334 的每个值@ 属于 upper 和 lower。我的目标是在总分高于某个阈值（例如 1.4）的每个 lst 中找到最低值。

这是计算总分的方法。因此，对于df 的第一行，lst 的第一个值为 31。在 df_other_1 中，feat1 高于最高存储桶，因此得分为 1。df_other_2 相同.因此，总分将是 1+1 =2。对于 10 的值（同样是第一行），总分将为 1 + 0.5 = 1.5。

这就是df 最终的样子：

df = sc.parallelize([
            ['a', 'b', 1, [31, 30, 20, 10, 4, 3, 1], [2.0, 2.0, 2.0, 1.5, 1.5, 1.1, 0.2], 4],
            ['c', 'd', 3, [26, 25, 22, 10, 5], [2.0, 1.5, 1.4, 1.4, 1.1], 25]
            ]).toDF(('feat1', 'feat2', 'value', 'lst', 'total_scores', 'target_value'))

我实际上正在寻找这些目标值4 和25。中间步骤并不重要。

================================================ =============================

到目前为止，这是我尝试过的：

def get_threshold_for_row(feat1, feat2, threshold):

    this_df_other_1 = df_other_1.filter(col('feat1') == feat1)
    this_df_other_2 = df_other_2.filter(col('feat1') == feat2)

    values_feat_1 = [i[0] for i in this_df_other_1.select('upper').collect()]
    values_feat_1.append(values_feat_1[-1] + 1)
    values_feat_2 = [i[0] for i in this_df_other_2.select('upper').collect()]
    values_feat_2.append(values_feat_2[-1] + 1)

    values = values_feat_1 + values_feat_2
    values = list(set(values)) #Keep unique values
    values.sort(reverse=True)  #Sort from largest to smallest

    df_1_score = df_2_score = 0
    prev_value = 10000 #Any large number
    prev_score = 10000

    for value in values:
        df_1_score = get_score_for_key(this_df_other_1, 'feat_1', feat_1, value)
        df_2_score = get_score_for_key(this_df_other_2, 'feat_1', feat_2, value)

        total_score = df_1_score + df_2_score

        if total_score < threshold and prev_score >= threshold:
            return prev_value

        prev_score = total_score
        prev_value = value


def is_dataframe_empty(df):
    return len(df.take(1)) == 0

def get_score_for_key(scores_df, grouping_key, this_id, value):

    if is_dataframe_empty(scores_df):
        return 0.0

    w = Window.partitionBy([grouping_key]).orderBy(col('upper'))

    scores_df_tmp = scores_df.withColumn("prev_value", lead(scores_df.upper).over(w))\
                        .withColumn("is_last", when(col('prev_value').isNull(), 1).otherwise(0))\
                        .drop('prev_value')

    scores_df_tmp = scores_df_tmp.withColumn("next_value", lag(scores_df_tmp.upper).over(w))\
                        .withColumn("is_first", when(col('next_value').isNull(), 1).otherwise(0))\
                        .drop('next_value').cache()

    grouping_key_score = scores_df_tmp.filter((col(grouping_key) == this_id) & 
                              (((value >= col('from_value')) & (value < col('to_value'))) | 
                                ((value >= col('to_value')) & (col('is_last') == 1)) |
                                ((value < col('from_value')) & (col('is_first') == 1)) |
                                (col('from_value').isNull()))) \
                    .withColumn('final_score', when(value <= col('to_value'), col('score')).otherwise(1.0)) \
                    .collect()[0]['final_score']

    return grouping_key_score

df.rdd.map(lambda r: (r['feat_1'], r['feat_2'])) \
    .map(lambda v: (v[0], v[1], get_threshold_for_row(v[0], v[1], 1.4)))
    .toDF()

但我得到： AttributeError: 'Py4JError' object has no attribute 'message'

抱歉，帖子太长了。有什么想法吗？

【问题讨论】：

标签： python apache-spark pyspark user-defined-functions

【解决方案1】：

我有一个数据框，我想对每一行应用一个函数。此功能取决于其他数据帧。

tl;dr 这在 UDF 中是不可能的。

在最广泛的意义上，UDF 是一个函数（实际上是 Catalyst 表达式），它接受零个或多个列值（作为列引用）。

如果 UDF 是用户定义的聚合函数 (UDAF)，则 UDF 只能处理在最广泛的情况下可能是整个 DataFrame 的记录。

如果您想在 UDF 中处理多个 DataFrame，您必须 join DataFrames 才能拥有要用于 UDF 的列。

【讨论】：