【问题标题】:Pyspark DataFrame : How to map array elements to columns and format string with valuesPyspark DataFrame:如何将数组元素映射到列并使用值格式化字符串
【发布时间】:2022-10-01 00:57:26
【问题描述】:

我有一个看起来像这样的 Pyspark DataFrame:

sdf1 = sc.parallelize([[\"toto\", \"tata\", [\"table\", \"column\"], \"SELECT {1} FROM {0}\"], \"titi\", \"tutu\", [\"table\", \"column\"], \"SELECT {1} FROM {0}\"]]).toDF([\"table\", \"column\", \"parameters\", \"statement\"])

+-----+------+---------------+-------------------+
|table|column|     parameters|          statement|
+-----+------+---------------+-------------------+
| toto|  tata|[table, column]|SELECT {1} FROM {0}|
| titi|  tutu|[table, column]|SELECT {1} FROM {0}|
+-----+------+---------------+-------------------+

我尝试将数组“参数”元素映射到列,最终用列中的值格式化“语句”。

这是我在处理转换后所期望的:

sdf2 = sc.parallelize([[\"toto\", \"tata\", [\"table\", \"column\"], \"SELECT {1} FROM {0}\", \"SELECT tata FROM toto\"],[\"titi\", \"tutu\", [\"table\", \"column\"], \"SELECT {1} FROM {0}\", \"SELECT tutu FROM titi\"]]).toDF([\"table\", \"column\", \"parameters\", \"statement\", \"result\"])

+-----+------+---------------+-------------------+---------------------+
|table|column|     parameters|          statement|               result|
+-----+------+---------------+-------------------+---------------------+
| toto|  tata|[table, column]|SELECT {1} FROM {0}|SELECT tata FROM toto|
| titi|  tutu|[table, column]|SELECT {1} FROM {0}|SELECT tutu FROM titi|
+-----+------+---------------+-------------------+---------------------+

    标签: arrays dataframe dictionary pyspark format-string


    【解决方案1】:

    一种使用 RDD 的方法。

    def addParamsToQuery(param_ls, query, r):
        new_param_ls = [r[k] for k in param_ls]
        new_query = query.format(*new_param_ls)
        return new_query
    
    columns = data_sdf.columns
    
    data_sdf. \
        rdd. \
        map(lambda r: [r[c] for c in columns] + [addParamsToQuery(r.parameters, r.statement, r)]). \
        toDF(columns + ['result']). \
        show(truncate=False)
    
    # +-----+------+---------------+-------------------+---------------------+
    # |table|column|parameters     |statement          |result               |
    # +-----+------+---------------+-------------------+---------------------+
    # |toto |tata  |[table, column]|SELECT {1} FROM {0}|SELECT tata FROM toto|
    # |titi |tutu  |[table, column]|SELECT {1} FROM {0}|SELECT tutu FROM titi|
    # +-----+------+---------------+-------------------+---------------------+
    

    函数addParamsToQuery 使用列值创建参数值列表,并使用.format() 插入到语句中。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2022-01-03
      • 1970-01-01
      • 2017-02-20
      • 1970-01-01
      • 1970-01-01
      • 2012-01-23
      • 1970-01-01
      相关资源
      最近更新 更多