【问题标题】:Filter dataframe by key in a list of dictionaries in pyspark在pyspark的字典列表中按键过滤数据帧
【发布时间】:2022-07-21 21:04:44
【问题描述】:

在 pyspark 中,如何根据特定的字典键值过滤包含字典列表列的数据框?

+------------------------------------+---------------+
|foo_dic_list                        |text           |
+------------------------------------+---------------+
|[{'1': [1,2,3],'4': [2,3,4]}]       |teacher        |
|[{'2': [5,2,3] }]                   |student        |
|[{'4': [2,2,2]}]                    |gamer          |
|[{'3': [3,3,3]}]                    |robot          | 
+------------------------------------+---------------+

我想选择如下行,其中 foo_dic_list 列的键中包含“4”。

+------------------------------------+---------------+
|foo_dic_list                        |text           |
+------------------------------------+---------------+
|[{'1': [1,2,3],'4': [2,3,4]}]       |teacher        |
|[{'4': [2,2,2]}]                    |gamer          |
+------------------------------------+---------------+

【问题讨论】:

  • 该列的数据类型是什么?

标签: dictionary pyspark


【解决方案1】:

这可能不是最好的方法,但我们可以使用 UDF 来获取键列表,然后在其上使用 array_contains() 进行过滤。 仅当数组中只有一个字典时,以下方法才有效。

data_ls = [
    (['''{'1': [1,2,3],'4': [2,3,4]}'''], 'teacher'),
    (['''{'2': [5,2,3] }'''], 'student'),
    (['''{'4': [2,2,2]}'''], 'gamer')
]

data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['foo_dic_list', 'text'])

# +-----------------------------+-------+
# |foo_dic_list                 |text   |
# +-----------------------------+-------+
# |[{'1': [1,2,3],'4': [2,3,4]}]|teacher|
# |[{'2': [5,2,3] }]            |student|
# |[{'4': [2,2,2]}]             |gamer  |
# +-----------------------------+-------+

# root
#  |-- foo_dic_list: array (nullable = true)
#  |    |-- element: string (containsNull = true)
#  |-- text: string (nullable = true)

创建一个函数以将其解析为 json 字符串,从而生成字典。然后使用dict.keys() 获取密钥列表。

def getDictKeys(json_str):
    import json

    json_dict = json.loads(json_str.replace("\'", '\"'))
    json_dict_keys = list(json_dict.keys())

    return json_dict_keys

getDictKeys_udf = func.udf(getDictKeys, ArrayType(StringType()))

data_sdf. \
    withColumn('arr_element', func.col('foo_dic_list').getItem(0)). \
    withColumn('keys_arr', getDictKeys_udf(func.col('arr_element'))). \
    filter(func.array_contains('keys_arr', '4')). \
    select('foo_dic_list', 'text'). \
    show(truncate=False)

# +-----------------------------+-------+
# |foo_dic_list                 |text   |
# +-----------------------------+-------+
# |[{'1': [1,2,3],'4': [2,3,4]}]|teacher|
# |[{'4': [2,2,2]}]             |gamer  |
# +-----------------------------+-------+

data_sdf. \
    withColumn('arr_element', func.col('foo_dic_list').getItem(0)). \
    withColumn('keys_arr', getDictKeys_udf(func.col('arr_element'))). \
    show(truncate=False)

# +-----------------------------+-------+---------------------------+--------+
# |foo_dic_list                 |text   |arr_element                |keys_arr|
# +-----------------------------+-------+---------------------------+--------+
# |[{'1': [1,2,3],'4': [2,3,4]}]|teacher|{'1': [1,2,3],'4': [2,3,4]}|[1, 4]  |
# |[{'2': [5,2,3] }]            |student|{'2': [5,2,3] }            |[2]     |
# |[{'4': [2,2,2]}]             |gamer  |{'4': [2,2,2]}             |[4]     |
# +-----------------------------+-------+---------------------------+--------+

# root
#  |-- foo_dic_list: array (nullable = true)
#  |    |-- element: string (containsNull = true)
#  |-- text: string (nullable = true)
#  |-- arr_element: string (nullable = true)
#  |-- keys_arr: array (nullable = true)
#  |    |-- element: string (containsNull = true)

【讨论】:

    【解决方案2】:

    选择简单的方法:像这样使用locate。然后过滤location > 0

    d1 = [
        ("[{'1': [1,2,3],'4': [2,3,4]}]", "teacher"),
        ("[{'2': [5,2,3] }]",             "student"),
        ("[{'4': [2,2,2]}]",              "gamer"),
        ("[{'3': [3,3,3]}]",              "robot"),
    ]
    
    df1 = spark.createDataFrame(d1, ['foo_dic_list', 'text'])
    df1.printSchema()
    # root
    #  |-- foo_dic_list: string (nullable = true)
    #  |-- text: string (nullable = true)
    df1.withColumn('location', locate("\'4\':", col('foo_dic_list'))).show(10, False)
    # +-----------------------------+-------+--------+
    # |foo_dic_list                 |text   |location|
    # +-----------------------------+-------+--------+
    # |[{'1': [1,2,3],'4': [2,3,4]}]|teacher|16      |
    # |[{'2': [5,2,3] }]            |student|0       |
    # |[{'4': [2,2,2]}]             |gamer  |3       |
    # |[{'3': [3,3,3]}]             |robot  |0       |
    # +-----------------------------+-------+--------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-01-07
      • 1970-01-01
      • 2017-03-16
      • 2021-04-10
      • 1970-01-01
      • 2021-07-02
      • 2020-02-06
      • 1970-01-01
      相关资源
      最近更新 更多