【发布时间】:2019-04-07 13:52:46
【问题描述】:
我一直在四处寻找,但还没有找到一种方法来重构数据框的列,以便根据数组内容动态地向数据框添加新列。我是 python 新手,所以我可能在搜索错误的术语,这也是我还没有找到明确示例的原因。请让我知道这是否是重复的和找到它的参考链接。我想我只需要指出正确的方向。
好的,详细的。
环境是pyspark 2.3.2和python 2.7
示例列包含 2 个数组,它们彼此 1 对 1 相关。我想为 titles 数组中的每个值创建一个列并放入相应的名称(在person 数组)各自的列。
我拼凑了一个例子来专注于我更改数据框的问题。
import json
from pyspark.sql.types import ArrayType, StructType, StructField, StringType
from pyspark.sql import functions as f
input = { "sample": { "titles": ["Engineer", "Designer", "Manager"], "person": ["Mary", "Charlie", "Mac"] }, "location": "loc a"},{ "sample": { "titles": ["Engineer", "Owner"],
"person": ["Tom", "Sue"] }, "location": "loc b"},{ "sample": { "titles": ["Engineer", "Designer"], "person": ["Jane", "Bill"] }, "location": "loc a"}
a = [json.dumps(input)]
jsonRDD = sc.parallelize(a)
df = spark.read.json(jsonRDD)
这是我的数据框的架构:
In [4]: df.printSchema()
root
|-- location: string (nullable = true)
|-- sample: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- titles: array (nullable = true)
| | |-- element: string (containsNull = true)
我的数据框数据:
In [5]: df.show(truncate=False)
+--------+-----------------------------------------------------+
|location|sample |
+--------+-----------------------------------------------------+
|loc a |[[Mary, Charlie, Mac], [Engineer, Designer, Manager]]|
|loc b |[[Sue, Tom], [Owner, Engineer]] |
|loc a |[[Jane, Bill], [Engineer, Designer]] |
+--------+-----------------------------------------------------+
我希望我的数据框看起来像什么:
+--------+-----------------------------------------------------+------------+-----------+---------+---------+
|location|sample |Engineer |Desginer |Manager | Owner |
+--------+-----------------------------------------------------+------------+-----------+---------+---------+
|loc a |[[Mary, Charlie, Mac], [Engineer, Designer, Manager]]|Mary |Charlie |Mac | |
|loc b |[[Sue, Tom], [Owner, Engineer]] |Tom | | |Sue |
|loc a |[[Jane, Bill], [Engineer, Designer]] |Jane |Bill | | |
+--------+-----------------------------------------------------+------------+-----------+---------+---------+
我尝试过使用explode 函数,结果却是在每条记录中都有更多的带有数组字段的记录。 stackoverflow 中有一些示例,但它们具有静态列名。该数据集可以按任何顺序排列它们,并且以后可以添加新标题。
【问题讨论】:
标签: python-2.7 apache-spark pyspark