我想将大小列表(数组)的情况添加到 pault 答案中。
如果我们的列包含中等大小的数组(或大型数组),仍然可以将它们拆分为列。
from pyspark.sql.types import * # Needed to define DataFrame Schema.
from pyspark.sql.functions import expr
# Define schema to create DataFrame with an array typed column.
mySchema = StructType([StructField("V1", StringType(), True),
StructField("V2", ArrayType(IntegerType(),True))])
df = spark.createDataFrame([['A', [1, 2, 3, 4, 5, 6, 7]],
['B', [8, 7, 6, 5, 4, 3, 2]]], schema= mySchema)
# Split list into columns using 'expr()' in a comprehension list.
arr_size = 7
df = df.select(['V1', 'V2']+[expr('V2[' + str(x) + ']') for x in range(0, arr_size)])
# It is posible to define new column names.
new_colnames = ['V1', 'V2'] + ['val_' + str(i) for i in range(0, arr_size)]
df = df.toDF(*new_colnames)
结果是:
df.show(truncate= False)
+---+---------------------+-----+-----+-----+-----+-----+-----+-----+
|V1 |V2 |val_0|val_1|val_2|val_3|val_4|val_5|val_6|
+---+---------------------+-----+-----+-----+-----+-----+-----+-----+
|A |[1, 2, 3, 4, 5, 6, 7]|1 |2 |3 |4 |5 |6 |7 |
|B |[8, 7, 6, 5, 4, 3, 2]|8 |7 |6 |5 |4 |3 |2 |
+---+---------------------+-----+-----+-----+-----+-----+-----+-----+