Spark 本身不支持重命名单个嵌套字段。您必须铸造或重建整个结构。为简单起见,我们假设数据如下所示:
cat('{"contributors": "foo", "coordinates": "bar", "entities": {"hashtags": ["foo", "bar"], "media": "missing"}}', file = "/tmp/example.json")
df <- spark_read_json(sc, "df", "/tmp/example.json", overwrite=TRUE)
df %>% spark_dataframe() %>% invoke("schema") %>% invoke("treeString") %>% cat()
root
|-- contributors: string (nullable = true)
|-- coordinates: string (nullable = true)
|-- entities: struct (nullable = true)
| |-- hashtags: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- media: string (nullable = true)
用简单的字符串表示:
df %>%
spark_dataframe() %>%
invoke("schema") %>%
invoke("simpleString") %>%
cat(sep = "\n")
struct<contributors:string,coordinates:string,entities:struct<hashtags:array<string>,media:string>>
通过强制转换,您必须使用匹配类型描述来定义表达式:
expr_cast <- invoke_static(
sc, "org.apache.spark.sql.functions", "expr",
"CAST(entities AS struct<e_hashtags:array<string>,media:string>)"
)
df_cast <- df %>%
spark_dataframe() %>%
invoke("withColumn", "entities", expr_cast) %>%
sdf_register()
df_cast %>% spark_dataframe() %>% invoke("schema") %>% invoke("treeString") %>% cat()
root
|-- contributors: string (nullable = true)
|-- coordinates: string (nullable = true)
|-- entities: struct (nullable = true)
| |-- e_hashtags: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- media: string (nullable = true)
要重建结构,您必须匹配所有组件:
expr_struct <- invoke_static(
sc, "org.apache.spark.sql.functions", "expr",
"struct(entities.hashtags AS e_hashtags, entities.media)"
)
df_struct <- df %>%
spark_dataframe() %>%
invoke("withColumn", "entities", expr_struct) %>%
sdf_register()
df_struct %>% spark_dataframe() %>% invoke("schema") %>% invoke("treeString") %>% cat()
root
|-- contributors: string (nullable = true)
|-- coordinates: string (nullable = true)
|-- entities: struct (nullable = false)
| |-- e_hashtags: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- media: string (nullable = true)