访问 pyspark 数据框中的嵌套列答案

【问题标题】：Acessing nested columns in pyspark dataframe访问 pyspark 数据框中的嵌套列
【发布时间】：2017-07-03 13:48:03
【问题描述】：

我有一个如下所示的 xml 文档：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Position>
    <Search>
        <Location>
            <Region>OH</Region>
            <Country>us</Country>
            <Longitude>-816071</Longitude>
            <Latitude>415051</Latitude>
        </Location>
    </Search>
</Position>

我将其读入数据框：

df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='Position').load('1.xml')

我可以看到 1 列：

df.columns
['Search']

print df.select("Search")
DataFrame[Search: struct<Location:struct<Country:string,Latitude:bigint,Longitude:bigint,Region:string>>]

如何访问嵌套列。 ex Location.Region?

【问题讨论】：

您能否发布您获得的数据框的示例行。
这对您很有帮助，谢谢

标签： apache-spark dataframe pyspark

【解决方案1】：

您可以执行以下操作：

df.select("Search.Location.*").show()

输出：

+-------+--------+---------+------+
|Country|Latitude|Longitude|Region|
+-------+--------+---------+------+
|     us|  415051|  -816071|    OH|
+-------+--------+---------+------+

【讨论】：