使用 Pyspark 访问 Dataframe 的行内行（嵌套 JSON）答案

【问题标题】：Access Dataframe's Row inside Row (nested JSON) with Pyspark使用 Pyspark 访问 Dataframe 的行内行（嵌套 JSON）
【发布时间】：2018-08-31 04:20:57
【问题描述】：

使用 pyspark，我正在从一个文件夹 contentdata2 中读取多个包含一个 JSON 对象的文件，

df = spark.read\
.option("mode", "DROPMALFORMED")\
.json("./data/contentdata2/")

df.printSchema()
content = df.select('fields').collect()

df.printSchema() 产生的地方

root
|-- fields: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- field: string (nullable = true)
|    |    |-- type: string (nullable = true)
|    |    |-- value: string (nullable = true)
|-- id: string (nullable = true)
|-- score: double (nullable = true)
|-- siteId: string (nullable = true)

我希望访问fields.element.field，并存储每个等于body的字段，以及等于urlhash的字段（对于每个JSON对象）。

content的格式是一个Row（字段），包含其他Row，像这样：

[Row(fields=[Row(field=‘body’, type=None, value=’[“First line of text“,”Second line of text”]), Row(field='urlhash', type=None, value='0a0b774c21c68325aa02cae517821e78687b2780')]),  Row(fields=[Row(field=‘body’, type=None, value=’[“First line of text“,”Second line of text”]), Row(field='urlhash', type=None, value='0a0b774c21c6caca977e7821e78687b2780')]), ...

“[Row(fields=[Row(field=....)”重新出现的原因是因为来自不同文件的 JSON 对象被合并到一个列表中。有还有很多我不感兴趣的其他 Row 元素，因此没有包含在示例中。

JSON 对象的结构如下所示：

{
  "fields": [
    {
      "field": "body",
      "value": [
        "Some text",
        "Another line of text",
        "Third line of text."
      ]
    },
    {
      "field": "urlhash",
      "value": "0a0a341e189cf2c002cb83b2dc529fbc454f97cc"
    }
  ],
  "score": 0.87475455,
  "siteId": "9222270286501375973",
  "id": "0a0a341e189cf2c002cb83b2dc529fbc454f97cc"
}

我希望存储每个 url 正文中的所有单词，以便稍后删除停用词并将其提供给 K 最近邻算法。

如何解决为每个 url 存储正文中的单词的问题，最好是作为 tsv 或 csv 列 urlhash 和单词（这是正文中的单词列表）？

【问题讨论】：

标签： json dataframe pyspark row

【解决方案1】：

您可以通过两种方式解决此问题：

您可以explodearray 每行获取一条记录，然后展平嵌套数据框
或直接访问子字段（对于 Spark > 2.X）

让我们从您的示例数据框开始：

from pyspark.sql import Row
from pyspark.sql.types import *
schema = StructType([
    StructField('fields', ArrayType(StructType([
        StructField('field', StringType()), 
        StructField('type', StringType()), 
        StructField('value', StringType())])))])

content = spark.createDataFrame(
    sc.parallelize([
        Row(
            fields=[
                Row(
                    field='body', 
                    type=None, 
                    value='["First line of text","Second line of text"]'), 
                Row(
                    field='urlhash', 
                    type=None, 
                    value='0a0b774c21c68325aa02cae517821e78687b2780')]), 
        Row(
            fields=[
                Row(
                    field='body', 
                    type=None, 
                    value='["First line of text","Second line of text"]'), 
                Row(
                    field='urlhash', 
                    type=None, 
                    value='0a0b774c21c6caca977e7821e78687b2780')])]), schema=schema)
content.printSchema()

    root
     |-- fields: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- field: string (nullable = true)
     |    |    |-- type: string (nullable = true)
     |    |    |-- value: string (nullable = true)

1.展开和展平

可以使用. 访问嵌套数据框的字段，* 允许您展平所有嵌套字段并将它们带到root 级别。

import pyspark.sql.functions as psf
content \
    .select(psf.explode('fields').alias('tmp')) \
    .select('tmp.*') \
    .show()

    +-------+----+--------------------+
    |  field|type|               value|
    +-------+----+--------------------+
    |   body|null|["First line of t...|
    |urlhash|null|0a0b774c21c68325a...|
    |   body|null|["First line of t...|
    |urlhash|null|0a0b774c21c6caca9...|
    +-------+----+--------------------+

    root
     |-- field: string (nullable = true)
     |-- type: string (nullable = true)
     |-- value: string (nullable = true)

2。直接访问子字段

在更高版本的 Spark 中，您可以访问嵌套的 StructTypes 字段，即使它们包含在 ArrayType 中。您最终会得到子字段值的ArrayType。

content \
    .select('fields.field') \
    .show()

    +---------------+
    |          field|
    +---------------+
    |[body, urlhash]|
    |[body, urlhash]|
    +---------------+

    root
     |-- field: array (nullable = true)
     |    |-- element: string (containsNull = true)

【讨论】：

天哪，谢谢。我花了一天时间划分文件的子列。你拯救了我的一天。