【问题标题】:How do I explode multiple columns of arrays in a Spark Scala dataframe when the columns contain arrays that line up with one another?当列包含彼此对齐的数组时,如何在 Spark Scala 数据框中分解多列数组?
【发布时间】:2020-05-20 17:39:01
【问题描述】:

我在提出这个问题时遇到了一些麻烦,但我会尝试解释一下。我了解如何分解数组的单个列,但我有多个数组列,其中数组在索引值方面彼此对齐。在我的数据框中,爆炸每一列基本上只是做一个无用的交叉连接,导致几十个无效行。因此,我将从显示数据开始。

这显示了 SparkNLP 的一些结果,其中包含一些文本和四组文本特征。从 tr 到 nr 的每一列都包含一个数组。这些阵列中的每一个都与其他阵列对齐。

+--+---------------------+---------------------+----------------------+--------------------+--------------------+
|ID|                 text|                   tr|                    lr|                  pr|                  nr|
+--+---------------------+---------------------+----------------------+--------------------+--------------------+
|10|  thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...|   [NN, :, NNP, NNS]|    [O, O, I-PER, O]|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...|
|12| * I would like to...| [*, I, would, lik...|  [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...|
+--+---------------------+---------------------+----------------------+--------------------+--------------------+

我想要的是一个新的数据框,其中包含 ID 和文本以及单行上所有数组中的每个第 i 项,如下所示的上述数据框:

+--+---------------------+---------------------+----------------------+--------------------+--------------------+------+-------+---+-----+
|ID|                 text|                   tr|                    lr|                  pr|                  nr| token|  lemma|pos|  ner|
+--+---------------------+---------------------+----------------------+--------------------+--------------------+------+-------+---+-----+
|10|  thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...|   [NN, :, NNP, NNS]|    [O, O, I-PER, O]| thing|  thing| NN|    O|
|10|  thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...|   [NN, :, NNP, NNS]|    [O, O, I-PER, O]|     :|      :|  :|    O|
|10|  thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...|   [NN, :, NNP, NNS]|    [O, O, I-PER, O]|MacKay| MacKay|NNP|I-PER|
|10|  thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...|   [NN, :, NNP, NNS]|    [O, O, I-PER, O]| rolls|   roll|NNS|    O|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...| thing|  thing| NN|    O|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...|     :|      :|  :|    O|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...|MacKay| MacKay|NNP|I-PER|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...|  roll|   roll|NNS|    O|
|11|...
...
|12| * I would like to...| [*, I, would, lik...|  [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...|     *|      *| NN|    O|
|12| * I would like to...| [*, I, would, lik...|  [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...|     I|      I|PRP|    O|
|12| * I would like to...| [*, I, would, lik...|  [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...| would|  would| MD|    O|
|12| * I would like to...| [*, I, would, lik...|  [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...|  like|   like| VB|    O|
|12| * I would like to...| [*, I, would, lik...|  [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...|    to|    ...|...|    O|
|12|...
...
+--+---------------------+---------------------+----------------------+--------------------+--------------------+------+-------+---+-----+

我不需要输出中的 tr 到 nr 列,但为了清楚起见,我保留了它们。

有没有办法做到这一点?

另外,有没有办法同时提取数组索引(添加到输出行)?

【问题讨论】:

  • 可以使用spark sql pos_explode函数。
  • @Vamsi Prabhala,posexplode 的结果不是我想要的,除非我编码不正确。从我示例中的第一行开始,它生成四行,每列一行,每列显示索引,而不是数组(0 表示 tr,1 表示 lr,等等)。我用ann3.select($"ID", $"text", posexplode(array($"tr", $"lr", $"pr", $"nr")))
  • 你必须posexplode 每一列和join 索引上的结果。

标签: scala dataframe apache-spark


【解决方案1】:

在这种情况下,您想要做的是使用 withColumn 表达式分解各个列。假设您将数据集作为初始数据框 df 加载。现在你想实现如下所示。

      val df = <load initial dataset>
      val df1  = df.select($"id", $"text",$"tr", $"lr", $"pr", $"nr").withColumn("tr", explode($"tr")).withColumn("lr",explode($"lr")).withColumn("pr",explode($"pr")).withColumn("nr",explode($"nr"))

这将导致将数组值添加到记录中,并使用 ID 和文本进行标记。这种方法的一个缺点是增加了记录数和非数组列的重复。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-07-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多