【发布时间】:2020-05-20 17:39:01
【问题描述】:
我在提出这个问题时遇到了一些麻烦,但我会尝试解释一下。我了解如何分解数组的单个列,但我有多个数组列,其中数组在索引值方面彼此对齐。在我的数据框中,爆炸每一列基本上只是做一个无用的交叉连接,导致几十个无效行。因此,我将从显示数据开始。
这显示了 SparkNLP 的一些结果,其中包含一些文本和四组文本特征。从 tr 到 nr 的每一列都包含一个数组。这些阵列中的每一个都与其他阵列对齐。
+--+---------------------+---------------------+----------------------+--------------------+--------------------+
|ID| text| tr| lr| pr| nr|
+--+---------------------+---------------------+----------------------+--------------------+--------------------+
|10| thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...| [NN, :, NNP, NNS]| [O, O, I-PER, O]|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...|
|12| * I would like to...| [*, I, would, lik...| [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...|
+--+---------------------+---------------------+----------------------+--------------------+--------------------+
我想要的是一个新的数据框,其中包含 ID 和文本以及单行上所有数组中的每个第 i 项,如下所示的上述数据框:
+--+---------------------+---------------------+----------------------+--------------------+--------------------+------+-------+---+-----+
|ID| text| tr| lr| pr| nr| token| lemma|pos| ner|
+--+---------------------+---------------------+----------------------+--------------------+--------------------+------+-------+---+-----+
|10| thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...| [NN, :, NNP, NNS]| [O, O, I-PER, O]| thing| thing| NN| O|
|10| thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...| [NN, :, NNP, NNS]| [O, O, I-PER, O]| :| :| :| O|
|10| thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...| [NN, :, NNP, NNS]| [O, O, I-PER, O]|MacKay| MacKay|NNP|I-PER|
|10| thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...| [NN, :, NNP, NNS]| [O, O, I-PER, O]| rolls| roll|NNS| O|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...| thing| thing| NN| O|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...| :| :| :| O|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...|MacKay| MacKay|NNP|I-PER|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...| roll| roll|NNS| O|
|11|...
...
|12| * I would like to...| [*, I, would, lik...| [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...| *| *| NN| O|
|12| * I would like to...| [*, I, would, lik...| [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...| I| I|PRP| O|
|12| * I would like to...| [*, I, would, lik...| [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...| would| would| MD| O|
|12| * I would like to...| [*, I, would, lik...| [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...| like| like| VB| O|
|12| * I would like to...| [*, I, would, lik...| [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...| to| ...|...| O|
|12|...
...
+--+---------------------+---------------------+----------------------+--------------------+--------------------+------+-------+---+-----+
我不需要输出中的 tr 到 nr 列,但为了清楚起见,我保留了它们。
有没有办法做到这一点?
另外,有没有办法同时提取数组索引(添加到输出行)?
【问题讨论】:
-
可以使用spark sql
pos_explode函数。 -
@Vamsi Prabhala,posexplode 的结果不是我想要的,除非我编码不正确。从我示例中的第一行开始,它生成四行,每列一行,每列显示索引,而不是数组(0 表示 tr,1 表示 lr,等等)。我用
ann3.select($"ID", $"text", posexplode(array($"tr", $"lr", $"pr", $"nr"))) -
你必须
posexplode每一列和join索引上的结果。
标签: scala dataframe apache-spark