如何索引 Spark CoreNLP 分析？答案

【问题标题】：How to index Spark CoreNLP analysis?如何索引 Spark CoreNLP 分析？
【发布时间】：2017-09-11 14:28:24
【问题描述】：

我一直在用Stanford CoreNLP wrapper for Apache Spark做NEP分析，发现效果不错。但是，我想将简单示例扩展到可以将分析映射回原始数据框 id 的位置。见下文，我在简单示例中又添加了两行。

val input = Seq(
  (1, "<xml>Apple is located in California. It is a great company.</xml>"),
  (2, "<xml>Google is located in California. It is a great company.</xml>"),
  (3, "<xml>Netflix is located in California. It is a great company.</xml>")
).toDF("id", "text")

input.show()

input: org.apache.spark.sql.DataFrame = [id: int, text: string]
+---+--------------------+
| id|                text|
+---+--------------------+
|  1|<xml>Apple is loc...|
|  2|<xml>Google is lo...|
|  3|<xml>Netflix is l...|
+---+--------------------+

然后我可以通过 Spark CoreNLP 包装器运行此数据帧，以进行情绪和 NEP 分析。

val output = input
  .select(cleanxml('text).as('doc))
  .select(explode(ssplit('doc)).as('sen))
  .select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))

但是，在下面的输出中，我失去了与原始数据帧行 ID 的连接。

+--------------------+--------------------+--------------------+---------+
|                 sen|               words|             nerTags|sentiment|
+--------------------+--------------------+--------------------+---------+
|Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...|        2|
|It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
|Google is located...|[Google, is, loca...|[ORGANIZATION, O,...|        3|
|It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
|Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...|        3|
|It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
+--------------------+--------------------+--------------------+---------+

理想情况下，我想要以下内容：

+--+---------------------+--------------------+--------------------+---------+
|id|                  sen|               words|             nerTags|sentiment|
+--+---------------------+--------------------+--------------------+---------+
| 1| Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...|        2|
| 1| It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
| 2| Google is located...|[Google, is, loca...|[ORGANIZATION, O,...|        3|
| 2| It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
| 3| Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...|        3|
| 3| It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
+--+---------------------+--------------------+--------------------+---------+

我已尝试创建 UDF，但无法使其工作。

【问题讨论】：

标签： scala apache-spark stanford-nlp

【解决方案1】：

使用 Stanford CoreNLP wrapper for Apache Spark 中定义的 UDF，您可以使用以下代码生成所需的输出

val output = input.withColumn("doc", cleanxml('text).as('doc))
  .withColumn("sen", ssplit('doc).as('sen))
  .withColumn("sen", explode($"sen"))
  .withColumn("words", tokenize('sen).as('words))
  .withColumn("ner", ner('sen).as('nerTags))
  .withColumn("sentiment", sentiment('sen).as('sentiment))
  .drop("text")
  .drop("doc").show()

将产生以下数据框

+--+---------------------+--------------------+--------------------+---------+
|id|                  sen|               words|             nerTags|sentiment|
+--+---------------------+--------------------+--------------------+---------+
| 1| Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...|        2|
| 1| It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
| 2| Google is located...|[Google, is, loca...|[ORGANIZATION, O,...|        3|
| 2| It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
| 3| Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...|        3|
| 3| It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
+--+---------------------+--------------------+--------------------+---------+

【讨论】：