【问题标题】:How to index Spark CoreNLP analysis?如何索引 Spark CoreNLP 分析?
【发布时间】:2017-09-11 14:28:24
【问题描述】:

我一直在用Stanford CoreNLP wrapper for Apache Spark做NEP分析,发现效果不错。但是,我想将简单示例扩展到可以将分析映射回原始数据框 id 的位置。见下文,我在简单示例中又添加了两行。

val input = Seq(
  (1, "<xml>Apple is located in California. It is a great company.</xml>"),
  (2, "<xml>Google is located in California. It is a great company.</xml>"),
  (3, "<xml>Netflix is located in California. It is a great company.</xml>")
).toDF("id", "text")

input.show()

input: org.apache.spark.sql.DataFrame = [id: int, text: string]
+---+--------------------+
| id|                text|
+---+--------------------+
|  1|<xml>Apple is loc...|
|  2|<xml>Google is lo...|
|  3|<xml>Netflix is l...|
+---+--------------------+

然后我可以通过 Spark CoreNLP 包装器运行此数据帧,以进行情绪和 NEP 分析。

val output = input
  .select(cleanxml('text).as('doc))
  .select(explode(ssplit('doc)).as('sen))
  .select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))

但是,在下面的输出中,我失去了与原始数据帧行 ID 的连接。

+--------------------+--------------------+--------------------+---------+
|                 sen|               words|             nerTags|sentiment|
+--------------------+--------------------+--------------------+---------+
|Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...|        2|
|It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
|Google is located...|[Google, is, loca...|[ORGANIZATION, O,...|        3|
|It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
|Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...|        3|
|It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
+--------------------+--------------------+--------------------+---------+

理想情况下,我想要以下内容:

+--+---------------------+--------------------+--------------------+---------+
|id|                  sen|               words|             nerTags|sentiment|
+--+---------------------+--------------------+--------------------+---------+
| 1| Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...|        2|
| 1| It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
| 2| Google is located...|[Google, is, loca...|[ORGANIZATION, O,...|        3|
| 2| It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
| 3| Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...|        3|
| 3| It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
+--+---------------------+--------------------+--------------------+---------+

我已尝试创建 UDF,但无法使其工作。

【问题讨论】:

    标签: scala apache-spark stanford-nlp


    【解决方案1】:

    使用 Stanford CoreNLP wrapper for Apache Spark 中定义的 UDF,您可以使用以下代码生成所需的输出

    val output = input.withColumn("doc", cleanxml('text).as('doc))
      .withColumn("sen", ssplit('doc).as('sen))
      .withColumn("sen", explode($"sen"))
      .withColumn("words", tokenize('sen).as('words))
      .withColumn("ner", ner('sen).as('nerTags))
      .withColumn("sentiment", sentiment('sen).as('sentiment))
      .drop("text")
      .drop("doc").show()
    

    将产生以下数据框

    +--+---------------------+--------------------+--------------------+---------+
    |id|                  sen|               words|             nerTags|sentiment|
    +--+---------------------+--------------------+--------------------+---------+
    | 1| Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...|        2|
    | 1| It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
    | 2| Google is located...|[Google, is, loca...|[ORGANIZATION, O,...|        3|
    | 2| It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
    | 3| Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...|        3|
    | 3| It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
    +--+---------------------+--------------------+--------------------+---------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2016-04-24
      • 1970-01-01
      • 1970-01-01
      • 2017-09-10
      • 1970-01-01
      • 2023-03-25
      • 1970-01-01
      相关资源
      最近更新 更多