如何评估文本摘要工具？答案

【问题标题】：How do I evaluate a text summarization tool?如何评估文本摘要工具？
【发布时间】：2012-04-10 09:05:15
【问题描述】：

我编写了一个系统来总结包含数千个单词的长文档。是否有关于如何在用户调查的背景下评估此类系统的规范？

简而言之，是否有衡量我的工具拯救人类时间的指标？目前，我正在考虑使用（阅读原始文档所用时间/阅读摘要所用时间）来确定节省的时间，但有更好的指标吗？

目前，我正在向用户提出关于摘要准确性的主观问题。

【问题讨论】：

标签： language-agnostic nlp information-retrieval evaluation

【解决方案1】：

一般：

Bleu 测量精度：机器生成的摘要中的单词（和/或 n-gram）出现在人类参考摘要中的程度。

Rouge 衡量召回率：人类参考摘要中的单词（和/或 n-gram）出现在机器生成的摘要中的次数。

自然 - 这些结果是相辅相成的，正如在精确率与召回率中经常出现的情况一样。如果系统结果中有很多单词/ngrams 出现在人类参考文献中，那么你的 Bleu 就会很高，如果你有很多来自人类参考文献的单词/ngrams 出现在系统结果中，那么 Rouge 就会很高。

有一种叫做简洁惩罚的东西，它非常重要并且已经被添加到标准的 Bleu 实现中。它会惩罚比参考的一般长度短的系统结果（阅读更多关于它的信息here）。这补充了 n-gram 度量行为，实际上惩罚比参考结果更长，因为分母增长得越长，系统结果越长。

您也可以对 Rouge 实施类似的操作，但这次会惩罚比一般参考长度更长的系统结果，否则它们会人为地获得更高的 Rouge 分数（因为结果越长，您获得的机会就越高会碰到一些出现在参考文献中的词）。在 Rouge 中，我们除以人工参考的长度，因此我们需要对更长的系统结果进行额外的惩罚，这可能会人为地提高他们的 Rouge 分数。

最后，您可以使用 F1 度量使指标协同工作：F1 = 2 * (Bleu * Rouge) / (Bleu + Rouge)

【讨论】：

您已经发布了两个问题的确切答案。如果您认为其中一个与另一个重复，则应将它们标记为此类（并且不要两次发布相同的答案）。
答案不完全相同，问题也不完全相同..其中一个答案包含另一个答案是正确的，但我看不到明确的方法收敛两个问题。

【解决方案2】：

从历史上看，通常通过与人工生成的参考摘要进行比较来评估摘要系统。在某些情况下，人类摘要器通过从原始文档中选择相关句子来构建摘要；在其他情况下，摘要是从头开始手写的。

这两种技术类似于自动摘要系统的两大类 - 抽取式与抽象式（更多详细信息请访问 Wikipedia）。

一个标准工具是Rouge，一个计算自动摘要和参考摘要之间的n-gram重叠的脚本（或一组脚本；我不记得了）。 Rough 可以选择计算重叠，允许在两个摘要之间插入或删除单词（例如，如果允许跳过 2 个单词，则“已安装的泵”将被视为与“已安装的有缺陷的防洪泵”相匹配）。

我的理解是，Rouge 的 n-gram 重叠分数与人类对摘要的评估具有相当好的相关性，达到一定的准确性，但随着摘要质量的提高，这种关系可能会破裂。即，超出某些质量阈值，由人类评估者判断为更好的摘要可能会与被判断为较差的摘要相似或得分更高。不过，Rouge 分数可能是比较 2 个候选摘要系统时有用的第一个方法，或者是一种在将系统传递给人类评估者之前自动进行回归测试并清除严重回归的方法。

如果您能够承担时间/金钱成本，那么您收集人工判断的方法可能是最好的评估。为了给这个过程增加一点严谨性，您可以查看最近的总结任务中使用的评分标准（参见@John Lehmann 提到的各种会议）。这些评估人员使用的评分表可能有助于指导您自己的评估。

【讨论】：

【解决方案3】：

我不确定时间评估，但关于准确性，您可以参考主题 Automatic Document Summarization 下的文献。主要评估是文档理解会议 (DUC)，直到 2008 年摘要任务转移到文本分析会议 (TAC)。其中大部分集中在高级摘要主题，如多文档、多语言和更新摘要。

您可以找到在线发布的每个活动的评估指南。对于单个文档摘要任务，请查看 DUC 2002-2004。

或者，您可以查阅维基百科中的 ADS evaluation section。

【讨论】：

感谢您的分享。您提到他的摘要任务已于 2008 年移至文本检索会议 (TREC)。但是您提供的链接指向 TAC（文本分析会议）。我在trec.nist.gov/data.html 上也找不到摘要任务（时间摘要任务除外）。
谢谢，我已经修好了。

【解决方案4】：

BLEU

Bleu 测量精度
双语评估研究
最初用于机器翻译（双语）
W(机器生成摘要) in (Human reference Summary)
这是机器生成的摘要中的单词（和/或 n-gram）在人工参考摘要中出现的次数
机器翻译越接近专业的人工翻译越好

胭脂

胭脂措施召回
主旨评估的面向召回的研究 -W(Human Reference Summary) In w(机器生成摘要)
这就是机器生成摘要中出现的单词（和/或 n-gram）在机器生成摘要中出现的次数。

系统和参考摘要之间的 N-gram 重叠。 -Rouge N，这里 N 是 n-gram

reference_text = """Artificial intelligence (AI, also machine intelligence, MI) is intelligence demonstrated by machines, in contrast to the natural intelligence (NI) displayed by humans and other animals. In computer science AI research is defined as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving". See glossary of artificial intelligence. The scope of AI is disputed: as machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, a phenomenon known as the AI effect, leading to the quip "AI is whatever hasn't been done yet." For instance, optical character recognition is frequently excluded from "artificial intelligence", having become a routine technology. Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go), autonomous cars, intelligent routing in content delivery networks, military simulations, and interpreting complex data, including images and videos. Artificial intelligence was founded as an academic discipline in 1956, and in the years since has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success and renewed funding. For most of its history, AI research has been divided into subfields that often fail to communicate with each other. These sub-fields are based on technical considerations, such as particular goals (e.g. "robotics" or "machine learning"), the use of particular tools ("logic" or "neural networks"), or deep philosophical differences. Subfields have also been based on social factors (particular institutions or the work of particular researchers). The traditional problems (or goals) of AI research include reasoning, knowledge, planning, learning, natural language processing, perception and the ability to move and manipulate objects. General intelligence is among the field's long-term goals. Approaches include statistical methods, computational intelligence, and traditional symbolic AI. Many tools are used in AI, including versions of search and mathematical optimization, neural networks and methods based on statistics, probability and economics. The AI field draws upon computer science, mathematics, psychology, linguistics, philosophy and many others. The field was founded on the claim that human intelligence "can be so precisely described that a machine can be made to simulate it". This raises philosophical arguments about the nature of the mind and the ethics of creating artificial beings endowed with human-like intelligence, issues which have been explored by myth, fiction and philosophy since antiquity. Some people also consider AI to be a danger to humanity if it progresses unabatedly. Others believe that AI, unlike previous technological revolutions, will create a risk of mass unemployment. In the twenty-first century, AI techniques have experienced a resurgence following concurrent advances in computer power, large amounts of data, and theoretical understanding; and AI techniques have become an essential part of the technology industry, helping to solve many challenging problems in computer science."""

抽象摘要

   # Abstractive Summarize       
   len(reference_text.split())
   from transformers import pipeline
   summarization = pipeline("summarization")
   abstractve_summarization = summarization(reference_text)[0]["summary_text"]

抽象输出

   In computer science AI research is defined as the study of "intelligent agents" Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving" Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go)

提取摘要

   # Extractive summarize
   from sumy.parsers.plaintext import PlaintextParser
   from sumy.nlp.tokenizers import Tokenizer
   from sumy.summarizers.lex_rank import LexRankSummarizer
   parser = PlaintextParser.from_string(reference_text, Tokenizer("english"))
   # parser.document.sentences
   summarizer = LexRankSummarizer()
   extractve_summarization  = summarizer(parser.document,2)
   extractve_summarization) = ' '.join([str(s) for s in list(extractve_summarization)])

提取输出

Colloquially, the term "artificial intelligence" is often used to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. Sub-fields have also been based on social factors (particular institutions or the work of particular researchers).The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects.

使用 Rouge 评估抽象摘要

  from rouge import Rouge
  r = Rouge()
  r.get_scores(abstractve_summarization, reference_text)

使用 Rouge Abstractive 摘要输出

  [{'rouge-1': {'f': 0.22299651364421083,
  'p': 0.9696969696969697,
  'r': 0.12598425196850394},
  'rouge-2': {'f': 0.21328671127225052,
  'p': 0.9384615384615385,
  'r': 0.1203155818540434},
  'rouge-l': {'f': 0.29041095634452996,
  'p': 0.9636363636363636,
  'r': 0.17096774193548386}}]

使用 Rouge 评估抽象摘要

  from rouge import Rouge
  r = Rouge()
  r.get_scores(extractve_summarization, reference_text)

使用 Rouge Extractive 摘要输出

  [{'rouge-1': {'f': 0.27860696251962963,
  'p': 0.8842105263157894,
  'r': 0.16535433070866143},
  'rouge-2': {'f': 0.22296172781038814,
  'p': 0.7127659574468085,
  'r': 0.13214990138067062},
  'rouge-l': {'f': 0.354755780824869,
  'p': 0.8734177215189873,
  'r': 0.22258064516129034}}]

解读胭脂分数

ROUGE 是重叠词的分数。 ROUGE-N 是指重叠的 n-gram。具体来说：

与原始论文相比，我试图简化符号。假设我们正在计算 ROUGE-2，也就是二元匹配。分子 ∑s 循环遍历单个参考摘要中的所有二元组，并计算在候选摘要中找到匹配二元组的次数（由摘要算法提出）。如果有多个参考摘要，∑r 确保我们对所有参考摘要重复该过程。

分母只是计算所有参考摘要中的二元组总数。这是一个文档摘要对的过程。您对所有文档重复该过程，并对所有分数进行平均，从而为您提供 ROUGE-N 分数。因此，较高的分数意味着平均而言，您的摘要和参考文献之间的 n-gram 重叠率很高。

   Example:

   S1. police killed the gunman
   
   S2. police kill the gunman
   
   S3. the gunman kill police

S1 是参考，S2 和 S3 是候选。注意 S2 和 S3 都与参考有一个重叠的二元组，因此它们具有相同的 ROUGE-2 分数，尽管 S2 应该更好。一个额外的 ROUGE-L 分数处理这个问题，其中 L 代表最长公共子序列。在 S2 中，第一个词和最后两个词匹配参考，因此得分 3/4，而 S3 仅匹配二元组，因此得分 2/4。

【讨论】：

【解决方案5】：

还有最近的 BERTScore 指标（arXiv'19、ICLR'20，已被近 90 次引用）不受众所周知的 ROUGE 和 BLEU 问题的影响。

论文摘要：

我们提出了 BERTScore，一种用于文本的自动评估指标一代。与常用指标类似，BERTScore 计算候选句子中每个标记与每个标记的相似度得分参考句中的记号。但是，我们不是完全匹配，而是使用上下文嵌入计算令牌相似度。我们评估使用 363 机器翻译和图像字幕的输出系统。 BERTScore 与人类判断的相关性更好，并提供比现有指标更强的模型选择性能。最后，我们使用对抗性释义检测任务来证明 BERTScore 与现有的相比，对具有挑战性的示例更稳健指标。

论文：https://arxiv.org/pdf/1904.09675.pdf
代码：https://github.com/Tiiiger/bert_score
完整参考：

Zhang、Tianyi、Varsha Kishore、Felix Wu、Kilian Q. Weinberger 和 Yoav Artzi。 “Bertscore：使用 bert 评估文本生成。” arXiv 预印本 arXiv:1904.09675 (2019)。

【讨论】：

只是想知道，我们如何评估 BertScore 是否是更好的 ROUGE？有合适的方法吗？

【解决方案6】：

您可以根据许多参数来评估您的摘要系统。喜欢精度=重要句子的数量/总结的句子总数。召回率 = 检索到的重要句子总数 / 存在的重要句子总数。

F 分数 = 2*(Precision*Recall/Precision+Recall) 压缩率=摘要中的总字数/原始文档中的总字数。

【讨论】：

程序如何找到重要句子的数量等？

【解决方案7】：

当您评估自动摘要系统时，您通常会查看摘要的内容而不是时间。

你的想法：

（阅读原文所用时间/阅读摘要所用时间）

并没有告诉您太多关于您的摘要系统的信息，它实际上只是让您了解系统的压缩率（即摘要是原始文档的 10%）。

您可能需要考虑系统汇总文档所需的时间与人工所需的时间（系统：2 秒，人工：10 分钟）。

【讨论】：