作者: Florian Boudin and Emmanuel Morin
来源: 2013 NAACL-HLT
概述:
这篇文章扩展了Filippova (2010)’s word graph-based MSC方法,添加了一个re-reranking步骤,使得包含最多相关关键词的compression被选择出来。
资源:
1. 代码:https://github.com/boudinfl/takahe
2. 数据集:https://github.com/boudinfl/lina-msc
相关工作:
1. Multi-sentence compression
a) Use a syntactic parser (control the grammaticality of the output)
b) Word graph-based approaches that only require a POS tagger (The key assumption is that the redundancy provides a reliable way of generating grammatical sentences. )
2. Keyphrase extraction
Supervised: 将其视为一个二分类问题,缺点:the need for training data; the bias towards the domain
Unsupervised: a) language modeling. b) graph-based ranking. c) clustering
模型:
Given a set of redundant sentences, a word-graph is constructed by iteratively adding sentences to it. The best compression is obtained by finding the shortest path in the word graph. The original algorithm was published and described in :
Katja Filippova, Multi-Sentence Compression: Finding Shortest Paths in Word Graphs.
A keyphrase-based reranking method can be applied to generated more informative compressions.
Step1: TextRank计算每个node的salience score:
Step2: 生成并计算每个keyphrase candidate的得分
Step3: 比(Filippova, 2010)使用更多的路径数,对这些路径重排序,计算sentence compression c的最终得分
计算ROUGE得分时移除了stopword并做了词干化处理:http://snowballstem.org/
相关文章: