如何从文本中提取所有可能的名词短语答案

【问题标题】：How to extract all possible noun phrases from text如何从文本中提取所有可能的名词短语
【发布时间】：2020-10-28 04:31:38
【问题描述】：

我想自动提取文本中一些想要的概念（名词短语）。我的计划是提取所有名词短语，然后将它们标记为两个分类（即，合意的短语和不合意的短语）。之后，训练一个分类器对它们进行分类。我现在正在尝试的是首先提取所有可能的短语作为训练集。例如，一个句子是Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described. 我想得到所有的短语，如shoulder、richer mix、shoulder of richer mix、junctions、junctions of columns and beams、columns and beams、columns、beams 或任何可能的.理想的短语是shoulder、junctions、junctions of columns and beams。但是我不在乎这一步的正确性，我只想先得到训练集。是否有可用于此类任务的工具？

我在 rake_nltk 中尝试了 Rake，但结果未能包含我想要的短语（即，它没有提取所有可能的短语）

from rake_nltk import Rake
data = 'Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.'
r = Rake()
r.extract_keywords_from_text(data)
phrase = r.get_ranked_phrases()
print(phrase)enter code herenter code here

结果：['richer mix', 'shoulder', 'required', 'junctions', 'items', 'described', 'columns', 'beams'] （这里错过了junctions of columns and beams）

我也试过短语机器，结果也漏掉了一些想要的。

import spacy
import phrasemachine
matchedList=[]
doc = nlp(data)
tokens = [token.text for token in doc]
pos = [token.pos_ for token in doc]
out = phrasemachine.get_phrases(tokens=tokens, postags=pos, output="token_spans")
print(out['token_spans'])
while len(out['token_spans']):
    start,end = out['token_spans'].pop()
    print(tokens[start:end])

结果：

[(2, 6), (4, 6), (14, 17)]
['junctions', 'of', 'columns']
['richer', 'mix']
['shoulder', 'of', 'richer', 'mix']

（这里漏掉了很多名词短语）

【问题讨论】：

你能解释一下什么是短语吗？
基本意思是名词短语

标签： python nlp spacy named-entity-recognition information-extraction

【解决方案1】：

你不妨利用noun_chunks属性：

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.')

phrases = set() 
for nc in doc.noun_chunks:
    phrases.add(nc.text)
    phrases.add(doc[nc.root.left_edge.i:nc.root.right_edge.i+1].text)
print(phrases)
{'junctions of columns and beams', 'junctions', 'the items', 'a shoulder', 'columns', 'richer mix', 'beams', 'columns and beams', 'a shoulder of richer mix', 'these junctions'}

【讨论】：

感谢您的回答，谢尔盖！但事实证明，仍然遗漏了一些名词短语。例如，对于这句话“边缘和断裂在楼板和墙壁，包括开口的拱腹和路缘石的侧面。”，“边缘和楼板和墙壁的断裂”被遗漏了......你知道如何改进这样的问题?