【发布时间】:2020-10-28 04:31:38
【问题描述】:
我想自动提取文本中一些想要的概念(名词短语)。我的计划是提取所有名词短语,然后将它们标记为两个分类(即,合意的短语和不合意的短语)。之后,训练一个分类器对它们进行分类。我现在正在尝试的是首先提取所有可能的短语作为训练集。例如,一个句子是Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described. 我想得到所有的短语,如shoulder、richer mix、shoulder of richer mix、junctions、junctions of columns and beams、columns and beams、columns、beams 或任何可能的.理想的短语是shoulder、junctions、junctions of columns and beams。但是我不在乎这一步的正确性,我只想先得到训练集。是否有可用于此类任务的工具?
我在 rake_nltk 中尝试了 Rake,但结果未能包含我想要的短语(即,它没有提取所有可能的短语)
from rake_nltk import Rake
data = 'Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.'
r = Rake()
r.extract_keywords_from_text(data)
phrase = r.get_ranked_phrases()
print(phrase)enter code herenter code here
结果:['richer mix', 'shoulder', 'required', 'junctions', 'items', 'described', 'columns', 'beams']
(这里错过了junctions of columns and beams)
我也试过短语机器,结果也漏掉了一些想要的。
import spacy
import phrasemachine
matchedList=[]
doc = nlp(data)
tokens = [token.text for token in doc]
pos = [token.pos_ for token in doc]
out = phrasemachine.get_phrases(tokens=tokens, postags=pos, output="token_spans")
print(out['token_spans'])
while len(out['token_spans']):
start,end = out['token_spans'].pop()
print(tokens[start:end])
结果:
[(2, 6), (4, 6), (14, 17)]
['junctions', 'of', 'columns']
['richer', 'mix']
['shoulder', 'of', 'richer', 'mix']
(这里漏掉了很多名词短语)
【问题讨论】:
-
你能解释一下什么是短语吗?
-
基本意思是名词短语
标签: python nlp spacy named-entity-recognition information-extraction