通过 pos_tag 过滤 SpaCy noun_chunks答案

【问题标题】：Filtering SpaCy noun_chunks by pos_tag通过 pos_tag 过滤 SpaCy noun_chunks
【发布时间】：2020-08-28 14:52:12
【问题描述】：

正如 subj 行所说，我正在尝试根据它们各自的 POS 标签提取 noun_chunks 的元素。似乎 noun_chunk 的元素无法访问全局句子 POS 标签。

为了证明问题：


[i.pos_ for i in nlp("Great coffee at a place with a great view!").noun_chunks]
>>> 
AttributeError: 'spacy.tokens.span.Span' object has no attribute 'pos_'

这是我的低效解决方案：

def parse(text):
    doc = nlp(text.lower())
    tags = [(idx,i.text,i.pos_) for idx,i in enumerate(doc)]

    chunks = [i for i in doc.noun_chunks]

    indices = []
    for c in chunks:
        indices.extend(j for j in range(c.start_char,c.end_char))
    non_chunks = [w for w in ''.join([i for idx,i in enumerate(text) if idx not in indices]).split(' ') 
                  if w != '']

    chunk_words = [tup[1] for tup in tags if tup[1] not in non_chunks and tup[2] not in ['DET','VERB','SYM','NUM']] #these are the POS tags which I wanted to filter out from the beginning!

    new_chunks = []
    for c in chunks:
        new_words = [w for w in str(c).split(' ') if w in chunk_words]
        if len(new_words) > 1:
            new_chunk = ' '.join(new_words)
            new_chunks.append(new_chunk)
    return new_chunks

parse(
"""
I may be biased about Counter Coffee since I live in town, but this is a great place that makes a great cup of coffee. I have been coming here for about 2 years and wish I would have found it sooner. It is located right in the heart of Forest Park and there is a ton of street parking. The coffee here is great....many other words could describe it, but that sums it up perfectly. You can by coffee by the pound, order a hot drink, and they also have food. On the weekend, there are donuts brought in from Do-Rite Donuts which have almost a cult like following. The food is a little on the high end price wise, but totally worth it. I am a self admitted latte snob and they make an amazing latte here. You can add skim, whole, almond or oat milk and they will make it happen. I always order easy foam and they always make it perfectly. My girlfriend loves the Chai Latte with Oat Milk and I will admit it is pretty good. Give them a try.
""")

>>>
['counter coffee',
 'great place',
 'great cup',
 'forest park',
 'street parking',
 'many other words',
 'hot drink',
 'almost cult',
 'high end price',
 'latte snob',
 'amazing latte',
 'oat milk',
 'easy foam',
 'chai latte',
 'oat milk']

欢迎任何更快的相同解决方案的方法！

【问题讨论】：

标签： python nlp spacy chunks pos-tagger

【解决方案1】：

这不起作用：

[i.pos_ for i in nlp("Great coffee at a place with a great view!").noun_chunks]

因为noun_chunks 返回Span 对象，而不是Token 对象。

您可以通过遍历标记来获取每个名词块中的 POS 标签：

nlp = spacy.load("en_core_web_md")
for i in nlp("Great coffee at a place with a great view!").noun_chunks:
    print(i, [t.pos_ for t in i])

这会给你

Great coffee ['ADJ', 'NOUN'] 
a place ['DET', 'NOUN'] 
a great view ['DET', 'ADJ', 'NOUN']

【讨论】：

天哪，这太简单了。你是英雄！

【解决方案2】：

此链接的原始信用： Phrase extraction

 def get_nns(doc):
        nns = []
        for token in doc:
            # Try this with other parts of speech for different subtrees.
            if token.pos_ == 'NOUN':
                pp = ' '.join([tok.orth_ for tok in token.subtree])
                nns.append(pp)
        return nns

 import spacy
    nlp = spacy.load('en_core_web_sm')
    ex = 'I am having a Great coffee at a place with a great view!'
    doc = nlp(ex)
    print(get_nns(doc))

输出：

['a Great coffee', 'a place with a great view', 'a great view']

【讨论】：