【问题标题】:Repeating entity in replacing entity with their entity label using spacy使用 spacy 将实体替换为实体标签时重复实体
【发布时间】:2020-12-22 12:14:59
【问题描述】:

代码:

import spacy
nlp = spacy.load("en_core_web_md")

#read txt file, each string on its own line
with open("./try.txt","r") as f:
    texts = f.read().splitlines()

#substitute entities with their TAGS
docs = nlp.pipe(texts)
out = []
for doc in docs:
    out_ = ""
    for tok in doc:
        text = tok.text
        if tok.ent_type_:
            text = tok.ent_type_
        out_ += text + tok.whitespace_
    out.append(out_)

# write to file
with open("./out_try.txt","w") as f:
    f.write("\n".join(out))

输入文件内容:

Georgia recently became the first U.S. state to "ban Muslim culture.
His friend Nicolas J. Smith is here with Bart Simpon and Fred.
Apple is looking at buying U.K. startup for $1 billion

输出文件内容:

GPE recently became the ORDINAL GPE state to "ban NORP culture.
His friend PERSON PERSON PERSON is here with PERSON PERSON and PERSON.
ORG is looking at buying GPE startup for MONEYMONEY MONEY

我需要在上面的句子中避免这个问题。例如在(在句子 2 'PERSON PERSON PERSON' 中成为一个实体 PERSON。 谢谢

【问题讨论】:

  • 如果您对后处理步骤没问题,您可以使用import re,然后使用re.sub(r'(?<!\S)([A-Z]+)(?: \1)+(?!\S)', r'\1', out_)
  • @3832970 谢谢,谢谢,我认为不好,因为 NER 不重复就返回。

标签: python nlp spacy named-entity-recognition


【解决方案1】:

让我们试试:

import spacy
from spacy.gold import biluo_tags_from_offsets, spans_from_biluo_tags
nlp = spacy.load("en_core_web_md")

#read txt file, each string on its own line
with open("./try.txt","r") as f:
    texts = f.read().splitlines()

docs = nlp.pipe(texts)
out_text = ""
for doc in docs:
    offsets = []
    for ent in doc.ents:
        offsets.append((ent.start_char, ent.end_char, ent.label_))
    tags = biluo_tags_from_offsets(doc, offsets)
    text = *zip([tok for tok in doc],tags),
    out = []
    for item in text:
        tag = item[1].split("-")
        if tag[0] == "O":
            out.append(item[0].text+item[0].whitespace_)
        if tag[0] == "U":
            out.append(item[0].ent_type_+item[0].whitespace_)
        elif tag[0] == "L":
            out.append(item[0].ent_type_+item[0].whitespace_)
    out_text += "".join(out)+"\n"

with open("out_try.txt","w") as f:
    f.write(out_text)

输出文件的内容:

GPE recently became the ORDINAL GPE state to "ban NORP culture.
His friend PERSON is here with PERSON and PERSON.
ORG is looking at buying GPE startup for MONEY

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2021-11-01
    • 2020-11-04
    • 1970-01-01
    • 1970-01-01
    • 2022-06-26
    • 1970-01-01
    • 1970-01-01
    • 2012-09-24
    相关资源
    最近更新 更多