【发布时间】:2020-12-22 12:14:59
【问题描述】:
代码:
import spacy
nlp = spacy.load("en_core_web_md")
#read txt file, each string on its own line
with open("./try.txt","r") as f:
texts = f.read().splitlines()
#substitute entities with their TAGS
docs = nlp.pipe(texts)
out = []
for doc in docs:
out_ = ""
for tok in doc:
text = tok.text
if tok.ent_type_:
text = tok.ent_type_
out_ += text + tok.whitespace_
out.append(out_)
# write to file
with open("./out_try.txt","w") as f:
f.write("\n".join(out))
输入文件内容:
Georgia recently became the first U.S. state to "ban Muslim culture.
His friend Nicolas J. Smith is here with Bart Simpon and Fred.
Apple is looking at buying U.K. startup for $1 billion
输出文件内容:
GPE recently became the ORDINAL GPE state to "ban NORP culture.
His friend PERSON PERSON PERSON is here with PERSON PERSON and PERSON.
ORG is looking at buying GPE startup for MONEYMONEY MONEY
我需要在上面的句子中避免这个问题。例如在(在句子 2 'PERSON PERSON PERSON' 中成为一个实体 PERSON。 谢谢
【问题讨论】:
-
如果您对后处理步骤没问题,您可以使用
import re,然后使用re.sub(r'(?<!\S)([A-Z]+)(?: \1)+(?!\S)', r'\1', out_) -
@3832970 谢谢,谢谢,我认为不好,因为 NER 不重复就返回。
标签: python nlp spacy named-entity-recognition