【发布时间】:2020-01-02 19:28:05
【问题描述】:
我有超过 10 万个句子的语料库,我有字典。我想匹配语料库中的单词并在句子中标记它们
语料库文件“sentences.txt”
Hello how are you doing. Headache is dangerous
Malaria can be cure
he has anxiety thats why he is behaving like that.
she is doing well
he has psychological problems
字典文件“dict.csv”
abc, anxiety, disorder
def, Headache, symptom
hij, Malaria, virus
klm, headache, symptom
我的python程序
import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams
import codecs
with open('dictionary.csv','r') as csvFile:
reader = csv.reader(csvFile)
myfile = open("sentences.txt", "rt")
my3file = open("tagged_sentences.txt", "w")
hay = myfile.read()
myfile.close()
for row in reader:
needle = row[1]
needle_length = len(needle.split())
max_sim_val = 0.9
max_sim_string = u""
for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
hay_ngram = u" ".join(ngram)
similarity = SM(None, hay_ngram, needle).ratio()
if similarity > max_sim_val:
max_sim_val = similarity
max_sim_string = hay_ngram
str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
str1 = max_sim_string , row[2]
for line in hay.splitlines():
if max_sim_string in line:
tag_sent = line.replace(max_sim_string, str1.__str__())
my3file.writelines(tag_sent + '\n')
print(tag_sent)
break
csvFile.close()
我现在的输出是
he has ('anxiety', ' disorder') thats why he is behaving like that.
('Malaria', ' virus') can be cure
Hello how are you doing. ('Headache', ' symptom') is dangerous
我希望我的输出为。我希望它在同一文件“sentences.txt”中标记句子中的单词或将其写入新文件“myfile3.txt”中。而不会干扰句子的顺序或完全忽略(不添加)它
Hello how are you doing. ('Headache', 'symptom') is dangerous
('Malaria', ' virus') can be cure.
he has ('anxiety', ' disorder') thats why he is behaving like that
she is doing well
he has psychological problems
【问题讨论】:
标签: python dictionary tagging named-entity-recognition