使用字典标记句子中的单词答案

【问题标题】：Tagging words in sentences using dictionares使用字典标记句子中的单词
【发布时间】：2020-01-02 19:28:05
【问题描述】：

我有超过 10 万个句子的语料库，我有字典。我想匹配语料库中的单词并在句子中标记它们

语料库文件“sentences.txt”

Hello how are you doing. Headache is dangerous
Malaria can be cure
he has anxiety thats why he is behaving like that.
she is doing well
he has psychological problems

字典文件“dict.csv”

abc, anxiety, disorder
def, Headache, symptom
hij, Malaria, virus
klm, headache, symptom

我的python程序

import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams

import codecs

with open('dictionary.csv','r') as csvFile:
    reader = csv.reader(csvFile)
    myfile = open("sentences.txt", "rt")
    my3file = open("tagged_sentences.txt", "w")
    hay = myfile.read()
    myfile.close()

for row in reader:
    needle = row[1]
    needle_length = len(needle.split())
    max_sim_val = 0.9
    max_sim_string = u""
    for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
        hay_ngram = u" ".join(ngram)

        similarity = SM(None, hay_ngram, needle).ratio()
        if similarity > max_sim_val:
            max_sim_val = similarity
            max_sim_string = hay_ngram
            str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
            str1 = max_sim_string , row[2]
            for line in hay.splitlines():
                if max_sim_string in line:
                    tag_sent = line.replace(max_sim_string, str1.__str__())
                    my3file.writelines(tag_sent + '\n')
                    print(tag_sent)
            break

csvFile.close()

我现在的输出是

 he has ('anxiety', ' disorder') thats why he is behaving like that.
 ('Malaria', ' virus') can be cure
 Hello how are you doing. ('Headache', ' symptom') is dangerous

我希望我的输出为。我希望它在同一文件“sentences.txt”中标记句子中的单词或将其写入新文件“myfile3.txt”中。而不会干扰句子的顺序或完全忽略（不添加）它

 Hello how are you doing. ('Headache', 'symptom') is dangerous
 ('Malaria', ' virus') can be cure.
 he has ('anxiety', ' disorder') thats why he is behaving like that
 she is doing well
 he has psychological problems

【问题讨论】：

标签： python dictionary tagging named-entity-recognition

【解决方案1】：

如果您希望按句子输入的顺序输出，那么您需要按照该顺序构建输出。相反，您将程序设计为按字典的顺序报告结果。您需要切换内循环和外循环。

将 dict 文件读入内部数据结构，因此您不必不断地重置和重新读取文件。

然后读取句子文件，一次一行。寻找要标记的词（你已经做得很好了）。照你的意思做替换，然后写出修改后的句子。

【讨论】：

我正要说同样的话！
谢谢兄弟。我现在就试试

【解决方案2】：

无需对代码进行太多更改，这应该可以使其正常工作：

...
phrases = []
for row in reader:
    needle = row[1]
    needle_length = len(needle.split())
    max_sim_val = 0.9
    max_sim_string = u""
    for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
        hay_ngram = u" ".join(ngram)

        similarity = SM(None, hay_ngram, needle).ratio()
        if similarity > max_sim_val:
            max_sim_val = similarity
            max_sim_string = hay_ngram
            str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
            str1 = max_sim_string , row[2]
            phrases.append((max_sim_string, row[2]))

for line in hay.splitlines():
    if any(max_sim_string in line for max_sim_string, _ in phrases):
        for phrase in phrases:
            max_sim_string, _ = phrase
            if max_sim_string in line:
                tag_sent = line.replace(max_sim_string, phrase.__str__())
                my3file.writelines(tag_sent + '\n')
                print(tag_sent)
                break        
    else:
        my3file.writelines(line + '\n')

csvFile.close()

【讨论】：

非常感谢兄弟。它在这里解决了我的一半问题。它按顺序给了我答案。但它忽略了没有字典匹配的句子。我也想要那些句子。即“sentences.txt”中的最后两句
@SubhaanKhan 现在呢？已编辑。
哇，太棒了。你太棒了兄弟。太感谢了。保持幸福