如何识别与主题相关的句子？答案

【问题标题】：How to identify a sentences related to a topic?如何识别与主题相关的句子？
【发布时间】：2019-02-04 19:25:32
【问题描述】：

我正在做一个项目，需要我对文档进行排序以匹配主题。

例如，我有 4 个主题，分别是 Lecture、Tutor、Lab 和 Exam。我有一些句子是：

讲座很吸引人
导师很好很积极
讲座的内容太多了2小时。
考试与每周的实验室相比似乎太难了。

现在我想将这些句子分类到上面的主题中，结果应该是：

讲座：2
导师：1人
考试：1

我进行了研究，发现最多的指令是使用 LDA 主题建模。但似乎无法解决我的问题，因为我知道 LDA 支持识别文档中的主题，并且不知道如何手动预先选择主题。

谁能帮帮我？我坚持这一点。

【问题讨论】：

欢迎来到 StackOverflow。请按照您创建此帐户时的建议阅读并遵循帮助文档中的发布指南。 On topic、how to ask 和 ... the perfect question 在此处申请。 StackOverflow 是特定编程解决方案的存档。您的问题似乎需要关于攻击定义松散的应用程序的通用指南。
执行此操作的方法可能因文档类型而异。我们在做什么？
我正在处理 csv 文件，我知道如何读取文件。
stackoverflow.com/questions/3113428/… 的可能欺骗。另外，太宽泛了。

标签： python nltk

【解决方案1】：

这是使用比字符串匹配更智能的东西的绝佳示例 =)

让我们考虑一下：

有没有办法将每个单词转换为向量形式（即浮点数组）？
有没有办法将每个句子转换为相同的向量形式（即与单词的向量形式相同维度的浮点数组？

首先让我们为您的句子列表中所有可能的单词获取一个词汇表（我们称之为语料库）：

>>> from itertools import chain
>>> s1 = "Lecture was engaging"
>>> s2 = "Tutor is very nice and active"
>>> s3 = "The content of lecture was too much for 2 hours."
>>> s4 = "Exam seem to be too difficult compare with weekly lab."
>>> list(map(word_tokenize, [s1, s2, s3, s4]))
[['Lecture', 'was', 'engaging'], ['Tutor', 'is', 'very', 'nice', 'and', 'active'], ['The', 'content', 'of', 'lecture', 'was', 'too', 'much', 'for', '2', 'hours', '.'], ['Exam', 'seem', 'to', 'be', 'too', 'difficult', 'compare', 'with', 'weekly', 'lab', '.']]
>>> vocab = sorted(set(token.lower() for token in chain(*list(map(word_tokenize, [s1, s2, s3, s4])))))
>>> vocab
['.', '2', 'active', 'and', 'be', 'compare', 'content', 'difficult', 'engaging', 'exam', 'for', 'hours', 'is', 'lab', 'lecture', 'much', 'nice', 'of', 'seem', 'the', 'to', 'too', 'tutor', 'very', 'was', 'weekly', 'with']

现在让'使用词汇表中单词的索引将 4 个关键词表示为向量：

>>> lecture = [1 if token == 'lecture' else 0 for token in vocab]
>>> lab = [1 if token == 'lab' else 0 for token in vocab]
>>> tutor = [1 if token == 'tutor' else 0 for token in vocab]
>>> exam = [1 if token == 'exam' else 0 for token in vocab]
>>> lecture
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
>>> lab
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
>>> tutor
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
>>> exam
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

同样，我们循环遍历每个句子并将它们转换为向量形式：

>>> [token.lower() for token in word_tokenize(s1)]
['lecture', 'was', 'engaging']
>>> s1_tokens = [token.lower() for token in word_tokenize(s1)]
>>> s1_vec = [1 if token in s1_tokens else 0  for token in vocab]
>>> s1_vec
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

对所有句子重复相同的内容：

>>> s2_tokens = [token.lower() for token in word_tokenize(s2)]
>>> s3_tokens = [token.lower() for token in word_tokenize(s3)]
>>> s4_tokens = [token.lower() for token in word_tokenize(s4)]
>>> s2_vec = [1 if token in s2_tokens else 0  for token in vocab]
>>> s3_vec = [1 if token in s3_tokens else 0  for token in vocab]
>>> s4_vec = [1 if token in s4_tokens else 0  for token in vocab]
>>> s2_vec
[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]
>>> s3_vec
[1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0]
>>> s4_vec
[1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1]

现在，给定句子和单词的向量形式，您可以使用相似度函数，例如cosine similarity:

>>> from numpy import dot
>>> from numpy.linalg import norm
>>> 
>>> cos_sim = lambda x, y: dot(x,y)/(norm(x)*norm(y))
>>> cos_sim(s1_vec, lecture)
0.5773502691896258
>>> cos_sim(s1_vec, lab)
0.0
>>> cos_sim(s1_vec, exam)
0.0
>>> cos_sim(s1_vec, tutor)
0.0

现在，更系统地进行：

>>> topics = {'lecture': lecture, 'lab': lab, 'exam': exam, 'tutor':tutor}
>>> topics
{'lecture': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'lab':     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'exam':    [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'tutor':   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]}


>>> sentences = {'s1':s1_vec, 's2':s2_vec, 's3':s3_vec, 's4':s4_vec}

>>> for s_num, s_vec in sentences.items():
...     print(s_num)
...     for name, topic_vec in topics.items():
...         print('\t', name, cos_sim(s_vec, topic_vec))
... 
s1
     lecture 0.5773502691896258
     lab 0.0
     exam 0.0
     tutor 0.0
s2
     lecture 0.0
     lab 0.0
     exam 0.0
     tutor 0.4082482904638631
s3
     lecture 0.30151134457776363
     lab 0.0
     exam 0.0
     tutor 0.0
s4
     lecture 0.0
     lab 0.30151134457776363
     exam 0.30151134457776363
     tutor 0.0

我想你明白了。但是我们看到 s4-lab 与 s4-exam 的分数仍然并列。所以问题就变成了，“有没有办法让他们分道扬镳？”你会跳进兔子洞：

如何最好地将句子/单词表示为固定大小的向量？
使用什么相似度函数来比较“主题”/单词与句子？
什么是“主题”？向量实际上代表什么？

上面的答案就是通常所说的 one-hot 向量来表示单词/句子。比简单地比较字符串来“识别与主题相关的句子”要复杂得多。（又名文档聚类/分类）。例如。一个文档/句子可以有多个主题吗？

请查找这些关键字以进一步了解“自然语言处理”、“文档分类”、“机器学习”的问题。同时，如果你不介意的话，我想这个问题很接近 “太宽泛”。

【讨论】：

无耻插件帮助您进一步回答上述问题，kaggle.com/alvations/basic-nlp-with-nltk 和 drive.google.com/file/d/1lxRclJablHF-veuRzWBgJ9gaqMNo6fPa/…
谢谢，这对我来说可能是答案

【解决方案2】：

我假设您正在读取文本文件或其他内容。以下是我将如何去做。

keywords = {"lecture": 0, "tutor": 0, "exam": 0}

with open("file.txt", "r") as f:
  for line in f:
    for key, value in keywords.items():
      if key in line.lower():
        value += 1

print(keywords)

这会在每一行搜索关键字字典中的任何单词，如果找到匹配项，则会增加该键上的值。

您不需要任何外部库或任何东西。

【讨论】：

谢谢，这将解决我的问题。但是，这可能会出现一个问题，如果一个句子重复两次“lecture”这个词，就会影响结果。
另外，如果我想扩展我的程序，它可以分析句子是肯定的还是否定的。你知道哪个图书馆会支持这个吗？我已经尝试将 nltk 与 scikit 分类器一起使用。但只能检查已经看到的结果。例如，我标记了 200 个正负组合的句子，然后让分类器学习并检查另一个标记的 50 cmets 中有多少百分比是正确的。
如果您想在一个句子中找到两次相同的关键字，您总是可以进一步拆分句子并评估每个单词。我不确定库...也许您可以查看软件推荐堆栈交换站点？

【解决方案3】：

解决方案

filename = "information.txt"


library = {"lecture": 0, "tutor": 0, "exam": 0}

with open(filename) as f_obj:
    content = f_obj.read() # read text into contents

words  = (content.lower()).split() # create list of all words in content

for k, v in library.items():
    for i in words:
        if k in i:
            v += 1 
            library[k] = v # without this line code count will not update 

for k, v in library.items():
    print(k.title() + ": "  + str(v))

输出

(xenial)vash@localhost:~/pcc/12/alien_invasion_2$ python3 helping_topic.py 
Tutor: 1
Lecture: 2
Exam: 1
(xenial)vash@localhost:~/pcc/12/alien_invasion_2$

此方法将为您计算重复项

享受吧！

【讨论】：

【解决方案4】：

只需根据您想要的主题命名变量

lecture = 2
tutor = 1
exam = 1

您可以使用variable_name += 1 来增加变量

【讨论】：

这不能回答问题。他们如何识别一个句子是否包含该术语？此外，最好为此 imo 使用dict
对不起，我认为当你在 python 中对文档进行排序时，文本被读取为字符串
@Frogmonkey 是的，但是您没有解释如何读取文件或如何检查句子中是否存在每个单词。你只是告诉他们如何增加变量，我相信他们已经知道了