python中文本的n-gram答案

【问题标题】：n-grams from text in pythonpython中文本的n-gram
【发布时间】：2018-08-12 00:20:56
【问题描述】：

对我之前的post 的更新，有一些变化：

假设我有 100 条推文。在这些推文中，我需要提取：1）食物名称和 2）饮料名称。我还需要为每次提取附加类型（饮料或食物）和一个 ID 号（每个项目都有一个唯一的 ID）。

我已经有一个包含名称、类型和 ID 号的词典：

lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}

推文示例：

经过对“tweet_1”的各种处理，我有这样的句子：

sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream', 
'coca cola and banana is not a good combo']

我请求的输出（可以是 list 以外的 type）：

["tweet_id_1",
 [[["dr pepper"], ["drink", "d_124"]],
  [["coca cola"], ["drink", "d_234"]],
  [["banana split"], ["food", "f_567"]],
  [["ice cream"], ["food", "f_789"]]],

 "tweet_id_1",,
 [[["coca cola"], ["drink", "d_234"]],
  [["banana"], ["food", "f_456"]]]]

重要的是输出应该不在 ngrams (n>1) 中提取 unigrams：

["tweet_id_1",
 [[["dr pepper"], ["drink", "d_124"]],
  [["coca cola"], ["drink", "d_234"]],
  [["cola"], ["drink", "d_345"]],
  [["banana split"], ["food", "f_567"]],
  [["banana"], ["food", "f_456"]],
  [["ice cream"], ["food", "f_789"]],
  [["cream"], ["food", "f_678"]]],

 "tweet_id_1",
 [[["coca cola"], ["drink", "d_234"]],
  [["cola"], ["drink", "d_345"]],
  [["banana"], ["food", "f_456"]]]]

理想情况下，我希望能够在各种 nltk 过滤器中运行我的句子，例如 lemmatize() 和 pos_tag() BEFORE 提取以获得如下输出。但是使用这个正则表达式解决方案，如果我这样做，那么所有单词都会被拆分为 unigram，或者它们将从字符串“coca cola”中生成 1 个 unigram 和 1 个 bigram，这将生成我不想拥有的输出（如上例）。理想的输出（同样输出的类型并不重要）：

["tweet_id_1",
 [[[("dr pepper", "NN")], ["drink", "d_124"]],
  [[("coca cola", "NN")], ["drink", "d_234"]],
  [[("banana split", "NN")], ["food", "f_567"]],
  [[("ice cream", "NN")], ["food", "f_789"]]],

 "tweet_id_1",
 [[[("coca cola", "NN")], ["drink", "d_234"]],
  [[("banana", "NN")], ["food", "f_456"]]]]

【问题讨论】：

stackoverflow.com/questions/49064114/… 的副本？
不重复，但非常相似

标签： python regex nlp nltk n-gram

【解决方案1】：

可能不是最有效的解决方案，但这肯定会让你开始 -

sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream', 
'coca cola and banana is not a good combo']

lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}

lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)

chunks = []

for sentence in sentences:
    for lex in lexicon_list:
        if lex in sentence:
                chunks.append({lex: list(lexicon[lex].values()) })
                sentence = sentence.replace(lex, '')

print(chunks)

输出

[{'dr pepper': ['drink', 'd_123']}, {'coca cola': ['drink', 'd_234']}, {'banana split': ['food', 'f_567']}, {'ice cream': ['food', 'f_789']}, {'coca cola': ['drink', 'd_234']}, {'banana': ['food', 'f_456']}]

说明

lexicon_list = list(lexicon.keys()) 获取需要搜索的短语列表并按长度对其进行排序（以便首先找到更大的块）

输出是dict 的列表，其中每个字典都有list 值。

【讨论】：

【解决方案2】：

不幸的是，由于我的声誉低下，我无法制作 cmets，但 Vivek 的答案可以通过 1) 正则表达式、2) 包括 pos_tag 标记作为 NN、3) 字典结构来改进，您可以在其中通过推文选择推文结果：

import re
import nltk
from collections import OrderedDict

tweets = {"tweet_1": ['dr pepper is better than coca cola and suits banana split with ice cream', 'coca cola and banana is not a good combo']}

lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}

lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)

#regex will be much more faster than "in" operator
pattern = "(" + "|".join(lexicon_list) +  ")"
pattern = re.compile(pattern)

# Here we make the dictionary of our phrases and their tagged equivalents
lexicon_pos_tag = {word:nltk.pos_tag(nltk.word_tokenize(word)) for word in lexicon_list}
# if you will train model that it recognizes e.g. "banana split" as ("banana split", "NN")
# not as ("banana", "NN") and ("split", "NN") you could use the following
# lexicon_pos_tag = {word:nltk.pos_tag(word) for word in lexicon_list}

#chunks will register the tweets as the keywords
chunks = OrderedDict()
for tweet in tweets:
    chunks[tweet] = []
    for sentence in tweets[tweet]:
        temp = OrderedDict()
        for word in pattern.findall(sentence):
            temp[word] = [lexicon_pos_tag[word], [lexicon[word]["type"], lexicon[word]["id"]]]
        chunks[tweet].append((temp))

最终输出为：

OrderedDict([('tweet_1',
          [OrderedDict([('dr pepper',
                         [[('dr', 'NN'), ('pepper', 'NN')],
                          ['drink', 'd_123']]),
                        ('coca cola',
                         [[('coca', 'NN'), ('cola', 'NN')],
                          ['drink', 'd_234']]),
                        ('banana split',
                         [[('banana', 'NN'), ('split', 'NN')],
                          ['food', 'f_567']]),
                        ('ice cream',
                         [[('ice', 'NN'), ('cream', 'NN')],
                          ['food', 'f_789']])]),
           OrderedDict([('coca cola',
                         [[('coca', 'NN'), ('cola', 'NN')],
                          ['drink', 'd_234']]),
                        ('banana',
                         [[('banana', 'NN')], ['food', 'f_456']])])])])

【讨论】：

感谢您的回复。但是，pos_tag 的重点并不是说每个“香蕉”都应该是 NN，而是在预训练模型中只找到那些属于 NN 类型的香蕉。
当然，但是正如我在 lexicon_pos_tag 上面的评论中指出的那样...如果您在训练 pos_tag 模型后执行上述代码，则代码：lexicon_pos_tag = {word:nltk.pos_tag(word) for word in lexicon_list} 将创建一个像 {"banana split “：（“香蕉分裂”，“NN”）}。然后将在代码temp[word] = [lexicon_pos_tag[word],...中正确使用。
谢谢！目前，我正在研究您原来的正则表达式解决方案。但我也会尝试更新！非常好的输入！ :)

【解决方案3】：

我会用一个 for 循环来过滤 ..

使用 if 语句在键中查找字符串。如果您希望包含 unigram，请删除

len(key.split()) > 1

如果您只想包含一元组，请将其更改为：

len(key.split()) == 1

 filtered_list = ['tweet_id_1']

 for k, v in lexicon.items():
     for s in sentences:
         if k in s and len(k.split()) > 1:
             filtered_list.extend((k, v))

  print(filtered_list)

【讨论】：

这不会在第二句中找到“香蕉”。它应该即将检测到所有 ngram，但不会生成相同字符串的重复。