计算评论中大量名词和动词/形容词的所有共同出现答案

【问题标题】：Counting all co-occurrences of a large list of nouns and verbs/adjectives within reviews计算评论中大量名词和动词/形容词的所有共同出现
【发布时间】：2021-05-28 14:43:55
【问题描述】：

我有一个包含大量评论的数据框，一个包含名词词的大列表 (1000) 和另一个包含动词/形容词的大列表 (1000)。

示例数据框和列表：

import pandas as pd

data = {'reviews':['Very professional operation. Room is very clean and comfortable',
                    'Daniel is the most amazing host! His place is extremely clean, and he provides everything you could possibly want (comfy bed, guidebooks & maps, mini-fridge, towels, even toiletries). He is extremely friendly and helpful.',
                    'The room is very quiet, and well decorated, very clean.',
                    'He provides the room with towels, tea, coffee and a wardrobe.',
                    'Daniel is a great host. Always recomendable.',
                    'My friend and I were very satisfied with our stay in his apartment.']}

df = pd.DataFrame(data)

nouns = ['place','Amsterdam','apartment','location','host','stay','city','room','everything','time','house',
         'area','home','’','center','restaurants','centre','Great','tram','très','minutes','walk','space','neighborhood',
         'à','station','bed','experience','hosts','Thank','bien']

verbs_adj = ['was','is','great','nice','had','clean','were','recommend','stay','are','good','perfect','comfortable',
             'have','easy','be','quiet','helpful','get','beautiful',"'s",'has','est','located','un','amazing','wonderful',]

我想创建一个字典来存储每个评论中名词和动词/形容词的所有共现，例如

'非常专业的操作。房间非常干净舒适。'

{'room': {'is': 1, 'clean': 1, 'comfortable': 1}

使用以下代码：

def count_co_occurences(reviews):
    # Iterate on each review and count
    occurences_per_review = {
        f"review_{i+1}": {
            noun: dict(Counter(review.lower().split(" ")))
            for noun in nouns
            if noun in review.lower()
        }
        for i, review in enumerate(reviews)
    }
    # Remove verb_adj not found in main list
    opr = deepcopy(occurences_per_review)
    for review, occurences in opr.items():
        for noun, counts in occurences.items():
            for verb_adj in counts.keys():
                if verb_adj not in verbs_adj:
                    del occurences_per_review[review][noun][verb_adj]
                    
    return occurences_per_review

pprint(count_co_occurences(data["reviews"]))

适用于列表和评论数量较小的情况，但是当此功能用于大型列表/大型编号时，我的笔记本会崩溃。的评论。如何修改代码以处理此问题？

【问题讨论】：

标签： python pandas nltk

【解决方案1】：

我认为您可能需要使用几个库来让您的生活更轻松。在这个例子中，我使用 nltk 和集合，当然除了 pandas：

import pandas as pd
import nltk
from collections import Counter

data = {'reviews':['Very professional operation. Room is very clean and comfortable',
                    'Daniel is the most amazing host! His place is extremely clean, and he provides everything you could possibly want (comfy bed, guidebooks & maps, mini-fridge, towels, even toiletries). He is extremely friendly and helpful.',
                    'The room is very quiet, and well decorated, very clean.',
                    'He provides the room with towels, tea, coffee and a wardrobe.',
                    'Daniel is a great host. Always recomendable.',
                    'My friend and I were very satisfied with our stay in his apartment.']}

df = pd.DataFrame(data)

nouns = ['place','Amsterdam','apartment','location','host','stay','city','room','everything','time','house',
         'area','home','’','center','restaurants','centre','Great','tram','très','minutes','walk','space','neighborhood',
         'à','station','bed','experience','hosts','Thank','bien']

verbs_adj = ['was','is','great','nice','had','clean','were','recommend','stay','are','good','perfect','comfortable',
             'have','easy','be','quiet','helpful','get','beautiful',"'s",'has','est','located','un','amazing','wonderful',]

def buildict(x):
    occurdict={}
    tokens = nltk.word_tokenize(x)
    tokenslower = list(map(str.lower, tokens)) 
    allnouns=[word for word in tokenslower if word in nouns]
    allverbs_adj=Counter(word for word in tokenslower if word in verbs_adj)
    for noun in allnouns:
        occurdict[noun]=dict(allverbs_adj)
    return occurdict

df['words']=df['reviews'].apply(lambda x: buildict(x))

输出：

0   Very professional operation. Room is very clea...   {'room': {'is': 1, 'clean': 1, 'comfortable': 1}}
1   Daniel is the most amazing host! His place is ...   {'host': {'is': 3, 'amazing': 1, 'clean': 1, '...
2   The room is very quiet, and well decorated, ve...   {'room': {'is': 1, 'quiet': 1, 'clean': 1}}
3   He provides the room with towels, tea, coffee ...   {'room': {}}
4   Daniel is a great host. Always recomendable.    {'host': {'is': 1, 'great': 1}}
5   My friend and I were very satisfied with our s...   {'stay': {'were': 1, 'stay': 1}, 'apartment': ...

【讨论】：

这正是我想要的，谢谢。是否也可以将 dicts 的 dict 转换为数据框？所以每一行都是名词，每一列都是动词/形容词
是的，有可能，它类似于 dfdict=pd.DataFrame(occurdict).transpose() 其中发生的dict是函数 buildict 返回的内容（dicts 的dict）
我已经尝试过您的解决方案和类似的代码，但它们都只为每个字典值输出 1 行和 1 列，我不确定为什么会这样。不过谢谢你的帮助！