查找 2 个 csv 文件之间的匹配项和分数答案

【问题标题】：find matches and scores between 2 csv files查找 2 个 csv 文件之间的匹配项和分数
【发布时间】：2021-03-11 12:11:27
【问题描述】：

text                                                    number
0   very nice house, and great garden                       3
1   the book is very boring                                 4
2   it was very interesting final end                       5
3   I have no idea which    book do you prefer              4

我有 2 个 csv 文件：一个 text.csv 和一个 words.csv

       word              score
0      boring           -1.0
1      very             -1.0
2      interesting       1.0
3      great             1.0
4      book              0.5

我要统计有多少正负词与文本匹配

例如“这本书很无聊”有 1 个 0.5 和 2 个负数 -1。然后我的输出应该是（正匹配和负匹配）[1,2] 基于 words.csv 中的分数匹配 text.csv

我是熊猫新手，不知道如何获取。

【问题讨论】：

标签： python-3.x pandas csv

【解决方案1】：

创建一个给句子打分的函数，然后应用到文本列：

import pandas as pd

# import data
text, number = zip(
    ("very nice house, and great garden",                       3),
    ("the book is very boring",                                 4),
    ("it was very interesting final end",                       5),
    ("I have no idea which    book do you prefer",              4),
)
df = pd.DataFrame(dict(text=text, number=number))

word, score = zip(
    ("boring",           -1.0),
    ("very",             -1.0),
    ("interesting",       1.0),
    ("great",             1.0),
    ("book",              0.5),
)
df2 = pd.DataFrame(dict(word=word, score=score))

# convert score data frame to a dictionary for faster indexing
word2score = dict(zip(df2['word'], df2['score']))

def score_text(sentence):
    score = 0
    for word in sentence.split():
        token = word.strip(",.:;!?()'/") # you probably want to do a more professional tokenization here
        if token in word2score:
            score += word2score[token]
    return score

df['score'] = df['text'].apply(score_text)

print(df)

#                                          text  number  score
# 0           very nice house, and great garden       3    0.0
# 1                     the book is very boring       4   -1.5
# 2           it was very interesting final end       5    0.0
# 3  I have no idea which    book do you prefer       4    0.5

编辑：

如果要统计正负词的个数，就得对打分函数做一些小改动：

def score_text(sentence):
    score = [0, 0]
    for word in sentence.split():
        token = word.strip(",.:;!?()'/") # you probably want to do a more professional tokenization here
        if token in word2score:
            if word2score[token] > 0:
                score[0] += 1
            elif word2score[token] < 0:
                score[1] += 1
    return score

#                                          text  number   score
# 0           very nice house, and great garden       3  [1, 1]
# 1                     the book is very boring       4  [1, 2]
# 2           it was very interesting final end       5  [1, 1]
# 3  I have no idea which    book do you prefer       4  [1, 0]

【讨论】：

感谢您的澄清；我相应地更新了答案。
我不明白你的意思。我正在向您的数据框中添加一个名为 score 的列，其中包含结果。这不是预期的输出格式吗？
df['score'].values.tolist() 将结果作为列表列表提供给您，如果这是您所要求的。