【问题标题】:find matches and scores between 2 csv files查找 2 个 csv 文件之间的匹配项和分数
【发布时间】:2021-03-11 12:11:27
【问题描述】:
text                                                    number
0   very nice house, and great garden                       3
1   the book is very boring                                 4
2   it was very interesting final end                       5
3   I have no idea which    book do you prefer              4

我有 2 个 csv 文件:一个 text.csv 和一个 words.csv

       word              score
0      boring           -1.0
1      very             -1.0
2      interesting       1.0
3      great             1.0
4      book              0.5

我要统计有多少正负词与文本匹配

例如“这本书很无聊”有 1 个 0.5 和 2 个负数 -1。然后我的输出应该是(正匹配和负匹配)[1,2] 基于 words.csv 中的分数匹配 text.csv

我是熊猫新手,不知道如何获取。

【问题讨论】:

    标签: python-3.x pandas csv


    【解决方案1】:

    创建一个给句子打分的函数,然后应用到文本列:

    import pandas as pd
    
    # import data
    text, number = zip(
        ("very nice house, and great garden",                       3),
        ("the book is very boring",                                 4),
        ("it was very interesting final end",                       5),
        ("I have no idea which    book do you prefer",              4),
    )
    df = pd.DataFrame(dict(text=text, number=number))
    
    word, score = zip(
        ("boring",           -1.0),
        ("very",             -1.0),
        ("interesting",       1.0),
        ("great",             1.0),
        ("book",              0.5),
    )
    df2 = pd.DataFrame(dict(word=word, score=score))
    
    # convert score data frame to a dictionary for faster indexing
    word2score = dict(zip(df2['word'], df2['score']))
    
    def score_text(sentence):
        score = 0
        for word in sentence.split():
            token = word.strip(",.:;!?()'/") # you probably want to do a more professional tokenization here
            if token in word2score:
                score += word2score[token]
        return score
    
    df['score'] = df['text'].apply(score_text)
    
    print(df)
    
    #                                          text  number  score
    # 0           very nice house, and great garden       3    0.0
    # 1                     the book is very boring       4   -1.5
    # 2           it was very interesting final end       5    0.0
    # 3  I have no idea which    book do you prefer       4    0.5
    

    编辑:

    如果要统计正负词的个数,就得对打分函数做一些小改动:

    def score_text(sentence):
        score = [0, 0]
        for word in sentence.split():
            token = word.strip(",.:;!?()'/") # you probably want to do a more professional tokenization here
            if token in word2score:
                if word2score[token] > 0:
                    score[0] += 1
                elif word2score[token] < 0:
                    score[1] += 1
        return score
    
    #                                          text  number   score
    # 0           very nice house, and great garden       3  [1, 1]
    # 1                     the book is very boring       4  [1, 2]
    # 2           it was very interesting final end       5  [1, 1]
    # 3  I have no idea which    book do you prefer       4  [1, 0]
    

    【讨论】:

    • 感谢您的澄清;我相应地更新了答案。
    • 我不明白你的意思。我正在向您的数据框中添加一个名为 score 的列,其中包含结果。这不是预期的输出格式吗?
    • df['score'].values.tolist() 将结果作为列表列表提供给您,如果这是您所要求的。
    猜你喜欢
    • 2014-04-09
    • 1970-01-01
    • 1970-01-01
    • 2020-09-22
    • 1970-01-01
    • 2022-10-01
    • 1970-01-01
    • 2017-08-06
    • 1970-01-01
    相关资源
    最近更新 更多