如何提取csv文件中的重复数据答案

【问题标题】：how to extract duplicate data in csv file如何提取csv文件中的重复数据
【发布时间】：2021-06-22 11:59:31
【问题描述】：

我正在对 YouTube 上的用户意见进行建模，因此我提取了大量数据（cmets 和视频），我有一个包含 5 列（channelId、videoId、userId、评论日期和极性）的 csv 文件和近 80k 行。现在我需要在 csv 文件中分别收集每个用户的 cmets。如何提取每个 userId 的所有 cmets？我试图提取重复项，但它不起作用。谁能帮我写一个 Python 小脚本？

【问题讨论】：

标签： python export-to-csv youtube-data-api data-extraction

【解决方案1】：

您可以将它们写入字典，其中每个条目都是一个用户 ID。

import csv
users = {}
with open('data.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for row in csv_reader:
        # rows are: channelId, videoId, userId, date of comment, popularity, comment
        user = row[2]
        if user in users.keys():
            user_list = users[user]
        else:
            user_list = []
        # now collect all data you want from the row, e.g.
        user_list.append({"channelId":row[0],"videoId":row[1],"date":row[3],"popularity":row[4], "comment":row[5]})
        # now write it back to the dict
        users[user] = user_list

现在您可以通过以下方式获取用户的所有发布日期：

thisUser = users['thisUserID']
for comment in thisUser:
    print(comment['date'])

要将用户写入单个 csv 文件，您可以使用 DictWriter 函数：

for userID in users.keys():
    with open(userID+'.csv', 'w', newline='') as csvfile:
        fieldnames = ['channelId', 'videoId', 'date', 'popularity','comment']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for comment in users[userID]:
            writer.writerow(comment)

【讨论】：

【解决方案2】：

你可以通过这个sn-p实现这个：

import numpy

datas = numpy.array([
    # channelId, videoId, userId, date, popularity, comment
    [0, 0, 1, "02052021", 3044, "blobi"],
    [1, 2, 1, "01052021", 4234, "uygukih"],
    [2, 1, 1, "02062021", 2452, "bla"],
    [0, 0, 2, "09052021", 2345, "arghh"],
    [1, 0, 5, "02042021", 234, "haha"]
])

i_user = 2
i_comment = 5

for user in numpy.unique(datas.T[2]):
    print("_" * 50)
    print("userId {0}".format(user))
    [print("comments {0}: {1}".format(i + 1, comment)) for i, comment in enumerate(datas.T[i_comment][numpy.where(datas.T[i_user] == user)])]

它会返回：

__________________________________________________
userId 1
comments 1: blobi
comments 2: uygukih
comments 3: bla
__________________________________________________
userId 2
comments 1: arghh
__________________________________________________
userId 5
comments 1: haha

【讨论】：

【解决方案3】：

如果我的理解是正确的，你在 csv 中有注释列（你忘了在键列表中提到它）

import pandas

csv = pandas.read_csv(r'youtube.csv')
print(csv.loc[csv['userId'] == 'h']['comment'])

【讨论】：

【解决方案4】：

使用熊猫。

df = pd.read_csv('data.csv')
new_df = df[['userId','comment ']]
new_df.to_csv('user_comment.csv',index=False)

【讨论】：