比较python中具有不同数据集的两个csv文件答案

【问题标题】：comparing two csv files in python that have different data sets比较python中具有不同数据集的两个csv文件
【发布时间】：2020-12-11 23:10:25
【问题描述】：

使用 python，我想比较两个 csv 文件，但只比较第一个 csv 的 row2 和第二个 csv 的 row0，但在新的 csv 文件中只打印比较行不匹配的行。

示例....

currentstudents.csv 包含以下信息

Susan,Smith,susan.smith@mydomain.com,8
John,Doe,john.doe@mydomain.com,9
Cool,Guy,cool.guy@mydomain.com,3
Test,User,test.user@mydomain.com,5

previousstudents.csv 包含以下信息

susan.smith@mydomain.com
john.doe@mydomain.com
test.user@mydomain.com

比较两个 csv 文件后，一个名为 NewStudents.csv 的新 csv 应写入以下信息：

Cool,Guy,cool.guy@mydomain.com,3

这就是我所拥有的，但这无法产生我需要的东西......如果我在原始 currentstudents.csv 文件中省略了除电子邮件地址之外的所有数据，那么下面的代码将起作用，但我没有结束在最终的 csv 文件中包含所需的数据。

def newusers():

for line in fileinput.input(r'C:\work\currentstudents.csv', inplace=1):
    print(line.lower(), end='')


with open(r'C:\work\previousstudents.csv', 'r') as t1, open(r'C:\work\currentstudents.csv', 'r') as t2:
    fileone = t1.readlines()
    filetwo = t2.readlines()

with open(r'C:\work\NewStudents.csv', 'w') as outFile:
    for (line[0]) in filetwo:
        if (line[0]) not in fileone:
            outFile.write(line)

提前致谢！

【问题讨论】：

仅供参考：彻底回答问题非常耗时。如果您的问题已解决，请通过接受最适合您的需求的解决方案表示感谢。接受检查位于答案左上角的向上/向下箭头下方。如果出现更好的解决方案，则可以接受新的解决方案。如果您的声誉超过 15 岁，您还可以使用向上或向下箭头对答案的质量/有用性进行投票。 如果解决方案无法回答问题，请发表评论。 What should I do when someone answers my question?。谢谢。

标签： python-3.x csv

【解决方案1】：

带有pandas 选项
- 对于小文件，这无关紧要，但对于较大的文件，pandas 的矢量化操作将比使用 csv 迭代 emails（多次）快得多。
用pd.read_csv读取数据
将数据与pandas.DataFrame.merge合并
- 问题中的列没有名称，因此按列索引选择列。
用Boolean indexing 和[all_students._merge == 'left_only'] 选择所需的新生。
- .iloc[:, :-2] 选择所有行，以及除最后两列之外的所有行。

import pandas as pd

# read the two csv files
cs = pd.read_csv('currentstudents.csv', header=None)
ps = pd.read_csv('previousstudents.csv', header=None)

# merge the data
all_students = cs.merge(ps, left_on=2, right_on=0, how='left', indicator=True)

# select only data from left_only
new_students = all_students.iloc[:, :-2][all_students._merge == 'left_only']

# save the data without the index or header
new_students.to_csv('NewStudents.csv', header=False, index=False)

# NewStudents.csv
Cool,Guy,cool.guy@mydomain.com,3

【讨论】：

【解决方案2】：

这个脚本写NewStudents.csv:

import csv

with open('sample.csv', newline='') as csvfile1, \
     open('sample2.csv', newline='') as csvfile2, \
     open('NewStudents.csv', 'w', newline='') as csvfile3:

    reader1 = csv.reader(csvfile1)
    reader2 = csv.reader(csvfile2)
    csvwriter = csv.writer(csvfile3)
    
    emails = set(row[0] for row in reader2)

    for row in reader1:
        if row[2] not in emails:
            csvwriter.writerow(row)

NewStudents.csv的内容：

Cool,Guy,cool.guy@mydomain.com,3

【讨论】：