使用python合并csv文件而不重复标题答案

【问题标题】：Merge csv files using python without repeating header使用python合并csv文件而不重复标题
【发布时间】：2017-12-26 00:32:27
【问题描述】：

我正在尝试这样做，

import glob

interesting_files = glob.glob("/home/tcs/PYTHONMAP/test1/*.csv") 

header_saved = False
with open('/home/tcs/PYTHONMAP/output.csv','wb') as fout:
    for filename in interesting_files:
        with open(filename) as fin:
            header =  next(fin)
            if not header_saved:
                fout.write(header)
                header_saved = True
            for line in fin:
                fout.write(line)

得到

File "/home/tcs/.config/spyder-py3/temp.py", line 11, in <module>
    fout.write(header)

TypeError: a bytes-like object is required, not 'str'

我对python不太了解，请帮忙我也想知道如何将 1 个大 csv 拆分为具有相同标题的多个 csv。

【问题讨论】：

看看熊猫。 pandas.pydata.org 和 stackoverflow.com/questions/2512386/…
您通过指定'wb' 以二进制形式打开fout 文件。我认为如果您指定 'w' 来编写字符串，它应该可以工作。您可能还想看看the csv module。
非常感谢我对系统命令所做的工作，你能描述一下我们如何将大 csv 文件拆分成小文件，但每个拆分都应该有标题，提前谢谢
sed 2d *.csv > /a2.csv 。命令能够在没有 python 的情况下连接相同，
@ShubhamChauhan 它不能工作它会重复标题

标签： python csv split

【解决方案1】：

使用熊猫：

import pandas as pd

interesting_files = glob.glob("/home/tcs/PYTHONMAP/test1/*.csv") 
df = pd.concat((pd.read_csv(f, header = 0) for f in interesting_files))
df.to_csv("output.csv")

也要去除重复的行：

import pandas as pd

interesting_files = glob.glob("/home/tcs/PYTHONMAP/test1/*.csv") 
df = pd.concat((pd.read_csv(f, header = 0) for f in interesting_files))
df_deduplicated = df.drop_duplicates()
df_deduplicated.to_csv("output.csv")

这不会在创建数据框时删除重复项，而是在创建之后。因此，通过连接所有文件来创建数据框。然后对其进行重复数据删除。然后可以将最终的数据帧保存到 csv。

【讨论】：

有什么方法可以同时删除重复项？
@Rahul 你的意思是重复的行吗？我已经更新了我的答案以包括一种删除重复行的方法，希望这会有所帮助！ :)
如果您的数据大小不适合内存，怎么办？使用这种方法，df 的大小可以变得比 RAM 能够容纳的机器更大。

【解决方案2】：

import glob
import csv
interesting_files = glob.glob("/home/tcs/PYTHONMAP/test1/*.csv") 

header_saved = False
with open('/home/tcs/PYTHONMAP/output.csv', 'w') as fout:
    writer = csv.writer(fout)
    for filename in interesting_files:
        with open(filename) as fin:
            header =  next(fin)
            if not header_saved:
                writer.writerows(header) # you may need to work here. The writerows require an iterable.
                header_saved = True
            writer.writerows(fin.readlines())

【讨论】：