【发布时间】:2017-09-07 04:54:06
【问题描述】:
我有一组 csv 文件要连接。为此,我编写了一个函数来完成这项工作。但是,我发现我的最终 csv(将所有 csv 分组)在前两行中具有重复的标题,然后在每次连接新的 csv 时重复标题。
如下:
from_line all_chars_in_the_same_row page_number words char left top right bottom
from_line all_chars_in_same_row page_number words char left top right bottom
0 0 ['m', 'i', 'i', 'l', 'm', 'u', 'i', 'l', 'i', 'l'] 1841729699_001 [[mi, il, mu, il, il]] m 38 104 2456 2492
1 0 ['m', 'i', 'i', 'l', 'm', 'u', 'i', 'l', 'i', 'l'] 1841729699_001 [[mi, il, mu, il, il]] i 40 102 2442 2448
然后在将其与新的 csv 文件连接时:
2048 49 ['L', 'A', 'C', 'H', 'E', 'T', 'E', 'U', 'R', 'D', 'É', 'C', 'L', 'A', 'R', 'E', 'A', 'V', 'O', 'I', 'R', 'P', 'R', 'I', 'S', 'C', 'O', 'N', 'N', 'A', 'I', 'S', 'S', 'A', 'N', 'C', 'E', 'D', 'E', 'S', 'C', 'O', 'N', 'D', 'I', 'T', 'I', 'O', 'N', 'S', 'G', 'É', 'N', 'É', 'R', 'A', 'L', 'E', 'S', 'D', 'E', 'V', 'E', 'N', 'T', 'E', 'S', 'T', 'I', 'P', 'U', 'L', 'É', 'E', 'S', 'A', 'U', 'V', 'E', 'R', 'S', 'O', '.'] 1841729699_001 [[lacheteur, declare, avoir, pris, connaissance, des, conditions, generales, de, vente, stipulees, au, verso.]] 0 2364 2366 3426 3429
from_line all_chars_in_same_row page_number words char left top right bottom
0 0 ['m', 'i', 'i', 'l', 'm', 'u', 'i', 'l', 'i', 'l'] 1841729699_001 [[mi, il, mu, il, il]] m 38 104 2456 2492
1 0 ['m', 'i', 'i', 'l', 'm', 'u', 'i', 'l', 'i', 'l'] 1841729699_001 [[mi, il, mu, il, il]] i 40 102 2442 2448
等等。我的功能如下:
import os
import glob
import pandas
def concatenate(indir="files",outfile="concatenated.csv"):
os.chdir(indir)
fileList=glob.glob("*.csv")
dfList=[]
colnames=[" ","from_line","all_chars_in_the_same_row","page_number","words","char","left","top","right","bottom"]
for filename in fileList:
print(filename)
df=pandas.read_csv(filename,header=None)
dfList.append(df)
concatDf=pandas.concat(dfList,axis=0)
concatDf.columns=colnames
concatDf.to_csv(outfile,index=None)
为了避免每次连接新文件时在前两行和标题中出现重复的标题:
header = next(filename)
如下:
import os
import glob
import pandas
def concatenate(indir="files",outfile="concatenated.csv"):
os.chdir(indir)
fileList=glob.glob("*.csv")
dfList=[]
colnames=[" ","from_line","all_chars_in_the_same_row","page_number","words","char","left","top","right","bottom"]
for filename in fileList:
print(filename)
header=next(filename)# l got an error in this line
df=pandas.read_csv(header,header=None)
dfList.append(df)
concatDf=pandas.concat(dfList,axis=0)
concatDf.columns=colnames
concatDf.to_csv(outfile,index=None)
我收到以下错误:
File "<input>", line 13, in concatenate
TypeError: 'str' object is not an iterator
EDIT1 做完这些改动后
>import os
import glob
import pandas
def concatenate(indir="files",outfile="concatenated.csv"):
os.chdir(indir)
fileList=glob.glob("*.csv")
dfList=[]
colnames=[" ","from_line","all_chars_in_the_same_row","page_number","words","char","left","top","right","bottom"]
for filename in fileList:
print(filename)
with open(filename) as f:
header=next(f)
df = pandas.read_csv(header, header=None)
dfList.append(df)
concatDf = pandas.concat(dfList, axis=0)
concatDf.columns = colnames
concatDf.to_csv(outfile, index=None)
我收到以下错误:
Traceback (most recent call last):
File "/usr/lib/python3.5/code.py", line 91, in runcode
exec(code, self.locals)
File "<input>", line 1, in <module>
File "<input>", line 15, in concatenate
File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 389, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 730, in __init__
self._make_engine(self.engine)
File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 923, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 1390, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 373, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4184)
File "pandas/parser.pyx", line 667, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:8449)
FileNotFoundError: File b',from_line,all_chars_in_same_row,page_number,words,char,left,top,right,bottom\n' does not exist
**EDIT2**
运行此代码后,我得到了两个第一列重复
import os
import pandas as pd
import glob
fileList=glob.glob("file*.csv")
colNames=[" ","from_line","all_chars_in_the_same_row","page_number","words","char","left","top","right","bottom"]
final_df = pd.DataFrame(columns=colNames)
for fileName in fileList:
df=pd.read_csv(fileName,skiprows=0) # skip first row w/ headers since you want to set column names yourself
df.columns = colNames
final_df = pd.concat([final_df, df], axis=0)
print(final_df)
from_line
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
10 10 1
11 11 1
12 12 2
但是在原始的 csv 文件中我有这个:
from_line
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
【问题讨论】:
标签: python csv pandas concatenation glob