使用 pandas.read_csv 跳过多行答案

【问题标题】：Skip multiple rows using pandas.read_csv使用 pandas.read_csv 跳过多行
【发布时间】：2019-02-19 11:34:16
【问题描述】：

我正在读取一个大块的 csv 文件，因为我没有足够的内存来存储。我想读取它的前 10 行（0 到 9 行），跳过接下来的 10 行（10 到 19），然后读取接下来的 10 行（20 到 29 行），再次跳过接下来的 10 行（30 到 39 ) 然后读取第 40 到第 49 行，依此类推。以下是我正在使用的代码：

#initializing n1 and n2 variable  
n1=1
n2=2
#reading data in chunks
for chunk in pd.read_csv('../input/train.csv',chunksize=10, dtype=dtypes,skiprows=list(range(  ((n1*10)+1), ((n2*10) +1) ))):
    sample_chunk=chunk
   #displaying the  sample_chunk
   print(sample_chunk)
   #incrementing n1
    n1=n1+2
   #incrementing n2
    n2=n2+2

但是，我认为我设计的代码不起作用。它只跳过从 10 到 19 的行（即：它读取从 0 到 9 的行，跳过 10 到 19，然后读取 20 到 29，然后再次读取 30 到 39，然后再次读取 40 到 49，并继续读取所有行）。请帮我找出我做错了什么。

【问题讨论】：

这是因为当你初始化pd.read_csv时你说skiprows =[11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
使用块大小检查这个答案：stackoverflow.com/questions/25962114/…
@Noor 你总共有多少行？
@Nihal 超过 200 万行。

标签： python python-3.x pandas csv

【解决方案1】：

代码：

ro = list(range(0, lengthOfFile + 10, 10))
d = [j + 1 for i in range(1, len(ro), 2) for j in range(ro[i], ro[i + 1])]
# print(ro)
print(d)

pd.read_csv('../input/train.csv',chunksize=10, dtype=dtypes,skiprows=d)

例如：

lengthOfFile = 100
ro = list(range(0, lengthOfFile + 10, 10))
d = [j for i in range(1, len(ro), 2) for j in range(ro[i], ro[i + 1])]
print(d)

输出： [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

【讨论】：

谢谢@Nihal。我已经实现了您的代码，但是，它读取第 0,1,2,3,4,5,6,7,8,19 行而不是 0-9，然后读取第 20,21 行， 22,23,24,25,26,27,28,39 而不是 20-29，然后读取 40,41,42,43,44,45,46,47,48,59 而不是 40-49 等等。
更新了我的答案，只需使用j + 1 创建d

【解决方案2】：

使用您的方法，您需要在初始化pd.read_csv时定义所有skiprows，您可以这样做，

rowskips = [i for x in range(1,int(lengthOfFile/10),2) for i in range(x*10, (x+1)*10)]

lengthOfFile 是文件的长度。

那么对于pd.read_csv

pd.read_csv('../input/train.csv',chunksize=10, dtype=dtypes,skiprows=rowskips)

来自文档：

skiprows : list-like, int or callable, optional

    Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

    If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

所以你可以通过list、int或callable，

int -> 它跳过文件开头的给定行
list -> 它跳过list中给定的行号
callable -> 它计算行号使用callable 然后决定是否跳过。

您正在传递list，它指定在启动时要跳过的行。您无法再次更新它。另一种方法可能是传递一个可调用的lamda x: x in rowskips，它会评估一行是否符合要跳过的条件。

【讨论】：

你的程序只保留 0-9 的行并跳过所有其他的
@Nihal 是的，我错过了range 中的2
还是错了，假设我有length=400 然后它会一直到4000
@Nihal 谢谢你，是的，现在应该没问题了。我忽略了第二个for中的*10
@Noor 添加了解释。