Pandas read_csv() 与 sep 和 quotechar 冲突导致意外的列数答案

【问题标题】：Pandas read_csv() conflict with sep and quotechar causing unexpected number of columnsPandas read_csv() 与 sep 和 quotechar 冲突导致意外的列数
【发布时间】：2017-05-01 17:17:20
【问题描述】：

已经看过here 和here，它们接近我认为我看到的核心问题，但在其他方面得到了解决。

我正在尝试解析一个 CSV，它有一个字段，现在需要在其中包含一个逗号，要求我们将该字段用引号引起来。它是引号中的唯一字段。

我们的分隔符 (sep) 是逗号，我们现在添加引号 (quotechar) 的字符串分隔符。

我把它归结为这个。在我看来， sep 和 quotechar 应用程序的顺序是关键问题，导致使用 quotechar 的行中带有 sep 的行将永远无法工作。

最后一行注释掉的数据文件。

$ cat simple.csv
column1,column2, column3
one,    two,                three
one,    two,               "three"
#one,    "two, two_again",   three
$

代码：

df = pd.read_csv( simple_file, sep=',', header=0, comment='#', quotechar='"')
print df

输出：

column1  column2                  column3
0     one      two                    three
1     one      two                 "three"

现在，在引用的字符串中添加具有 sep 字符的最后一行。

数据文件：

$ cat simple.csv
column1,column2, column3
one,    two,                three
one,    two,               "three"
one,    "two, two_again",   three
$

输出失败：

pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:22649)()
CParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 4

我相信我想强制 Pandas 首先在每一行上使用引号分隔符，然后使用分隔符，因为它正在做相反的事情。似乎无法弄清楚如何。有没有办法告诉熊猫这是我找不到的？

【问题讨论】：

去掉空格或将空格定义为分隔符的一部分。然后只需指定 header=None 即可读取文件，其余部分由默认值处理。

标签： python string csv parsing pandas

【解决方案1】：

pandas CSV 阅读器会感到困惑，因为您告诉它分隔符是严格的 '，但您还在数据文件中使用空格作为分隔符。更改分隔符或修复数据。以数据为

column1,column2, column3
one,two,three
one,two,"three"
one,"two, two_again",three

你得到以下

import pandas as pd
print(pd.read_csv("data.csv", header=None))

         0               1         2
0  column1         column2   column3
1      one             two     three
2      one             two     three
3      one  two, two_again     three

【讨论】：

当我创建这个小测试数据文件时，它看起来变得更糟了。我认为 Pandas 足够聪明，可以忽略分隔符周围的空白，但我只是告诉它分隔符是什么。目前我手头没有原件，但我的测试用例消除了空格，使它按我的预期工作。
我无法编辑自己的评论。我无法编辑自己对自己问题的评论？
@KevinM Panda 足够聪明，可以完全按照您的吩咐去做。如果分隔符只是逗号，则逗号后的空格是下一项的一部分。然后突然，您在项目中间有一个没有意义的引号，然后是一个额外的分隔符。解析器，明智地，barfs。您可以轻松地为 pandas 提供一个正则表达式作为分隔符，尽管这意味着它将使用 python 而不是 C 解析器 - 这比较慢，但可能与您的情况无关。
正是我所说的 :-)