【问题标题】:In Python, Pandas is loading CSV file incorrectly (Python for Data Analysis book example)在 Python 中,Pandas 加载 CSV 文件不正确(Python for Data Analysis 书籍示例)
【发布时间】:2015-10-01 18:43:39
【问题描述】:

我正在关注 Python for Data Analysis 一书。它告诉我从http://www.fec.gov/disclosurep/PDownload.do 获取所有文件并用pandas 加载它:

import pandas as pd

fec = pd.read_csv('P00000001-ALL.csv')

但是自从本书写完之后实际的文件已经改变了。旧文件(可在此处找到https://github.com/pydata/pydata-book/blob/master/ch09/P00000001-ALL.csv)加载得很好

fec = pd.read_csv('../pydata-book/ch09/P00000001-ALL.csv')

但新的加载错误,因为列似乎已经移动(第一列值被删除)

cmte_id                           P60008059
cand_id                           Bush, Jeb
cand_nm              EASTON, AMY KELLY MRS.
contbr_nm                      KEY BISCAYNE
contbr_city                              FL
contbr_st                         331491716
contbr_zip                        HOMEMAKER
contbr_employer                   HOMEMAKER
contbr_occupation                      2700
contb_receipt_amt                 26-JUN-15
contb_receipt_dt                        NaN
receipt_desc                            NaN
memo_cd                                 NaN
memo_text                             SA17A
form_tp                             1024106
file_num                        SA17.114991
tran_id                               P2016
election_tp                             NaN

实际的行是

C00579458,"P60008059","Bush, Jeb","EASTON, AMY KELLY MRS.","KEY BISCAYNE","FL","331491716","HOMEMAKER","HOMEMAKER",2700,26-JUN-15,"","","","SA17A","1024106","SA17.114991","P2016",

所以 C00579458 在某处丢失了。

标题看起来像这样。 cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp

【问题讨论】:

  • 您能否添加几行,包括导致问题的 csv 标题以及您为这些行获得的确切输出。
  • 嗨阿南德,你有标题和上面的一行吗?需要我再添加几行吗?
  • 当您检查数据框时,第一个元素是否被视为索引?

标签: python csv pandas


【解决方案1】:

正如其他答案已经表明的那样,您的 csv 格式错误,行尾带有 comma。因此,这会导致 pandas 将第一列视为索引列。

要解决此问题,您可以将 index_col=False 参数传递给 pandas.read_csv() 函数。示例 -

In [24]: s = io.StringIO("""cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
   ....: C00579458,"P60008059","Bush, Jeb","EASTON, AMY KELLY MRS.","KEY BISCAYNE","FL","331491716","HOMEMAKER","HOMEMAKER",2700,26-JUN-15,"","","","SA17A","1024106","SA17.114991","P2016",""")

In [25]: df = pd.read_csv(s)  #Issue

In [26]: df
Out[26]:
             cmte_id    cand_id                 cand_nm     contbr_nm  \
C00579458  P60008059  Bush, Jeb  EASTON, AMY KELLY MRS.  KEY BISCAYNE

          contbr_city  contbr_st contbr_zip contbr_employer  \
C00579458          FL  331491716  HOMEMAKER       HOMEMAKER

           contbr_occupation contb_receipt_amt  contb_receipt_dt  \
C00579458               2700         26-JUN-15               NaN

           receipt_desc  memo_cd memo_text  form_tp     file_num tran_id  \
C00579458           NaN      NaN     SA17A  1024106  SA17.114991   P2016

           election_tp
C00579458          NaN

In [29]: df = pd.read_csv(s,index_col=False)  #No issue

In [30]: df
Out[30]:
     cmte_id    cand_id    cand_nm               contbr_nm   contbr_city  \
0  C00579458  P60008059  Bush, Jeb  EASTON, AMY KELLY MRS.  KEY BISCAYNE

  contbr_st  contbr_zip contbr_employer contbr_occupation  contb_receipt_amt  \
0        FL   331491716       HOMEMAKER         HOMEMAKER               2700

  contb_receipt_dt  receipt_desc  memo_cd  memo_text form_tp  file_num  \
0        26-JUN-15           NaN      NaN        NaN   SA17A   1024106

       tran_id election_tp
0  SA17.114991       P2016

the documentations 中正确解释了这一点-

index_col : int or sequence or False, default None

用作 DataFrame 的行标签的列。如果给定一个序列,则使用 MultiIndex。 如果您的文件格式不正确,每行末尾都有分隔符,您可以考虑 index_col=False 来强制 pandas使用第一列作为索引(行名)

(强调我的)

【讨论】:

    【解决方案2】:

    原始数据的每一行末尾都有一个额外的逗号。

    C00458844,"P60006723","Rubio, Marco","HEFFERNAN, MICHAEL","APO","AE","090960009","INFORMATION REQUESTED PER BEST EFFORTS","INFORMATION REQUESTED PER BEST EFFORTS",210,27-JUN-15,"","","","SA17A","1015697","SA17.796904","P2016",
    

    如果您有 2 个逗号,则每行将移动 2 列。

    【讨论】:

    • 啊哈!所以源文件已损坏! Gayatri,有没有办法用 Pandas 解决这个问题(告诉它关于列之类的东西)?谢谢。
    猜你喜欢
    • 2014-06-18
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-07-08
    • 1970-01-01
    • 2020-10-05
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多