【问题标题】:Pandas read_csv, reading a csv file with a missing header elementPandas read_csv,读取缺少标题元素的 csv 文件
【发布时间】:2016-04-24 20:08:58
【问题描述】:

我正在尝试使用 pandas.read_csv 导入 csv 文件。文件如下:

    "COL_A","COL_B","COL_C"
    "ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
    "ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
    "ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
    "ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
    "ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
    "ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
    "ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"

在第一次尝试中我跑了:

    data = pd.read_csv('broken.csv')

我得到了:

                 COL_A     COL_B     COL_C
    ROW1COLA  ROW1COLB  ROW1COLC  ROW1COLD
    ROW2COLA  ROW2COLB  ROW2COLC  ROW2COLD
    ROW3COLA  ROW3COLB  ROW3COLC  ROW3COLD
    ROW4COLA  ROW4COLB  ROW4COLC  ROW4COLD
    ROW5COLA  ROW5COLB  ROW5COLC  ROW5COLD
    ROW6COLA  ROW6COLB  ROW6COLC  ROW6COLD
    ROW7COLA  ROW7COLB  ROW7COLC  ROW7COLD

设置 index_col=False

    data = pd.read_csv('broken.csv',index_col=False)

我明白了

          COL_A     COL_B     COL_C
    0  ROW1COLA  ROW1COLB  ROW1COLC
    1  ROW2COLA  ROW2COLB  ROW2COLC
    2  ROW3COLA  ROW3COLB  ROW3COLC
    3  ROW4COLA  ROW4COLB  ROW4COLC
    4  ROW5COLA  ROW5COLB  ROW5COLC
    5  ROW6COLA  ROW6COLB  ROW6COLC
    6  ROW7COLA  ROW7COLB  ROW7COLC

如果我添加前缀 = 'X'

    data = pd.read_csv('broken.csv',index_col=False,prefix='X')

我明白了

          COL_A     COL_B     COL_C
    0  ROW1COLA  ROW1COLB  ROW1COLC
    1  ROW2COLA  ROW2COLB  ROW2COLC
    2  ROW3COLA  ROW3COLB  ROW3COLC
    3  ROW4COLA  ROW4COLB  ROW4COLC
    4  ROW5COLA  ROW5COLB  ROW5COLC
    5  ROW6COLA  ROW6COLB  ROW6COLC
    6  ROW7COLA  ROW7COLB  ROW7COLC

与 read_table 相同

    data = pd.read_table('broken.csv',index_col=True,sep=',')

我想知道 pandas 是否有任何方式自动分配标题并获取缺少的标题列的值

【问题讨论】:

    标签: python csv pandas


    【解决方案1】:

    我认为您可以将read_csv 与参数header=0 一起使用,其中第一行设置为列,然后被参数names 覆盖为自定义列名。参数sep=','被省略了,因为它是默认的:

    import pandas as pd
    import io
    
    temp=u'''"COL_A","COL_B","COL_C"
    "ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
    "ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
    "ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
    "ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
    "ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
    "ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
    "ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"'''
    #after testing replace io.StringIO(temp) to filename
    df = pd.read_csv(io.StringIO(temp), header=0, names=['a','b','c','d'])
    
    print df
              a         b         c         d
    0  ROW1COLA  ROW1COLB  ROW1COLC  ROW1COLD
    1  ROW2COLA  ROW2COLB  ROW2COLC  ROW2COLD
    2  ROW3COLA  ROW3COLB  ROW3COLC  ROW3COLD
    3  ROW4COLA  ROW4COLB  ROW4COLC  ROW4COLD
    4  ROW5COLA  ROW5COLB  ROW5COLC  ROW5COLD
    5  ROW6COLA  ROW6COLB  ROW6COLC  ROW6COLD
    6  ROW7COLA  ROW7COLB  ROW7COLC  ROW7COLD
    

    更通用的解决方案,参数header=None 用于标题中没有列名,skiprows=[0] 用于跳过缺少最后一列名称的第一行:

    import pandas as pd
    import io
    
    temp=u'''"COL_A","COL_B","COL_C"
    "ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
    "ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
    "ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
    "ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
    "ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
    "ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
    "ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"'''
    #after testing replace io.StringIO(temp) to filename
    df = pd.read_csv(io.StringIO(temp), header=None, skiprows=[0])
    
    print df
              0         1         2         3
    0  ROW1COLA  ROW1COLB  ROW1COLC  ROW1COLD
    1  ROW2COLA  ROW2COLB  ROW2COLC  ROW2COLD
    2  ROW3COLA  ROW3COLB  ROW3COLC  ROW3COLD
    3  ROW4COLA  ROW4COLB  ROW4COLC  ROW4COLD
    4  ROW5COLA  ROW5COLB  ROW5COLC  ROW5COLD
    5  ROW6COLA  ROW6COLB  ROW6COLC  ROW6COLD
    6  ROW7COLA  ROW7COLB  ROW7COLC  ROW7COLD
    

    【讨论】:

      【解决方案2】:

      没有名称/标题的第一列被视为索引列。

      你也应该正确使用index_col参数:

      data = pd.read_table('broken.csv',index_col=[0],sep=',')
      

      如果您的第一列包含数据而不是索引,您可以跳过第一行,为您的列指定名称,并指示 read_csv 您不想读取标题:

      cols = ['col1','col2','col3','col4']
      data = pd.read_table('broken.csv',sep=',', skiprows=[0], header=None, names=cols)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2021-02-03
        • 2020-04-03
        • 2016-12-12
        • 1970-01-01
        • 2018-08-28
        • 2022-12-31
        • 1970-01-01
        相关资源
        最近更新 更多