【问题标题】:Pandas Dataframe (from CSV) with mutiple header rows throughout the dataPandas Dataframe(来自 CSV)在整个数据中具有多个标题行
【发布时间】:2018-04-25 23:15:34
【问题描述】:

Test data file我正在使用从 CSV 文件创建的数据框。数据在整个数据中都有标题行,这些标题行标识了该数据下方的行,直到下一个标题行。

数据看起来像这样。

2001|     |colour |Price | Quantity sold<br>
Shoes|<br>
Blank  | High heal Shoes| red |£22|44<br>
Blank  | Low heal Shoes|red |£22|44<br>
Slippers|<br>
Blank  | High heal Slippers| red |£22|44<br>
Blank  | High heal Slippers| blue |£22|44<br>
Blank  | Low heal Slippers| red |£22|44<br>
2002|   |colour |Price | Quantity sold<br>
Shoes|<br>
Blank  | High heal Shoes| red |£22|44<br>
Blank  | Low heal Shoes|red |£22|44<br>
Slippers|<br>
Blank  | High heal Slippers| red |£22|44<br>
Blank  | High heal Slippers| blue |£22|44<br>
Blank  | Low heal Slippers| red |£22|44<br>

这是什么类型的结构?

我需要通读此数据框,从标题行(如 2001 年、2002 年等)中获取每年特定项目(例如拖鞋)的所有数据。即使在每个数据行旁边添加相应年份的行也会有所帮助。

我会很感激一些关于如何做到这一点的帮助?

【问题讨论】:

    标签: python pandas csv dataframe


    【解决方案1】:

    用途:

    df = pd.read_csv('test.csv')
    
    #get value of first column (here 2001)
    col = df.columns[0]
    
    #forward fill last previous value
    df[col] = df[col].ffill()
    #convert first column to numeric
    num = pd.to_numeric(df[col], errors='coerce')
    #forward fill again, first group replace by value of first column name
    df['Year'] = num.ffill().fillna(col)
    #change columns names 
    df = df.rename(columns={col:'Shoes', 'Unnamed: 1':'Names'})
    #remove unnecessary rows
    df = df[num.isnull() & df['colour'].notnull()].reset_index(drop=True)
    
    print (df)
               Shoes       Names  colour price Quantity sold  Year
    0   Type A shoes  Sub type A     red    22             5  2001
    1   Type A shoes  Sub type A   green    11             5  2001
    2   Type A shoes  Sub type A  yellow    44             5  2001
    3   Type A shoes  Sub type B     red    33             5  2001
    4   Type A shoes  Sub type B   green    66             5  2001
    5   Type A shoes  Sub type B  yellow    22             5  2001
    6   Type B shoes  Sub type A     red    11             5  2001
    7   Type B shoes  Sub type A   green    44             5  2001
    8   Type B shoes  Sub type A  yellow    33             5  2001
    9   Type B shoes  Sub type B     red    66             5  2001
    10  Type B shoes  Sub type B   green    21             5  2001
    11  Type B shoes  Sub type B  yellow    22             5  2001
    12  Type A shoes  Sub type A     red    22             5  2002
    13  Type A shoes  Sub type A   green    11             5  2002
    14  Type A shoes  Sub type A  yellow    44             5  2002
    15  Type A shoes  Sub type B     red    33             5  2002
    16  Type A shoes  Sub type B   green    66             5  2002
    17  Type A shoes  Sub type B  yellow    22             5  2002
    18  Type B shoes  Sub type A     red    11             5  2002
    19  Type B shoes  Sub type A   green    44             5  2002
    20  Type B shoes  Sub type A  yellow    33             5  2002
    21  Type B shoes  Sub type B     red    66             5  2002
    22  Type B shoes  Sub type B   green    21             5  2002
    23  Type B shoes  Sub type B  yellow    22             5  2002
    

    编辑:

    df = pd.read_csv('testV2.csv', sep='\t')
    #print (df)
    
    #get value of first column (here 2001)
    col = df.columns[0]
    
    #forward fill last previous value
    df[col] = df[col].ffill()
    #convert first column to numeric
    num = pd.to_numeric(df[col], errors='coerce')
    #forward fill again, first group replace by value of first column name
    df['Year'] = num.ffill().fillna(col)
    #change columns names 
    df = df.rename(columns={col:'Top Category', 'Unnamed: 1':'Names'})
    #remove unnecessary rows
    df = df[num.isnull() & (df['Top Category'] != 'Top Category')].reset_index(drop=True)
    

    print (df)
    
       Top Category   Names Colour Price Sold  Year
    0        Item 1  Type 1      -     2  NaN  2001
    1        Item 2  Type 1      -     2  NaN  2001
    2        Item 3  Type 1    red     2    5  2001
    3        Item 3  Type 2   blue     2    5  2001
    4        Item 3  Type 3  green     2    5  2001
    5        item 4  Type 1    red     2    5  2001
    6        item 4  Type 2   blue     3  NaN  2001
    7        item 4  Type 3  green     3  NaN  2001
    8        Item 1  Type 1      -     3  NaN  2002
    9        Item 2  Type 1      -     3  NaN  2002
    10       Item 3  Type 1    red     3    5  2002
    11       Item 3  Type 2   blue     3    5  2002
    12       Item 3  Type 3  green     3    5  2002
    13        Item4  Type 1    red     3  NaN  2002
    14        Item4  Type 2   blue     3  NaN  2002
    15        Item4  Type 3  green     3  NaN  2002
    16       Item 1  Type 1      -     3  NaN  2003
    17       Item 2  Type 1      -     3  NaN  2003
    18       Item 3  Type 1    red     3    5  2003
    19       Item 3  Type 2   blue     3    5  2003
    20       Item 3  Type 3  green     3    5  2003
    21        Item4  Type 1    red     3  NaN  2003
    22        Item4  Type 2   blue     3  NaN  2003
    23        Item4  Type 3  green     3  NaN  2003
    

    【讨论】:

    • 感谢您的回复。我不明白某条线上发生了什么。我希望你不介意我问一些问题。这条线是做什么的? df[col] = df[col].str.strip().replace('Blank', np.nan).ffill() 前向填充具体有什么作用?
    • 没问题。但是,如果我的解决方案不起作用,可能问题出在文件的真实格式上,那么是否可以使用真实的分隔符、真实的空白值共享您的示例文件?
    • ffill() 替换最后一个已知的非 NaN 值,所以如果 1,2,NaN,NaN,4,7, NaN 它返回 1,2,2,2,4,7,7
    • 谢谢。 link 这是一个带有格式化数据的演示文件的链接。
    • 感谢您的所有帮助。我稍后会检查代码。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-02-18
    • 1970-01-01
    • 2020-09-18
    • 2019-04-17
    • 1970-01-01
    相关资源
    最近更新 更多