【问题标题】:file with multiple headers to dataframe with melt具有多个标题的文件到带有融化的数据框
【发布时间】:2020-01-22 02:41:53
【问题描述】:
+------+------+------+------+------+------+-------+----+
|      |      |      |      | USD  | EUR  | JPY  | RUP |
+------+------+------+------+------+------+------+-----+
+------+------+------+------+------+------+------+-----+
|      |      |      |      | Case | Cons | Case | Case|
+------+------+------+------+------+------+------+-----+
+------+------+------+------+------+------+------+-----+
|      |      |      |      | High | Low  | CWM  | AEP |
+------+------+------+------+------+------+------+-----+
+------+------+------+------+------+------+------+-----+
| Col1 | Col2 | Col3 | Col4 | Owner| OPS  | VH   |Delta|
+------+------+------+------+------+------+------+-----+
| V1   |  V2  | V3   | V4   | V5   | V6   | V7   | V8  |
| V1a  |  V2a | V3a  | V4a  | V5a  | V6a  | V7a  | V8a | 
+------+------+------+------+------+------+------+-----+

这里要求的是df.to_dict()输出的样本数据:

{('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'Unnamed: 0_level_2', 'Year'): {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020, 5: 2020, 6: 2020, 7: 2020, 8: 2020, 9: 2020, 10: 2020, 11: 2020, 12: 2020, 13: 2020, 14: 2020, 15: 2020, 16: 2020, 17: 2020, 18: 2020, 19: 2020, 20: 2020, 21: 2020, 22: 2020, 23: 2020, 24: 2020, 25: 2020, 26: 2020, 27: 2020, 28: 2020, 29: 2020, 30: 2020, 31: 2020, 32: 2020, 33: 2020, 34: 2020, 35: 2020, 36: 2020, 37: 2020, 38: 2020, 39: 2020, 40: 2020, 41: 2020, 42: 2020, 43: 2020, 44: 2020, 45: 2020, 46: 2020, 47: 2020}, ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'Unnamed: 1_level_2', 'Month'): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1, 23: 1, 24: 1, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1, 32: 1, 33: 1, 34: 1, 35: 1, 36: 1, 37: 1, 38: 1, 39: 1, 40: 1, 41: 1, 42: 1, 43: 1, 44: 1, 45: 1, 46: 1, 47: 1}, ('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'Unnamed: 2_level_2', 'Day'): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1, 23: 1, 24: 2, 25: 2, 26: 2, 27: 2, 28: 2, 29: 2, 30: 2, 31: 2, 32: 2, 33: 2, 34: 2, 35: 2, 36: 2, 37: 2, 38: 2, 39: 2, 40: 2, 41: 2, 42: 2, 43: 2, 44: 2, 45: 2, 46: 2, 47: 2}, ('Unnamed: 3_level_0', 'Unnamed: 3_level_1', 'Unnamed: 3_level_2', 'Hour'): {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 0, 25: 1, 26: 2, 27: 3, 28: 4, 29: 5, 30: 6, 31: 7, 32: 8, 33: 9, 34: 10, 35: 11, 36: 12, 37: 13, 38: 14, 39: 15, 40: 16, 41: 17, 42: 18, 43: 19, 44: 20, 45: 21, 46: 22, 47: 23}, ('USD', 'Cons', 'very high', 'Hub1'): {0: 23.06, 1: 21.49, 2: 21.73, 3: 21.58, 4: 21.67, 5: 22.78, 6: 27.15, 7: 26.09, 8: 26.23, 9: 28.21, 10: 29.21, 11: 31.97, 12: 30.45, 13: 30.45, 14: 30.45, 15: 29.14, 16: 28.28, 17: 26.35, 18: 26.32, 19: 27.01, 20: 26.34, 21: 28.22, 22: 27.77, 23: 26.94, 24: 24.16, 25: 22.74, 26: 22.67, 27: 22.67, 28: 22.74, 29: 23.14, 30: 27.81, 31: 27.87, 32: 28.05, 33: 27.91, 34: 32.66, 35: 35.14, 36: 33.32, 37: 36.17, 38: 38.33, 39: 31.75, 40: 30.9, 41: 26.36, 42: 27.17, 43: 28.17, 44: 26.17, 45: 26.5, 46: 28.95, 47: 26.94}, ('EUR', 'Case', 'CWM', 'Hub2'): {0: 18.59, 1: 18.32, 2: 18.32, 3: 18.32, 4: 18.32, 5: 19.19, 6: 22.57, 7: 25.38, 8: 25.53, 9: 25.9, 10: 26.47, 11: 26.47, 12: 26.09, 13: 25.59, 14: 25.35, 15: 24.97, 16: 24.22, 17: 25.22, 18: 25.49, 19: 26.19, 20: 25.63, 21: 25.1, 22: 21.93, 23: 19.61, 24: 19.4, 25: 18.75, 26: 18.85, 27: 18.75, 28: 18.88, 29: 19.41, 30: 23.97, 31: 27.07, 32: 27.23, 33: 29.21, 34: 30.49, 35: 28.52, 36: 27.49, 37: 26.93, 38: 26.71, 39: 25.76, 40: 25.24, 41: 25.67, 42: 26.72, 43: 27.98, 44: 26.73, 45: 25.97, 46: 22.34, 47: 19.47}, ('USD', 'Cons', 'Ventyx', 'Hub3'): {0: 19.78, 1: 20.96, 2: 21.58, 3: 21.5, 4: 21.27, 5: 22.59, 6: 26.22, 7: 26.78, 8: 26.78, 9: 26.97, 10: 26.97, 11: 26.97, 12: 26.53, 13: 26.34, 14: 26.5, 15: 26.22, 16: 25.6, 17: 26.5, 18: 26.74, 19: 27.44, 20: 26.87, 21: 26.5, 22: 23.2, 23: 23.58, 24: 22.74, 25: 22.31, 26: 22.27, 27: 22.27, 28: 22.74, 29: 22.84, 30: 27.79, 31: 31.63, 32: 29.6, 33: 29.25, 34: 30.53, 35: 28.51, 36: 27.48, 37: 26.97, 38: 26.74, 39: 26.53, 40: 26.5, 41: 26.92, 42: 28.89, 43: 30.24, 44: 28.38, 45: 27.38, 46: 24.39, 47: 23.2}}

这是我可以为这个文件做的最好的表示。

第 1-4 列有一个标题第 5-N 列(是 N,因为我们不知道有多少)有 4 个标题。

数据框需要如下所示:

 +------+------+------+------+------+------+------+------+------+
 | Col1 | Col2 | Col3 | Col4 | NCol1| NCol2|NCol3 | NCol4| Col9 |
 +------+------+------+------+------+------+------+------+------+
 | V1   |  V2  | V3   | V4   | USD  | Case | High | Owner| V5   |
 | V1a  |  V2a | V3a  | V4a  | USD  | Case | High | Owner| V5a  |
 | V1a  |  V2a | V3a  | V4a  | EUR  | Cons | Low  | Ops  | V6   |
 | V1a  |  V2a | V3a  | V4a  | EUR  | Cons | Low  | Ops  | V6a  |
 | V1a  |  V2a | V3a  | V4a  | JPY  | Case | CWM  | VH   | V7   |
 | V1a  |  V2a | V3a  | V4a  | JPY  | Case | CWM  | VH   | V7a  |
 | V1a  |  V2a | V3a  | V4a  | RUP  | Case | AEP  | Delta| V8   |
 | V1a  |  V2a | V3a  | V4a  | RUP  | Case | AEP  | Delta| V8a  |
 +------+------+------+------+------+------+-----+------+-------+

因此基本上将第 5 列到第 N 列标题转换为新列,其中每行数据与前 4 列以及值最初所在的标题对齐。

我试过了:

df = pd.read_csv(file,header=[0,1,2,3])
df.melt(var_name=['a','b','c','d'], value_name='e')

还有:

df2 = df.melt(id_vars=['Year','Month','Day','Hour'], col_level=3)

还有:

df2 = df.stack().stack().stack().stack()

最后一个非常接近,但它完成了前 4 列

但这不起作用,因为它只给了我 col1 和 col2。

【问题讨论】:

  • 你能做一个 df.to_dict() 并粘贴结果吗?即读取 csv 并将其输出为 dict 并共享它。它应该比你目前提供的更容易使用
  • 让我看看如何创建一些更匹配的相同信息并将其发布。
  • 在提供该 dict 时,我唯一担心的是这是一个小样本,并且数据框可能有未知数量的列。
  • 您没有阅读我尝试过的部分吗?
  • 在这里我将添加另外 10 个我尝试过但也不起作用的东西。

标签: python pandas


【解决方案1】:

我觉得我在黑暗中拍摄,但这是我可以拉出来的。让我知道它是否不是您想要的。如果不是,请根据您发布的 dict 发布示例输出,以便其他人可以加入。我很乐意删除此 hack

df = pd.DataFrame(sample)

df.columns =df.columns.to_flat_index()

df.columns = ['_'.join(i) for i in df.columns]

df = df.melt(id_vars=['Unnamed: 0_level_0_Unnamed: 0_level_1_Unnamed: 
             0_level_2_Year',
    'Unnamed: 1_level_0_Unnamed: 1_level_1_Unnamed: 1_level_2_Month',
    'Unnamed: 2_level_0_Unnamed: 2_level_1_Unnamed: 2_level_2_Day',
    'Unnamed: 3_level_0_Unnamed: 3_level_1_Unnamed: 3_level_2_Hour'])

df.columns = [i.split('_')[-1] for i in df.columns]

pd.concat([df,df.variable.str.split('_',expand=True)],axis=1)

【讨论】:

  • 我尝试了类似的方法。您的代码结果为"The following 'id_vars' are not present in the DataFrame: ['Unnamed: 0_level_0_Unnamed: 0_level_1_Unnamed:0_level_2_Year']"
  • 你的代码在没有df2.columns = [i.split('_')[3] for i in df.columns]的情况下工作
  • 很高兴我能提供帮助。干杯
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-01-21
  • 1970-01-01
  • 2017-12-17
  • 2012-07-30
  • 2015-02-21
  • 1970-01-01
相关资源
最近更新 更多