堆叠、拆垛、融化、旋转、转置？将多列转换为行（PySpark 或 Pandas）的简单方法是什么？）答案

【问题标题】：Stack, unstack, melt, pivot, transpose? What is the simple method to convert multiple columns into rows (PySpark or Pandas)?)堆叠、拆垛、融化、旋转、转置？将多列转换为行（PySpark 或 Pandas）的简单方法是什么？）
【发布时间】：2021-01-18 14:35:00
【问题描述】：

我的工作环境主要使用PySpark，但是做一些谷歌搜索，在PySpark中转置非常复杂。我想将它保留在 PySpark 中，但如果在 Pandas 中更容易做到，我会将 Spark 数据帧转换为 Pandas 数据帧。在我认为性能是一个问题的地方，数据集并没有那么大。

我想将具有多列的数据框转换为行：

输入：

import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
 'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
 'Hospital Address': {0: '1234 Street 429',
  1: '553 Alberta Road 441',
  2: '994 Random Street 923'},
 'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
 'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
 'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
 'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})

Record          Hospital       Hospital Address Medicine_1 Medicine_2 Medicine_3 Medicine_4  
     1         Red Cross        1234 Street 429  Effective  Effective     Normal  Effective    
     2  Alberta Hospital   553 Alberta Road 441   Effecive     Normal     Normal  Effective
     3  General Hospital  994 Random Street 923     Normal  Effective     Normal  Effective

输出：

    Record          Hospital       Hospital Address        Name      Value
0        1         Red Cross        1234 Street 429  Medicine_1  Effective
1        2         Red Cross        1234 Street 429  Medicine_2  Effective
2        3         Red Cross        1234 Street 429  Medicine_3     Normal
3        4         Red Cross        1234 Street 429  Medicine_4  Effective
4        5  Alberta Hospital   553 Alberta Road 441  Medicine_1   Effecive
5        6  Alberta Hospital   553 Alberta Road 441  Medicine_2     Normal
6        7  Alberta Hospital   553 Alberta Road 441  Medicine_3     Normal
7        8  Alberta Hospital   553 Alberta Road 441  Medicine_4  Effective
8        9  General Hospital  994 Random Street 923  Medicine_1     Normal
9       10  General Hospital  994 Random Street 923  Medicine_2  Effective
10      11  General Hospital  994 Random Street 923  Medicine_3     Normal
11      12  General Hospital  994 Random Street 923  Medicine_4  Effective

查看 PySpark 示例，很复杂：PySpark Dataframe melt columns into rows

看看 Pandas 的例子，它看起来要容易得多。但是有许多不同的 Stack Overflow 答案，有些人说要使用 pivot、melt、stack、unstack，而且最终会让人感到困惑。

所以，如果有人在 PySpark 中有一个简单的方法来做到这一点，我会全力以赴。如果没有，我会很乐意接受 Pandas 的答案。

非常感谢您的帮助！

【问题讨论】：

IIUC，你可以在 pyspark 中使用explode 参见stackoverflow.com/questions/55378047/…
您好，我已经删除了图片并编辑了您的问题，但以后请参阅：stackoverflow.com/questions/20109391/…

标签： pandas pyspark pivot transform melt

【解决方案1】：

您也可以使用.melt 并指定id_vars。其他一切都将考虑value_vars。您拥有的value_vars 列数将数据框中的行数乘以该数字，将四列中的所有列信息堆叠为一列，并将id_var 列复制为您想要的格式：

数据框设置：

import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
 'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
 'Hospital Address': {0: '1234 Street 429',
  1: '553 Alberta Road 441',
  2: '994 Random Street 923'},
 'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
 'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
 'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
 'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})

代码：

df = (df.melt(id_vars=['Record','Hospital', 'Hospital Address'],
              var_name='Name',
              value_name='Value')
     .sort_values('Record')
     .reset_index(drop=True))
df['Record'] = df.index+1
df
Out[1]: 
    Record          Hospital       Hospital Address        Name      Value
0        1         Red Cross        1234 Street 429  Medicine_1  Effective
1        2         Red Cross        1234 Street 429  Medicine_2  Effective
2        3         Red Cross        1234 Street 429  Medicine_3     Normal
3        4         Red Cross        1234 Street 429  Medicine_4  Effective
4        5  Alberta Hospital   553 Alberta Road 441  Medicine_1   Effecive
5        6  Alberta Hospital   553 Alberta Road 441  Medicine_2     Normal
6        7  Alberta Hospital   553 Alberta Road 441  Medicine_3     Normal
7        8  Alberta Hospital   553 Alberta Road 441  Medicine_4  Effective
8        9  General Hospital  994 Random Street 923  Medicine_1     Normal
9       10  General Hospital  994 Random Street 923  Medicine_2  Effective
10      11  General Hospital  994 Random Street 923  Medicine_3     Normal
11      12  General Hospital  994 Random Street 923  Medicine_4  Effective

【讨论】：

嗨大卫，你能解释一下为什么你选择在这个场景中使用熔化而不是堆栈/枢轴吗？谢谢。
@Anonymous 您可以通过多种方法获得相同的解决方案，正如您在 Andy 的stack 回答中所看到的那样。我只是认为melt 可以更干净一点，但安迪的堆栈答案也很有效。它是 3 次操作而不是 5 次，所以更简洁一些，也许更有效率。

【解决方案2】：

这是使用stack的熊猫

df_final =  (df.set_index(['Record', 'Hospital', 'Hospital Address'])
               .stack(dropna=False)
               .rename('Value')
               .reset_index()
               .rename({'level_3': 'Name'},axis=1)
               .assign(Record=lambda x: x.index+1))

Out[120]:
    Record          Hospital       Hospital Address       Name       Value
0        1         Red Cross        1234 Street 429  Medicine_1  Effective
1        2         Red Cross        1234 Street 429  Medicine_2  Effective
2        3         Red Cross        1234 Street 429  Medicine_3     Normal
3        4         Red Cross        1234 Street 429  Medicine_4  Effective
4        5  Alberta Hospital   553 Alberta Road 441  Medicine_1   Effecive
5        6  Alberta Hospital   553 Alberta Road 441  Medicine_2     Normal
6        7  Alberta Hospital   553 Alberta Road 441  Medicine_3     Normal
7        8  Alberta Hospital   553 Alberta Road 441  Medicine_4  Effective
8        9  General Hospital  994 Random Street 923  Medicine_1     Normal
9       10  General Hospital  994 Random Street 923  Medicine_2  Effective
10      11  General Hospital  994 Random Street 923  Medicine_3     Normal
11      12  General Hospital  994 Random Street 923  Medicine_4  Effective

【讨论】：

嗨，Andy，你能解释一下为什么在这种情况下你使用堆栈而不是 pivot/melt 吗？
@Anonymous: pivot 将值转换为索引和列。您正在将列转换为值，因此我们不能使用 pivot。 Melt 是一个可能的候选人。但是，melt 会将列熔化到此顺序Medicine_1, Medicine_1, Medicine_1, Medicine_2, Medicine_2, Medicine_2,...。你需要一个sort_values 来使它成为Medicine_1, Medicine_2, Medicine_3, Medicine_4, Medicine_1, Medicine_2, Medicine_3, Medicine_4... 而stack 马上返回Medicine_1, Medicine_2, Medicine_3, Medicine_1, Medicine_2, Medicine_3,...。这就是我选择stack的原因

【解决方案3】：

使用 stack 使用 pyspark 也相当简单/容易。

# create sample data 
import pandas as pd
from pyspark.sql.functions import expr
panda_df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
 'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
 'Hospital Address': {0: '1234 Street 429',
  1: '553 Alberta Road 441',
  2: '994 Random Street 923'},
 'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
 'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
 'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
 'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})
df = spark.createDataFrame(panda_df)

# calculate
df.select("Hospital","Hospital Address", 
          expr("stack(4, 'Medicine_1', Medicine_1, 'Medicine_2', Medicine_2, \
          'Medicine_3', Medicine_3,'Medicine_4',Medicine_4) as (MedicinName, Effectiveness)")
         ).where("Effectiveness is not null").show()

在大量列的情况下生成动态查询：

这里的主要思想是动态创建堆栈(x,a,b,c)。我们可以利用python字符串格式化来制作动态sring。

index_cols= ["Hospital","Hospital Address"]
drop_cols = ['Record']
# Select all columns which needs to be pivoted down
pivot_cols = [c  for c in df.columns if c not in index_cols+drop_cols ]
# Create a dynamic stackexpr in this case we are generating stack(4,'{0}',{0},'{1}',{1}...)
# " '{0}',{0},'{1}',{1}".format('Medicine1','Medicine2') = "'Medicine1',Medicine1,'Medicine2',Medicine2"
# which is similiar to what we have previously
stackexpr = "stack("+str(len(pivot_cols))+","+",".join(["'{"+str(i)+"}',{"+str(i)+"}" for i in range(len(pivot_cols))]) +")"
df.selectExpr(*index_cols,stackexpr.format(*pivot_cols) ).show()

输出：

+----------------+--------------------+-----------+-------------+
|        Hospital|    Hospital Address|MedicinName|Effectiveness|
+----------------+--------------------+-----------+-------------+
|       Red Cross|     1234 Street 429| Medicine_1|    Effective|
|       Red Cross|     1234 Street 429| Medicine_2|    Effective|
|       Red Cross|     1234 Street 429| Medicine_3|       Normal|
|       Red Cross|     1234 Street 429| Medicine_4|    Effective|
|Alberta Hospital|553 Alberta Road 441| Medicine_1|     Effecive|
|Alberta Hospital|553 Alberta Road 441| Medicine_2|       Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_3|       Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_4|    Effective|
|General Hospital|994 Random Street...| Medicine_1|       Normal|
|General Hospital|994 Random Street...| Medicine_2|    Effective|
|General Hospital|994 Random Street...| Medicine_3|       Normal|
|General Hospital|994 Random Street...| Medicine_4|    Effective|
+----------------+--------------------+-----------+-------------+

【讨论】：

嘿 Venky，感谢您让我了解 PySpark 中的堆栈功能。在我的实际数据框中，大约有 20 多种药物。做'Medicine_1'，Medicine_1需要很长时间。可以跳过别名并做一个列表吗？
@Anonymous 我已经更新了我的答案以处理大量列并动态生成数据。如果堆栈是 pyspark 函数而不是 sql 函数，会更干净/更容易。