【问题标题】:Stack, unstack, melt, pivot, transpose? What is the simple method to convert multiple columns into rows (PySpark or Pandas)?)堆叠、拆垛、融化、旋转、转置?将多列转换为行(PySpark 或 Pandas)的简单方法是什么?)
【发布时间】:2021-01-18 14:35:00
【问题描述】:

我的工作环境主要使用PySpark,但是做一些谷歌搜索,在PySpark中转置非常复杂。我想将它保留在 PySpark 中,但如果在 Pandas 中更容易做到,我会将 Spark 数据帧转换为 Pandas 数据帧。在我认为性能是一个问题的地方,数据集并没有那么大。

我想将具有多列的数据框转换为行:

输入:

import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
 'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
 'Hospital Address': {0: '1234 Street 429',
  1: '553 Alberta Road 441',
  2: '994 Random Street 923'},
 'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
 'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
 'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
 'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})

Record          Hospital       Hospital Address Medicine_1 Medicine_2 Medicine_3 Medicine_4  
     1         Red Cross        1234 Street 429  Effective  Effective     Normal  Effective    
     2  Alberta Hospital   553 Alberta Road 441   Effecive     Normal     Normal  Effective
     3  General Hospital  994 Random Street 923     Normal  Effective     Normal  Effective

输出:

    Record          Hospital       Hospital Address        Name      Value
0        1         Red Cross        1234 Street 429  Medicine_1  Effective
1        2         Red Cross        1234 Street 429  Medicine_2  Effective
2        3         Red Cross        1234 Street 429  Medicine_3     Normal
3        4         Red Cross        1234 Street 429  Medicine_4  Effective
4        5  Alberta Hospital   553 Alberta Road 441  Medicine_1   Effecive
5        6  Alberta Hospital   553 Alberta Road 441  Medicine_2     Normal
6        7  Alberta Hospital   553 Alberta Road 441  Medicine_3     Normal
7        8  Alberta Hospital   553 Alberta Road 441  Medicine_4  Effective
8        9  General Hospital  994 Random Street 923  Medicine_1     Normal
9       10  General Hospital  994 Random Street 923  Medicine_2  Effective
10      11  General Hospital  994 Random Street 923  Medicine_3     Normal
11      12  General Hospital  994 Random Street 923  Medicine_4  Effective

查看 PySpark 示例,很复杂:PySpark Dataframe melt columns into rows

看看 Pandas 的例子,它看起来要容易得多。但是有许多不同的 Stack Overflow 答案,有些人说要使用 pivot、melt、stack、unstack,而且最终会让人感到困惑。

所以,如果有人在 PySpark 中有一个简单的方法来做到这一点,我会全力以赴。如果没有,我会很乐意接受 Pandas 的答案。

非常感谢您的帮助!

【问题讨论】:

标签: pandas pyspark pivot transform melt


【解决方案1】:

您也可以使用.melt 并指定id_vars。其他一切都将考虑value_vars。您拥有的value_vars 列数将数据框中的行数乘以该数字,将四列中的所有列信息堆叠为一列,并将id_var 列复制为您想要的格式:

数据框设置:

import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
 'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
 'Hospital Address': {0: '1234 Street 429',
  1: '553 Alberta Road 441',
  2: '994 Random Street 923'},
 'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
 'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
 'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
 'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})

代码:

df = (df.melt(id_vars=['Record','Hospital', 'Hospital Address'],
              var_name='Name',
              value_name='Value')
     .sort_values('Record')
     .reset_index(drop=True))
df['Record'] = df.index+1
df
Out[1]: 
    Record          Hospital       Hospital Address        Name      Value
0        1         Red Cross        1234 Street 429  Medicine_1  Effective
1        2         Red Cross        1234 Street 429  Medicine_2  Effective
2        3         Red Cross        1234 Street 429  Medicine_3     Normal
3        4         Red Cross        1234 Street 429  Medicine_4  Effective
4        5  Alberta Hospital   553 Alberta Road 441  Medicine_1   Effecive
5        6  Alberta Hospital   553 Alberta Road 441  Medicine_2     Normal
6        7  Alberta Hospital   553 Alberta Road 441  Medicine_3     Normal
7        8  Alberta Hospital   553 Alberta Road 441  Medicine_4  Effective
8        9  General Hospital  994 Random Street 923  Medicine_1     Normal
9       10  General Hospital  994 Random Street 923  Medicine_2  Effective
10      11  General Hospital  994 Random Street 923  Medicine_3     Normal
11      12  General Hospital  994 Random Street 923  Medicine_4  Effective

【讨论】:

  • 嗨大卫,你能解释一下为什么你选择在这个场景中使用熔化而不是堆栈/枢轴吗?谢谢。
  • @Anonymous 您可以通过多种方法获得相同的解决方案,正如您在 Andy 的stack 回答中所看到的那样。我只是认为melt 可以更干净一点,但安迪的堆栈答案也很有效。它是 3 次操作而不是 5 次,所以更简洁一些,也许更有效率。
【解决方案2】:

这是使用stack的熊猫

df_final =  (df.set_index(['Record', 'Hospital', 'Hospital Address'])
               .stack(dropna=False)
               .rename('Value')
               .reset_index()
               .rename({'level_3': 'Name'},axis=1)
               .assign(Record=lambda x: x.index+1))

Out[120]:
    Record          Hospital       Hospital Address       Name       Value
0        1         Red Cross        1234 Street 429  Medicine_1  Effective
1        2         Red Cross        1234 Street 429  Medicine_2  Effective
2        3         Red Cross        1234 Street 429  Medicine_3     Normal
3        4         Red Cross        1234 Street 429  Medicine_4  Effective
4        5  Alberta Hospital   553 Alberta Road 441  Medicine_1   Effecive
5        6  Alberta Hospital   553 Alberta Road 441  Medicine_2     Normal
6        7  Alberta Hospital   553 Alberta Road 441  Medicine_3     Normal
7        8  Alberta Hospital   553 Alberta Road 441  Medicine_4  Effective
8        9  General Hospital  994 Random Street 923  Medicine_1     Normal
9       10  General Hospital  994 Random Street 923  Medicine_2  Effective
10      11  General Hospital  994 Random Street 923  Medicine_3     Normal
11      12  General Hospital  994 Random Street 923  Medicine_4  Effective

【讨论】:

  • 嗨,Andy,你能解释一下为什么在这种情况下你使用堆栈而不是 pivot/melt 吗?
  • @Anonymous: pivot 将值转换为索引和列。您正在将列转换为值,因此我们不能使用 pivotMelt 是一个可能的候选人。但是,melt 会将列熔化到此顺序Medicine_1, Medicine_1, Medicine_1, Medicine_2, Medicine_2, Medicine_2,...。你需要一个sort_values 来使它成为Medicine_1, Medicine_2, Medicine_3, Medicine_4, Medicine_1, Medicine_2, Medicine_3, Medicine_4...stack 马上返回Medicine_1, Medicine_2, Medicine_3, Medicine_1, Medicine_2, Medicine_3,...。这就是我选择stack的原因
【解决方案3】:

使用 stack 使用 pyspark 也相当简单/容易。

# create sample data 
import pandas as pd
from pyspark.sql.functions import expr
panda_df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
 'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
 'Hospital Address': {0: '1234 Street 429',
  1: '553 Alberta Road 441',
  2: '994 Random Street 923'},
 'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
 'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
 'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
 'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})
df = spark.createDataFrame(panda_df)

# calculate
df.select("Hospital","Hospital Address", 
          expr("stack(4, 'Medicine_1', Medicine_1, 'Medicine_2', Medicine_2, \
          'Medicine_3', Medicine_3,'Medicine_4',Medicine_4) as (MedicinName, Effectiveness)")
         ).where("Effectiveness is not null").show()

在大量列的情况下生成动态查询

这里的主要思想是动态创建堆栈(x,a,b,c)。我们可以利用python字符串格式化来制作动态sring。

index_cols= ["Hospital","Hospital Address"]
drop_cols = ['Record']
# Select all columns which needs to be pivoted down
pivot_cols = [c  for c in df.columns if c not in index_cols+drop_cols ]
# Create a dynamic stackexpr in this case we are generating stack(4,'{0}',{0},'{1}',{1}...)
# " '{0}',{0},'{1}',{1}".format('Medicine1','Medicine2') = "'Medicine1',Medicine1,'Medicine2',Medicine2"
# which is similiar to what we have previously
stackexpr = "stack("+str(len(pivot_cols))+","+",".join(["'{"+str(i)+"}',{"+str(i)+"}" for i in range(len(pivot_cols))]) +")"
df.selectExpr(*index_cols,stackexpr.format(*pivot_cols) ).show()

输出:

+----------------+--------------------+-----------+-------------+
|        Hospital|    Hospital Address|MedicinName|Effectiveness|
+----------------+--------------------+-----------+-------------+
|       Red Cross|     1234 Street 429| Medicine_1|    Effective|
|       Red Cross|     1234 Street 429| Medicine_2|    Effective|
|       Red Cross|     1234 Street 429| Medicine_3|       Normal|
|       Red Cross|     1234 Street 429| Medicine_4|    Effective|
|Alberta Hospital|553 Alberta Road 441| Medicine_1|     Effecive|
|Alberta Hospital|553 Alberta Road 441| Medicine_2|       Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_3|       Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_4|    Effective|
|General Hospital|994 Random Street...| Medicine_1|       Normal|
|General Hospital|994 Random Street...| Medicine_2|    Effective|
|General Hospital|994 Random Street...| Medicine_3|       Normal|
|General Hospital|994 Random Street...| Medicine_4|    Effective|
+----------------+--------------------+-----------+-------------+

【讨论】:

  • 嘿 Venky,感谢您让我了解 PySpark 中的堆栈功能。在我的实际数据框中,大约有 20 多种药物。做'Medicine_1',Medicine_1需要很长时间。可以跳过别名并做一个列表吗?
  • @Anonymous 我已经更新了我的答案以处理大量列并动态生成数据。如果堆栈是 pyspark 函数而不是 sql 函数,会更干净/更容易。
猜你喜欢
  • 2018-02-27
  • 1970-01-01
  • 2018-10-04
  • 2020-01-16
  • 1970-01-01
  • 2021-03-18
  • 2020-05-22
  • 2014-05-09
  • 2022-01-03
相关资源
最近更新 更多