如何找到火花数据框中所有列的最大值[重复]答案

【问题标题】：how to find the max value of all columns in a spark dataframe [duplicate]如何找到火花数据框中所有列的最大值[重复]
【发布时间】：2019-07-13 17:20:25
【问题描述】：

我有一个大约 60M 行的 spark 数据框。我想创建一个单行数据框，该数据框将具有所有单个列的最大值。

我尝试了以下选项，但每个选项都有自己的缺点-

df.select(col_list).describe().filter(summary = 'max').show()

-- 此查询不返回字符串列。所以我原来的数据框维度变小了。
df.select(max(col1).alias(col1), max(col2).alias(col2), max(col3).alias(col3), ...).show()

-- 这个查询有效，但是当我有大约 700 奇数列时它是不利的。

有人可以提出更好的语法吗？

【问题讨论】：

参考这个stackoverflow.com/questions/33224740/…
如何聚合字符串列？你的逻辑是什么？字符串列的最大值是多少？澄清。
在 df.columns 上使用 selectExpr 和地图
@cph_sto 我的想法是为整个数据帧返回非空值，对于数字列， max() 非常简单，对于字符串列没有逻辑。它可以返回任何非空值。
@sramalingam24 你能分享一下确切的语法吗？我想试试 selectExpr，但不知道确切的语法。

标签： python apache-spark dataframe pyspark

【解决方案1】：

无论有多少列或混合的数据类型如何，代码都可以正常工作。

注意： OP 在她的 cmets 中建议，对于字符串列，在分组时取第一个 non-Null 值。

# Import relevant functions
from pyspark.sql.functions import max, first, col

# Take an example DataFrame
values = [('Alice',10,5,None,50),('Bob',15,15,'Simon',10),('Jack',5,1,'Timo',3)]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4','col5'])
df.show()
+-----+----+----+-----+----+
| col1|col2|col3| col4|col5|
+-----+----+----+-----+----+
|Alice|  10|   5| null|  50|
|  Bob|  15|  15|Simon|  10|
| Jack|   5|   1| Timo|   3|
+-----+----+----+-----+----+

# Lists all columns in the DataFrame
seq_of_columns = df.columns
print(seq_of_columns)
    ['col1', 'col2', 'col3', 'col4', 'col5']

# Using List comprehensions to create a list of columns of String DataType
string_columns = [i[0] for i in df.dtypes if i[1]=='string']
print(string_columns)
    ['col1', 'col4']

# Using Set function to get non-string columns by subtracting one list from another.
non_string_columns = list(set(seq_of_columns) - set(string_columns))
print(non_string_columns)
    ['col2', 'col3', 'col5']

了解first() 和ignorenulls here

# Aggregating both string and non-string columns
df = df.select(*[max(col(c)).alias(c) for c in non_string_columns],*[first(col(c),ignorenulls = True).alias(c) for c in string_columns])
df = df[[seq_of_columns]]
df.show()
+-----+----+----+-----+----+
| col1|col2|col3| col4|col5|
+-----+----+----+-----+----+
|Alice|  15|  15|Simon|  50|
+-----+----+----+-----+----+

【讨论】：

感谢列表理解工作！这使我们免于重复执行 700 次 - df.select(max(col1).alias(col1), max(col2).alias(col2), max(col3).alias(col3), ...).show ()