Jupyter Notebook 中的 PySpark：“列”对象不可调用答案

【问题标题】：PySpark in Jupyter Notebook: 'Column' object is not callableJupyter Notebook 中的 PySpark：“列”对象不可调用
【发布时间】：2020-10-09 06:11:17
【问题描述】：

我正在对有关奥运成绩的数据进行分析，并希望概述哪些运动员获得的奖牌最多。首先，我创建了附加列，因为在原始数据集中，赢得的奖牌由字符串（“Gold”、“Silver”等）或 NA 表示。

totalDF = olympicDF.count()
medalswonDF = olympicDF\
   .where(col("Medal")!="NA")\
.withColumn("Gold", when(col("Medal")== "Gold",("1")))\
.withColumn("Silver", when(col("Medal")== "Silver",("1")))\
.withColumn("Bronze", when(col("Medal")== "Bronze",("1")))\
.withColumn("Total", when(col("Medal")!=  "NA", ("1"))) -> the  "1" is just a placeholder for now

在下一步中，我想展示 25 位最成功的运动员的表格（就获得的奖牌而言）

medalswonDF.cache() # optimization to make the processing faster

medalswonDF.where(col("Medal")!="NA")\
                     .select("Name", "Gold", "Silver", "Bronze")\
                     .groupBy("Name")\
                     .agg(count("Gold")),\
                          (count("Silver")),\
                            (count("Bronze"))\
.orderBy("Gold").desc()\
.select("Name", "Gold", "Silver", "Bronze").show(25,True)

但是，我不断收到错误消息“TypeError: 'Column' object is not callable”。我知道，如果您想应用一个不能应用于列的函数，除其他原因外，就是这种情况，但据我了解，这不应该是这里的原因。

供参考的架构：

root
 |-- ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Height: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Team: string (nullable = true)
 |-- NOC: string (nullable = true)
 |-- Games: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Season: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Sport: string (nullable = true)
 |-- Event: string (nullable = true)
 |-- Medal: string (nullable = true)
 |-- Gold: string (nullable = true)
 |-- Silver: string (nullable = true)
 |-- Bronze: string (nullable = true)
 |-- Total: string (nullable = true)

我做错了什么？

【问题讨论】：

标签： python pyspark jupyter-notebook pyspark-dataframes

【解决方案1】：

在需要关闭 agg 之前，您正在使用额外的括号来关闭它。

如下图修改代码，

medalswonDF.where(col("Medal")!="NA")\
                 .select("Name", "Gold", "Silver", "Bronze")\
                 .groupBy("Name")\
                 .agg(count("Gold").alias("Gold_count"),
                      count("Silver").alias("Silver_count"),
                      count("Bronze").alias("Bronze_count")) \
                 .orderBy("Gold_count").desc()\
                 .select("Name", "Gold_count", "Silver_count", "Bronze_count").show(25,True)

【讨论】：

谢谢，但这会导致异常 "cannot resolve 'Gold' given input columns: [Name, count(Bronze), count(Gold), count(Silver)];; '排序['Gold ASC NULLS FIRST]，真的”。这与模式或值的类型有关吗？
编辑了代码。你能看看这是否有效@JackSomeone
谢谢，我终于和medalswonDF.where(col("Medal")!="NA")\ .select("Name", "Gold", "Silver", "Bronze")\ .groupBy("Name")\ .agg(count("Gold").alias("Gold_count"), count("Silver").alias("Silver_count"), count("Bronze").alias("Bronze_count")) \ .orderBy(col("Gold_count").desc()).show(25,True)一起去了