【发布时间】:2020-10-09 06:11:17
【问题描述】:
我正在对有关奥运成绩的数据进行分析,并希望概述哪些运动员获得的奖牌最多。首先,我创建了附加列,因为在原始数据集中,赢得的奖牌由字符串(“Gold”、“Silver”等)或 NA 表示。
totalDF = olympicDF.count()
medalswonDF = olympicDF\
.where(col("Medal")!="NA")\
.withColumn("Gold", when(col("Medal")== "Gold",("1")))\
.withColumn("Silver", when(col("Medal")== "Silver",("1")))\
.withColumn("Bronze", when(col("Medal")== "Bronze",("1")))\
.withColumn("Total", when(col("Medal")!= "NA", ("1"))) -> the "1" is just a placeholder for now
在下一步中,我想展示 25 位最成功的运动员的表格(就获得的奖牌而言)
medalswonDF.cache() # optimization to make the processing faster
medalswonDF.where(col("Medal")!="NA")\
.select("Name", "Gold", "Silver", "Bronze")\
.groupBy("Name")\
.agg(count("Gold")),\
(count("Silver")),\
(count("Bronze"))\
.orderBy("Gold").desc()\
.select("Name", "Gold", "Silver", "Bronze").show(25,True)
但是,我不断收到错误消息“TypeError: 'Column' object is not callable”。我知道,如果您想应用一个不能应用于列的函数,除其他原因外,就是这种情况,但据我了解,这不应该是这里的原因。
供参考的架构:
root
|-- ID: integer (nullable = true)
|-- Name: string (nullable = true)
|-- Sex: string (nullable = true)
|-- Age: integer (nullable = true)
|-- Height: integer (nullable = true)
|-- Weight: integer (nullable = true)
|-- Team: string (nullable = true)
|-- NOC: string (nullable = true)
|-- Games: string (nullable = true)
|-- Year: string (nullable = true)
|-- Season: string (nullable = true)
|-- City: string (nullable = true)
|-- Sport: string (nullable = true)
|-- Event: string (nullable = true)
|-- Medal: string (nullable = true)
|-- Gold: string (nullable = true)
|-- Silver: string (nullable = true)
|-- Bronze: string (nullable = true)
|-- Total: string (nullable = true)
我做错了什么?
【问题讨论】:
标签: python pyspark jupyter-notebook pyspark-dataframes