【问题标题】:PySpark in Jupyter Notebook: 'Column' object is not callableJupyter Notebook 中的 PySpark:“列”对象不可调用
【发布时间】:2020-10-09 06:11:17
【问题描述】:

我正在对有关奥运成绩的数据进行分析,并希望概述哪些运动员获得的奖牌最多。首先,我创建了附加列,因为在原始数据集中,赢得的奖牌由字符串(“Gold”、“Silver”等)或 NA 表示。

totalDF = olympicDF.count()
medalswonDF = olympicDF\
   .where(col("Medal")!="NA")\
.withColumn("Gold", when(col("Medal")== "Gold",("1")))\
.withColumn("Silver", when(col("Medal")== "Silver",("1")))\
.withColumn("Bronze", when(col("Medal")== "Bronze",("1")))\
.withColumn("Total", when(col("Medal")!=  "NA", ("1"))) -> the  "1" is just a placeholder for now

在下一步中,我想展示 25 位最成功的运动员的表格(就获得的奖牌而言)

medalswonDF.cache() # optimization to make the processing faster

medalswonDF.where(col("Medal")!="NA")\
                     .select("Name", "Gold", "Silver", "Bronze")\
                     .groupBy("Name")\
                     .agg(count("Gold")),\
                          (count("Silver")),\
                            (count("Bronze"))\
.orderBy("Gold").desc()\
.select("Name", "Gold", "Silver", "Bronze").show(25,True)

但是,我不断收到错误消息“TypeError: 'Column' object is not callable”。我知道,如果您想应用一个不能应用于列的函数,除其他原因外,就是这种情况,但据我了解,这不应该是这里的原因。

供参考的架构:

root
 |-- ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Height: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Team: string (nullable = true)
 |-- NOC: string (nullable = true)
 |-- Games: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Season: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Sport: string (nullable = true)
 |-- Event: string (nullable = true)
 |-- Medal: string (nullable = true)
 |-- Gold: string (nullable = true)
 |-- Silver: string (nullable = true)
 |-- Bronze: string (nullable = true)
 |-- Total: string (nullable = true)

我做错了什么?

【问题讨论】:

    标签: python pyspark jupyter-notebook pyspark-dataframes


    【解决方案1】:

    在需要关闭 agg 之前,您正在使用额外的括号来关闭它。

    如下图修改代码,

    medalswonDF.where(col("Medal")!="NA")\
                     .select("Name", "Gold", "Silver", "Bronze")\
                     .groupBy("Name")\
                     .agg(count("Gold").alias("Gold_count"),
                          count("Silver").alias("Silver_count"),
                          count("Bronze").alias("Bronze_count")) \
                     .orderBy("Gold_count").desc()\
                     .select("Name", "Gold_count", "Silver_count", "Bronze_count").show(25,True)
    

    【讨论】:

    • 谢谢,但这会导致异常 "cannot resolve 'Gold' given input columns: [Name, count(Bronze), count(Gold), count(Silver)];; '排序['Gold ASC NULLS FIRST],真的”。这与模式或值的类型有关吗?
    • 编辑了代码。你能看看这是否有效@JackSomeone
    • 谢谢,我终于和medalswonDF.where(col("Medal")!="NA")\ .select("Name", "Gold", "Silver", "Bronze")\ .groupBy("Name")\ .agg(count("Gold").alias("Gold_count"), count("Silver").alias("Silver_count"), count("Bronze").alias("Bronze_count")) \ .orderBy(col("Gold_count").desc()).show(25,True)一起去了
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-06-06
    • 1970-01-01
    • 2019-04-19
    • 1970-01-01
    相关资源
    最近更新 更多