【问题标题】:How to count frequency of elements from a columns of lists in pyspark dataframe?如何计算pyspark数据框中一列列表中元素的频率?
【发布时间】:2021-12-27 03:14:12
【问题描述】:

我有一个如下所示的 pyspark 数据框,

data2 = [("James",["A x","B z","C q","D", "E"]),
    ("Michael",["A x","C","E","K", "D"]),
    ("Robert",["A y","R","B z","B","D"]),
    ("Maria",["X","A y","B z","F","B"]),
    ("Jen",["A","B","C q","F","R"])
  ]

 
df2 = spark.createDataFrame(data2, ["Name", "My_list" ])

df2
    Name    My_list
0   James   [A x, B z, C q, D, E]
1   Michael     [A x, C, E, K, D]
2   Robert  [A y, R, B z, B, D]
3   Maria   [X, A y, B z, F, B]
4   Jen     [A, B, C q, F, R]

我希望能够对“My_list”列中的元素进行计数并按降序排序?例如,

'A x' appeared -> P times, 
'B z' appeared -> Q times, and so on. 

有人可以在上面放一些灯吗?非常感谢您。

【问题讨论】:

    标签: list pyspark apache-spark-sql frequency-analysis


    【解决方案1】:

    以下命令将数组分解,并提供每个元素的计数

    import pyspark.sql.functions as F
    
    df_ans = (df2
               .withColumn("explode", F.explode("My_list"))
               .groupBy("explode")
               .count()
               .orderBy(F.desc("count"))
    

    结果是

    【讨论】:

      猜你喜欢
      • 2019-03-04
      • 2020-12-29
      • 2011-01-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-03-29
      相关资源
      最近更新 更多