【问题标题】:PySpark Groupby and Filter based on RegexPySpark Groupby 和基于 Regex 的过滤器
【发布时间】:2020-06-24 01:46:55
【问题描述】:

我有一个带有

的 PySpark df
from pyspark.sql import functions as F
print(df.groupBy(['issue_month', 'loan_status']).count().show())

+-----------+------------------+-----+
|issue_month|       loan_status|count|
+-----------+------------------+-----+
|         06|        Fully Paid|12632|
|         03|        Fully Paid|16243|
|         07|           Default|    1|
|         02|        Fully Paid|16467|
|         06|           Default|    1|
|         07|   In Grace Period|  289|
|         01|       Charged Off| 5975|
|         05|       Charged Off| 5209|
|         02|Late (31-120 days)|  184|
|         11|           Current|17525|
|         12|   In Grace Period|  369|
|         10|        Fully Paid|19222|
|         04|        Fully Paid|16802|
|         07|       Charged Off| 7072|
|         06|       Charged Off| 4589|
|         04| Late (16-30 days)|   98|
|       null|              null|    2|
|         10|Late (31-120 days)|  621|
|         07| Late (16-30 days)|  125|
|         10|           Default|    2|
+-----------+------------------+-----+

我想只过滤loan_status is late,它可以是值“Late(16-30 天)”或“Late(31-120 天)”。所以我尝试了:

print(df.groupBy(['issue_month', 'loan_status']).count().filter((F.col('loan_status')=='Late (31-120 days)')|F.col('loan_status')=='Late (16-30 days)').show())

这失败了,但无论如何,它很脏。我想在熊猫中做类似的事情,我可以简单地过滤正则表达式。在我的情况下,它会是这样的:

F.col('loan_status').contains("Late")

【问题讨论】:

  • 试试:df.groupby(..).count().filter("loan_status rlike 'Late'")

标签: python dataframe apache-spark pyspark filtering


【解决方案1】:

Pyspark 也有 contains()(或)like 函数,我们可以在 .filter()

中使用

Example:

#sample data
df.show()
#+-----------+------------------+
#|issue_month|       loan_status|
#+-----------+------------------+
#|         10|        Fully Paid|
#|         10|           Default|
#|         10|Late (31-120 days)|
#+-----------+------------------+

#in filter query convert loan_status to lower case and look for substring late.
df.groupBy("issue_month","loan_status").\
count().\
filter(lower(col("loan_status")).contains("late")).\
show()

#by using like function
df.groupBy("issue_month","loan_status").\
count().\
filter(lower(col("loan_status")).like("late%")).\
show()

#i would suggest filtering rows before groupby will significantly increases the performance in bigdata!!
df.filter(lower(col("loan_status")).like("late%")).\
groupBy("issue_month","loan_status").\
count().\
show()

#+-----------+------------------+-----+
#|issue_month|       loan_status|count|
#+-----------+------------------+-----+
#|         10|Late (31-120 days)|    1|
#+-----------+------------------+-----+

我们可以使用 .agg(sum("count")) 来获取 count 的总和,而不考虑 issue_month。

Example:

from pyspark.sql.functions import sum as _sum
df.show()
#+-----------+------------------+
#|issue_month|       loan_status|
#+-----------+------------------+
#|         10|        Fully Paid|
#|         10|           Default|
#|         11|Late (31-120 days)|
#|         11|Late (31-120 days)|
#|         10| Late (16-30 days)|
#+-----------+------------------+

df.filter(lower(col("loan_status")).contains("late")).\
groupBy("issue_month","loan_status").\
count().\
agg(_sum("count").alias("sum")).\
show()

#+---+
#|sum|
#+---+
#|  3|
#+---+

df.filter(lower(col("loan_status")).like("late%")).\
groupBy("issue_month","loan_status").\
count().\
groupBy("loan_status").\
agg(_sum("count").alias("sum_count")).\
show()

#same result will get by using one group too
df.filter(lower(col("loan_status")).contains("late")).\
groupBy("loan_status").\
agg(count("*").alias("sum_count")).\
show()

#+------------------+---------+
#|       loan_status|sum_count|
#+------------------+---------+
#|Late (31-120 days)|        2|
#| Late (16-30 days)|        1|
#+------------------+---------+

更新:

df.filter(lower(col("loan_status")).contains("late")).\
groupBy("issue_month").\
agg(count("*").alias("sum_count")).\
show()

#+-----------+---------+
#|issue_month|sum_count|
#+-----------+---------+
#|         10|        1|
#|         11|        2|
#+-----------+---------+

【讨论】:

  • 这些很好用,但我如何按每个月获得一笔款项,无论是晚 31-120 天还是晚 16-30 天?有没有办法在不做另一个 groupby 的情况下做到这一点?
  • @Odisseo,请在答案中查看我的更新
  • mhmhmh 我收到“TypeError: unsupported operand type(s) for +: 'int' and 'str'”
  • @Odisseo,我认为你没有使用 spark sum 函数尝试导入 from pyspark.sql.functions import sum as _sum 然后使用 _sum 而不是 sum .. 我已经更新了答案太..`
  • 确实有效!但这不是我要找的……对不起。我正在寻找的是一个 df,就像你在第一个 group by 中的那个一样,除了在 30 到 120 之间的行和 16 到 30 之间的行之间没有单独的计数。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-01-09
  • 2020-10-10
  • 1970-01-01
相关资源
最近更新 更多