【发布时间】:2020-06-24 01:46:55
【问题描述】:
我有一个带有
的 PySpark dffrom pyspark.sql import functions as F
print(df.groupBy(['issue_month', 'loan_status']).count().show())
+-----------+------------------+-----+
|issue_month| loan_status|count|
+-----------+------------------+-----+
| 06| Fully Paid|12632|
| 03| Fully Paid|16243|
| 07| Default| 1|
| 02| Fully Paid|16467|
| 06| Default| 1|
| 07| In Grace Period| 289|
| 01| Charged Off| 5975|
| 05| Charged Off| 5209|
| 02|Late (31-120 days)| 184|
| 11| Current|17525|
| 12| In Grace Period| 369|
| 10| Fully Paid|19222|
| 04| Fully Paid|16802|
| 07| Charged Off| 7072|
| 06| Charged Off| 4589|
| 04| Late (16-30 days)| 98|
| null| null| 2|
| 10|Late (31-120 days)| 621|
| 07| Late (16-30 days)| 125|
| 10| Default| 2|
+-----------+------------------+-----+
我想只过滤loan_status is late,它可以是值“Late(16-30 天)”或“Late(31-120 天)”。所以我尝试了:
print(df.groupBy(['issue_month', 'loan_status']).count().filter((F.col('loan_status')=='Late (31-120 days)')|F.col('loan_status')=='Late (16-30 days)').show())
这失败了,但无论如何,它很脏。我想在熊猫中做类似的事情,我可以简单地过滤正则表达式。在我的情况下,它会是这样的:
F.col('loan_status').contains("Late")
【问题讨论】:
-
试试:
df.groupby(..).count().filter("loan_status rlike 'Late'")
标签: python dataframe apache-spark pyspark filtering