在 PySpark 中按一列中的不同值过滤行答案

【问题标题】：Filter rows by distinct values in one column in PySpark在 PySpark 中按一列中的不同值过滤行
【发布时间】：2017-01-10 07:03:46
【问题描述】：

假设我有下表：

+--------------------+--------------------+------+------------+--------------------+
|                host|                path|status|content_size|                time|
+--------------------+--------------------+------+------------+--------------------+
|js002.cc.utsunomi...|/shuttle/resource...|   404|           0|1995-08-01 00:07:...|
|    tia1.eskimo.com |/pub/winvn/releas...|   404|           0|1995-08-01 00:28:...|
|grimnet23.idirect...|/www/software/win...|   404|           0|1995-08-01 00:50:...|
|miriworld.its.uni...|/history/history.htm|   404|           0|1995-08-01 01:04:...|
|      ras38.srv.net |/elv/DELTA/uncons...|   404|           0|1995-08-01 01:05:...|
| cs1-06.leh.ptd.net |                    |   404|           0|1995-08-01 01:17:...|
|dialip-24.athenet...|/history/apollo/a...|   404|           0|1995-08-01 01:33:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:35:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:37:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:37:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:37:...|
|hsccs_gatorbox07....|/pub/winvn/releas...|   404|           0|1995-08-01 01:44:...|
|www-b2.proxy.aol....|/pub/winvn/readme...|   404|           0|1995-08-01 01:48:...|
|www-b2.proxy.aol....|/pub/winvn/releas...|   404|           0|1995-08-01 01:48:...|
+--------------------+--------------------+------+------------+--------------------+

如何过滤此表以在 PySpark 中仅具有不同的路径？但该表应包含所有列。

【问题讨论】：

标签： apache-spark dataframe pyspark apache-spark-sql spark-dataframe

【解决方案1】：

如果要保存特定列中所有值都不同的行，则必须在 DataFrame 上调用 dropDuplicates 方法。在我的示例中就像这样：

dataFrame = ... 
dataFrame.dropDuplicates(['path'])

路径是列名

【讨论】：

在重复记录中，dropDuplicates 如何决定删除哪条记录？
@prudhviIndana 您无法调整此行为。如果你需要这个，可能你应该使用其他查询，例如使用 filter / groupby
不正确。有关如何仅在有序数据框中保留第一次出现的示例，请参见此处：stackoverflow.com/a/54738843/4166885

【解决方案2】：

至于调整哪些记录被保留和丢弃，如果你可以将你的条件变成一个窗口表达式，你可以使用这样的东西。这是在 scala 中（或多或少），但我想你也可以在 PySpark 中做到这一点。

val window = Window.parititionBy('columns,'to,'make,'unique).orderBy('conditionToPutRowToKeepFirst)

dataframe.withColumn("row_number",row_number().over(window)).where('row_number===1).drop('row_number)

【讨论】：