【发布时间】:2017-09-07 17:45:00
【问题描述】:
我有一个 pyspark 数据框,其中包含 starttime 和 stoptime 列以及其他列的值得到更新
|startime |stoptime |hour |minute |sec |sip |dip |sport|dport|proto|pkt |byt |
|1504766585|1504801216|16 |20 |16 |192.168.0.11 |23.204.108.58 |51249|80 |6 |0 |0 |
|1504766585|1504801216|16 |20 |16 |192.168.0.11 |23.204.108.58 |51249|80 |6 |0 |0 |
|1504781751|1504801216|16 |20 |16 |192.168.0.11 |23.72.38.96 |51252|80 |6 |0 |0 |
|1504781751|1504801216|16 |20 |16 |192.168.0.11 |23.72.38.96 |51252|80 |6 |0 |0 |
|1504766585|1504801336|16 |22 |16 |192.168.0.11 |23.204.108.58 |51249|80 |6 |0 |0 |
|1504766585|1504801336|16 |22 |16 |192.168.0.11 |23.204.108.58 |51249|80 |6 |0 |0 |
|1504781751|1504801336|16 |22 |16 |192.168.0.11 |23.72.38.96 |51252|80 |6 |0 |0 |
|1504781751|1504801336|16 |22 |16 |192.168.0.11 |23.72.38.96 |51252|80 |6 |0 |0 |
在这个例子中,我想选择所有具有最近停止时间的行,所有其他列值都是重复的。
【问题讨论】:
标签: pyspark spark-dataframe pyspark-sql