[Spark][Python]sortByKey 例子:
[training@localhost ~]$ hdfs dfs -cat test02.txt
00002 sku010
00001 sku933
00001 sku022
00003 sku888
00004 sku411
00001 sku912
00001 sku331
[training@localhost ~]$
mydata001=sc.textFile("test02.txt")
mydata002=mydata001.map(lambda line: line.split(\' \'))
mydata002.take(3)
Out[4]: [[u\'00002\', u\'sku010\'], [u\'00001\', u\'sku933\'], [u\'00001\', u\'sku022\']]
mydata003=mydata002.sortByKey()
In [9]: mydata003.take(5)
Out[9]:
[[u\'00001\', u\'sku933\'],
[u\'00001\', u\'sku022\'],
[u\'00001\', u\'sku912\'],
[u\'00001\', u\'sku331\'],
[u\'00002\', u\'sku010\']]
In [10]:
API 参考:
https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD