【发布时间】:2019-04-07 13:11:27
【问题描述】:
我希望能够在对其中一个值应用转换时选择 RDD 的多个列。我能够 - 选择特定的列 - 在其中一列上应用转换
我无法同时应用它们
1) 选择特定列
from pyspark import SparkContext
logFile = "/FileStore/tables/tendulkar.csv"
rdd = sc.textFile(logFile)
rdd.map(lambda line: (line.split(",")[0],line.split(",")[1],line.split(",")
[2])).take(4)
[('Runs', 'Mins', 'BF'),
('15', '28', '24'),
('DNB', '-', '-'),
('59', '254', '172')]
2) 对第一列应用转换
df=(rdd.map(lambda line: line.split(",")[0])
.filter(lambda x: x !="DNB")
.filter(lambda x: x!= "TDNB")
.filter(lambda x: x!="absent")
.map(lambda x: x.replace("*","")))
df.take(4)
['Runs', '15', '59', '8']
我尝试将它们一起做如下
rdd.map(lambda line: ( (line.split(",")[0]).filter(lambda
x:x!="DNB"),line.split(",")[1],line.split(",")[2])).count()
我收到一个错误
Py4JJavaError Traceback (most recent call last)
<command-2766458519992264> in <module>()
10 .map(lambda x: x.replace("*","")))
11
---> 12 rdd.map(lambda line: ( (line.split(",")[0]).filter(lambda x:x!="DNB"),line.split(",")[1],line.split(",")[2])).count()
/databricks/spark/python/pyspark/rdd.py in count(self)
1067 3
1068 """
-> 1069 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
1070
1071 def stats(self):
请帮忙
【问题讨论】: