【发布时间】:2020-01-11 12:53:51
【问题描述】:
所以我有 2 个问题我认为对于有 PySpark 经验的人来说应该是基本的,但我似乎无法解决它们。
我的csv 文件中的示例条目是-
"dfg.AAIXpWU4Q","1"
"cvbc.AAU3aXfQ","1"
"T-L5aw0L1uT_OfFyzbk","1"
"D9TOXY7rA_LsnvwQa-awVk","2"
"JWg8_0lGDA7OCwWcH_9aDc","2"
"ewrq.AAbRaACr2tVh5wA","1"
"ewrq.AALJWAAC-Qku3heg","1"
"ewrq.AADStQqmhJ7A","2"
"ewrq.AAEAABh36oHUNA","1"
"ewrq.AALJABfV5u-7Yg","1"
我创建了以下数据框-
>>> df2.show(3)
+-------+----+
|user_id|hits|
+-------+----+
|"aYk...| "7"|
|"yDQ...| "1"|
|"qUU...|"13"|
+-------+----+
only showing top 3 rows
首先,这是将hits 列转换为IntegerType() 的正确方法吗?为什么所有值都变成null?
>>> df2 = df2.withColumn("hits", df2["hits"].cast(IntegerType()))
>>> df2.show(3)
+-------+----+
|user_id|hits|
+-------+----+
|"aYk...|null|
|"yDQ...|null|
|"qUU...|null|
+-------+----+
only showing top 3 rows
其次,我需要按照hits 列的降序对这个列表进行排序。所以,我尝试了这个-
>>> df1 = df2.sort(col('hits').desc())
>>> df1.show(20)
但我收到以下错误-
java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 2 fields are required while 18 values are provided.
我猜这是因为我创建数据框时使用-
>>> rdd = sc.textFile("/path/to/file/*")
>>> rdd.take(2)
['"7wAfdgdfgd","7"', '"1x3Qdfgdf","1"']
>>> my_df = rdd.map(lambda x: (x.split(","))).toDF()
>>> df2 = my_df.selectExpr("_1 as user_id", "_2 as hits")
>>> df2.show(3)
+-------+----+
|user_id|hits|
+-------+----+
|"aYk...| "7"|
|"yDQ...| "1"|
|"qUU...|"13"|
+-------+----+
only showing top 3 rows
我猜有些行中有多余的逗号。如何避免这种情况 - 或者阅读此文件的最佳方式是什么?
【问题讨论】:
标签: python apache-spark pyspark regexp-replace