【发布时间】:2021-11-27 19:20:30
【问题描述】:
我收到了这个错误
line 23, in parseRating
IndexError: list index out of range
...在.collect()、.count() 等处的任何尝试。所以最后一行df3.collect() 会抛出该错误,但所有.show() 的工作。我不认为这是数据的问题,但我可能是错的。
新手,真的不知道发生了什么。非常感谢任何帮助。
import os
from os import remove, removedirs
from os.path import join, isfile, dirname
from pyspark.sql.functions import col, explode
import pandas as pd
from pyspark.sql.functions import col, explode
from pyspark import SparkContext
from pyspark.sql import SparkSession
def parseRating(line):
"""
Parses a rating record in MovieLens format userId::movieId::rating::timestamp .
"""
fields = line.strip().split("::")
return int(fields[3]), int(fields[0]), int(fields[1]), float(fields[2])
#return int(fields[0]), int(fields[1]), float(fields[2])
if __name__ == "__main__":
# set up environment
spark = SparkSession.builder \
.master("local") \
.appName("Movie Recommendation Engine") \
.config("spark.driver.memory", "16g") \
.getOrCreate() \
sc = spark.sparkContext
# load personal ratings
#myRatings = loadRatings(os.path.abspath('personalRatings.txt'))
myRatingsRDD = sc.textFile("personalRatings.txt").map(parseRating)
ratings = sc.textFile("ratings.dat").map(parseRating)
df1 = spark.createDataFrame(myRatingsRDD,["timestamp","userID","movieID","rating"])
df1.show()
df2 = spark.createDataFrame(ratings,["timestamp","userID","movieID","rating"])
df2.show()
df3 = df1.union(df2)
df3.show()
df3.printSchema()
df3 = df3.\
withColumn('userID', col('userID').cast('integer')).\
withColumn('movieID', col('movieID').cast('integer')).\
withColumn('rating', col('rating').cast('float')).\
drop('timestamp')
df3.show()
ratings = df3
df3.collect()
【问题讨论】:
-
为什么要使用 RDD?使用
spark.text('personalRatiings.txt')获取数据框,然后在该行上应用一个函数 -
我的猜测是“字段”列表超出范围,因为它不包含示例,拆分后的字段 [3]。
-
show()打印 20 行。collect()或count()将实现整个数据集。错误意味着(至少)其中一条线再往下,超过 20 行,格式不正确,无法按照您的预期进行解析。
标签: python dataframe apache-spark pyspark rdd