【发布时间】:2021-10-08 01:15:23
【问题描述】:
我在 AWS EMR 上使用 pyspark(4 个 r5.xlarge 作为 4 个工作人员,每个工作人员有一个执行程序和 4 个核心),我得到了AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'。以下是引发此错误的代码的 sn-p:
search = SearchEngine(db_file_dir = "/tmp/db")
conn = sqlite3.connect("/tmp/db/simple_db.sqlite")
pdf_ = pd.read_sql_query('''select zipcode, lat, lng,
bounds_west, bounds_east, bounds_north, bounds_south from
simple_zipcode''',conn)
brd_pdf = spark.sparkContext.broadcast(pdf_)
conn.close()
@udf('string')
def get_zip_b(lat, lng):
pdf = brd_pdf.value
out = pdf[(np.array(pdf["bounds_north"]) >= lat) &
(np.array(pdf["bounds_south"]) <= lat) &
(np.array(pdf['bounds_west']) <= lng) &
(np.array(pdf['bounds_east']) >= lng) ]
if len(out):
min_index = np.argmin( (np.array(out["lat"]) - lat)**2 + (np.array(out["lng"]) - lng)**2)
zip_ = str(out["zipcode"].iloc[min_index])
else:
zip_ = 'bad'
return zip_
df = df.withColumn('zipcode', get_zip_b(col("latitude"),col("longitude")))
下面是回溯,其中第 102 行,在 get_zip_b 中指的是pdf = brd_pdf.value:
21/08/02 06:18:19 WARN TaskSetManager: Lost task 12.0 in stage 7.0 (TID 1814, ip-10-22-17-94.pclc0.merkle.local, executor 6): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 605, in main
process()
File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 597, in process
serializer.dump_stream(out_iter, outfile)
File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
for obj in iterator:
File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py", line 212, in _batched
for item in iterator:
File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 450, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
return lambda *a: f(*a)
File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/util.py", line 121, in wrapper
return f(*args, **kwargs)
File "/mnt/var/lib/hadoop/steps/s-1IBFS0SYWA19Z/Mobile_ID_process_center.py", line 102, in get_zip_b
File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py", line 146, in value
self._value = self.load_from_path(self._path)
File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py", line 123, in load_from_path
return self.load(f)
File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py", line 129, in load
return pickle.load(file)
AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks' from '/mnt/miniconda/lib/python3.9/site-packages/pandas/core/internals/blocks.py'>
一些观察和思考过程:
1、网上搜索了一下,pyspark的AttributeError好像是driver和worker的pandas版本不匹配造成的?
2,但是我在两个不同的数据集上运行了相同的代码,一个工作没有任何错误,另一个没有,这看起来很奇怪和不确定,而且看起来错误可能不是由不匹配的 pandas 版本引起的。否则,两个数据集都不会成功。
3,然后我再次在成功的数据集上运行相同的代码,但这次使用不同的 spark 配置:将 spark.driver.memory 从 2048M 设置为 4192m,并抛出 AttributeError。
4,总而言之,我认为AttributeError与驱动程序有关。但我无法判断它们与错误消息的关系以及如何修复它:AttributeError: Can't get attribute 'new_block' on
【问题讨论】:
-
当您使用协议 = 4 在 pandas 1.3.2 上保存 pickle 文件并尝试在 pandas 1.2 上打开相同的 pickle 文件时,会出现相同的错误。截至今天,我没有找到其他地方,但这里说的是这个问题。
标签: python pandas apache-spark pyspark attributeerror