我认为一个好的解决方案是更正生成此数据集的过程,不应将其保存为 Row 对象。
在 Pyspark 中,您可以使用一些字符串函数(split、regexp_extract...)将其解析为多个列,但这可能非常乏味。尤其是该行包含复杂对象,例如Ambience。
您可能考虑的另一种可能性是尝试将 spark 数据帧转换为 pandas 并使用 python eval(但 not recommended)将该字符串评估为 pyspark Row 对象:
import pandas as pd
sdf = spark.createDataFrame([
('Row(AcceptsInsurance=None, AgesAllowed=None, Alcohol="\'beer_and_wine\'", Ambience="{\'touristy\': False, \'hipster\': False, \'romantic\': False, \'divey\': False, \'intimate\': False, \'trendy\': False, \'upscale\': False, \'classy\': False, \'casual\': True}", BYOB=None, BYOBCorkage=None, BestNights=None, BikeParking=\'True\', BusinessAcceptsBitcoin=\'False\', BusinessAcceptsCreditCards=\'True\', BusinessParking="{\'garage\': False, \'street\': True, \'validated\': False, \'lot\': False, \'valet\': False}")',)
], ["row"])
df = sdf.toPandas()["row"].apply(lambda x: eval(x).asDict()).apply(pd.Series).astype(str)
sdf = spark.createDataFrame(df)
sdf.show()
#+----------------+-----------+---------------+--------------------+----+-----------+----------+-----------+----------------------+--------------------------+--------------------+
#|AcceptsInsurance|AgesAllowed| Alcohol| Ambience|BYOB|BYOBCorkage|BestNights|BikeParking|BusinessAcceptsBitcoin|BusinessAcceptsCreditCards| BusinessParking|
#+----------------+-----------+---------------+--------------------+----+-----------+----------+-----------+----------------------+--------------------------+--------------------+
#| None| None|'beer_and_wine'|{'touristy': Fals...|None| None| None| True| False| True|{'garage': False,...|
#+----------------+-----------+---------------+--------------------+----+-----------+----------+-----------+----------------------+--------------------------+--------------------+