【发布时间】:2021-09-24 00:23:53
【问题描述】:
我正在尝试从 csv 创建的数据框中的名称中删除特殊字符。有 100 列名称很长。我已经尝试了多种方法,至少在其中一列上都返回错误?
df = spark.read.format("com.databricks.spark.csv") \
.option("mode", "DROPMALFORMED") \
.option("header", "true") \
.option("inferschema", "true") \
.option("delimiter", ",").load(getArgument('sourceCSVpath') + getArgument('sourceCSV'))
df = df.select([F.col(col).alias(re.sub("[^0-9a-zA-Z$]+","",col)) for col in df.columns])
temp_df1 = df.select([F.col(col).alias(col.replace('- ', '')) for col in df.columns])
错误
无法解析给定输入列的“Organization - No. Of Employees - Employee Figures Date”
Domestic Ultimate Employee Information Scope Code,NACE Revision 2 Description - Priority 4,NACE Revision 2 Description - Priority 5,NACE Revision 2 Description - Priority 6,Organization - No. Of Employees - Employee Figures Date,Number of Employees Scope Text,Organization Founded Date,NACE Revision 2 Description - Priority 1
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
9067,,,,,Headquarters Only (Employs Here),1997,Hospital activities
【问题讨论】:
-
一个可重现的例子会很有帮助:见stackoverflow.com/help/minimal-reproducible-example
-
添加示例数据以在上面发布
-
如果 CSV 不是那么大,你总是可以使用 Pandas 重命名这些列,然后使用 Spark 处理它