在 pyspark 中以第 3 行作为标题读取 excel 文件答案

【问题标题】：Reading excel files in pyspark with 3rd row as header在 pyspark 中以第 3 行作为标题读取 excel 文件
【发布时间】：2021-04-07 15:50:12
【问题描述】：

我想将读取的 excel 文件读取为 spark 数据帧，第 3 行作为标题。将 excel 文件读取为 spark 数据帧，第 1 行作为标题的语法是：

s_df  = spark.read.format("com.crealytics.spark.excel") \
                           .option("header", "true") \
                           .option("inferSchema", "true") \
                           .load(path + 'Sales.xlsx')

和以第 3 行作为标题的 pandas 数据帧读取的等效语法是：

p_df = pd.read_excel(path + 'Sales.xlsx',header=3)

我想在 pyspark 中做同样的事情，即读取 excel 文件作为 spark 数据帧，第三行作为标题。

【问题讨论】：

你能用 pandas 阅读它并转换成 spark 数据框吗？ Excel 文件通常不大，熊猫应该能够处理
是的，我可以这样做，但有没有办法直接将文件作为 spark 数据帧读取？

标签： excel pyspark azure-databricks

【解决方案1】：

使用 dataAddress 选项指定数据所在的单元格/行。由于您需要跳过两行，因此您的数据（包括标题）从 A3 行开始。

s_df = spark.read.format("com.crealytics.spark.excel") \
           .option("header", "true") \
           .option("inferSchema","true") \
           .option("dataAddress", "'Sheet1'!A3") \
           .load("yourfilepath")

另外，请注意，如果您的前两行为空，则不必指定 dataAddress。默认会跳过前导的空行。

查看文档here

【讨论】：