【问题标题】:How can I ingest an Excel spreadsheet with multiple tabs?如何摄取具有多个选项卡的 Excel 电子表格?
【发布时间】:2020-09-22 23:37:39
【问题描述】:

我想在远程文件夹或 SFTP 中提取 Excel 文件。它适用于 CSV 文件,但不适用于 XLS 或 XLSX 文件。

【问题讨论】:

    标签: palantir-foundry foundry-data-connection


    【解决方案1】:

    下面的代码提供了将 xls/xlsx 文件转换为 Spark 数据帧的函数。

    要使用这些功能,您需要:

    1. 将以下函数复制粘贴到您的存储库中(例如在 utils.py 文件中)
    2. 创建一个新的转换脚本
    3. 在转换脚本中,复制/粘贴示例转换并修改参数。

    使用函数的示例转换:

    # Parameters for Excel files with multiple tabs ingestion 
    SHEETS_PARAMETERS = {
        # Each of these blocks will take one tab of your Excel file ("Artists" here) and write from "header" a dataset in the path provided "/Studio/studio_datasource/artists"
        "Artists": {
            "output_dataset_path": "/Studio/studio_datasource/artists",
            "header": 7
        },
        "Records": {
            "output_dataset_path": "/Studio/studio_datasource/records",
            "header": 0
        },
        "Albums": {
            "output_dataset_path": "/Studio/studio_datasource/albums",
            "header": 1
        }
    }
    
    # Define the dictionary of outputs needed by the transform's decorator
    outputs = {
        sheet_parameter["output_dataset_path"]: Output(sheet_parameter["output_dataset_path"])
        for sheet_parameter in SHEETS_PARAMETERS.values()
    }
    @transform(
        my_input=Input("/Studio/studio_datasource/excel_file"),
        **outputs
    )
    def my_compute_function(my_input, ctx, **outputs):
        # Add the output objects to the parameters
        for sheetname, parameters in SHEETS_PARAMETERS.items():
            output_dataset_path = SHEETS_PARAMETERS[sheetname]["output_dataset_path"]
            SHEETS_PARAMETERS[sheetname]["output_dataset"] = outputs[output_dataset_path]
    
        # Transform the sheets to datasets
        write_datasets_from_excel_sheets(my_input, SHEETS_PARAMETERS, ctx)
    

    功能:

    import pandas as pd
    import tempfile
    import shutil
    
    def normalize_column_name(cn):
        """
        Remove forbidden characters from the columns names
        """
        invalid_chars = " ,;{}()\n\t="
        for c in invalid_chars:
            cn = cn.replace(c, "_")
        return cn
    
    def get_dataframe_from_excel_sheet(fp, ctx, sheet_name, header):
        """
        Generate a Spark dataframe from a sheet in an excel file available in Foundry
        Arguments:
            fp:
                TemporaryFile object that allows to read to the file that contains the Excel file
            ctx:
                Context object available in a transform
            sheet_name:
                Name of the sheet
            header:
                Row (0-indexed) to use for the column labels of the parsed DataFrame.
                If a list of integers is passed those row positions will be combined into a MultiIndex.
                Use None if there is no header.
        """
        # Using UTF-8 encoding is safer
        dataframe = pd.read_excel(
            fp,
            sheet_name,
            header=header,
            encoding="utf-8"
        )
    
        # Cast all the dataframes as string
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-03-29
      • 1970-01-01
      • 2013-08-16
      • 2020-07-02
      • 1970-01-01
      • 2016-10-10
      相关资源
      最近更新 更多