如何流式传输所有文件的 hdfs 位置并同时写入另一个 hdfs 位置答案

【问题标题】：How to stream an hdfs location for all files and write to another hdfs location simultaneously如何流式传输所有文件的 hdfs 位置并同时写入另一个 hdfs 位置
【发布时间】：2019-10-08 06:36:48
【问题描述】：

我在一个 hdfs 位置有大约 2 万个镶木地板格式的 JSON 文件。我的工作是流式传输该位置并读取数据帧中的所有文件，然后将其写入另一个 hdfs 位置。

有人可以建议我该怎么做。我正在使用 Azure Databricks 平台和 pyspark 来完成这项任务。

【问题讨论】：

标签： pyspark hdfs azure-data-lake azure-databricks

【解决方案1】：

我不确定您是想以“流式传输”方式还是以“批处理”方式进行操作。但是，您可以使用流式处理功能并触发该作业一次。

    (spark
.readStream # Read data as streaming
.schema(USER_SCHEMA) # For streaming, you must provide the input schema of data
.format("parquet")
.load(PARQUET_ORIGIN_LOCATION)
.writeStream
.format("delta")
.option("path", PARQUET_DESTINATION_LOCATION + 'data/')  # Where to store the data
.option("checkpointLocation", PARQUET_DESTINATION_LOCATION + 'checkpoint/')  # The check point location
.option("overwriteSchema", True)  # Allows the schema to be overwritten
.queryName(QUERY_NAME)  # Name of the query
.trigger(once=True)  # For Batch Processing
.start()
)

【讨论】：

非常感谢。我想以流媒体的方式进行。我想知道是否可以在不预先定义架构的情况下做到这一点。因为我将所有源 parquet 文件都放在同一个 hdfs 位置，但在流式传输之后，我想使用五个不同的模式将它存储在目标位置。这是我面临的问题。有没有办法解决这个问题