【问题标题】:Read Parquet file from Azure blob with out downloading it locally c# .net从 Azure blob 读取 Parquet 文件,无需在本地下载它 c# .net
【发布时间】:2020-05-15 19:21:52
【问题描述】:

我们有一个 parquet 格式文件 (500 mb),它位于 Azure blob 中。如何直接从 blob 读取文件并保存在 c# 的内存中,例如:Datatable。

我可以使用以下代码读取实际位于文件夹中的镶木地板文件。

public void ReadParqueFile()
    {
         using (Stream fileStream = System.IO.File.OpenRead("D:/../userdata1.parquet"))     
        {
            using (var parquetReader = new ParquetReader(fileStream))
            {
                DataField[] dataFields = parquetReader.Schema.GetDataFields();

                for (int i = 0; i < parquetReader.RowGroupCount; i++)
                {

                    using (ParquetRowGroupReader groupReader = parquetReader.OpenRowGroupReader(i))
                    {
                        DataColumn[] columns = dataFields.Select(groupReader.ReadColumn).ToArray();

                        DataColumn firstColumn = columns[0];

                        Array data = firstColumn.Data;
                        //int[] ids = (int[])data;
                    }
                }
           }
        }

    }
}

(我可以使用 sourcestream 直接从 blob 读取 csv 文件)。请建议一种最快的方法来直接从 blob 读取 parquet 文件

【问题讨论】:

    标签: c# azure blob parquet


    【解决方案1】:

    根据我的经验,直接从blob读取parquet文件的解决方案是首先使用sas令牌生成blob url,然后使用sas从url中获取HttpClient的流,最后读取http响应流通过ParquetReader

    首先,请参考官方文档Create a service SAS for a container or blob with .NETCreate a service SAS for a blob部分下面的示例代码,使用Azure Blob Storage SDK for .NET Core。

    private static string GetBlobSasUri(CloudBlobContainer container, string blobName, string policyName = null)
    {
        string sasBlobToken;
    
        // Get a reference to a blob within the container.
        // Note that the blob may not exist yet, but a SAS can still be created for it.
        CloudBlockBlob blob = container.GetBlockBlobReference(blobName);
    
        if (policyName == null)
        {
            // Create a new access policy and define its constraints.
            // Note that the SharedAccessBlobPolicy class is used both to define the parameters of an ad hoc SAS, and
            // to construct a shared access policy that is saved to the container's shared access policies.
            SharedAccessBlobPolicy adHocSAS = new SharedAccessBlobPolicy()
            {
                // When the start time for the SAS is omitted, the start time is assumed to be the time when the storage service receives the request.
                // Omitting the start time for a SAS that is effective immediately helps to avoid clock skew.
                SharedAccessExpiryTime = DateTime.UtcNow.AddHours(24),
                Permissions = SharedAccessBlobPermissions.Read | SharedAccessBlobPermissions.Write | SharedAccessBlobPermissions.Create
            };
    
            // Generate the shared access signature on the blob, setting the constraints directly on the signature.
            sasBlobToken = blob.GetSharedAccessSignature(adHocSAS);
    
            Console.WriteLine("SAS for blob (ad hoc): {0}", sasBlobToken);
            Console.WriteLine();
        }
        else
        {
            // Generate the shared access signature on the blob. In this case, all of the constraints for the
            // shared access signature are specified on the container's stored access policy.
            sasBlobToken = blob.GetSharedAccessSignature(null, policyName);
    
            Console.WriteLine("SAS for blob (stored access policy): {0}", sasBlobToken);
            Console.WriteLine();
        }
    
        // Return the URI string for the container, including the SAS token.
        return blob.Uri + sasBlobToken;
    }
    

    然后从带有sas token的url获取HttpClient的http响应流。

    var blobUrlWithSAS = GetBlobSasUri(container, blobName);
    var client = new HttpClient();
    var stream = await client.GetStreamAsync(blobUrlWithSAS);
    

    最后通过ParquetReader阅读,代码来自GitHub repoaloneguid/parquet-dotnetReading Data

    var options = new ParquetOptions { TreatByteArrayAsString = true };
    var reader = new ParquetReader(stream, options);
    

    【讨论】:

    • 谢谢彼得。我试过 var stream = await client.GetStreamAsync(blobUrl);这但它得到了超时问题..我能够直接从blob使用这种方法读取小的csv文件而无需本地下载..Actuallty我必须直接从blob读取1.7 GB csv文件或大约500 mb的相应parquet文件
    • 嗨,使用相同的逻辑时出现以下错误。但是,我可以使用 streamreader 打印文件。任何想法。感谢“不是 Parquet 文件(头部是 '')”
    • @Subba 我遇到了同样的错误。你解决了吗?
    • 我们遇到同样的错误 - 有人找到工作版本吗?
    • Parquet.Net 文档提到无法从网络流中读取文件。见github.com/aloneguid/parquet-dotnet#reading-files
    猜你喜欢
    • 2019-09-25
    • 1970-01-01
    • 2021-09-29
    • 1970-01-01
    • 2018-09-27
    • 2017-08-20
    • 1970-01-01
    • 2015-12-19
    相关资源
    最近更新 更多