使用云函数、Python 将 CSV 转换为 .GZ答案

【问题标题】：CSV to .GZ using Cloud function, Python使用云函数、Python 将 CSV 转换为 .GZ
【发布时间】：2020-04-08 22:39:27
【问题描述】：

在使用 Cloud Function-Python 3.7 上传到 GCS 之前，我一直在尝试将我的 CSV 文件压缩为 .gz，但我的代码所做的只是添加了 .gz 扩展名，但并没有真正压缩文件，所以在最后，文件已损坏。你能告诉我如何解决这个问题吗？谢谢

这是我的代码的一部分

import gzip


def to_gcs(request):    
    job_config = bigquery.QueryJobConfig()
    gcs_filename = 'filename_{}.csv'
    bucket_name = 'bucket_gcs_name'
    subfolder = 'subfolder_name'
    client = bigquery.Client()


    job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE

    QUERY = "SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` session, UNNEST(hits) AS hits"
    query_job = client.query(
        QUERY,
        location='US',
        job_config=job_config)

    while not query_job.done():
        time.sleep(1)

    rows_df = query_job.result().to_dataframe()
    storage_client = storage.Client()

    storage_client.get_bucket(bucket_name).blob(subfolder+'/'+gcs_filename+'.gz').upload_from_string(rows_df.to_csv(sep='|',index=False,encoding='utf-8',compression='gzip'), content_type='application/octet-stream')

【问题讨论】：

这能回答你的问题吗？ Write pandas dataframe as compressed CSV directly to Amazon s3 bucket?
您应该检查从 Pandas 收到的警告，请参阅 stackoverflow.com/a/44168817/1358308 和 github.com/pandas-dev/pandas/issues/22555
@SamMason 的第一条评论中投票最多的答案确实对我有用。 @Justine 这对你有用吗？
@Jose V，确实如此！
@JoseV 我有一个小提琴并添加了关于使用tempfile 模块的注释。 upload_from_string method 也会立即创建一个 BytesIO 对象，因此如果可能的话，最好传递一个文件对象，这现在很简单

标签： python google-cloud-functions google-cloud-storage

【解决方案1】：

正如@Sam Mason 在评论中提到的thread 中所建议的，一旦您获得了Pandas 数据名，您应该使用TextIOWrapper() 和BytesIO()，如以下示例所述：

以下示例的灵感来自@ramhiser 在this SO 线程中的回答

df = query_job.result().to_dataframe()
blob = bucket.blob(f'{subfolder}/{gcs_filename}.gz')

with BytesIO() as gz_buffer:
    with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
        df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)

    blob.upload_from_file(gz_buffer,
        content_type='application/octet-stream')

还请注意，如果您希望此文件大于几 MB，则最好使用 tempfile module 中的内容代替 BytesIO。 SpooledTemporaryFile 基本上是为这个用例设计的，它将使用一个给定大小的内存缓冲区，并且仅在文件变得非常大时才使用磁盘

【讨论】：

对于面对ValueError: Stream must be at beginning 的人，在blob.upload_from_file(...) 行之前使用插入gz_buffer.seek(0)。

【解决方案2】：

您好，我试图重现您的用例：

我使用此快速入门 link 创建了一个云函数：

def hello_world(request):

  from google.cloud import bigquery
  from google.cloud import storage 
  import pandas as pd 


  client = bigquery.Client() 
  storage_client = storage.Client() 

  path = '/tmp/file.gz'


  query_job = client.query("""
  SELECT
  CONCAT(
    'https://stackoverflow.com/questions/',
     CAST(id as STRING)) as url,
  view_count
  FROM `bigquery-public-data.stackoverflow.posts_questions`
  WHERE tags like '%google-bigquery%'
  ORDER BY view_count DESC
  LIMIT 10""")  

  results = query_job.result().to_dataframe()
  results.to_csv(path,sep='|',index=False,encoding='utf-8',compression='gzip')

  bucket = storage_client.get_bucket('mybucket')  
  blob = bucket.blob('file.gz')
  blob.upload_from_filename(path)

这是requirements.txt：

# Function dependencies, for example:

google-cloud-bigquery
google-cloud-storage
pandas

我部署了函数。

我检查了输出。

gsutil cp gs://mybucket/file.gz file.gz
gzip -d file.gz
cat file


#url|view_count
https://stackoverflow.com/questions/22879669|52306
https://stackoverflow.com/questions/13530967|46073
https://stackoverflow.com/questions/35159967|45991
https://stackoverflow.com/questions/10604135|45238
https://stackoverflow.com/questions/16609219|37758
https://stackoverflow.com/questions/11647201|32963
https://stackoverflow.com/questions/13221978|32507
https://stackoverflow.com/questions/27060396|31630
https://stackoverflow.com/questions/6607552|31487
https://stackoverflow.com/questions/11057219|29069

【讨论】：