将多个 CSV 从 Cloud Storage 加载到 Google Datalab答案

【问题标题】：Loading multiple CSVs into Google Datalab from Cloud Storage将多个 CSV 从 Cloud Storage 加载到 Google Datalab
【发布时间】：2017-11-29 16:03:47
【问题描述】：

问题的一些变体是answered here 和here，我已经成功使用了

不过，我有一个稍微不同的问题。我已经使用 BigQuery 将 1GB 的数据导出到 Google 存储中。此导出分为 5 个 csv 文件，每个数据集都包含列名（我认为这是导致事情中断的原因）。

我的代码是：

# Run import
import pandas as pd
import numpy as np
from io import BytesIO

# Grab the file from the cloud storage
variable_list = ['part1', 'part2','part3','part4','part5']
for variable in variable_list:
  file_path = "gs://[Bucket-name]/" + variable + ".csv"
  %gcs read --object {file_path} --variable byte_data

# Read the dataset
data = pd.read_csv(BytesIO(byte_data), low_memory=False)

但是，当我致电 len(data) 时，我没有得到全部的行数。上面的代码似乎只加载了 1 个文件。

我可以加载 5 个不同的数据帧，然后通过 data=[df1, df2, df3, df4, df5] 将它们简单地组合到 pandas 中，但看起来很丑。

【问题讨论】：

我最初的想法是byte_data 在每次迭代中都会被覆盖。您能否创建另一个 python 变量来存储全部内容（您可以在每次迭代后附加byte_data）？
@AnthoniosPartheniou type(byte_data) 返回它是一个字节对象。但是如果我创建空字节对象 full_data = bytes() ，它没有附加。我尝试将 full_data 更改为列表，但我得到：'NoneType' object has no attribute 'append'
尝试使用bytearray 或搜索“字节连接”。

标签： google-cloud-datalab

【解决方案1】：

我发现了这一点，并在我的案例中采用了这一点。我遍历存储桶（文件夹）中的所有文件：

from google.datalab import Context
import google.datalab.storage as storage
import pandas as pd

try:
    from StringIO import StringIO
except ImportError:
    from io import BytesIO as StringIO

bucket_folder = 'ls_w'

df = pd.DataFrame()                # Final dataframe
for obj in bucket.objects():       # loop in all objects of the bucket
    if '/' not in obj.key:           # add other options to exclude other files
                                     # in this case it looks only at bucket level
                                     # not into subfolders!
        fn = obj.key                   # created file name variable (optional)
        print(obj.key)

        bites = 'gs://%s/%s' % (bucket_folder, fn)
        %gcs read --object $bites --variable data

        tdf = pd.read_csv(StringIO(data))  # read 

        df = pd.concat([df, tdf])          # concatenate results

【讨论】：