使用 Pandas 读取 CSV 时，我应该如何处理 `*_id` 列？答案

【问题标题】：How should I deal with `*_id` columns when reading a CSV with Pandas?使用 Pandas 读取 CSV 时，我应该如何处理 `*_id` 列？
【发布时间】：2018-08-17 06:33:21
【问题描述】：

我在 CSV 文件中有几列后缀为 _id。它们中的大多数是整数，但有时它们没有值（NaN - CSV 文件中为空）。例如，一些名称可能是：

user_id
device_id
company_id
country_id
...

当我使用pd.read_csv 时，它会将其中一些识别为 int64，而另一些识别为 float64。我不确定，但我认为这些标识符的“正确”类型是 object。

我知道read_csv 具有dtype 属性。所以我能做的是：

dtypes = {'user_id': str,
          'device_id': str,
          'company_id': str,
          'country_id': str}
df = pd.read_csv('file.csv', dtype=dtypes)

这样做的缺点是我必须指定所有列。有没有办法将此应用于所有“*_id”列？如果我之后使用df['user_id'].astype(object)，当熊猫已经“识别”int64 或float64 时会有所不同吗？

【问题讨论】：

标签： python pandas csv

【解决方案1】：

您可以使用预处理提取带有_id 的列，将其转换为dict，然后使用red_csv：

path_csv = 'file.csv'
with open(path_csv) as f:
    lines = f.readlines()
    #get header to list
    header = [x.strip() for x in lines[0].split(',') if '_id' in x]

print (header)
['user_id', 'device_id', 'company_id', 'country_id']

d = dict.fromkeys(header, 'str')
print (d)
{'user_id': 'str', 'device_id': 'str', 'company_id': 'str', 'country_id': 'str'}

df = pd.read_csv('file.csv', dtype=d)
print (df)
  user_id device_id company_id country_id  country
0       1         2          3          4        5

print (df.dtypes)
user_id       object
device_id     object
company_id    object
country_id    object
country        int64
dtype: object

如果我之后使用 df['user_id'].astype(object)，当 pandas 已经“识别”int64 或 float64 时会有所不同吗？

这取决于数据，但如果某些NaNs 带有整数，它应该将int 转换为float。 float 的精度也应该改变。

【讨论】：