为什么类别列比对象列占用更多空间？答案

【问题标题】：Why do Categories columns take up more space than the Object columns?为什么类别列比对象列占用更多空间？
【发布时间】：2023-01-03 02:47:26
【问题描述】：

当我运行此代码并查看 info() 的输出时，使用类别类型的数据帧似乎比使用对象类型的数据帧（624 字节）占用更多空间（932 字节）。

def initData():
    myPets = {"animal":         ["cat",    "alligator", "snake",     "dog",    "gerbil",  "lion",      "gecko",  "hippopotamus", "parrot",   "crocodile", "falcon",   "hamster", "guinea pig"],
              "feel"  :         ["furry",  "rough",     "scaly",     "furry",  "furry",   "furry",     "rough",  "rough",        "feathery", "rough",     "feathery", "furry",   "furry"     ],
              "where lives":    ["indoor", "outdoor",   "indoor",    "indoor", "indoor",  "outdoor",   "indoor", "outdoor",      "indoor",   "outdoor",   "outdoor",  "indoor",  "indoor"    ],
              "risk":           ["safe",   "dangerous", "dangerous", "safe",   "safe",    "dangerous", "safe",   "dangerous",    "safe",     "dangerous", "safe",     "safe",    "safe"      ],
              "favorite food":  ["treats", "fish",      "bugs",      "treats", "grain",   "antelope",  "bugs",   "antelope",     "grain",    "fish",      "rabbit",   "grain",   "grain"     ],
              "want to own":    [1,        0,           0,           1,        1,         0,           1,        0,              1,          0,           1,          1,         1           ] }
    petDF = pd.DataFrame(myPets)
    petDF = petDF.set_index("animal")
    #print(petDF.info())
    #petDF.head(100)
    return petDF

def addCategoryColumns(myDF):
    myDF["cat_feel"]          = myDF["feel"].astype("category")
    myDF["cat_where_lives"]   = myDF["where lives"].astype("category")
    myDF["cat_risk"]          = myDF["risk"].astype("category")
    myDF["cat_favorite_food"] = myDF["favorite food"].astype("category")
    return myDF

objectsDF = initData()
categoriesDF = initData()
categoriesDF = addCategoryColumns(categoriesDF)
categoriesDF = categoriesDF.drop(["feel", "where lives", "risk", "favorite food"], axis = 1)
print(objectsDF.info())
print(categoriesDF.info())
categoriesDF.head()


<class 'pandas.core.frame.DataFrame'>
Index: 13 entries, cat to guinea pig
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   feel           13 non-null     object
 1   where lives    13 non-null     object
 2   risk           13 non-null     object
 3   favorite food  13 non-null     object
 4   want to own    13 non-null     int64 
dtypes: int64(1), object(4)
memory usage: 624.0+ bytes
None
<class 'pandas.core.frame.DataFrame'>
Index: 13 entries, cat to guinea pig
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   want to own        13 non-null     int64   
 1   cat_feel           13 non-null     category
 2   cat_where_lives    13 non-null     category
 3   cat_risk           13 non-null     category
 4   cat_favorite_food  13 non-null     category
dtypes: category(4), int64(1)
memory usage: 932.0+ bytes
None

【问题讨论】：

标签： python pandas

【解决方案1】：

保存数值数据，如 int / float / category 在一个 numpy 数组中。放入一百万或两行，所以簿记开销是微不足道的，你会看到内存使用恰好是 8 × num_elements，或小于 64 位的数据类型的更小倍数。

相反，“对象”数据类型是一个指针一些外部分配的内存区域，通常是 str。所以 numpy / pandas 报告关于数组大小，使用时为 8 × num_elements 64 位地址，但留给你总结所有这些外部分配。

使用getsizeof 递归地，或使用pympler，更好地了解总内存消耗。或者使用psutil 在之前/之后向操作系统询问内存资源你做了一个很大的分配。

【讨论】：