while 循环中累积的内存使用量答案

【问题标题】：Memory usage accumulated in while loopwhile 循环中累积的内存使用量
【发布时间】：2015-11-18 07:19:49
【问题描述】：

我的代码包含这个while 循环：

while A.shape[0] > 0:
    idx = A.score.values.argmax()
    one_center = A.coordinate.iloc[idx]
    # peak_centers and peak_scores are python lists
    peak_centers.append(one_center)
    peak_scores.append(A.score.iloc[idx])
    # exclude the coordinates around the selected peak
    A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]

A 是一个熊猫DataFrame，看起来像这样：

   score  coordinate
0  0.158           1
1  0.167           2
2  0.175           3
3  0.183           4
4  0.190           5

我试图在A 中找到最高分（一个峰值），然后排除先前找到的峰值周围的一些坐标（在这种情况下为几百个），然后找到下一个峰值，依此类推。

A 这是一个非常大的熊猫DataFrame。在运行这个while 循环之前，ipython 会话使用了 20% 的机器内存。我认为运行这个while 循环只会导致内存消耗下降，因为我从DataFrame 中排除了一些数据。但是，我观察到内存使用量不断增加，并且在某些时候机器内存已耗尽。

这里有什么我错过的吗？我需要在某处显式释放内存吗？

这是一个可以使用随机数据复制行为的简短脚本：

import numpy as np
import pandas as pd

A = pd.DataFrame({'score':np.random.random(132346018), 'coordinate':np.arange(1, 132346019)})
peak_centers = []
peak_scores = []
exclusion = 147
while A.shape[0] > 0:
    idx = A.score.values.argmax()
    one_center = A.coordinate.iloc[idx]
    # peak_centers and peak_scores are python lists
    peak_centers.append(one_center)
    peak_scores.append(A.score.iloc[idx])
    # exclude the coordinates around the selected peak
    A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]

# terminated the loop after memory consumption gets to 90% of machine memory
# but peak_centers and peak_scores are still short lists
print len(peak_centers)
# output is 16

【问题讨论】：

你打印列表的长度吗？也许你错过了一些东西，尽管你假设它变得越来越大！？
我的第一个猜测是 A.shape[0] 永远不会达到 0。while 循环是意外创建无限循环的好方法，并且您在每个循环上都添加了 peak_centers 和 peak_scores。如果你搞砸了，它们会不断变大，直到空间用完。强烈建议如果您必须使用 while 循环，请仔细检查并确保 while 测试在每个循环上都接近 False。
到目前为止，我认为这不是原因。请参阅更新的代码以复制此内容。

标签： python memory pandas

【解决方案1】：

您的DataFrame 太大而无法处理。执行此行时，内存负载加倍：

A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]

这是因为您为A 分配了一个新值，因此在过滤旧的DataFrame 的同时为新的DataFrame 分配了内存。新的几乎与旧的大小相同，因为您选择了几乎所有的数据点。这为两个A 副本消耗了足够的内存，而且这还没有考虑到loc 实现所做的记账额外内存。

显然loc 导致 pandas 分配足够的内存用于数据的额外副本。我不确定这是为什么。我认为这是某种性能优化。这意味着您最终在内存使用高峰时消耗了 DataFrame 大小的四倍。一旦loc 完成并释放未分配的内存（您可以通过调用gc.collect() 强制执行此操作），内存负载将下降到DataFrame 的两倍大小。在下一次调用loc 时，一切都翻了一番，你又回到了四倍的负载。再次收集垃圾，你又回到了两倍。只要您愿意，这将持续下去。

要验证发生了什么，请运行修改后的代码版本：

import numpy as np
import pandas as pd
import gc

A = pd.DataFrame({'score':np.random.random(32346018), 'coordinate':np.arange(1, 32346019)})
peak_centers = []
peak_scores = []
exclusion = 147
count = 0
while A.shape[0] > 0:
    gc.collect()  # Force garbage collection.
    count += 1    # Increment the iteration count.
    print('iteration %d, shape %s' % (count, A.shape))
    raw_input()   # Wait for the user to press Enter.
    idx = A.score.values.argmax()
    one_center = A.coordinate.iloc[idx]
    # peak_centers and peak_scores are python lists
    peak_centers.append(one_center)
    peak_scores.append(A.score.iloc[idx])
    print(len(peak_centers), len(peak_scores))
    # exclude the coordinates around the selected peak
    A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]

在迭代之间按 Enter 键，并使用 top 或类似工具密切关注内存使用情况。

在第一次迭代开始时，您会看到x% 的内存使用率。在第二次迭代中，第一次调用 loc 后，内存使用量翻倍至 2x。随后，您会看到它在每次调用 loc 期间上升到 4x，然后在垃圾回收后下降到 2x。

【讨论】：

我在运行代码后有点明白发生了什么。谢谢你的解释！
@qkhhly 也请查看我的答案。我可以通过使用与top 相同的小工具来确认，如果您将drop 与inplace=True 一起使用，则每次传递的内存不会从副本中翻倍。在我看来，就地丢弃更像是惯用的熊猫，而不是复制只是为了立即垃圾收集现已不复存在的原件。

【解决方案2】：

如果您想破坏性地改变A 而不复制A 的大量数据子集，请使用DataFrame.drop 和inplace=True。

places_to_drop = ~(A.coordinate - one_center).between(-exclusion, exclusion)
A.drop(A.index[np.where(places_to_drop)], inplace=True)

The place where the original usage of loc ultimately bottoms out 在_NDFrameIndexer 方法_getitem_iterable 中。 _LocIndexer 是 _NDFrameIndexer 的子类，_LocIndexer 的实例被创建并填充 DataFrame 的 loc 属性。

特别是，_getitem_iterable 会检查布尔索引，这在您的情况下会发生。然后创建一个新的布尔位置数组（当key 已经是布尔格式时，这会浪费内存）。

inds, = key.nonzero()

然后最终在副本中返回“真实”位置：

return self.obj.take(inds, axis=axis, convert=False)

从代码中：key 将是您的布尔索引（即表达式 (A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion) 的结果），self.obj 将是调用 loc 的父 DataFrame 实例，所以 obj这里只是A。

DataFrame.take 文档说明默认行为是制作副本。在索引器的当前实现中，没有机制允许您传递关键字参数，最终将用于执行take 而无需复制。

在任何合理的现代机器上，使用drop 方法对于您描述的数据大小应该是没有问题的，因此A 的大小不是问题。

【讨论】：

感谢您解释内部结构。它似乎并没有阻止内存使用率飙升至 90% 以上，尽管它并没有像以前的版本那样快速增长。最终，我决定不使用这篇文章中的方法。它在较小的A 中对我来说运行得相当快，但对于较大的A 来说太慢了。
@qkhhly 我怀疑还有其他事情发生，因为当我使用 Michael Laszlo 发布的 sn-p 以这种方式（使用drop）进行内存分析时，我看不到内存增长。不幸的是，pandas 索引器代码是如此迟钝且难以理解，因为此类意外问题似乎经常出现。
我没有强制垃圾回收，这可能是原因。