将字符串的 Pandas DataFrame 转换为直方图答案

【问题标题】：Turn Pandas DataFrame of strings into histogram将字符串的 Pandas DataFrame 转换为直方图
【发布时间】：2013-02-06 05:01:32
【问题描述】：

假设我有一个这样创建的 DataFrame：

import pandas as pd
s1 = pd.Series(['a', 'b', 'a', 'c', 'a', 'b'])
s2 = pd.Series(['a', 'f', 'a', 'd', 'a', 'f', 'f'])
d = pd.DataFrame({'s1': s1, 's2', s2})

真实数据中的字符串非常稀疏。我想创建字符串出现的直方图，看起来像 d.hist() 为 s1 和 s2 （每个子图一个）生成的（例如，带有子图）。

只做 d.hist() 就会出现这个错误：

/Library/Python/2.7/site-packages/pandas/tools/plotting.pyc in hist_frame(data, column, by, grid, xlabelsize, xrot, ylabelsize, yrot, ax, sharex, sharey, **kwds)
   1725         ax.xaxis.set_visible(True)
   1726         ax.yaxis.set_visible(True)
-> 1727         ax.hist(data[col].dropna().values, **kwds)
   1728         ax.set_title(col)
   1729         ax.grid(grid)

/Library/Python/2.7/site-packages/matplotlib/axes.pyc in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
   8099             # this will automatically overwrite bins,
   8100             # so that each histogram uses the same bins
-> 8101             m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
   8102             if mlast is None:
   8103                 mlast = np.zeros(len(bins)-1, m.dtype)

/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/lib/function_base.pyc in histogram(a, bins, range, normed, weights, density)
    167             else:
    168                 range = (a.min(), a.max())
--> 169         mn, mx = [mi+0.0 for mi in range]
    170         if mn == mx:
    171             mn -= 0.5

TypeError: cannot concatenate 'str' and 'float' objects

我想我可以手动浏览每个系列，执行value_counts()，然后将其绘制为条形图，然后手动创建子图。我想看看有没有更简单的方法。

【问题讨论】：

所有关于 value_count 的答案都是错误的，因为问题是关于生成直方图而不仅仅是计数值。最好将字符串集合的直方图捕获为分类和可排序数据，其中包含最小值和最大值、bin 和总排序。

标签： python pandas matplotlib dataframe

【解决方案1】：

重新创建数据框：

import pandas as pd
s1 = pd.Series(['a', 'b', 'a', 'c', 'a', 'b'])
s2 = pd.Series(['a', 'f', 'a', 'd', 'a', 'f', 'f'])
d = pd.DataFrame({'s1': s1, 's2': s2})

根据需要获取带有子图的直方图：

d.apply(pd.value_counts).plot(kind='bar', subplots=True)

OP 在问题中提到了pd.value_counts。我认为缺少的部分只是没有理由“手动”创建所需的条形图。

d.apply(pd.value_counts) 的输出是一个 pandas 数据框。我们可以像任何其他数据框一样绘制值，并选择选项subplots=True 给我们想要的东西。

【讨论】：

这很有效！知道为什么 matplotlib 的 hist 无法绘制相同的内容（它只需要永远）而不是使用value_counts 和像这里这样的条形图吗？

【解决方案2】：

你可以使用pd.value_counts（value_counts也是一个series方法）：

In [20]: d.apply(pd.value_counts)
Out[20]: 
   s1  s2
a   3   3
b   2 NaN
c   1 NaN
d NaN   1
f NaN   3

然后绘制生成的 DataFrame。

【讨论】：

【解决方案3】：

我会将系列推入collections.Counter (documentation)（您可能需要先将其转换为列表）。我不是pandas 专家，但我认为您应该能够将Counter 对象折叠回由字符串索引的Series，并使用它来制作您的绘图。

这是行不通的，因为当它试图猜测 bin 边缘应该在哪里时，它（正确地）引发了错误，这对于字符串来说毫无意义。

【讨论】：

ag，打败我！是的，计数器是这项工作的工具！
感谢您的回复。 value_counts 做同样的事情，并且是一个 Series -> Series 转换（所以没有必要强制它回到一个 Series）。我想我想知道是否有一些选项可以为我自动为这种特定的字符串情况进行计数和绘图，因为有一个用于整数。