熊猫，情节热图和矩阵答案

【问题标题】：Pandas, plotly heatmaps and matrix熊猫，情节热图和矩阵
【发布时间】：2021-01-30 11:28:06
【问题描述】：

我正在使用 python 3.8，plotly 4.14.1，pandas1.2.0

我不知道如何在 pandas 中分离我的数据并将数据分配给计数器，以便更新热图。

我想创建一个影响 x 可能性的风险矩阵，并将这些数字显示在 Plotly 热图上。

将数据硬编码到数据帧中，并按预期工作

下面有图工厂

gross_data=[[0,1,2,6,3], [0,7,18,12,6], [6,10,43,44,7], [3,15,29,46,18], [5,14,26,22,21]]

x=['Minor [Very Low]<br>1', 'Important [Low]<br>2', 'Significant [Moderate]<br>3', 'Major [High]<br>4', 'Critical [Very High]<br>5']
y=['Very Remote<br>1', 'Remote<br>2', 'Unlikely<br>3', 'Possible<br>4', 'Highly Possible<br>5']

fig = ff.create_annotated_heatmap(gross_data, x=x, y=y, colorscale='Magma')
fig['layout']['xaxis']['side'] = 'bottom'
fig.show()

Plotly Figure Factory 注释热图

或者用 plotly express 来做

gross_data=[[0,1,2,6,3], [0,7,18,12,6], [6,10,43,44,7], [3,15,29,46,18], [5,14,26,22,21]]

fig = px.imshow(gross_data,
                labels=dict(x="Impact", y="Probability", color="Number of Risks"),
                x=['Minor [Very Low]', 'Important [Low]', 'Significant [Moderate]', 'Major [High]', 'Critical [Very High]'],
                y=['Very Remote', 'Remote', 'Unlikely', 'Possible', 'Highly Possible']
               )
fig.update_xaxes(side="bottom")

fig.show()

绘制热图

现在，这两者都适用于一次性演示，但我希望能够对我的数据进行切片和切块，并由业务部门等显示。

我正在使用 pandas 读取大约 1000 行 Excel 数据，这些数据已由人工填写为表格，其中包含所有带来的错误。我已经整理好并清理了数据。

我无法解决的是如何获取为风险矩阵中的每个单元格提供大小所需的数据，而不需要像下面这样的大量 if 语句。

for X in df[gross_impact]:
    for y in df[gross_likelihood]:
        if (x == 1) & (y == 1):
            increment a counter associated with cell x1y1
       elif (x == 2) & (y == 1):
            increment a counter associated with cell x2y1
       elif (x == 3) & (y == 1):
            increment a counter associated with cell x3y1
       elif (x == 4) & (y == 1):
            increment a counter associated with cell x4y1
       elif (x == 4) & (y == 1):
            increment a counter associated with cell x5y1
       elif (x == 1) & (y == 2):
            increment a counter associated with cell x1y2
       elif (x == 2) & (y == 2):
            increment a counter associated with cell x2y2
      .....
      .....
      elif (x == 5) & (y == 5):
        increment a counter associated with cell x5y5

这显然是完全低效的。 id 运行了大约 233,000 次循环以获得我的结果，虽然计算资源很便宜，但这并不正确。

我只是不知道该怎么做。我已经阅读了几个堆栈交换问题，但它们没有解决我的问题。我在网上搜索了风险矩阵，但在 C++ 或财务数据中返回的东西看起来并不适用。

我的数据框有大约 30 个字段，但唯一的参考是 risk_id。我相互相乘的领域是 risk_impact 和 risk_likelihood。我可以对 risk_id 和 business_unit 进行排序

我必须区分 1 x 3 和 3 x 1，因为它进入两个不同的箱子计数应该从左下角到右上角，所以 1 x 3 是第 1 列第 3 行。3 x 1 是第 3 列第 1 行

希望这很清楚。如果我确切地知道该要求什么，我可能已经做到了

我们将不胜感激任何有关解决此问题的帮助。

这是我的数据帧中的数据示例

import pandas as pf

# Our data to be loaded into the data frame
data = {'risk_id': ['AP-P01-R01', 'AP-P02-R02', 'AP-P03-R03', 'AP-P01-R04', 'BP-P01-R01', 'BP-P01-R02', 'BP-P01-R03', 'BP-P01-R04', 'BP-P01-R05', 'BP-P01-R06', 'BP-P01-R07', 'CP-P01-R01', 'CP-P01-R02', 'CP-P01-R03', 'CP-P01-R04', 'CP-P01-R05', 'CP-P01-R06', 'CP-P01-R07', 'CP-P01-R08'],
        'gross_impact': [4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 3, 3, 4, 4, 2, 3, 5, 3, 2],
        'gross_likelihood': [3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 2, 3, 3, 3, 3, 4, 4, 5, 3],
        'business_unit': ['Accounts Payable', 'Accounts Payable', 'Accounts Payable', 'Accounts Payable', 'British Petroleum', 'British Petroleum', 'British Petroleum', 'British Petroleum', 'British Petroleum', 'British Petroleum', 'British Petroleum', 'Client Profile', 'Client Profile', 'Client Profile', 'Client Profile', 'Client Profile', 'Client Profile', 'Client Profile', 'Client Profile']
        }

一切正常

使用提供的解决方案在我有限的数据集上完美运行。当我使用我的主文件创建一个纯熊猫数据框时，我收到以下错误

IndexError: 用作索引的数组必须是整数（或布尔）类型

当我运行代码及其以下行时出现错误错误

heatmap[counts[:,1]-1, counts[:,0]-1] = counts[:,2]

我正在运行的完整代码是

df = raca_df[['risk_id', 'gross_impact','gross_likelihood', 'business_unit']].dropna()
counts = df.groupby(['gross_impact','gross_likelihood']).apply(len).reset_index().values

heatmap = np.zeros((np.max(5), np.max(5)))
#heatmap = np.zeros((np.max(df['gross_impact']), np.max(df['gross_likelihood'])))
heatmap[counts[:,1]-1, counts[:,0]-1] = counts[:,2]

import plotly.figure_factory as ff
fig = ff.create_annotated_heatmap(heatmap)
fig['layout']['xaxis']['side'] = 'bottom'
fig.show()

在提示我如何创建 raca df 后，快速 raca_df.info 指出了我的问题。这原来是我原来的 raca_df 数据框和列类型为 float64 的问题。我在列中也有一些空白条目，所以它不会让我在那里更改列类型。

我不得不从 raca_df 创建一个名为 df 的新数据框，并使用循环和 astype('int') 更改了那里的列类型

df = raca_df[['risk_id', 'gross_impact','gross_likelihood', 'business_unit']].dropna()

for item in ['gross_impact','gross_likelihood', 'net_impact', 'net_likelihood']:
    raca_df[item] = raca_df[item].astype('int')

【问题讨论】：

嗨。请提供您的数据示例
添加示例数据
你是如何定义raca_df的？
嘿，谢谢。我只是注释了这个问题。我现在已经解决了这个问题。我的原始列是 float62，但我无法将它们更改为 int，因为列中有空白值。当我创建 df 数据框时，然后我通过 .dropna 删除了空白值，然后我能够修改列类型，现在一切正常。再次感谢您花时间为我指明正确的方向

标签： python-3.x pandas dataframe plotly plotly-python

【解决方案1】：

您可以按影响和可能性进行分组，并使用分组大小来获取您的热图强度：

counts = df.groupby(['gross_impact','gross_likelihood']).apply(len).reset_index().values

# then, you can initialise an empty heatmap and use fancy indexing to populate it:    
heatmap = np.zeros((np.max(df['gross_impact']), np.max(df['gross_likelihood'])))
heatmap[counts[:,0]-1, counts[:,1]-1] = counts[:,2]

# and finally plot it
import plotly.figure_factory as ff
fig = ff.create_annotated_heatmap(heatmap)
fig.show()

编辑：

对于各个业务部门，您可以运行相同的代码，并进行以下任一调整：

# for a single business unit
counts = df[df['business_unit']=='Accounts Payable'].groupby(['gross_impact','gross_likelihood']).apply(len).reset_index().values

# and then the remaining code


# for all units, you can loop:
for business_unit in set(df['business unit']):
    counts = df[df['business unit']==business_unit].groupby(['gross_impact','gross_likelihood']).apply(len).reset_index().values
    
    # and then the remaining code

【讨论】：

太棒了！如果还有什么不清楚的，请告诉我
嘿扭曲。感谢它运作良好。我玩过它，我想我了解其中的大部分内容。我将第二行修改为“heatmap = np.zeros((np.max(5), np.max(5)))”，因为有可能没有 5 的影响和可能性，因此矩阵可能收缩。你能解释一下索引吗？我用它改变数据来查看效果，但是在查看 Plotly FigureFactory create_annotated_heatmap 时，我看不到矩阵是如何映射的。我们只是映射值而热图处理 x 和 y 吗？
您能否指出我将如何为单个业务单元热图提取子集的方向？
如果您查看counts，您会看到它有 3 列。这些分别来自gross_impact、gross_likelihood 和分组大小。 gross_impact 被解释为行索引，gross_likelihood 被解释为列索引，分配的值是组大小。 -1 纠正了 python 索引从 0 开始的事实。
非常感谢您的指导。我从中学到了很多。绝对是我的循环想法:)