将散点图分配到特定的 bin答案

【问题标题】：Allocate scatter plot into specific bins将散点图分配到特定的 bin
【发布时间】：2019-03-07 03:53:44
【问题描述】：

我有一个scatter plot，它被分类为4 Bins。它们之间由两个arcs 和一个line 隔开（见下图）。

这两个arcs 有一点问题。如果X-Coordiante 大于ang2，则不会归因于正确的Bin。（请看下图）

import math
import matplotlib.pyplot as plt
import matplotlib as mpl

X = [24,15,71,72,6,13,77,52,52,62,46,43,31,35,41]  
Y = [94,61,76,83,69,86,78,57,45,94,82,74,56,70,94]      

fig, ax = plt.subplots()
ax.set_xlim(-100,100)
ax.set_ylim(-40,140)
ax.grid(False)

plt.scatter(X,Y)

#middle line
BIN_23_X = 0 
#two arcs
ang1 = -60, 60
ang2 = 60, 60
angle = math.degrees(math.acos(2/9.15))
E_xy = 0,60

Halfway = mpl.lines.Line2D((BIN_23_X,BIN_23_X), (0,125), color = 'white', lw = 1.5, alpha = 0.8, zorder = 1)
arc1 = mpl.patches.Arc(ang1, 70, 110, angle = 0, theta2 = angle, theta1 = 360-angle, color = 'white', lw = 2)
arc2 = mpl.patches.Arc(ang2, 70, 110, angle = 0, theta2 = 180+angle, theta1 = 180-angle, color = 'white', lw = 2)
Oval = mpl.patches.Ellipse(E_xy, 160, 130, lw = 3, edgecolor = 'black', color = 'white', alpha = 0.2)

ax.add_line(Halfway)
ax.add_patch(arc1)
ax.add_patch(arc2)
ax.add_patch(Oval)

#Sorting the coordinates into bins   
def get_nearest_arc_vert(x, y, arc_vertices):
err = (arc_vertices[:,0] - x)**2 + (arc_vertices[:,1] - y)**2
nearest = (arc_vertices[err == min(err)])[0]
return nearest

arc1v = ax.transData.inverted().transform(arc1.get_verts())
arc2v = ax.transData.inverted().transform(arc2.get_verts())

def classify_pointset(vx, vy):
    bins = {(k+1):[] for k in range(4)}
    for (x,y) in zip(vx, vy):
        nx1, ny1 = get_nearest_arc_vert(x, y, arc1v)
        nx2, ny2 = get_nearest_arc_vert(x, y, arc2v)

        if x < nx1:                         
            bins[1].append((x,y))
        elif x > nx2:                      
            bins[4].append((x,y))
        else:
            if x < BIN_23_X:               
                bins[2].append((x,y))
            else:                          
               bins[3].append((x,y))
    return bins

#Bins Output
bins_red  = classify_pointset(X,Y)

all_points = [None] * 5
for bin_key in [1,2,3,4]:
    all_points[bin_key] = bins_red[bin_key]

输出：

[[], [], [(24, 94), (15, 61), (71, 76), (72, 83), (6, 69), (13, 86), (77, 78), (62, 94)], [(52, 57), (52, 45), (46, 82), (43, 74), (31, 56), (35, 70), (41, 94)]]

这不太对。查看下面的figure output，4 coordinates 在Bin 3 中，11 在Bin 4 中。但是8 归属于Bin 3，7 归属于Bin 4。

我认为问题出在blue coordinates。具体来说，当X-Coordinate大于ang2时，即为60。如果我将这些更改为小于60，它们将被更正为Bin 3。

我不确定是否应该扩展arcs 大于60，或者代码是否可以改进？

请注意，这仅适用于 Bin 4 和 ang2。 Bin 1 和 ang1 将出现此问题。也就是说，如果 X-Cooridnate 小于 60，则不会归因于 Bin 1

预期输出：

[[], [], [(24, 94), (15, 61), (6, 69), (13, 86)], [(71, 76), (72, 83), (52, 57), (52, 45), (46, 82), (43, 74), (31, 56), (35, 70), (41, 94), (77, 78), (62, 94)]]

注意：首选预期输出。该示例使用一个row 输入数据。但是，我的数据集要大得多。如果我们使用大量rows，则输出应该是逐行的。例如

#Numerous rows
X = np.random.randint(50, size=(100, 10))
Y = np.random.randint(80, size=(100, 10))

输出：

Row 0 = [(x,y)],[(x,y)],[(x,y)],[(x,y)]
Row 1 = [(x,y)],[(x,y)],[(x,y)],[(x,y)]
Row 2 = [(x,y)],[(x,y)],[(x,y)],[(x,y)]
etc

【问题讨论】：

标签： python pandas numpy matplotlib plot

【解决方案1】：

补丁有一个测试是否包含点：contains_point，甚至是点数组：contains_points

我为你准备了一个代码 sn-p，你可以在添加补丁的部分和 #Sorting the coordinates into bins 代码块之间添加它。

它添加了两个额外的（透明）椭圆，用于计算如果弧是完全闭合的椭圆，它们是否包含点。然后，如果一个点属于大椭圆、左侧或右侧省略号或具有正或负 x 坐标，则您的 bin 计算只是测试的布尔组合。

ov1 = mpl.patches.Ellipse(ang1, 70, 110, alpha=0)
ov2 = mpl.patches.Ellipse(ang2, 70, 110, alpha=0)
ax.add_patch(ov1)
ax.add_patch(ov2)

for px, py in zip(X, Y):
    in_oval = Oval.contains_point(ax.transData.transform(([px, py])), 0)
    in_left = ov1.contains_point(ax.transData.transform(([px, py])), 0)
    in_right = ov2.contains_point(ax.transData.transform(([px, py])), 0)
    on_left = px < 0
    on_right = px > 0
    if in_oval:
        if in_left:
            n_bin = 1
        elif in_right:
            n_bin = 4
        elif on_left:
            n_bin = 2
        elif on_right:
            n_bin = 3
        else:
            n_bin = -1
    else:
        n_bin = -1
    print('({:>2}/{:>2}) is {}'.format(px, py, 'in Bin ' +str(n_bin) if n_bin>0 else 'outside'))

输出是：

(24/94) is in Bin 3
(15/61) is in Bin 3
(71/76) is in Bin 4
(72/83) is in Bin 4
( 6/69) is in Bin 3
(13/86) is in Bin 3
(77/78) is outside
(52/57) is in Bin 4
(52/45) is in Bin 4
(62/94) is in Bin 4
(46/82) is in Bin 4
(43/74) is in Bin 4
(31/56) is in Bin 4
(35/70) is in Bin 4
(41/94) is in Bin 4

请注意，当点的 x-coord=0 时，您仍然应该决定如何定义 bin - 目前它们等于外部，因为 on_left 和 on_right两者都不觉得对它们负责...

PS：感谢@ImportanceOfBeingErnest 提供必要转换的提示：https://stackoverflow.com/a/49112347/8300135

注意：对于以下所有编辑，您需要 import numpy as np
编辑： 用于计算每个 X, Y 数组输入的 bin 分布的函数：

def bin_counts(X, Y):
    bc = dict()
    E = Oval.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    E_l = ov1.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    E_r = ov2.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    L = np.array(X) < 0
    R = np.array(X) > 0
    bc[1] = np.sum(E & E_l)
    bc[2] = np.sum(E & L & ~E_l)
    bc[3] = np.sum(E & R & ~E_r)
    bc[4] = np.sum(E & E_r)
    return bc

会导致这样的结果：

bin_counts(X, Y)
Out: {1: 0, 2: 0, 3: 4, 4: 10}

EDIT2： X 和 Y 的两个二维数组中有很多行：

np.random.seed(42)
X = np.random.randint(-80, 80, size=(100, 10))
Y = np.random.randint(0, 120, size=(100, 10))

循环遍历所有行：

for xr, yr in zip(X, Y):
    print(bin_counts(xr, yr))

结果：

{1: 1, 2: 2, 3: 6, 4: 0}
{1: 1, 2: 0, 3: 4, 4: 2}
{1: 5, 2: 2, 3: 1, 4: 1}
...
{1: 3, 2: 2, 3: 2, 4: 0}
{1: 2, 2: 4, 3: 1, 4: 1}
{1: 1, 2: 1, 3: 6, 4: 2}

EDIT3： 要返回不是每个 bin 中的点数，而是一个包含四个数组的数组，其中包含每个 bin 中点的 x,y 坐标，请使用以下命令：

X = [24,15,71,72,6,13,77,52,52,62,46,43,31,35,41]  
Y = [94,61,76,83,69,86,78,57,45,94,82,74,56,70,94]      

def bin_points(X, Y):
    X = np.array(X)
    Y = np.array(Y)
    E = Oval.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    E_l = ov1.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    E_r = ov2.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    L = X < 0
    R = X > 0
    bp1 = np.array([X[E & E_l], Y[E & E_l]]).T
    bp2 = np.array([X[E & L & ~E_l], Y[E & L & ~E_l]]).T
    bp3 = np.array([X[E & R & ~E_r], Y[E & R & ~E_r]]).T
    bp4 = np.array([X[E & E_r], Y[E & E_r]]).T
    return [bp1, bp2, bp3, bp4]

print(bin_points(X, Y))
[array([], shape=(0, 2), dtype=int32), array([], shape=(0, 2), dtype=int32), array([[24, 94],
       [15, 61],
       [ 6, 69],
       [13, 86]]), array([[71, 76],
       [72, 83],
       [52, 57],
       [52, 45],
       [62, 94],
       [46, 82],
       [43, 74],
       [31, 56],
       [35, 70],
       [41, 94]])]

...再一次，为了将其应用于大型二维数组，只需对其进行迭代：

np.random.seed(42)
X = np.random.randint(-100, 100, size=(100, 10))
Y = np.random.randint(-40, 140, size=(100, 10))

bincol = ['r', 'g', 'b', 'y', 'k']

for xr, yr in zip(X, Y):
    for i, binned_points in enumerate(bin_points(xr, yr)):
        ax.scatter(*binned_points.T, c=bincol[i], marker='o' if i<4 else 'x')

【讨论】：

感谢@SpghttCd。我希望在我的问题中产生相同的输出。我使用的输入数据只是嵌入在更大数据集中的一行。我遍历每一行以计算每个 bin 中的点数。
IIUC 你需要一个函数，它接受 X 和 Y 并且只返回每个 bin 中的点数。请查看我的编辑。
感谢您的更新，尽管您已经充分回答了第一个问题。我很高兴接受，但我仍然希望产生预期的输出。如果我有一个包含许多行的较大数据集，您的代码将返回 TypeError：只有长度为 1 的数组可以转换为 Python 标量。我将接受并发布另一个包含更新输入数据的问题
1. 我不明白你做了什么得到这个错误。将我的函数应用于多行 X 和 Y 坐标只需对其进行迭代 - 请参阅我的 EDIT2。（顺便说一句：出于比较目的，我在随机函数之前添加了一个种子，并让随机数出现在您的垃圾箱的完整区域中）。 2. 我不明白想要的结果[(x,y)],[(x,y)],[(x,y)],[(x,y)] 这里的 x 和 y 是什么？这是“Bin x 包含 y 点”的列表吗？那么我的函数的结果已经是正确的但是你想要一个列表而不是一个字典......？请更准确。
也许我现在有了你想要的。我只是再次阅读了第一段并更改了函数以返回每个 bin 的实际坐标，而不仅仅是它们的计数。请参阅Edit3。

【解决方案2】：

这是我将其分类为省略号的版本。由于 OP 使用简单的几何形状，因此可以用一个简单的公式对此进行测试，即不“询问”补丁。我将它推广到 n 弧，但有一个小缺点，即 bin 编号不是从左到右，但这可以在其他地方处理。输出的类型

[ [ [x,y], [x,y],...], ... ]

即每个 bin 的 x,y 列表。不过这里的编号是从 -3 到 3，外面是 0。

import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np

def in_ellipse( xy, x0y0ab):
    x, y = xy
    x0, y0 = x0y0ab[0]
    a = x0y0ab[1]/2.  ## as the list of ellipses takes width and not semi axis
    b = x0y0ab[2]/2.
    return ( x - x0 )**2 / a**2+ ( y - y0 )**2 / b**2 < 1

def sort_into_bins( xy, mainE, eList ):
    binCntr = 0
    xyA = (np.abs(xy[0]),xy[1]) ## all positive
    if in_ellipse( xyA, mainE ):
        binCntr +=1
        for ell in eList:
            if in_ellipse( xyA, ell ):
                break
            binCntr +=1
    binCntr=np.copysign( binCntr, xy[0] )
    return int( binCntr )

X = 200 * np.random.random(150) - 100
Y = 140 * np.random.random(150) - 70 + 60

fig, ax = plt.subplots()
ax.set_xlim(-100,100)
ax.set_ylim(-40,140)
ax.grid(False)


BIN_23_X = 0 
mainEllipse = [ np.array([0, 60]), 160, 130 ]
allEllipses = [ [ np.array([60,60]), 70., 110. ], [ np.array([60,60]), 100, 160 ]  ]

Halfway = mpl.lines.Line2D((BIN_23_X,BIN_23_X), (0,125), color = '#808080', lw = 1.5, alpha = 0.8, zorder = 1)
Oval = mpl.patches.Ellipse( mainEllipse[0], mainEllipse[1], mainEllipse[2], lw = 3, edgecolor = '#808080', facecolor = '#808080', alpha = 0.2)
ax.add_patch(Oval)
ax.add_line(Halfway)

for ell in allEllipses:
    arc =  mpl.patches.Arc( ell[0] , ell[1], ell[2], angle = 0,  color = '#808080', lw = 2, linestyle=':')
    ax.add_patch( arc )
    arc =  mpl.patches.Arc( ell[0] * np.array([ -1, 1 ]), ell[1], ell[2], angle = 0,  color = '#808080', lw = 2, linestyle=':')
    ax.add_patch( arc )

binDict = dict()
for x,y in zip(X,Y):
    binDict[( x,y)]=sort_into_bins( (x,y), mainEllipse, allEllipses )

rowEval=[]
for s in range(-3,4):
    rowEval+=[[]]
for key, val in binDict.iteritems():
    rowEval[ val + 3 ]+=[key]

for s in range(-3,4):
    plt.scatter( *zip( *rowEval[ s + 3 ] ) )

plt.show()

显示

请注意，我使用了关于 x=0 的对称性事实。如果椭圆相对于 x 移动，则代码必须稍作修改。另请注意，提供省略号的顺序很重要！

【讨论】：