如何选择部分密集数据集的均匀分布子集？答案

【问题标题】：How to select a uniformly distributed subset of a partially dense dataset?如何选择部分密集数据集的均匀分布子集？
【发布时间】：2016-04-15 13:22:30
【问题描述】：

P 是一个 n*d 矩阵，包含n d 维样本。 P 在某些区域的密度是其他区域的数倍。我想选择P 的一个子集，其中任何一对样本之间的距离都大于d0，并且我需要它遍布整个区域。所有样本都具有相同的优先级，无需优化任何内容（例如覆盖区域或成对距离之和）。

这是一个这样做的示例代码，但它真的很慢。我需要一个更高效的代码，因为我需要多次调用它。

%% generating sample data
n_4 = 1000; n_2 = n_4*2;n = n_4*4;
x1=[ randn(n_4, 1)*10+30; randn(n_4, 1)*3 + 60];
y1=[ randn(n_4, 1)*5 + 35; randn(n_4, 1)*20 + 80 ];
x2 = rand(n_2, 1)*(max(x1)-min(x1)) + min(x1);
y2 = rand(n_2, 1)*(max(y1)-min(y1)) + min(y1);
P = [x1,y1;x2, y2];
%% eliminating close ones
tic
d0 = 1.5;
D = pdist2(P, P);D(1:n+1:end) = inf;
E = zeros(n, 1); % eliminated ones
for i=1:n-1
    if ~E(i)
        CloseOnes = (D(i,:)<d0) & ((1:n)>i) & (~E');
        E(CloseOnes) = 1;
    end
end
P2 = P(~E, :);
toc
%% plotting samples
subplot(121); scatter(P(:, 1), P(:, 2)); axis equal;
subplot(122); scatter(P2(:, 1), P2(:, 2)); axis equal;

编辑：子集应该有多大？

正如j_random_hacker 在 cmets 中指出的那样，如果我们不对所选样本的数量进行限制，可以说P(1, :) 是最快的答案。它巧妙地显示了标题的不连贯性！但我认为当前的标题更好地描述了目的。因此，让我们定义一个约束：“如果可能，请尝试选择 m 样本”。现在有了m=n 的隐含假设，我们可以获得最大可能的子集。正如我之前提到的，一种更快的方法优于找到最佳答案的方法。

【问题讨论】：

请注意，在pdist2(P, P) 中计算距离会占用一半以上的时间，因此即使完全优化循环，总执行时间也只会减少一半。不过很有趣的问题。
我没有看到样本中应该有（或至少）一些给定数量的点的任何约束，所以：为什么不随机选择任何一个点，并称之为你的样本? （解读为：考虑一个适当的约束或惩罚函数，并将其添加到您的问题中。）
@zelanix 高度依赖d0 的值。较小的d0s 会使循环变慢。
我的意思是，没有什么能阻止“选择一个点”成为您问题的有效答案正如目前所说的。由于这可能不是对您有用的答案，因此您需要通过添加一些禁止将其作为答案的约束来更新您的问题。
@j_random_hacker 感谢您的通知，我编辑了问题。

标签： algorithm matlab

【解决方案1】：

一遍又一遍地寻找最近的点表明了一种针对空间搜索进行了优化的不同数据结构。我建议使用 delaunay 三角剖分。

下面的解决方案是“近似的”，因为它可能会删除比严格必要的更多的点。我正在批处理所有计算并在每次迭代中删除导致距离过长的 all 点，并且在许多情况下，删除一个点可能会删除同一迭代中稍后出现的边缘。如果这很重要，可以进一步处理边缘列表以避免重复，甚至找到要删除的点，这会影响最大的距离。

这很快。

dt = delaunayTriangulation(P(:,1), P(:,2));
d0 = 1.5;

while 1
    edge = edges(dt);  % vertex ids in pairs

    % Lookup the actual locations of each point and reorganize
    pwise = reshape(dt.Points(edge.', :), 2, size(edge,1), 2);
    % Compute length of each edge
    difference = pwise(1,:,:) - pwise(2,:,:);
    edge_lengths = sqrt(difference(1,:,1).^2 + difference(1,:,2).^2);

    % Find edges less than minimum length
    idx = find(edge_lengths < d0);
    if(isempty(idx))
        break;
    end

    % pick first vertex of each too-short edge for deletion
    % This could be smarter to avoid overdeleting
    points_to_delete = unique(edge(idx, 1));

    % remove them.  triangulation auto-updates
    dt.Points(points_to_delete, :) = [];

    % repeat until no edge is too short
end

P2 = dt.Points;

【讨论】：

那很快 - 我喜欢 :) 我不知道 delaunay 三角测量
嗯，我刚刚意识到最初的请求是针对 N-d 数据集的。三角剖分应该适用于 3d 数据集（但我没有尝试过），但它肯定不适用于 N>3。

【解决方案2】：

您没有指定要选择的点数。这对问题至关重要。

我看不出有什么方法可以优化您的方法。

假设欧几里得距离作为距离度量是可接受的，以下实现在仅选择少量点时要快得多，即使在尝试使用“所有”有效点的子集时也更快（请注意，找到可能的最大值点数很难）。

%%
figure;
subplot(121); scatter(P(:, 1), P(:, 2)); axis equal;

d0 = 1.5;

m_range = linspace(1, 2000, 100);
m_time = NaN(size(m_range));

for m_i = 1:length(m_range);
    m = m_range(m_i)

    a = tic;
    % Test points in random order.
    r = randperm(n);
    r_i = 1;

    S = false(n, 1); % selected ones
    for i=1:m
        found = false;

        while ~found
            j = r(r_i);
            r_i = r_i + 1;
            if r_i > n
                % We have tried all points. Nothing else can be valid.
                break;
            end
            if sum(S) == 0
                % This is the first point.
                found = true;
            else
                % Get the points already selected
                P_selected = P(S, :);
                % Exclude points >= d0 along either axis - they cannot have
                % a Euclidean distance less than d0.
                P_valid = (abs(P_selected(:, 1) - P(j, 1)) < d0) & (abs(P_selected(:, 2) - P(j, 2)) < d0);
                if sum(P_valid) == 0
                    % There are no points that can be < d0.
                    found = true;
                else
                    % Implement Euclidean distance explicitly rather than
                    % using pdist - this makes a large difference to
                    % timing.
                    found = min(sqrt(sum((P_selected(P_valid, :) - repmat(P(j, :), sum(P_valid), 1)) .^ 2, 2))) >= d0;
                end
            end
        end
        if found
            % We found a valid point - select it.
            S(j) = true;
        else
            % Nothing found, so we must have exhausted all points.
            break;
        end
    end
    P2 = P(S, :);
    m_time(m_i) = toc(a);
    subplot(122); scatter(P2(:, 1), P2(:, 2)); axis equal;
    drawnow;
end
%%
figure
plot(m_range, m_time);
hold on;
plot(m_range([1 end]), ones(2, 1) * original_time);
hold off;

其中original_time 是您的方法所花费的时间。这给出了以下时间，其中红线是您的方法，蓝色是我的方法，沿 x 轴选择的点数。请注意，当已选择满足标准的“所有”点时，该行会变平。

正如您在评论中所说，性能高度依赖于d0 的值。事实上，随着d0的减少，上面的方法在性能上似乎有了更大的提升（这是针对d0=0.1的）：

但是请注意，这也取决于其他因素，例如您的数据分布。此方法利用数据集的特定属性，并通过过滤掉计算欧几里得距离毫无意义的点来减少昂贵的计算次数。这对于选择较少的点特别有效，并且对于较小的d0 实际上更快，因为数据集中符合标准的点较少（因此所需的欧几里得距离计算较少）。此类问题的最佳解决方案通常取决于所使用的确切数据集。

还要注意，在我上面的代码中，手动计算欧几里得距离比调用pdist 快得多。 Matlab 内置函数的灵活性和通用性在简单情况下往往不利于性能。

【讨论】：