在 MATLAB 中查找子集的索引答案

【问题标题】：find indices of subsets in MATLAB在 MATLAB 中查找子集的索引
【发布时间】：2018-05-11 22:00:12
【问题描述】：

这个问题是由非常具体的组合优化问题引起的，其中搜索空间被定义为向量未排序的具有多重性的离散值集的置换子集的空间。

我正在寻找能够以下列方式找到子集索引的有效（足够快、矢量化或任何其他更聪明的解决方案）函数：

t = [1 1 3 2 2 2 3 ]

是所有可能值的未排序向量，包括其多重性。

item = [2 3 1; 2 1 2; 3 1 1; 1 3 3]

是向量 t 的置换子集的列表。

我需要找到对应于向量 t 的子集项的对应索引列表。因此，对于上述示例，我们有：

item =

     2     3     1
     2     1     2
     3     1     1
     1     3     3

t =

     1     1     3     2     2     2     3

ind = item2ind(item,t)
ind =

     4     3     1
     4     1     5
     3     1     2
     1     3     7

所以，对于 item = [2 3 1]，我们得到 ind = [4 3 1]，这意味着：

item 处的第一个值“2”对应于位置“4”上 t 处的第一个值“2”， item 处的第二个值“3”对应于位置“3”上 t 处的第一个值“3”，并且 item 处的第三个值“1”对应于位置“1”上 t 处的第一个值“1”。

在 item =[ 2 1 2] 的情况下，我们得到 ind = [4 1 5]，这意味着：

item 处的第一个值“2”对应于位置“4”上 t 处的第一个值“2”， item 处的第二个值“1”对应于位置“1”上 t 处的第一个值“1”，并且 item 处的第三个值“2”对应于位置“5”上 t 处的第二个（!!!）值“1”。

对于

item = [1 1 1]

不存在任何解，因为向量t只包含两个“1”。

我当前版本的函数“item2ind”是非常简单的串行代码，可以通过将“for”循环更改为“parfor”循环来简单地并行化：

function ind = item2ind(item,t)
[nlp,N] = size(item);
ind = zeros(nlp,N);
for i = 1:nlp
    auxitem = item(i,:);
    auxt = t;
    for j = 1:N
        I = find(auxitem(j) == auxt,1,'first');
        if ~isempty(I)
            auxt(I) = 0;
            ind(i,j) = I;
        else
            error('Incompatible content of item and t.');
        end
    end
end
end

但我需要一些更聪明的东西……而且更快:)

较大输入数据的测试用例：

t = 1:10;  % 10 unique values at vector t
t = repmat(t,1,5); % unsorted vector t with multiplicity of all unique values 5
nlp = 100000; % number of item rows
[~,p] = sort(rand(nlp,length(t)),2); % 100000 random permutations
item = t(p);  % transform permutations to items
item = item(:,1:30); % transform item to shorter subset
tic;ind = item2ind(item,t);toc % runing and timing of the original function
tic;ind_ = item2ind_new(item,t);toc % runing and timing of the new function
isequal(ind,ind_) % comparison of solutions

【问题讨论】：

我还没有找到一个明确的解决方案，但是[~,~,ib]=intersect(item(ii,:),t,'stable'); 非常接近你想要的，但是它不明白重复的值不是同一个项目。
@AnderBiguri 为问题文本添加一些简要说明...

标签： matlab performance vectorization

【解决方案1】：

为了实现代码向量化，我假设错误情况不会出现。它应该首先被丢弃，我将在下面介绍一个简单的过程。

方法首先，我们计算t中所有元素的索引：

t = t(:);
mct = max(accumarray(t,1));
G = accumarray(t,1:length(t),[],@(x) {sort(x)});
G = cellfun(@(x) padarray(x.',[0 mct-length(x)],0,'post'), G, 'UniformOutput', false);
G = vertcat(G{:});

解释：将输入放入列向量形状后，我们使用accumarray 计算t 中每个可能值的最大出现次数。现在，我们形成所有数字的所有索引的数组。它形成一个元胞数组，因为每个值的出现次数可能不同。为了形成一个矩阵，我们将每个数组独立地填充到最大长度（命名mct）。然后我们可以将元胞数组转换为矩阵。在这一步，我们有：

G =
     1    11    21    31    41
     2    12    22    32    42
     3    13    23    33    43
     4    14    24    34    44
     5    15    25    35    45
     6    16    26    36    46
     7    17    27    37    47
     8    18    28    38    48
     9    19    29    39    49
    10    20    30    40    50

现在，我们处理item。为此，让我们弄清楚如何创建向量内值出现的累积和。例如，如果我有：

A = [1 1 3 2 2 2 3];

那我想得到：

B = [1 2 1 1 2 3 2];

由于隐式扩展，我们可以在一行中得到它：

B = diag(cumsum(A==A'));

就这么简单。语法A==A' 扩展为一个矩阵，其中每个元素都是A(i)==A(j)。只在一个维度上求累积和并取对角线给我们带来了很好的结果：每一列都在一个值上的累积出现和。

要在item 2-D 中使用这个技巧，我们应该使用 3D 数组。让我们打电话给m=size(item,1) 和n=size(item,2)。所以：

C = cumsum(reshape(item,m,1,n)==item,3);

是所有累积事件的（大）3D 矩阵。最后一件事是选择沿维度 2 和 3 对角线上的列：

ia = C(sub2ind(size(C),repelem((1:m).',1,n),repelem(1:n,m,1),repelem(1:n,m,1)));

现在，有了所有这些矩阵，索引很容易：

ind = G(sub2ind(size(G),item,ia));

最后，让我们回顾一下函数的代码：

function ind = item2ind_new(item,t)
t = t(:);
[m,n] = size(item);
mct = max(accumarray(t,1));
G = accumarray(t,1:length(t),[],@(x) {sort(x)});
G = cellfun(@(x) padarray(x.',[0 mct-length(x)],0,'post'), G, 'UniformOutput', false);
G = vertcat(G{:});
C = cumsum(reshape(item,m,1,n)==item,3);
ia = C(sub2ind(size(C),repelem((1:m).',1,n),repelem(1:n,m,1),repelem(1:n,m,1)));
ind = G(sub2ind(size(G),item,ia));

结果在旧的 4 核上运行提供的脚本，我得到：

Elapsed time is 4.317914 seconds.
Elapsed time is 0.556803 seconds.
ans =
  logical
   1

加速是实质性的（超过 8 倍）以及内存消耗（使用矩阵 C）。我想这部分可以做一些改进以节省更多内存。

编辑为了生成 ia，此过程可能会导致内存丢失。一种节省内存的方法是使用for循环直接生成这个数组：

ia = zeros(size(item));
for i=unique(t(:)).'
    ia = ia+cumsum(item==i, 2).*(item==i);
end

在所有情况下，当您拥有ia 时，与t 相比，很容易测试item 是否有错误：

any(ind(:)==0)

然后是一个简单的解决项目错误（作为掩码）的解决方案

min(ind,[],2)==0

【讨论】：

natura 中矢量化能力的一个很好的例子。谢谢！！！关于数组 C 的内存消耗。是的，在我最坏的情况下，C 数组的内存消耗 = prod(size(item))*size(item,2)*8 ~ 1GB，这几乎是不可接受的。您知道如何在不降低性能的情况下减少 C 的内存消耗吗？将 C 转换为 uint8 类呢？
这个怎么样？ C = cumsum(uint8(reshape(item,m,1,n)==item),3);
@michal 当然只使用 uint8 会减少内存消耗。目前我还没有找到另一种生成矩阵ia 而不转向 3D 矩阵的方法。一种方法是使用cumsum(unique(item)==reshape(item,m,1,n))，但我现在没有足够的时间来找到获取索引的正确方法。在您的情况下，内存减少了 5 倍。
如果您能帮助我找到更好的解决方案，我将非常高兴……现在我必须说，我没有看到任何正确的方法来使用您的提示。
@michal 放弃使用unique 的想法。 IMO，没有太多额外内存的唯一方法是使用 for 循环：ia = zeros(size(item)); for i=1:max(t); ia = ia+cumsum(item==i, 2).*(item==i); end。