如何有效地在矩阵中获得一种“最大值”答案

【问题标题】：how to get a kind of "maximum" in a matrix, efficiently如何有效地在矩阵中获得一种“最大值”
【发布时间】：2019-07-28 18:03:53
【问题描述】：

我有以下问题：我有一个使用pandas 模块打开的矩阵，其中每个单元格都有一个介于 -1 和 1 之间的数字。我想要找到的是一行中的最大“可能”值也是不是另一行中的最大值。

如果例如 2 行的最大值在同一列，我比较两个值并取较大的值，然后对于最大值小于另一行的行，我取第二个最大值（和一遍又一遍地做同样的分析）。

为了更好地解释自己，请考虑我的代码

import pandas as pd

matrix = pd.read_csv("matrix.csv") 
# this matrix has an id (or name) for each column 
# ... and the firt column has the id of each row
results = pd.DataFrame(np.empty((len(matrix),3),dtype=pd.Timestamp),columns=['id1','id2','max_pos'])

l = len(matrix.col[[0]]) # number of columns

while next = 1:
   next = 0
   for i in range(0, len(matrix)):
       max_column = str(0)
       for j in range(1, l): # 1 because the first column is an id
           if matrix[max_column][i] < matrix[str(j)][i]:
               max_column = str(j)
       results['id1'][i] = str(i) # I coul put here also matrix['0'][i]
       results['id2'][i] = max_column
       results['max_pos'][i] = matrix[max_column][i]

   for i in range(0, len(results)): #now I will check if two or more rows have the same max column
       for ii in range(0, len(results)):
       # if two id1 has their max in the same column, I keep it with the biggest 
       # ... max value and chage the other to "-1" to iterate again
           if (results['id2'][i] == results['id2'][ii]) and (results['max_pos'][i] < results['max_pos'][ii]):
               matrix[results['id2'][i]][i] = -1
               next = 1

举个例子：

#consider
pd.DataFrame({'a':[1, 2, 5, 0], 'b':[4, 5, 1, 0], 'c':[3, 3, 4, 2], 'd':[1, 0, 0, 1]})

   a  b  c  d
0  1  4  3  1
1  2  5  3  0
2  5  1  4  0
3  0  0  2  1

#at the first iterarion I will have the following result

0  b  4 # this means that the row 0 has its maximum at column 'b' and its value is 4
1  b  5
2  a  5
3  c  2

#the problem is that column b is the maximum of row 0 and 1, but I know that the maximum of row 1 is bigger than row 0, so I take the second maximum of row 0, then:

0  c  3
1  b  5
2  a  5
3  c  2

#now I solved the problem for row 0 and 1, but I have that the column c is the maximum of row 0 and 3, so I compare them and take the second maximum in row 3 

0  c  3
1  b  5
2  a  5
3  d  1

#now I'm done. In the case that two rows have the same column as maximum and also the same number, nothing happens and I keep with that values.

#what if the matrix would be 
pd.DataFrame({'a':[1, 2, 5, 0], 'b':[5, 5, 1, 0], 'c':[3, 3, 4, 2], 'd':[1, 0, 0, 1]})

   a  b  c  d
0  1  5  3  1
1  2  5  3  0
2  5  1  4  0
3  0  0  2  1

#then, at the first itetarion the result will be:

0  b  5
1  b  5
2  a  5
3  c  2

#then, given that the max value of row 0 and 1 is at the same column, I should compare the maximum values
# ... but in this case the values are the same (both are 5), this would be the end of iterating 
# ... because I can't choose between row 0 and 1 and the other rows have their maximum at different columns...

例如，如果我有一个 100x100 的矩阵，那么这段代码对我来说是完美的。但是，如果矩阵大小达到 50,000x50,000，则代码需要很长时间才能完成。我现在我的代码可能是最无效的方法，但我不知道如何处理这个问题。

我一直在阅读有关 python 中的线程的信息，这可能会有所帮助，但如果我放置 50,000 个线程则无济于事，因为我的计算机不使用更多 CPU。我也尝试将一些函数用作.max()，但我无法获取最大值列并将其与其他最大值进行比较...

如果有人可以帮助我，给我一些建议以提高效率，我将不胜感激。

【问题讨论】：

What I wanted to find is the maximum "posible" value in a row that is also not the maximum value in another row. - 当多行做具有相同的最大值时会发生什么？
例如，如果第 3 列有第 2 行和第 4 行的最大值，我比较第 2 行和第 4 行之间的值。假设第 2 行中的值大于第 4 行，那么在这种情况下，我将最大值留给第 2 行并取第 4 行的第二个最大值（然后，另一列将是最大值）。如果第 2 行和第 4 行的值相同，那么我什么都不做。
@hllspwn 这是一个非常令人困惑的评论。你能否提出你的问题，一些可重现的东西向我们展示你的意思。创建一个非常基本的表格，例如 pd.DataFrame({'a':[1, 2, 4], 'b':[4, 5, 1], 'c':[3, 3, 4]}) 并告诉我们您想从中看到什么。
完成，如果之前无法解释自己，我很抱歉，我希望这个例子有帮助。感谢@Matt W 的建议。
不用道歉！感谢您的澄清，现在它更有意义了。我去看看。

标签： python-3.x pandas performance matrix iteration

【解决方案1】：

需要更多关于此的信息。你想在这里完成什么？

这将帮助您取得一些进展，但为了完全实现您正在做的事情，我需要更多背景信息。

我们将从集合中导入 numpy、random 和 Counter：

import numpy as np
import random 
from collections import Counter

我们将创建一个随机的 50k x 50k 矩阵，由 -10M 到 +10M 之间的数字组成

mat = np.random.randint(-10000000,10000000,(50000,50000))

现在要获得每个行的最大值，我们可以执行以下列表推导：

maximums = [max(mat[x,:]) for x in range(len(mat))]

现在我们想找出在任何其他行中哪些不是最大值。我们可以在我们的最大值列表中使用Counter 来找出每个有多少。 Counter 返回一个类似于字典的计数器对象，其中最大值为键，出现的次数为值。然后我们进行字典理解，其中值 == 到 1。这将为我们提供只出现一次的最大值。我们使用.keys()函数自己抓取数字，然后将其变成一个列表。

c = Counter(maximums)
{9999117: 15,
9998584: 2,
9998352: 2,
9999226: 22,
9999697: 59,
9999534: 32,
9998775: 8,
9999288: 18,
9998956: 9,
9998119: 1,
...}

k = list( {x: c[x] for x in c if c[x] == 1}.keys() )

[9998253,
 9998139,
 9998091,
 9997788,
 9998166,
 9998552,
 9997711,
 9998230,
 9998000,
...]

最后，我们可以执行以下列表推导来遍历原始最大值列表，以获取这些行所在位置的指示。

indices = [i for i, x in enumerate(maximums) if x in k]

根据您还想做什么，我们可以从这里开始。

它不是最快的程序，但在已加载的 50,000 x 50,000 矩阵上查找最大值、计数器和指标需要 182 秒。

【讨论】：