【发布时间】:2019-07-28 18:03:53
【问题描述】:
我有以下问题:我有一个使用pandas 模块打开的矩阵,其中每个单元格都有一个介于 -1 和 1 之间的数字。我想要找到的是一行中的最大“可能”值也是不是另一行中的最大值。
如果例如 2 行的最大值在同一列,我比较两个值并取较大的值,然后对于最大值小于另一行的行,我取第二个最大值(和一遍又一遍地做同样的分析)。
为了更好地解释自己,请考虑我的代码
import pandas as pd
matrix = pd.read_csv("matrix.csv")
# this matrix has an id (or name) for each column
# ... and the firt column has the id of each row
results = pd.DataFrame(np.empty((len(matrix),3),dtype=pd.Timestamp),columns=['id1','id2','max_pos'])
l = len(matrix.col[[0]]) # number of columns
while next = 1:
next = 0
for i in range(0, len(matrix)):
max_column = str(0)
for j in range(1, l): # 1 because the first column is an id
if matrix[max_column][i] < matrix[str(j)][i]:
max_column = str(j)
results['id1'][i] = str(i) # I coul put here also matrix['0'][i]
results['id2'][i] = max_column
results['max_pos'][i] = matrix[max_column][i]
for i in range(0, len(results)): #now I will check if two or more rows have the same max column
for ii in range(0, len(results)):
# if two id1 has their max in the same column, I keep it with the biggest
# ... max value and chage the other to "-1" to iterate again
if (results['id2'][i] == results['id2'][ii]) and (results['max_pos'][i] < results['max_pos'][ii]):
matrix[results['id2'][i]][i] = -1
next = 1
举个例子:
#consider
pd.DataFrame({'a':[1, 2, 5, 0], 'b':[4, 5, 1, 0], 'c':[3, 3, 4, 2], 'd':[1, 0, 0, 1]})
a b c d
0 1 4 3 1
1 2 5 3 0
2 5 1 4 0
3 0 0 2 1
#at the first iterarion I will have the following result
0 b 4 # this means that the row 0 has its maximum at column 'b' and its value is 4
1 b 5
2 a 5
3 c 2
#the problem is that column b is the maximum of row 0 and 1, but I know that the maximum of row 1 is bigger than row 0, so I take the second maximum of row 0, then:
0 c 3
1 b 5
2 a 5
3 c 2
#now I solved the problem for row 0 and 1, but I have that the column c is the maximum of row 0 and 3, so I compare them and take the second maximum in row 3
0 c 3
1 b 5
2 a 5
3 d 1
#now I'm done. In the case that two rows have the same column as maximum and also the same number, nothing happens and I keep with that values.
#what if the matrix would be
pd.DataFrame({'a':[1, 2, 5, 0], 'b':[5, 5, 1, 0], 'c':[3, 3, 4, 2], 'd':[1, 0, 0, 1]})
a b c d
0 1 5 3 1
1 2 5 3 0
2 5 1 4 0
3 0 0 2 1
#then, at the first itetarion the result will be:
0 b 5
1 b 5
2 a 5
3 c 2
#then, given that the max value of row 0 and 1 is at the same column, I should compare the maximum values
# ... but in this case the values are the same (both are 5), this would be the end of iterating
# ... because I can't choose between row 0 and 1 and the other rows have their maximum at different columns...
例如,如果我有一个 100x100 的矩阵,那么这段代码对我来说是完美的。但是,如果矩阵大小达到 50,000x50,000,则代码需要很长时间才能完成。我现在我的代码可能是最无效的方法,但我不知道如何处理这个问题。
我一直在阅读有关 python 中的线程的信息,这可能会有所帮助,但如果我放置 50,000 个线程则无济于事,因为我的计算机不使用更多 CPU。我也尝试将一些函数用作.max(),但我无法获取最大值列并将其与其他最大值进行比较...
如果有人可以帮助我,给我一些建议以提高效率,我将不胜感激。
【问题讨论】:
-
What I wanted to find is the maximum "posible" value in a row that is also not the maximum value in another row.- 当多行做具有相同的最大值时会发生什么? -
例如,如果第 3 列有第 2 行和第 4 行的最大值,我比较第 2 行和第 4 行之间的值。假设第 2 行中的值大于第 4 行,那么在这种情况下,我将最大值留给第 2 行并取第 4 行的第二个最大值(然后,另一列将是最大值)。如果第 2 行和第 4 行的值相同,那么我什么都不做。
-
@hllspwn 这是一个非常令人困惑的评论。你能否提出你的问题,一些可重现的东西向我们展示你的意思。创建一个非常基本的表格,例如
pd.DataFrame({'a':[1, 2, 4], 'b':[4, 5, 1], 'c':[3, 3, 4]})并告诉我们您想从中看到什么。 -
完成,如果之前无法解释自己,我很抱歉,我希望这个例子有帮助。感谢@Matt W 的建议。
-
不用道歉!感谢您的澄清,现在它更有意义了。我去看看。
标签: python-3.x pandas performance matrix iteration