【发布时间】:2022-01-16 10:19:09
【问题描述】:
我有以下数据框:
Bacteria Year Feature_Vector
XYRT23 1968 [0 1 0 0 1 1 0 0 0 0 1 1]
XXQY12 1968 [0 1 0 0 0 1 1 0 0 0 1 1]
RTy11R 1968 [1 0 0 0 0 1 1 0 1 1 1 1]
XYRT23 1969 [0 1 0 0 1 1 0 0 0 0 1 1]
XXQY12 1969 [0 0 1 0 0 1 1 0 0 0 1 1]
RTy11R 1969 [1 0 0 0 0 1 1 1 1 1 1 1]
我想计算给定年份中每一对的成对汉明距离,并将其保存到新的数据框中。示例:(注:汉明距离是我自己编的,其实不需要Pair列)
Pair Year HammingDistance
XYRT23 - XXQY12 1968 0.24
XYRT23 - RTy11R 1968 0.33
XXQY12 - RTy11R 1968 0.29
XYRT23 - XXQY12 1969 0.22
XYRT23 - RTy11R 1969 0.34
XXQY12 - RTy11R 1969 0.28
我尝试了类似的方法:
import itertools
from sklearn.metrics.pairwise import pairwise_distances
my_list = df.groupby('Year')['Feature_Vector'].apply(list)
total_list = []
for lists in my_list:
i = 0
results = []
for x in itertools.combinations(lists, 2):
vec1, vec2 = np.array(x[0]), np.array(x[1])
keepers = np.where(np.logical_not((np.vstack((vec1, vec2)) == 0).all(axis=0)))
vecx = vec1[keepers].reshape(1, -1)
vecy = vec2[keepers].reshape(1, -1)
try:
score = pairwise_distances(vecx, vecy, metric = "hamming")
print(score)
except:
score = 0
results.append(score)
【问题讨论】:
-
请把
print(df['Feature_Vector'][0], type(df['Feature_Vector'][0]))的结果发过来好吗?很难说它到底是什么(字符串、列表、numpy 数组等) -
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
标签: python pandas scikit-learn