【发布时间】:2020-02-12 20:24:22
【问题描述】:
我试图对以下数据应用一种热编码。但我对输出感到困惑。在应用一种热编码之前,数据的形状是 (5,10),在应用一种热编码之后,数据的形状是 (5,20)。但是每个字母都会被编码为一个 4 元素。因此,在应用一种热编码后,形状应该是 (5, 40) 而不是 (5,10)。我该如何解决这个问题?
X = [[‘A’, ‘G’, ‘T’, ‘G’, ‘T’, ‘C’, ‘T’, ‘A’, ‘A’, ‘C’],
[‘A’, ‘G’, ‘T’, ‘G’, ‘T’, ‘C’, ‘T’, ‘A’, ‘A’, ‘C’],
[‘G’, ‘C’, ‘C’, ‘A’, ‘C’, ‘T’, ‘C’, ‘G’, ‘G’, ‘T’],
[‘G’, ‘C’, ‘C’, ‘A’, ‘C’, ‘T’, ‘C’, ‘G’, ‘G’, ‘T’],
[‘G’, ‘C’, ‘C’, ‘A’, ‘C’, ‘T’, ‘C’, ‘G’, ‘G’, ‘T’]]
Y = np.array(X)
print('Shape of numpy array', Y.shape)
# one hot encoding
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(Y)
print(onehot_encoded)
print('Shape of one hot encoding', onehot_encoded.shape)
Output:
Shape of numpy array (5, 10)
[[1. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 1. 0.]
[1. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 1. 0.]
[0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 1.]
[0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 1.]
[0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 1.]]
Shape of one hot encoding (5, 20)
【问题讨论】:
-
我不确定输出应该是 (5,40)。您有 5 个示例,每个示例由 10 个字母组成。如果每个字母变成 4 个值的列表,那么您将有 10 个 4 值列表的 5 个示例。那将是一个 (5, 10, 4) 数组(即三维)
标签: python scikit-learn numpy-ndarray one-hot-encoding