【问题标题】:Change in the Dimension (shape) because of np.hstack on tf.keras.preprocessing.text.Tokenizer.texts_to_sequences由于 tf.keras.preprocessing.text.Tokenizer.texts_to_sequences 上的 np.hstack 导致维度(形状)发生变化
【发布时间】:2020-02-15 10:11:37
【问题描述】:

我已在 tensorflow.keras.preprocessing.text.Tokenizer.texts_to_sequences 上为训练标签和验证(测试)标签申请了 np.hstack

令人惊讶和神秘的是,在我应用训练标签之后,输出的大小与我应用之前的不同np.hstack。但是,在 tensorflow.keras.preprocessing.text.Tokenizer.texts_to_sequencesnp.hstack 应用前后,验证标签的形状没有变化。

这是Google Colab的链接,方便重现错误。

下面给出了重现错误的完整代码(以防万一链接不起作用):

!pip install tensorflow==2.1

# For Preprocessing the Text => To Tokenize the Text
from tensorflow.keras.preprocessing.text import Tokenizer
# If the Two Articles are of different length, pad_sequences will make the length equal
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Package for performing Numerical Operations
import numpy as np

Unique_Labels_List = ['India', 'USA', 'Australia', 'Germany', 'Bhutan', 'Nepal', 'New Zealand', 'Israel', 'Canada', 'France', 'Ireland', 'Poland', 'Egypt', 'Greece', 'China', 'Spain', 'Mexico']


Train_Labels = Unique_Labels_List[0:14]
#print('Train Labels = {}'.format(Train_Labels))

Val_Labels =  Unique_Labels_List[14:]
#print('Val_Labels = {}'.format(Val_Labels))

No_Of_Train_Items = [248, 200, 200, 218, 248, 248, 249, 247, 220, 200, 200, 211, 224, 209]
No_Val_Items = [212, 200, 219]

T_L = []
for Each_Label, Item in zip(Train_Labels, No_Of_Train_Items):
    T_L.append([Each_Label] * Item)

T_L = [item for sublist in T_L for item in sublist]

V_L = []
for Each_Label, Item in zip(Val_Labels, No_Val_Items):
    V_L.append([Each_Label] * Item)

V_L = [item for sublist in V_L for item in sublist]


len(T_L)

len(V_L)

label_tokenizer = Tokenizer()

label_tokenizer.fit_on_texts(Unique_Labels_List)

# Since it should be a Numpy Array, we should Convert the Sequences to Numpy Array, for both Training and 
# Test Labels

training_label_list = label_tokenizer.texts_to_sequences(T_L)

validation_label_list = label_tokenizer.texts_to_sequences(V_L)

training_label_seq = np.hstack(training_label_list)

validation_label_seq = np.hstack(validation_label_list)

print('Actual Number of Train Labels before np.hstack are {}'.format(len(training_label_list)))
print('Change in the Number of Train Labels because of np.hstack are {}'.format(len(training_label_seq)))

print('-------------------------------------------------------------------------------------------------------')

print('Actual Number of Validation Labels before np.hstack are {}'.format(len(validation_label_list)))
print('However, there is no change in the Number of Validation Labels because of np.hstack {}'.format(len(validation_label_seq)))

提前谢谢你。

【问题讨论】:

    标签: python numpy tensorflow keras tensorflow2.0


    【解决方案1】:

    这是因为您在training_label_list 中有多个值的列表。您可以通过sorted(training_label_list, key=lambda x: len(x), reverse = True)进行验证。

    这是因为 label_tokenizer 以下列方式考虑 New Zealand

    >>>label_tokenizer.index_word
    {1: 'india',
     2: 'usa',
     3: 'australia',
     4: 'germany',
     5: 'bhutan',
     6: 'nepal',
     7: 'new',
     8: 'zealand',
     9: 'israel',
     10: 'canada',
     11: 'france',
     12: 'ireland',
     13: 'poland',
     14: 'egypt',
     15: 'greece',
     16: 'china',
     17: 'spain',
     18: 'mexico'}
    

    查看索引 7 和 8。

    【讨论】:

    • 另外,您可以提到,不仅是空格,而且任何特殊字符都会导致该行为。再次感谢您。
    猜你喜欢
    • 1970-01-01
    • 2017-08-02
    • 1970-01-01
    • 1970-01-01
    • 2012-05-17
    • 1970-01-01
    • 2022-01-13
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多