【问题标题】:sparse matrix length is ambiguous稀疏矩阵长度不明确
【发布时间】:2019-08-01 21:31:49
【问题描述】:

我对机器学习很陌生,所以这个问题听起来可能很愚蠢。 我正在关注tutorial on Text Classification,但我遇到了一个我不知道如何解决的错误。

这是我的代码(基本上是在教程中找到的)

import pandas as pd

filepath_dict = {'yelp':   'data/yelp_labelled.txt',
              'amazon': 'data/amazon_cells_labelled.txt',
              'imdb':   'data/imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
df['source'] = source  
df_list.append(df)

df = pd.concat(df_list)
print(df.iloc[0:4])


from sklearn.feature_extraction.text import CountVectorizer

df_yelp = df[df['source'] == 'yelp']

sentences = df_yelp['sentence'].values
y = df_yelp['label'].values

from sklearn.model_selection import train_test_split
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.25, random_state=1000)


from sklearn.feature_extraction.text import CountVectorizer


vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)

X_train = vectorizer.transform(sentences_train)
X_test  = vectorizer.transform(sentences_test)

from keras.models import Sequential
from keras import layers

input_dim = X_train.shape[1] 

model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', 
            optimizer='adam', 
            metrics=['accuracy'])
model.summary()

history = model.fit(X_train, y_train,
nb_epoch=100,
verbose=False,
validation_data=(X_test, y_test),
batch_size=10)

当我到达最后一行时,我得到一个错误

"TypeError: 稀疏矩阵长度不明确;使用 getnnz() 或 shape[0]"

我想我必须对我正在使用的数据执行某种转换,或者我应该尝试以不同的方式加载这些数据。我已经尝试在 Stackoverflow 上进行搜索,但是 - 对这一切都很陌生 - 我找不到任何有用的东西。

我该如何进行这项工作?理想情况下,我不仅想获得解决方案,还想简要说明错误发生的原因以及解决方案为了解决它做了什么。

谢谢!

【问题讨论】:

  • 哪一行出现错误?
  • type(X_train), type(y_train) 的输出是什么?
  • @SergeyBushmanov type(X_train): ;类型(y_train):
  • 您可以尝试将稀疏矩阵转换为密集矩阵,如X_train.todense() 并将结果传递给model.fit()
  • @FrancoPiccolo 最后一个 history = model.fit(X_train, y_train, nb_epoch=100, verbose=False, validation_data=(X_test, y_test), batch_size=10)

标签: python keras scikit-learn sklearn-pandas


【解决方案1】:

您面临此困难的原因是您的 X_trainX_test<class scipy.sparse.csr.csr_matrix> 类型,而您的模型希望它是一个 numpy 数组。

尝试将它们铸造成稠密的,你就可以了:

X_train = X_train.todense()
X_test = X_test.todense()

【讨论】:

    【解决方案2】:

    不确定,为什么您收到此脚本的错误。

    以下脚本运行良好;即使是稀疏矩阵。可以在你的机器上试一试。

    sentences = ['i want to test this','let us try this',
                 'would this work','how about this',
                 'even this','this should not work']
    y= [0,0,0,0,0,1]
    from sklearn.model_selection import train_test_split
    sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.25, random_state=1000)
    
    
    from sklearn.feature_extraction.text import CountVectorizer
    
    
    vectorizer = CountVectorizer()
    vectorizer.fit(sentences_train)
    
    X_train = vectorizer.transform(sentences_train)
    X_test  = vectorizer.transform(sentences_test)
    
    from keras.models import Sequential
    from keras import layers
    
    input_dim = X_train.shape[1] 
    
    model = Sequential()
    model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    
    model.compile(loss='binary_crossentropy', 
                optimizer='adam', 
                metrics=['accuracy'])
    model.summary()
    
    model.fit(X_train, y_train,
                            epochs=2,
                            verbose=True,
                            validation_data=(X_test, y_test),
                            batch_size=2)
    
    #
    Layer (type)                 Output Shape              Param #   
    =================================================================
    dense_5 (Dense)              (None, 10)                110       
    _________________________________________________________________
    dense_6 (Dense)              (None, 1)                 11        
    =================================================================
    Total params: 121
    Trainable params: 121
    Non-trainable params: 0
    _________________________________________________________________
    Train on 4 samples, validate on 2 samples
    Epoch 1/2
    4/4 [==============================] - 1s 169ms/step - loss: 0.7570 - acc: 0.2500 - val_loss: 0.6358 - val_acc: 1.0000
    Epoch 2/2
    4/4 [==============================] - 0s 3ms/step - loss: 0.7509 - acc: 0.2500 - val_loss: 0.6328 - val_acc: 1.0000
    

    【讨论】:

    • 没错,我也这么认为。 Todense 是一项昂贵的操作。可能更新软件包可能是一个更好的解决方案。
    猜你喜欢
    • 1970-01-01
    • 2020-07-30
    • 1970-01-01
    • 2017-02-24
    • 2018-01-19
    • 2021-09-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多