【问题标题】:I have a data type problem in the text classification problem我在文本分类问题中有一个数据类型问题
【发布时间】:2023-04-05 00:47:01
【问题描述】:

我想为 Kickstarter 活动预测构建深度学习分类器。我的模型部分有问题,但我无法解决。

我的代码:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from keras.models import Sequential
from keras import layers


df = pd.read_csv('../input/kickstarter-campaigns-dataset/kickstarter_data_full.csv')

df_X = [] # for x class
df_y = [] # for labels

for i in range(len(df)):
    tmp = str(df['blurb'][i]) + " " + str(df['goal'][i]) + " " + str(df['pledged'][i]) + " " + str(df['country'][i]) + " " + str(df['currency'][i]) + " " + str(df['category'][i]) + " " + str(df['spotlight'][i])  
    df_X.append(tmp)
    df_y.append(str(df['SuccessfulBool'][i]))

X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.25, random_state=1000)
vectorizer = CountVectorizer()
vectorizer.fit(X_train)

X_train = vectorizer.transform(X_train)
X_test  = vectorizer.transform(X_test)

input_dim = X_train.shape[1]

model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

history = model.fit(X_train, y_train,
                     epochs=100,
                     verbose=False,
                     validation_data=(X_test, y_test),
                     batch_size=10)

在这一点上,我得到 ValueError: Failed to find data adapter that can handle input: , ( contains values类型 {""})

我尝试使用 np.asarray 来解决

X_train = np.asarray(X_train)
y_train = np.asarray(y_train)
X_test = np.asarray(X_test)
y_test = np.asarray(y_test)

我得到这个ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type csr_matrix)。

因此,我使用这个:

np.asarray(X_train).astype(np.float32)
np.asarray(y_train).astype(np.float32)
np.asarray(X_test).astype(np.float32)
np.asarray(y_test).astype(np.float32)

但我得到 ValueError: setting an array element with a sequence。

我试试这个:

X_train = np.expand_dims(X_train, -1)
y_train   = np.expand_dims(y_train, -1)
X_test = np.expand_dims(X_test, -1)
y_test   = np.expand_dims(y_test, -1)

但我在历史部分中不断遇到同样的错误。 ValueError:无法将 NumPy 数组转换为张量(不支持的对象类型 csr_matrix)。

我在 Kaggle 研究 Kickstarter 活动数据集。 https://www.kaggle.com/sripaadsrinivasan/kickstarter-campaigns-dataset

我没有足够的 NLP 信息。我搜索并尝试解决,但我无法解决。这是我的作业。你能帮我解决这个问题吗?

df_X 和 df_y 大小相等,输出如下: x y

【问题讨论】:

    标签: python numpy deep-learning nlp text-classification


    【解决方案1】:

    您需要在 NN 的顶部添加一个嵌入层来对单词进行矢量化。像这样:

    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import CountVectorizer
    from keras.preprocessing.text import one_hot
    from keras.preprocessing.sequence import pad_sequences
    from keras.models import Sequential
    from keras import layers
    
    
    df = pd.read_csv('../input/kickstarter-campaigns-dataset/kickstarter_data_full.csv')
    
    df_X = [] # for x class
    df_y = [] # for labels
    
    for i in range(len(df)):
        tmp = str(df['blurb'][i]) + " " + str(df['goal'][i]) + " " + str(df['pledged'][i]) + " " + str(df['country'][i]) + " " + str(df['currency'][i]) + " " + str(df['category'][i]) + " " + str(df['spotlight'][i])  
        df_X.append(tmp)
        df_y.append(str(df['SuccessfulBool'][i]))
    
    vocab_size = 1000
    encoded_docs = [one_hot(d, vocab_size) for d in df_X]
    max_length = 20
    padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
    X_train, X_test, y_train, y_test = train_test_split(padded_docs, np.array(df_y)[:, None].astype(int), test_size=0.25, random_state=1000)
    model = Sequential()
    model.add(layers.Embedding(vocab_size, 100, input_length=max_length))
    model.add(layers.Flatten())
    model.add(layers.Dense(10, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    print(model.summary())
    model.fit(X_train, y_train, 
              epochs=50, 
              verbose=1,
              validation_data=(X_test, y_test),
              batch_size=10)
    

    【讨论】:

    • 非常感谢。你说的对。像你展示的那样添加嵌入层后,我解决了!
    猜你喜欢
    • 2019-09-27
    • 1970-01-01
    • 2014-10-21
    • 2011-03-18
    • 2021-04-24
    • 2019-06-11
    • 2015-01-26
    • 2010-12-24
    相关资源
    最近更新 更多