keras：无法在model.fit中将字符串转换为浮点数答案

【问题标题】：keras: could not convert string to float in model.fitkeras：无法在model.fit中将字符串转换为浮点数
【发布时间】：2020-08-20 17:49:38
【问题描述】：

我有一个像这样的 DNA 序列数据框：

Feature         Label
GCTAGATGACAGT   0
TTTTAAAACAG     1
TAGCTATACT      2    
TGGGGCAAAAAAAA  0
AATGTCG         3
AATGTCG         0
AATGTCG         1

其中有一列包含 DNA 序列，并且标签可以是 0、1、2、3（即该 DNA 序列的类别）。我想开发一个 NN 来预测每个序列分类到 1,2 或 3 类别的概率（不是 0，我不关心 0）。每个序列可以在数据框中出现多次，并且每个序列有可能出现在多个（或所有）类别中。所以输出应该是这样的：

GCTAGATGACAGT   (0.9,0.1,0.2)
TTTTAAAACAG     (0.7,0.6,0.3)
TAGCTATACT      (0.3,0.3,0.2)    
TGGGGCAAAAAAAA  (0.1,0.5,0.6)

元组中的数字是在类别 1,2 和 3 中找到序列的概率。

我编写了这个基本代码来开始使用。你可以看到我已经注释掉了一些更棘手的部分，我正在尝试让一个基本方法起作用，然后我会逐渐扩展它，但我已经包含了所有内容，以便人们可以看到我正在考虑的一般想法。

# Split into input (X) and output (Y) variables
X = df.iloc[:,[0]].as_matrix() #as matrix due to this error: https://stackoverflow.com/questions/45479239/pandas-keyerror-not-in-index-when-training-a-keras-model
y = df.iloc[:,-1].as_matrix()
print(X[0:10])
print(y[0:10])


# Define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
kf = kfold.get_n_splits(X)
cvscores = []
for train, test in kfold.split(X, Y):
    X_train, X_test = X[train], X[test]
    y_train, y_test = y[train], y[test]


# Pre-process the data
#    X_train = sequence.pad_sequences(X[train], maxlen=30) #based on 30 aa being max we're interested in
#    X_test = sequence.pad_sequences(X[test], maxlen=30) #based on 30 aa being max we're interested in




# Create model
    model = Sequential()
#   model.add(Embedding(3000, 32, input_length=30))
#   model.add(Bidirectional(LSTM(20, return_sequences=True), input_shape=(n_timesteps, 1)))
    model.add(Dense(1, activation='sigmoid'))



# Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])



# Monitor val accuracy and perform early stopping
#    es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)
#    mc = ModelCheckpoint('best_model.h5', monitor='val_accuracy', mode='max', verbose=1, save_best_only=True)



# Fit the model
    model.fit(X_train, y_train, epochs=150, batch_size=10, verbose=0)


# Evaluate the model
#    scores = model.evaluate(X[test], Y[test], verbose=0)
#    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
#    cvscores.append(scores[1] * 100)
#print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))


#output a three sigmoid model, and plot accuracy and loss

输出首先按预期打印序列（即打印语句）：

[['GCTAGATGACAGT']
 ['TTTTAAAACAG']
 ['TAGCTATACT']
 ['TGGGGCAAAAAAAA']
 ['AATGTCG']
 ['AATGTCG']
 ['AATGTCG']
 ['TTATATAAAAG']
 ['GCTGGGAG']
 ['TTTGCGTATAGATAGATAG']]
[0 1 2 0 3 0 1 2 2 0]

然后我得到错误：

ValueError: could not convert string to float: 'XXX' (where XXXX is one of the sequences in the data set, but not one of the top 10 in the output above), and further up in the error it points to the value error being in the line:

    model.fit(X_train, y_train, epochs=150, batch_size=10, verbose=0)

我确实看到了this 的问题，但我不认为我的问题是同一个根本原因。有人可以解释我为什么会得到这个吗？我想知道是不是因为我还没有/正确地向模型解释我正在处理计算序列的概率而不是分类特征？

【问题讨论】：

标签： python keras

【解决方案1】：

正如我在 prints 语句中看到的那样，您正在为您的 NN 提供字符串/文本，这是不可能的。您必须将它们编码为数字。要执行此操作，可以使用不同的方法：您可以对字符进行一次性编码，也可以为每个字符创建可训练的嵌入。

我建议你来自 TF 的Tokenizer，它可以在文本序列的数字编码过程中帮助你

【讨论】：