【发布时间】:2018-02-08 14:56:45
【问题描述】:
我有一个包含超过 1700 万个观测值的数据集,我正在尝试使用它来训练一个 DNNRegressor 模型。但是,培训根本不起作用。损失在 10^15 的数量级上,这简直太可怕了。几个星期以来,我一直在尝试不同的事情,无论我做什么,我都无法减少损失。
例如,在训练之后,我使用用于训练数据的相同观察结果之一进行测试预测。预期结果为140944.00,但预测结果为-169532.5,这很荒谬。训练数据里连负值都没有,我不明白怎么会这么偏。
这里是一些样本训练数据:
Amount Contribution ServiceType Percentile Time Result
214871.00 3501.00 SM23 high 50 17807828.00
214871.00 3501.00 SM23 high 51 19216520.00
214871.00 3501.00 SM23 high 52 19676064.00
214871.00 3501.00 SM23 high 53 21038840.00
214871.00 3501.00 SM23 high 54 22248295.00
214871.00 3501.00 SM23 high 55 22412713.00
28006.00 83.00 SM0 i_low 0 28006.00
28006.00 83.00 SM0 i_low 1 28804.00
28006.00 83.00 SM0 i_low 2 30140.00
28006.00 83.00 SM0 i_low 3 31598.00
28006.00 83.00 SM0 i_low 4 33130.00
28006.00 83.00 SM0 i_low 5 34663.00
这是我的代码:
feature_columns = [
tf.feature_column.numeric_column('Amount', dtype=dtypes.float32),
tf.feature_column.numeric_column('Contribution', dtype=dtypes.float32),
tf.feature_column.embedding_column(
tf.feature_column.categorical_column_with_vocabulary_list(
'ServiceType',
[
'SM0', 'SM1', 'SM2', 'SM3',
'SM4', 'SM5', 'SM6', 'SM7',
'SM8', 'SM9', 'SM10', 'SM11',
'SM12', 'SM13', 'SM14', 'SM15',
'SM16', 'SM17', 'SM18', 'SM19',
'SM20', 'SM21', 'SM22', 'SM23'
],
dtype=dtypes.string
),
dimension=16
),
tf.feature_column.embedding_column(
tf.feature_column.categorical_column_with_vocabulary_list(
'Percentile',
['i_low', 'low', 'mid', 'high'],
dtype=dtypes.string
),
dimension=16
),
tf.feature_column.numeric_column('Time', dtype=dtypes.int8)
]
model = tf.estimator.DNNRegressor(
hidden_units=[64, 32],
feature_columns=feature_columns,
model_dir=os.getcwd() + "\job",
label_dimension=1,
weight_column=None,
optimizer='Adagrad',
activation_fn=tf.nn.elu,
dropout=None,
input_layer_partitioner=None,
config=RunConfig(
master=None,
num_cores=4,
log_device_placement=False,
gpu_memory_fraction=1,
tf_random_seed=None,
save_summary_steps=100,
save_checkpoints_secs=0,
save_checkpoints_steps=None,
keep_checkpoint_max=5,
keep_checkpoint_every_n_hours=10000,
log_step_count_steps=100,
evaluation_master='',
model_dir=os.getcwd() + "\job",
session_config=None
)
)
print('Training...')
model.train(input_fn=get_input_fn('train'), steps=100000)
print('Evaluating...')
model.evaluate(input_fn=get_input_fn('test'), steps=4000)
print('Predicting...')
prediction = model.predict(input_fn=get_input_fn('predict'))
print(list(prediction))
input_fn 计算如下:
def split_input():
data = pd.read_csv('C:\\all_data.txt', sep='\t')
x = data.drop('Result', axis=1)
y = data.Result
return train_test_split(x, y, test_size=0.2, random_state=123)
def get_input_fn(input_fn_type):
train_x, test_x, train_y, test_y = split_input()
if input_fn_type == 'train':
return tf.estimator.inputs.pandas_input_fn(
x=train_x,
y=train_y,
num_epochs=None,
shuffle=True
)
elif input_fn_type == 'test':
return tf.estimator.inputs.pandas_input_fn(
x=test_x,
y=test_y,
num_epochs=1,
shuffle=False
)
elif input_fn_type == 'predict':
return tf.estimator.inputs.pandas_input_fn(
x=pd.DataFrame(
{
'Amount': 52050.00,
'Contribution': 1394.00,
'ServiceType': 'SM0',
'Percentile': 'i_low',
'Time': 5
},
index=[0]
),
num_epochs=1,
shuffle=False
)
输出如下:
Training...
INFO:tensorflow:loss = 6.30944e+15, step = 1
INFO:tensorflow:global_step/sec: 457.091
INFO:tensorflow:loss = 3.28245e+15, step = 101 (0.219 sec)
INFO:tensorflow:global_step/sec: 533.271
INFO:tensorflow:loss = 2.65647e+15, step = 201 (0.188 sec)
INFO:tensorflow:global_step/sec: 533.274
...
INFO:tensorflow:loss = 1.06601e+15, step = 99701 (0.203 sec)
INFO:tensorflow:global_step/sec: 533.289
INFO:tensorflow:loss = 2.12652e+15, step = 99801 (0.188 sec)
INFO:tensorflow:global_step/sec: 533.273
INFO:tensorflow:loss = 1.31647e+15, step = 99901 (0.203 sec)
INFO:tensorflow:Saving checkpoints for 100000 into C:\projection_model\job\model.ckpt.
INFO:tensorflow:Loss for final step: 2.88956e+15.
Evaluating...
INFO:tensorflow:Evaluation [1/4000]
INFO:tensorflow:Evaluation [2/4000]
INFO:tensorflow:Evaluation [3/4000]
...
INFO:tensorflow:Evaluation [3998/4000]
INFO:tensorflow:Evaluation [3999/4000]
INFO:tensorflow:Evaluation [4000/4000]
INFO:tensorflow:Finished evaluation at 2017-08-30-19:04:03
INFO:tensorflow:Saving dict for global step 100000: average_loss = 1.37941e+13, global_step = 100000, loss = 1.76565e+15
Predicting...
[{'predictions': array([-169532.5], dtype=float32)}] # Should be somewhere around 140944.00
为什么模型不学习数据?我尝试了不同的回归器和输入标准化,但没有任何效果。
【问题讨论】:
-
一个应该比较快尝试的建议:作为一个测试,这只是一个测试,尝试使用每万个数据点,这样数据集的大小会更小,故障排除相应更快。
标签: machine-learning tensorflow neural-network deep-learning regression