使用 TensorFlow 进行多元线性回归答案

【问题标题】：Multiple Linear Regression with TensorFlow使用 TensorFlow 进行多元线性回归
【发布时间】：2021-04-14 16:44:47
【问题描述】：

我正在尝试使用 TensorFlow 执行多元线性回归，并使用 statsmodels 库处理结果。

我生成了两个随机变量 X1 和 X2（以便任何人都可以重现它）来解释 Y 变量。 X2 变量对于这个回归完全没用，它只是一个大尺度的噪声，因此系数不会显着（p-val 接近 1）。最后我应该得到一个基本上是的模型。 y_data = alpha + (0.25)x1 + (0.00)x2 + error.

我尝试将此code 调整为我随机生成的数据，但不幸的是，这根本不起作用。这是我的尝试：

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow import keras
import datetime


#generating variables:
np.random.seed(1)
lin_x = np.arange(0,200,2)
y_data = np.true_divide(lin_x,4)
n = np.shape(lin_x)
##adding noise:
rand1 = norm.rvs(loc=0,scale=5,size=n)
np.random.seed(2)
rand2 = norm.rvs(loc=0,scale=1000,size=n)
x1 = np.add(lin_x,rand1)
x2 = rand2

#creating the X matrix: beta = (X'X)^-1(X'y):
x_data = np.column_stack((x1,x2))
#adding ones vector for the intercept:
x_data = sm.add_constant(x_data)

#MLR with statsmodels:
mod = sm.OLS(y_data,x_data)
LinReg = mod.fit()
print(LinReg.summary())



#MLR with tensorflow:
normalizer = preprocessing.Normalization()
normalizer.adapt(x_data)

normalized_data = normalizer(x_data)

print(normalized_data)


model = tf.keras.Sequential([
  normalizer,
  layers.Dense(units=1)
])

model.compile(loss = tf.losses.MeanSquaredError(),
              optimizer = tf.keras.optimizers.SGD(
                learning_rate=0.06, momentum=0.0, nesterov=True, name="SGD",
              ))

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

model.summary()
print('--------------')
weights = model.layers[0].get_weights()[0]
biases = model.layers[0].get_weights()[1]
print('--------------')

x_data_tf = tf.convert_to_tensor(x_data)
y_data_tf = tf.convert_to_tensor(y_data)
model.fit(y_data_tf,x_data_tf, epochs=1000, callbacks=[tensorboard_callback])

weights = model.layers[0].get_weights()[0]
biases = model.layers[0].get_weights()[1]

print("TensorFlow results: ")
print("weigths: ", weights)
print("biases: ", biases)

print(LinReg.summary())

如何获得与使用 TensorFlow 的 statsmodels 库获得的相同系数？谢谢

【问题讨论】：

标签： python tensorflow linear-regression

【解决方案1】：

您的代码的关键问题如下：

虽然需要在使用 statsmodels 运行回归之前向特征矩阵 x_data 添加一列，但在使用 tensorflow 运行回归时则没有必要。这意味着您将 3 个特征传递给 tensorflow，而不是 2 个，其中附加特征（x_data 的第一列）是恒定的。
在第一列已经添加x_data = sm.add_constant(x_data) 之后，您正在规范化x_data。由于一列的方差为零，因此在归一化后，您会得到一列nan（因为您除以零）。这意味着您传递给 tensorflow 的 3 个功能中的第一个完全丢失（即始终为 nan）。
虽然 statsmodels 首先将 y 作为输入，然后是 X，但 tensorflow 首先将 X 作为输入，然后是 y。这意味着您在 tensorflow 中运行回归时已经切换了特征和目标。

我在下面包含了一个完整的示例。

import numpy as np
import statsmodels.api as sm
import tensorflow as tf

np.random.seed(0)

N = 500    # number of samples
K = 2      # number of features

# generate the features
m = np.array([0, 0])     # means
s = np.array([5, 1000])  # variances
X = np.random.multivariate_normal(m * np.ones(K), s * np.eye(K), N)

# generate the target
b = 0.5                   # bias
w = np.array([0.107, 0])  # weights
y = b + np.dot(X, w) + np.random.normal(0, 1, N)

# normalize the features
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0, ddof=1)

# run a linear regression with statsmodels
reg = sm.OLS(y, sm.add_constant(X)).fit()

# run a linear regression with tensorflow
model = tf.keras.Sequential([
  tf.keras.layers.Dense(units=1)
])

model.compile(loss=tf.losses.MeanSquaredError(), optimizer=tf.keras.optimizers.SGD(learning_rate=0.01))
model.fit(X, y, epochs=1000, verbose=0)

bias = model.layers[0].get_weights()[1]
weights = model.layers[0].get_weights()[0].flatten()

# compare the parameters
print('Statsmodels parameters:')
print(np.round(reg.params, 3))
# [0.523 0.25 0.063]

print('Tensorflow parameters:')
print(np.round(np.append(bias, weights), 3))
# [0.528 0.25 0.066]

【讨论】：

代码有效，但输出应该是“beta1”（x1 的系数）为 0.25。在您的示例中，以及您如何生成多元高斯 x1 变量比 y 数据小 10 倍，所以我得到“beta1”= 11（大约）。但是斜率应该是 y = 1/4X1 ...我现在正在尝试解决这个问题，但现在没有成功。我认为偏差是斜率，因为它的值为 0.25（即 beta1 的值我想拥有）。你能解决这个小问题吗？
我不明白您是如何生成变量的。所以我认为这个数据与 y=(0.25)x1 的关系。为什么你将第一个权重设置为 0.107？由于使用 OLS，您也可以在不对数据进行规范化的情况下训练模型，这对我来说就不那么令人困惑了。如果我使用 SGD，是否总是需要对特征进行归一化？因为这样我总能得到标准化数据的估计值。只是想知道我是否可以在没有标准化的情况下训练模型。感谢您的帮助顺便说一句
Flavia 我改变了选择 Adam 的优化器，它在没有标准化过程的情况下也能正常工作