keras显着性检验中逻辑回归的标准误差答案

【问题标题】：Standard Errors from logistic regression in keras significance testkeras显着性检验中逻辑回归的标准误差
【发布时间】：2020-03-30 10:15:56
【问题描述】：

我有两个自变量 (x1,x2)，用于预测 y（二进制）。通过逻辑回归，我可以只使用每个估计系数的标准误来检验它们的显着性。

然而，我有一个基于 inputA（一些文本）和 inputB（数字数据）的深度网络。

这意味着，我必须从最后一层提取标准误差，以测试 inputB 的系数的显着性。否则，将无法检查 inputB 是否确实显着增加了模型。 如何从深度学习模型 (keras) 中运行的逻辑回归中提取标准误？

#### Network
# define two sets of inputs
inputA = Input(shape=(32,))
inputB = Input(shape=(128,))

# the first branch operates on the first input
x = Dense(8, activation="relu")(inputA)
x = Dense(4, activation="relu")(x)
x = Model(inputs=inputA, outputs=x)

# the second branch opreates on the second input
y = Dense(64, activation="relu")(inputB)
y = Dense(32, activation="relu")(y)
y = Dense(4, activation="relu")(y)
y = Model(inputs=inputB, outputs=y)

# combine the output of the two branches
combined = concatenate([x.output, y.output])

# our model will accept the inputs of the two branches and
# then output a single value
preds = Dense(1, activation='sigmoid',name='output')(combined) 
model = Model(inputs=[x.input, y.input], outputs=[preds])

model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['acc'])

model.fit([x_train,numeric], y_train, epochs=40, batch_size=50)

编辑：

现在，我发现了这个有用的链接：

https://stats.stackexchange.com/questions/89484/how-to-compute-the-standard-errors-of-a-logistic-regressions-coefficients

所以我假设我可以使用y_pred = model.predict(inputs=[x_train,numeric], verbose=1) # gives probabilities

然后，我必须将combined 输入到以下代码中......但是我该怎么做......或者我的方法是错误的？

#### Standard Error testing
# Design matrix -- add column of 1's at the beginning of your X_train matrix
X_design = np.hstack([np.ones((combined.shape[0], 1)), combined])

# Initiate matrix of 0's, fill diagonal with each predicted observation's variance
V = np.diagflat(np.product(y_preds, axis=1))


# Covariance matrix
covLogit = np.linalg.inv(np.dot(np.dot(X_design.T, V), X_design))

有人可以添加一些建议/验证吗？

编辑2

让我感到困惑的是我有两个输入，一个数字输入 numeric 和一个非数字输入 x_train。为了测试系数，我必须创建一个 combined-input 形状的矩阵（实际上填充了组合输入）。

然后我可以使用模型预测来测试最后一层系数的显着性（如系数测试的参考链接所述）。

但是我该如何输入组合输入......还是我在某个地方弄错了？

【问题讨论】：

你如何定义predProbs？错误信息是什么？乍一看，我觉得不错
哦，predProbs 是 y_preds - 我的错
好的。在这种情况下，根据 StackExchange 上的答案，V 应该定义为：V = np.diagflat(y_preds * (1 - y_preds))，不是吗？从我的角度来看，应用此修改后，这应该会产生 StackExchange 答案的行为
你见过SHAP的价值观吗？类似的东西或permutation tests 看起来很有用。

标签： python keras logistic-regression

【解决方案1】：

不确定这个答案是否适合你....

我会怎么做？

测试仅使用 inputA 的网络。
测试仅使用 inputB 的网络。
测试组合网络。

我会使用那个赢家。

获取网络允许通过的每个输入的数量：

如果你得到最后一层的权重，你将有两个张量：

(1, 8) 权重矩阵（或 (8, 1)，在这种情况下无关紧要）。
一个(1,) 偏差值

得到它们：

w, b = model.get_layer("output").get_weights()

扁平化w（没关系，因为您只有 1 个输出单元）并查看网络对每个输入的加权程度。按照连接 x 和 y 的顺序：

w = w.reshape((-1,))
weights_x = w[:4] #the first 4 weigths are multiplying `x.output`   
weights_y = w[4:] #the last 4 weights are multiplying `y.output`

【讨论】：

考虑一个只有 inputA 的模型，另一个带有 inputA 和 inputB 的模型是一个很好的建议。我已经通过比较模型的 F 分数来考虑这一点。但是，系数显着性检验（仅针对层中的逻辑回归）也应该得出有趣的结论。但是，我不确定如何检索 combined-input 并进行显着性检验