保存的随机森林模型在同一数据集上产生不同的结果答案

【问题标题】：Saved Random Forest model produces different results on the same dataset保存的随机森林模型在同一数据集上产生不同的结果
【发布时间】：2020-11-28 06:09:41
【问题描述】：

我在使用保存在磁盘上的随机森林模型并使用完全相同的数据集进行预测时无法重现结果。换句话说，我用数据集 A 训练一个模型并将其保存在我的本地机器上，然后我加载它并使用它来预测数据集 B，每次我预测数据集 B 时都会得到不同的结果。

我知道随机森林分类器中涉及的随机性，但据我了解，这种随机性是在训练期间，一旦创建模型，如果您使用相同的数据进行预测，则预测不应改变。

训练脚本的结构如下：

df_train = spark.read.format("csv") \
      .option('header', 'true') \
      .option('inferSchema', 'true') \
      .option('delimiter', ';') \
      .load("C:\2020_05.csv") 

#The problem seems to be related to the StringIndexer/One-Hot Encoding
#If I remove all categorical variables the results can be reproduced
categorical_variables = []
for variable in df_train.dtypes:
    if variable[1] == 'string' :
       categorical_variables.append(variable[0])

indexers = [StringIndexer(inputCol=col, outputCol=col+"_indexed") for col in categorical_variables]

for indexer in indexers:
    df_train =indexer.fit(df_train).transform(df_train)
    df_train = df_train.drop(indexer.getInputCol())
      
indexed_cols = []
for variable in df_train.columns:
    if variable.endswith("_indexed"):
        indexed_cols.append(variable)

encoders = []
for variable in indexed_cols:
    inputCol = variable
    outputCol = variable.replace("_indexed", "_encoded")
    one_hot_encoder_estimator_train = OneHotEncoderEstimator(inputCols=[inputCol], outputCols=[outputCol])

    encoder_model_train = one_hot_encoder_estimator_train.fit(df_train)
    df_train = encoder_model_train.transform(df_train)
    df_train = df_train.drop(inputCol)


inputCols = [x for x in df_train.columns if x != "id" and x != "churn"]

vector_assembler_train = VectorAssembler(
      inputCols=inputCols,
      outputCol='features',
      handleInvalid='keep'
)

df_train = vector_assembler_train.transform(df_train)

df_train = df_train.select('churn', 'features', 'id')

df_train_1 = df_train.filter(df_train['churn'] == 0).sample(withReplacement=False, fraction=0.3, seed=7)
df_train_2 = df_train.filter(df_train['churn'] == 1).sample(withReplacement=True, fraction=20.0, seed=7)
df_train = df_train_1.unionAll(df_train_2) 

rf = RandomForestClassifier(labelCol="churn", featuresCol="features")
  paramGrid = ParamGridBuilder() \
      .addGrid(rf.numTrees, [100]) \
      .addGrid(rf.maxDepth, [15]) \
      .addGrid(rf.maxBins, [32]) \
      .addGrid(rf.featureSubsetStrategy, ['onethird']) \
      .addGrid(rf.subsamplingRate, [1.0])\
      .addGrid(rf.minInfoGain, [0.0])\
      .addGrid(rf.impurity, ['gini']) \
      .addGrid(rf.minInstancesPerNode, [1]) \
      .addGrid(rf.seed, [10]) \
  .build()



  evaluator = BinaryClassificationEvaluator(
      labelCol="churn")

  crossval = CrossValidator(estimator=rf,
                            estimatorParamMaps=paramGrid,
                            evaluator=evaluator,
                            numFolds=3)
  model = crossval.fit(df_train)
  model.save("C:/myModel")

测试脚本如下：

df_test = spark.read.format("csv") \
      .option('header', 'true') \
      .option('inferSchema', 'true') \
      .option('delimiter', ';') \
      .load("C:\2020_06.csv")
  
#The problem seems to be related to the StringIndexer/One-Hot Encoding
#If I remove all categorical variables the results can be reproduced
categorical_variables = []
for variable in df_test.dtypes:
    if variable[1] == 'string' :
       categorical_variables.append(variable[0])

indexers = [StringIndexer(inputCol=col, outputCol=col+"_indexed") for col in categorical_variables]

for indexer in indexers:
    df_test =indexer.fit(df_test).transform(df_test)
    df_test = df_test.drop(indexer.getInputCol())
      
indexed_cols = []
for variable in df_test.columns:
    if variable.endswith("_indexed"):
        indexed_cols.append(variable)

encoders = []
for variable in indexed_cols:
    inputCol = variable
    outputCol = variable.replace("_indexed", "_encoded")
    one_hot_encoder_estimator_test = OneHotEncoderEstimator(inputCols=[inputCol], outputCols=[outputCol])

    encoder_model_test= one_hot_encoder_estimator_test.fit(df_test)
    df_test= encoder_model_test.transform(df_test)
    df_test= df_test.drop(inputCol)


inputCols = [x for x in df_test.columns if x != "id" and x != "churn"]

vector_assembler_test = VectorAssembler(
      inputCols=inputCols,
      outputCol='features',
      handleInvalid='keep'
)

df_test = vector_assembler_test.transform(df_test)

df_test = df_test.select('churn', 'features', 'id')


model = CrossValidatorModel.load("C:/myModel")

result = model.transform(df_test)

areaUnderROC = evaluator.evaluate(result)

tp = result.filter("prediction == 1.0 AND churn == 1").count()
tn = result.filter("prediction == 0.0 AND churn == 0").count()
fp = result.filter("prediction == 1.0 AND churn == 0").count()
fn = result.filter("prediction == 0.0 AND churn == 1").count()

每次我运行测试脚本时，AUC 和混淆矩阵总是不同的。我在 Windows 10 机器上使用 Spark 2.4.5 和 Python 3.7。非常感谢任何建议或想法。

编辑：问题与 StringIndexer/One-Hot Encoding 步骤有关。当我只使用数值变量时，我能够重现结果。这个问题仍然悬而未决，因为我无法解释为什么会发生这种情况。

【问题讨论】：

每次得到不同的结果时，您是否检查过您的训练/测试数据是否保持不变？
@AhmetTavli 每次执行的行数和列数都相同

标签： apache-spark pyspark random-forest apache-spark-ml one-hot-encoding

【解决方案1】：

根据我的经验，这个问题是因为您正在重新评估测试中的 OneHotEncoder。

这里是 OneHotEncoding 的工作原理，来自docs：

一种单热编码器，将一列类别索引映射到一列二进制向量，每行最多有一个单值表示输入类别索引。例如，对于 5 个类别，输入值 2.0 将映射到 [0.0, 0.0, 1.0, 0.0] 的输出向量。默认情况下不包括最后一个类别（可通过 dropLast 配置），因为它使向量条目总和为 1，因此线性相关。所以输入值 4.0 映射到 [0.0, 0.0, 0.0, 0.0]。

因此，每次数据不同时（在训练与测试中自然是这种情况），One Hot Encoder 在向量中产生的值是不同的。

您应该将 OneHotEncoder 与经过训练的模型一起添加到管道中，对其进行拟合，然后保存，然后在测试中再次加载它。这样，每次通过管道运行数据时，One Hot Encoded 值都可以保证与相同的值匹配。

有关保存和加载管道的更多详细信息，请参阅documentation。

【讨论】：

谢谢！我会试试看。如果可行，我会接受这个答案作为解决方案。