【发布时间】:2020-11-28 06:09:41
【问题描述】:
我在使用保存在磁盘上的随机森林模型并使用完全相同的数据集进行预测时无法重现结果。 换句话说,我用数据集 A 训练一个模型并将其保存在我的本地机器上,然后我加载它并使用它来预测数据集 B,每次我预测数据集 B 时都会得到不同的结果。
我知道随机森林分类器中涉及的随机性,但据我了解,这种随机性是在训练期间,一旦创建模型,如果您使用相同的数据进行预测,则预测不应改变。
训练脚本的结构如下:
df_train = spark.read.format("csv") \
.option('header', 'true') \
.option('inferSchema', 'true') \
.option('delimiter', ';') \
.load("C:\2020_05.csv")
#The problem seems to be related to the StringIndexer/One-Hot Encoding
#If I remove all categorical variables the results can be reproduced
categorical_variables = []
for variable in df_train.dtypes:
if variable[1] == 'string' :
categorical_variables.append(variable[0])
indexers = [StringIndexer(inputCol=col, outputCol=col+"_indexed") for col in categorical_variables]
for indexer in indexers:
df_train =indexer.fit(df_train).transform(df_train)
df_train = df_train.drop(indexer.getInputCol())
indexed_cols = []
for variable in df_train.columns:
if variable.endswith("_indexed"):
indexed_cols.append(variable)
encoders = []
for variable in indexed_cols:
inputCol = variable
outputCol = variable.replace("_indexed", "_encoded")
one_hot_encoder_estimator_train = OneHotEncoderEstimator(inputCols=[inputCol], outputCols=[outputCol])
encoder_model_train = one_hot_encoder_estimator_train.fit(df_train)
df_train = encoder_model_train.transform(df_train)
df_train = df_train.drop(inputCol)
inputCols = [x for x in df_train.columns if x != "id" and x != "churn"]
vector_assembler_train = VectorAssembler(
inputCols=inputCols,
outputCol='features',
handleInvalid='keep'
)
df_train = vector_assembler_train.transform(df_train)
df_train = df_train.select('churn', 'features', 'id')
df_train_1 = df_train.filter(df_train['churn'] == 0).sample(withReplacement=False, fraction=0.3, seed=7)
df_train_2 = df_train.filter(df_train['churn'] == 1).sample(withReplacement=True, fraction=20.0, seed=7)
df_train = df_train_1.unionAll(df_train_2)
rf = RandomForestClassifier(labelCol="churn", featuresCol="features")
paramGrid = ParamGridBuilder() \
.addGrid(rf.numTrees, [100]) \
.addGrid(rf.maxDepth, [15]) \
.addGrid(rf.maxBins, [32]) \
.addGrid(rf.featureSubsetStrategy, ['onethird']) \
.addGrid(rf.subsamplingRate, [1.0])\
.addGrid(rf.minInfoGain, [0.0])\
.addGrid(rf.impurity, ['gini']) \
.addGrid(rf.minInstancesPerNode, [1]) \
.addGrid(rf.seed, [10]) \
.build()
evaluator = BinaryClassificationEvaluator(
labelCol="churn")
crossval = CrossValidator(estimator=rf,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=3)
model = crossval.fit(df_train)
model.save("C:/myModel")
测试脚本如下:
df_test = spark.read.format("csv") \
.option('header', 'true') \
.option('inferSchema', 'true') \
.option('delimiter', ';') \
.load("C:\2020_06.csv")
#The problem seems to be related to the StringIndexer/One-Hot Encoding
#If I remove all categorical variables the results can be reproduced
categorical_variables = []
for variable in df_test.dtypes:
if variable[1] == 'string' :
categorical_variables.append(variable[0])
indexers = [StringIndexer(inputCol=col, outputCol=col+"_indexed") for col in categorical_variables]
for indexer in indexers:
df_test =indexer.fit(df_test).transform(df_test)
df_test = df_test.drop(indexer.getInputCol())
indexed_cols = []
for variable in df_test.columns:
if variable.endswith("_indexed"):
indexed_cols.append(variable)
encoders = []
for variable in indexed_cols:
inputCol = variable
outputCol = variable.replace("_indexed", "_encoded")
one_hot_encoder_estimator_test = OneHotEncoderEstimator(inputCols=[inputCol], outputCols=[outputCol])
encoder_model_test= one_hot_encoder_estimator_test.fit(df_test)
df_test= encoder_model_test.transform(df_test)
df_test= df_test.drop(inputCol)
inputCols = [x for x in df_test.columns if x != "id" and x != "churn"]
vector_assembler_test = VectorAssembler(
inputCols=inputCols,
outputCol='features',
handleInvalid='keep'
)
df_test = vector_assembler_test.transform(df_test)
df_test = df_test.select('churn', 'features', 'id')
model = CrossValidatorModel.load("C:/myModel")
result = model.transform(df_test)
areaUnderROC = evaluator.evaluate(result)
tp = result.filter("prediction == 1.0 AND churn == 1").count()
tn = result.filter("prediction == 0.0 AND churn == 0").count()
fp = result.filter("prediction == 1.0 AND churn == 0").count()
fn = result.filter("prediction == 0.0 AND churn == 1").count()
每次我运行测试脚本时,AUC 和混淆矩阵总是不同的。 我在 Windows 10 机器上使用 Spark 2.4.5 和 Python 3.7。 非常感谢任何建议或想法。
编辑:问题与 StringIndexer/One-Hot Encoding 步骤有关。当我只使用数值变量时,我能够重现结果。这个问题仍然悬而未决,因为我无法解释为什么会发生这种情况。
【问题讨论】:
-
每次得到不同的结果时,您是否检查过您的训练/测试数据是否保持不变?
-
@AhmetTavli 每次执行的行数和列数都相同
标签: apache-spark pyspark random-forest apache-spark-ml one-hot-encoding