无法将 StringIndexer 作为列表传递给模型管道阶段答案

【问题标题】：Not able to pass StringIndexer as list to the model pipeline stage无法将 StringIndexer 作为列表传递给模型管道阶段
【发布时间】：2023-03-25 08:24:01
【问题描述】：

PySpark 管道对我来说很新。我正在尝试通过传递以下列表来创建管道中的阶段：

pipeline = Pipeline().setStages([indexer,assembler,dtc_model])

我在多个列上应用特征索引：

cat_col = ['Gender','Habit','Mode']

indexer = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(training_data_0) for column in cat_col ]

在管道上运行拟合时出现以下错误：

model_pipeline = pipeline.fit(train_df)

我们如何将列表传递到舞台或任何解决方法以实现此目的或更好的方法？

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<command-3999694668013877> in <module>
----> 1 model_pipeline = pipeline.fit(train_df)

/databricks/spark/python/pyspark/ml/base.py in fit(self, dataset, params)
    130                 return self.copy(params)._fit(dataset)
    131             else:
--> 132                 return self._fit(dataset)
    133         else:
    134             raise ValueError("Params must be either a param map or a list/tuple of param maps, "

/databricks/spark/python/pyspark/ml/pipeline.py in _fit(self, dataset)
     95             if not (isinstance(stage, Estimator) or isinstance(stage, Transformer)):
     96                 raise TypeError(
---> 97                     "Cannot recognize a pipeline stage of type %s." % type(stage))
     98         indexOfLastEstimator = -1
     99         for i, stage in enumerate(stages):

TypeError: Cannot recognize a pipeline stage of type <class 'list'>.```

【问题讨论】：

标签： pyspark apache-spark-mllib apache-spark-ml

【解决方案1】：

试试下面-

cat_col = ['Gender','Habit','Mode']
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(training_data_0) for column in cat_col ]

assembler = VectorAssembler...
dtc_model = DecisionTreeClassifier...

# Create pipeline using transformers and estimators
stages = indexer
stages.append(assembler)
stages.append(dtc_model)
pipeline = Pipeline().setStages(stages)

model_pipeline = pipeline.fit(train_df)

【讨论】：

在应用建议的更改时，得到相同的错误：TypeError: Cannot recognize a pipeline stage of type <class 'list'>
更新了答案。虽然我还没有执行，但试一试
我通过以下方式修复了它：pipeline = Pipeline().setStages(indexer + [assembler,dtc_model]) 感谢您的帮助！