【发布时间】:2019-01-07 21:23:21
【问题描述】:
如何在 Spark DataFrame 中打印特定样本的决策路径?
Spark Version: '2.3.1'
下面的代码打印了整个模型的决策路径,如何让它打印特定样本的决策路径?比如tagvalue ball等于2的那一行的决策路径
import pyspark.sql.functions as F
from pyspark.ml import Pipeline, Transformer
from pyspark.sql import DataFrame
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler
import findspark
findspark.init()
from pyspark import SparkConf
from pyspark.sql import SparkSession
import pandas as pd
import pyspark.sql.functions as F
from pyspark.ml import Pipeline, Transformer
from pyspark.sql import DataFrame
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import monotonically_increasing_id, col, row_number
from pyspark.sql.window import Window
spark = SparkSession.builder.appName('demo')\
.master('local[*]')\
.getOrCreate()
data = pd.DataFrame({
'ball': [0, 1, 2, 3],
'keep': [4, 5, 6, 7],
'hall': [8, 9, 10, 11],
'fall': [12, 13, 14, 15],
'mall': [16, 17, 18, 10],
'label': [21, 31, 41, 51]
})
df = spark.createDataFrame(data)
df = df.withColumn("mono_ID", monotonically_increasing_id())
w = Window().orderBy("mono_ID")
df = df.select(row_number().over(w).alias("tagvalue"), col("*"))
assembler = VectorAssembler(
inputCols=['ball', 'keep', 'hall', 'fall'], outputCol='features')
dtc = DecisionTreeClassifier(featuresCol='features', labelCol='label')
pipeline = Pipeline(stages=[assembler, dtc]).fit(df)
transformed_pipeline = pipeline.transform(df)
#ml_pipeline = pipeline.stages[1]
result = transformed_pipeline.filter(transformed_pipeline.tagvalue == 2)
result.select('tagvalue', 'prediction').show()
+--------+----------+
|tagvalue|prediction|
+--------+----------+
| 2| 31.0|
+--------+----------+
上面打印标签值2的prediction。现在我想要算法中的决策路径导致该标签值而不是整个模型的答案。
我知道以下内容,但打印的是整个模型决策路径而不是特定模型。
ml_pipeline = pipeline.stages[1]
ml_pipeline.toDebugString
在scikitlearn中存在等价物,spark中的等价物是什么?
更新 1:
如果您要在 scikit learn 中运行以下代码,它将打印该特定示例的决策路径,这是直接从网站中获取的 sn-p。
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
estimator = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0)
estimator.fit(X_train, y_train)
n_nodes = estimator.tree_.node_count
children_left = estimator.tree_.children_left
children_right = estimator.tree_.children_right
feature = estimator.tree_.feature
threshold = estimator.tree_.threshold
# First let's retrieve the decision path of each sample. The decision_path
# method allows to retrieve the node indicator functions. A non zero element of
# indicator matrix at the position (i, j) indicates that the sample i goes
# through the node j.
node_indicator = estimator.decision_path(X_test)
# Similarly, we can also have the leaves ids reached by each sample.
leave_id = estimator.apply(X_test)
# Now, it's possible to get the tests that were used to predict a sample or
# a group of samples. First, let's make it for the sample.
sample_id = 0
node_index = node_indicator.indices[node_indicator.indptr[sample_id]:
node_indicator.indptr[sample_id + 1]]
print('Rules used to predict sample %s: ' % sample_id)
for node_id in node_index:
if leave_id[sample_id] != node_id:
continue
if (X_test[sample_id, feature[node_id]] <= threshold[node_id]):
threshold_sign = "<="
else:
threshold_sign = ">"
print("decision id node %s : (X_test[%s, %s] (= %s) %s %s)" %
(node_id,
sample_id,
feature[node_id],
X_test[sample_id, feature[node_id]],
threshold_sign,
threshold[node_id]))
输出会是这样的
用于预测样本 0 的规则:决策 id 节点 4 : (X_test[0, -2] (= 5.1) > -2.0)
【问题讨论】:
-
Pyspark 没有这个功能。您是否需要有关如何使用来自
.toDebugString()的信息来实施方法的帮助?
标签: apache-spark pyspark apache-spark-ml