在 PySpark 中将负数表示为随机森林算法的类别的问题答案

【问题标题】：Issue in denoting negative numbers as a category for Random Forest algorithm in PySpark在 PySpark 中将负数表示为随机森林算法的类别的问题
【发布时间】：2015-11-30 22:07:28
【问题描述】：

这个问题是我在this link的另一个问题的延续@

我正在使用PySpark 在Spark MLlib 中为classification 工作Random Forest algorithm。我的示例dataset 如下所示：

Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000

如您所见，这些字段采用non-numeric 格式，因此在传递给模型之前需要encoding。每行中的最后一个值是string 格式（unicode）的numeric 字段，其中一些值前面带有- 符号。在这里，只要features 说Level1,Male,New York,New York，那么预测将是352.888890。所以352.888890 变成了一个类别而不仅仅是一个数值。我写了这段代码，我在其中读取数据并形成training_setRDD。然后我encodenon-numeric 字段，然后形成LabeledPoint 的RDD，然后将其传递给模型进行分类。这是我当前的代码：

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.util import MLUtils
import numpy as np
from sklearn.feature_extraction import DictVectorizer as DV
import pandas as pd
import sqlite3

from pyspark.ml.feature import StringIndexer
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.feature import VectorAssembler 

from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col
from pyspark.mllib.tree import RandomForest, RandomForestModel

def extract(line):

    return (line[0],line[1],line[2],line[3],line[4].lstrip('-'))

input_file = sc.textFile('file1.csv').zipWithIndex().filter(lambda (line,rownum): rownum>0).map(lambda (line, rownum): line)

input_data = (input_file
    .map(lambda line: line.split(","))
    .filter(lambda line: len(line) >1 )
    .map(extract)) # Map to tuples

# Divide the input data in training and test set with 80%-20% ratio
(training_data, test_data) = input_data.randomSplit([0.8, 0.2])

# the column in training_data which is label - a numeric field in string format
label_col = "x4"

# converting RDD to dataframe
training_data_df = training_data.toDF(("x0","x1","x2","x3","x4"))


# Indexers encode strings with doubles
string_indexers = [
   StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))
   for x in training_data_df.columns if x != label_col
]

# Assembles multiple columns into a single vector
assembler = VectorAssembler(
    inputCols=["idx_{0}".format(x) for x in training_data_df.columns if x != label_col],
    outputCol="features"
)


pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(trainingData_df_1)
indexed = model.transform(trainingData_df_1)

label_points = (indexed
    .select(col(label_col).cast("double").alias("label"), col("features"))
    .map(lambda row: LabeledPoint(row.label, row.features)))

feature1 = training_data.map(lambda x: x[0]).distinct().collect()
feature2 = training_data.map(lambda x: x[1]).distinct().collect()
feature3 = training_data.map(lambda x: x[2]).distinct().collect()
feature4 = training_data.map(lambda x: x[3]).distinct().collect()
label_set = training_data.map(lambda x: x[4]).distinct().collect()

model_classifier = RandomForest.trainClassifier(label_points,numClasses=len(label_set),categoricalFeaturesInfo={0: len(feature1), 1: len(feature2), 2: len(feature3),3: len(feature4)},
                                 numTrees=50, featureSubsetStrategy="auto",
                                 impurity='gini', maxDepth=10, maxBins=max([len(feature1),len(feature2),len(feature3),len(feature4)]))

当我运行此代码时，我收到ava.lang.IllegalArgumentException: GiniAggregator given label -495.8001345 but requires label is non-negative. 的错误

问题是一些label 值是negative numeric。如何使用negative numeric 值来表示类别而不是数字？

【问题讨论】：

标签： python encoding apache-spark machine-learning pyspark

【解决方案1】：

在Spark源码中，有Gini Impurity逻辑检查0到numClasses范围内需要的标签，见下面源码

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala

经过一番研究，我发现有人指出导致问题的标签需要转换为 Gini Impurity 可以正确处理的范围

http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-Error-td23847.html

希望对你有帮助

【讨论】：