【问题标题】:spark java.lang.stackoverflow logistic regression fit with large datasetspark java.lang.stackoverflow 逻辑回归适合大型数据集
【发布时间】:2017-11-23 04:23:27
【问题描述】:

我正在尝试为具有 470 个特征和 1000 万个训练实例的数据集拟合逻辑回归模型。这是我的代码的 sn-p。

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import RFormula

formula = RFormula(formula = "label ~ .-classWeight")


bestregLambdaVal = 0.005
bestregAlphaVal = 0.01

lr = LogisticRegression(maxIter=1000, regParam=bestregLambdaVal, elasticNetParam=bestregAlphaVal,weightCol="classWeight") 
pipeLineLr = Pipeline(stages = [formula, lr])
pipeLineFit = pipeLineLr.fit(mySparkDataFrame[featureColumnNameList + ['classWeight','label']])

我还创建了一个检查点目录,

sc.setCheckpointDir('checkpoint/')

这里建议: Spark gives a StackOverflowError when training using ALS

但是我得到一个错误,这里是部分跟踪:

File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 64, in fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 108, in _fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 64, in fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 265, in _fit
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 262, in _fit_java
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o383361.fit.
: java.lang.StackOverflowError
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
    at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)

我还想指出,使用withcolumn() 将 470 个特征列迭代添加到 spark 数据框。

【问题讨论】:

    标签: apache-spark pyspark


    【解决方案1】:

    所以我犯的错误是,在检查数据帧时,我只会这样做:

    mySparkDataFrame.checkpoint(eager=True)
    

    正确的做法是:

    mySparkDataFrame = mySparkDataFrame.checkpoint(eager=True)
    

    这是基于我在这里提出的另一个问题(并得到了答案):

    pyspark rdd isCheckPointed() is false

    另外,建议在检查点之前持久化()数据帧,并在检查点之后计数()它

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2016-03-24
      • 1970-01-01
      • 2016-09-13
      • 1970-01-01
      • 1970-01-01
      • 2014-11-13
      • 2019-07-11
      • 1970-01-01
      相关资源
      最近更新 更多