【问题标题】:pyspark MlLib: exclude a column value in a rowpyspark MlLib:排除一行中的列值
【发布时间】:2017-06-21 19:54:14
【问题描述】:

我正在尝试从数据框创建一个 LabeledPoint 的 RDD,以便稍后将其用于 MlLib。

如果my_target 列是 sparkDF 中的第一列,则下面的代码可以正常工作。但是,如果my_target 列不是第一列,如何修改下面的代码以排除my_target 以创建正确的LabeledPoint?

import pyspark.mllib.classification as clf
labeledData = sparkDF.rdd.map(lambda row: clf.LabeledPoint(row['my_target'],row[1:]))

logRegr = clf.LogisticRegressionWithSGD.train(labeledData)

row[1:]现在排除了第一列的值;如果我想排除行 N 列中的值,我该怎么做?谢谢!

【问题讨论】:

    标签: pyspark spark-dataframe rdd apache-spark-mllib


    【解决方案1】:
    >>> a = [(1,21,31,41),(2,22,32,42),(3,23,33,43),(4,24,34,44),(5,25,35,45)]
    >>> df = spark.createDataFrame(a,["foo","bar","baz","bat"])
    >>> df.show()
    +---+---+---+---+
    |foo|bar|baz|bat|
    +---+---+---+---+
    |  1| 21| 31| 41|
    |  2| 22| 32| 42|
    |  3| 23| 33| 43|
    |  4| 24| 34| 44|
    |  5| 25| 35| 45|
    +---+---+---+---+
    
    >>> N = 2 
    # N is the column that you want to exclude (in this example the third, indexing starts at 0)
    >>> labeledData = df.rdd.map(lambda row: LabeledPoint(row['foo'],row[:N]+row[N+1:]))
    # it is just a concatenation with N that is excluded both in row[:N] and row[N+1:]
    
    >>> labeledData.collect()
    [LabeledPoint(1.0, [1.0,21.0,41.0]), LabeledPoint(2.0, [2.0,22.0,42.0]), LabeledPoint(3.0, [3.0,23.0,43.0]), LabeledPoint(4.0, [4.0,24.0,44.0]), LabeledPoint(5.0, [5.0,25.0,45.0])]
    

    【讨论】:

      猜你喜欢
      • 2017-08-31
      • 2019-04-11
      • 2016-12-14
      • 1970-01-01
      • 2017-10-25
      • 2021-11-19
      • 2014-12-11
      • 1970-01-01
      • 2022-01-14
      相关资源
      最近更新 更多