【问题标题】:Convert string list to binary list in pyspark在pyspark中将字符串列表转换为二进制列表
【发布时间】:2019-10-09 21:31:09
【问题描述】:

我有一个这样的数据框

data = [(("ID1", ['October', 'September', 'August'])), (("ID2", ['August', 'June', 'May'])), 
    (("ID3", ['October', 'June']))]
df = spark.createDataFrame(data, ["ID", "MonthList"])
df.show(truncate=False)

+---+----------------------------+
|ID |MonthList                   |
+---+----------------------------+
|ID1|[October, September, August]|
|ID2|[August, June, May]         |
|ID3|[October, June]             |
+---+----------------------------+

我想将每一行与一个默认列表进行比较,这样如果值存在,则分配 1 else 0

default_month_list = ['October', 'September', 'August', 'July', 'June', 'May']

因此我的预期输出是这样的

+---+----------------------------+------------------+
|ID |MonthList                   |Binary_MonthList  |
+---+----------------------------+------------------+
|ID1|[October, September, August]|[1, 1, 1, 0, 0, 0]|
|ID2|[August, June, May]         |[0, 0, 1, 0, 1, 1]|
|ID3|[October, June]             |[1, 0, 0, 0, 1, 0]|
+---+----------------------------+------------------+

我可以在 python 中做到这一点,但不知道如何在pyspark 中做到这一点

【问题讨论】:

    标签: apache-spark pyspark apache-spark-sql pyspark-dataframes


    【解决方案1】:

    您可以尝试使用这样的udf

    from pyspark.sql.functions import udf, col
    from pyspark.sql.types import ArrayType, IntegerType
    
    default_month_list = ['October', 'September', 'August', 'July', 'June', 'May']
    
    def_month_list_func = udf(lambda x: [1 if i in x else 0 for i in default_month_list], ArrayType(IntegerType()))
    
    df = df.withColumn("Binary_MonthList", def_month_list_func(col("MonthList")))
    
    df.show()
    # output
    +---+--------------------+------------------+
    | ID|           MonthList|  Binary_MonthList|
    +---+--------------------+------------------+
    |ID1|[October, Septemb...|[1, 1, 1, 0, 0, 0]|
    |ID2| [August, June, May]|[0, 0, 1, 0, 1, 1]|
    |ID3|     [October, June]|[1, 0, 0, 0, 1, 0]|
    +---+--------------------+------------------+
    

    【讨论】:

      【解决方案2】:

      array_contains()怎么样:

      from pyspark.sql.functions import array, array_contains        
      
      df.withColumn('Binary_MonthList', array([array_contains('MonthList', c).astype('int') for c in default_month_list])).show()                                                                                                         
      +---+--------------------+------------------+
      | ID|           MonthList|  Binary_MonthList|
      +---+--------------------+------------------+
      |ID1|[October, Septemb...|[1, 1, 1, 0, 0, 0]|
      |ID2| [August, June, May]|[0, 0, 1, 0, 1, 1]|
      |ID3|     [October, June]|[1, 0, 0, 0, 1, 0]|
      +---+--------------------+------------------+
      

      【讨论】:

        【解决方案3】:

        pissall 回答完全没问题。我只是发布一个更通用的解决方案,它不需要 udf 并且不需要您了解可能的值。

        CountVectorizer 完全符合您的要求。该算法将所有不同的值添加到他的字典中,只要它们满足某些标准(例如最小或最大出现)。您可以将此模型应用于数据帧,它将返回编码为 one-hot 的稀疏向量列 (which can be converted to a dense vector column),它表示给定输入列的项目。

        from pyspark.ml.feature import CountVectorizer
        
        data = [(("ID1", ['October', 'September', 'August']))
                , (("ID2", ['August', 'June', 'May', 'August']))
                , (("ID3", ['October', 'June']))]
        df = spark.createDataFrame(data, ["ID", "MonthList"])
        
        df.show(truncate=False)
        
        #binary=True checks only if a item of the dictionary is present and not how often
        #vocabSize defines the maximum size of the dictionary
        #minDF=1.0 defines in how much rows (1.0 means one row is enough) a values has to be present to be added to the vocabulary
        cv = CountVectorizer(inputCol="MonthList", outputCol="Binary_MonthList", vocabSize=12, minDF=1.0, binary=True)
        
        cvModel = cv.fit(df)
        
        df = cvModel.transform(df)
        
        df.show(truncate=False)
        
        cvModel.vocabulary
        

        输出:

        +---+----------------------------+
        |ID |                  MonthList | 
        +---+----------------------------+ 
        |ID1|[October, September, August]| 
        |ID2| [August, June, May, August]| 
        |ID3|            [October, June] | 
        +---+----------------------------+ 
        
        +---+----------------------------+-------------------------+ 
        |ID |                  MonthList |        Binary_MonthList | 
        +---+----------------------------+-------------------------+ 
        |ID1|[October, September, August]|(5,[1,2,3],[1.0,1.0,1.0])| 
        |ID2|[August, June, May, August] |(5,[0,1,4],[1.0,1.0,1.0])| 
        |ID3|[October, June]             |     (5,[0,2],[1.0,1.0]) |
        +---+----------------------------+-------------------------+ 
        
        ['June', 'August', 'October', 'September', 'May']
        

        【讨论】:

          猜你喜欢
          • 2020-06-20
          • 1970-01-01
          • 1970-01-01
          • 2010-11-22
          • 2018-03-05
          • 1970-01-01
          • 2021-04-14
          • 2021-10-13
          • 2020-04-05
          相关资源
          最近更新 更多