【问题标题】:Pyspark VectorAssembler on Ngram/Tokenizer Transofmed DataframeNgram/Tokenizer Transofmed Dataframe 上的 Pyspark VectorAssembler
【发布时间】:2018-02-18 01:49:36
【问题描述】:

如果我有一个包含字段 ['did','doc'] 的数据框,例如

data = sc.parallelize(['This is a test',
                   'This is also a test',
                   'These sentence are tests',
                   'This tests these sentences'])\
         .zipWithIndex()\
         .map(lambda x: (x[1],x[0]))\
         .toDF(['did','doc'])
data.show()
+---+--------------------+--------------------+
|did|                 doc|               words|
+---+--------------------+--------------------+
|  0|      This is a test| [this, is, a, test]|
|  1| This is also a test|[this, is, also, ...|
|  2|These sentence ar...|[these, sentence,...|
|  3|This tests these ...|[this, tests, the...|
+---+--------------------+--------------------+

我对该文档进行了一些转换,例如标记和查找 2-gram:

data = Tokenizer(inputCol = 'doc',outputCol = 'words').transform(data)
data = NGram(n=2,inputCol = 'words',outputCol='grams').transform(data)
data.show()
+---+--------------------+--------------------+--------------------+
|did|                 doc|               words|               grams|
+---+--------------------+--------------------+--------------------+
|  0|      This is a test| [this, is, a, test]|[this is, is a, a...|
|  1| This is also a test|[this, is, also, ...|[this is, is also...|
|  2|These sentence ar...|[these, sentence,...|[these sentence, ...|
|  3|This tests these ...|[this, tests, the...|[this tests, test...|
+---+--------------------+--------------------+--------------------+

最后我想用 VectorAssembler 将两个 gram 和单词组合成一列特征:

data = VectorAssembler(inputCol=['words','grams'],
                       outputCol='features').transform(data)

然后我收到以下错误:

Py4JJavaError: An error occurred while calling o504.transform.
: java.lang.IllegalArgumentException: Data type ArrayType(StringType,true) is not supported.

因为 VectorAssembler 不喜欢使用字符串列表。为了解决这个问题,我可以将数据框放到一个 rdd 中,将 rdd 映射到适当的行,然后将其重新压缩到一个数据框中,一个 la

data = data.rdd.map(lambda x: Row(did = x['did'], 
           features = x['words']+x['grams'])) .toDF(['did','features'])

这对于这个小型数据集来说不是问题,但对于大型数据集来说却过于昂贵。

有没有比上述方法更有效地实现这一目标的方法?

【问题讨论】:

    标签: dataframe pyspark


    【解决方案1】:

    您可以像这样使用 udf 创建 features 列

    import pyspark.sql.functions as f
    import pyspark.sql.types as t
    
    
    udf_add = f.udf(lambda x,y: x+y, t.ArrayType(t.StringType()))
    data.withColumn('features', udf_add('words','grams')).select('features').show()
    
    [Row(features=['this', 'is', 'a', 'test', 'this is', 'is a', 'a test']),
    Row(features=['this', 'is', 'also', 'a', 'test', 'this is', 'is also', 'also a', 'a test']),
    Row(features=['these', 'sentence', 'are', 'tests', 'these sentence', 'sentence are', 'are tests']),
    Row(features=['this', 'tests', 'these', 'sentences', 'this tests', 'tests these', 'these sentences'])]
    

    【讨论】:

    • 这将实现它,但 udf 也很慢。它实际上是否比将数据从数据框拖动到 rdd、映射并再次将其压缩回数据框更快?
    • 我还没有检查解决方案的速度。作为一般规则,数据帧操作应该比 rdd 操作快得多。
    • 我可以请你检查一个类似的question好吗?
    猜你喜欢
    • 2020-01-17
    • 1970-01-01
    • 1970-01-01
    • 2020-11-27
    • 2018-08-31
    • 1970-01-01
    • 2021-08-17
    • 1970-01-01
    • 2016-08-30
    相关资源
    最近更新 更多