【问题标题】:How do I leverage Spark's pipelines to find phrases in strings then add feature category?如何利用 Spark 的管道在字符串中查找短语然后添加特征类别?
【发布时间】:2021-04-16 20:56:47
【问题描述】:

我想在 pyspark 数据框中的文本列中搜索短语。这是一个示例,可以向您展示我的意思。

sentenceData = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(4, "I wish Java could use case classes"),
(11, "Logistic regression models are neat")], 
["id", "sentence"])

如果句子包含“听说过 spark”,则 categorySpark=1 和 categoryHeard=1。

如果句子包含“java OR regression”,则 categoryCool=1。

我有大约 28 个布尔值(或者如果我使用正则表达式可能更好)来检查。

sentenceData.withColumn('categoryCool',sentenceData['sentence'].rlike('Java | regression')).show()

返回:

+---+--------------------+------------+
| id|            sentence|categoryCool|
+---+--------------------+------------+
|  0|Hi I heard about ...|       false|
|  4|I wish Java could...|        true|
| 11|Logistic regressi...|        true|
+---+--------------------+------------+

这是我想要的,但我想将它作为转换步骤添加到管道中。

【问题讨论】:

    标签: apache-spark pyspark nlp feature-extraction


    【解决方案1】:

    我发现了这个nice Medium articlethis S.O. answer,我将它们结合起来回答我自己的问题!我希望有一天有人会觉得这很有帮助。

        from pyspark.ml.pipeline import Transformer
        from pyspark.ml import Pipeline
        from pyspark.sql.types import *
        from pyspark.ml.util import Identifiable
        
        sentenceData = spark.createDataFrame([
            (0, "Hi I heard about Spark"),
            (4, "I wish Java could use case classes"),
            (11, "Logistic regression models are neat")
        ], ["id", "sentence"])
        
        class OneSearchMultiLabelExtractor(Transformer):
            def __init__(self, rlikeSearch, outputCols, inputCol = 'fullText'):
                self.inputCol = inputCol
                self.outputCols = outputCols
                self.rlikeSearch = rlikeSearch
                self.uid = str(Identifiable())
            def copy(extra):
                defaultCopy(extra)
            def check_input_type(self, schema):
                field = schema[self.inputCol]
                if (field.dataType != StringType()):
                    raise Exception('OneSearchMultiLabelExtractor input type %s did not match input type StringType' % field.dataType)
            def check_output_type(self):
                if not (isinstance(self.outputCols,list)):
                    raise Exception('OneSearchMultiLabelExtractor output columns must be a list')
            def _transform(self, df):
                self.check_input_type(df.schema)
                self.check_output_type()
                df = df.withColumn("searchResult", df[self.inputCol].rlike(self.rlikeSearch)).cache()
                for outputCol in self.outputCols:
                    df = df.withColumn(outputCol, df["searchResult"])
                return df.drop("searchResult")
                
        dex = CoolExtractor(inputCol='sentence',rlikeSearch='Java | regression',outputCols=['coolCategory'])
        FeaturesPipeline =  Pipeline(stages=[dex])
        Featpip = FeaturesPipeline.fit(sentenceData)
        Featpip.transform(sentenceData).show()
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2016-12-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-11-21
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多