作为参数对 Spark 列进行操作的函数答案

【问题标题】：Function that operates on Spark column as argument作为参数对 Spark 列进行操作的函数
【发布时间】：2016-11-19 11:25:40
【问题描述】：

编辑：我自己终于想通了。我一直在函数内的column 上使用select()，这就是它不起作用的原因。我将我的解决方案添加为原始问题中的 cmets，以防万一它可能对其他人有用。

我正在学习一个在线课程，我应该编写以下函数：

# TODO: Replace <FILL IN> with appropriate code

# Note that you shouldn't use any RDD operations or need to create custom user defined functions (udfs) to accomplish this task

from pyspark.sql.functions import regexp_replace, trim, col, lower

def removePunctuation(column):
    """Removes punctuation, changes to lower case, and strips leading and trailing spaces.

    Note:
        Only spaces, letters, and numbers should be retained.  Other characters should should be
        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after
        punctuation is removed.

    Args:
        column (Column): A Column containing a sentence.

    Returns:
        Column: A Column named 'sentence' with clean-up operations applied.
    """

    # EDIT: MY SOLUTION
    # column = lower(column)
    # column = regexp_replace(column, r'([^a-z\d\s])+', r'')
    # return trim(column).alias('sentence')

    return <FILL IN>

sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
                                         (' No under_score!',),
                                         (' *      Remove punctuation then spaces  * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
 .select(removePunctuation(col('sentence')))
 .show(truncate=False))

我已经编写了代码，为DataFrame 本身的操作提供了所需的输出：

# Lower case
lower = sentenceDF.select(lower(col('sentence')).alias('lower'))
lower.show()

# Remove Punctuation
cleaned = lower.select(regexp_replace(col('lower'), r'([^a-z\d\s])+', r'').alias('cleaned'))
cleaned.show()

# Trim
sentenceDF = cleaned.select(trim(col('cleaned')).alias('sentence'))
sentenceDF.show(truncate=False)

我只是不知道如何在我的函数中实现此代码，因为它不能在DataFrame 上运行，而只能在给定的column 上运行。我尝试了不同的方法，一种是使用

从column 输入中创建一个新的DataFrame

[...]
df = sqlContext.createDataFrame(column, ['sentence'])
[...]

在函数内，但不起作用：TypeError: Column is not iterable。其他方法试图直接在函数内对column 进行操作，总是导致TypeError: 'Column' object is not callable。

几天前我开始使用(Py)Spark，但仍然存在关于如何仅处理行和列的概念问题。对于当前问题，我非常感谢任何形式的帮助。

【问题讨论】：

这是来自 edx cs105 的作业。您可以在广场查看讨论。
并且正则表达式应该是 r'([^a-zA-Z\d\s])+'
@offwhitelotus 实际上没有，因为我在应用正则表达式之前在列上使用了lower()，所以不需要A-Z。
@turingcomplete 啊，是的，你是对的。

标签： python apache-spark pyspark

【解决方案1】：

您可以在一行中完成此操作。

return re.sub(r'[^a-z0-9\s]','',text.lower().strip()).strip()

【讨论】：