# 字符串方法 TypeError: 列在 pyspark 中不可迭代答案

【问题标题】：# string methods TypeError: Column is not iterable in pyspark# 字符串方法 TypeError: 列在 pyspark 中不可迭代
【发布时间】：2020-07-04 22:22:58
【问题描述】：

我正在尝试将用 python 编写的情感分析重新实现到 pyspark，因为我正在使用大数据，我是 pyspark 语法的新手，并且在尝试应用 nltk 包中的词形还原函数时遇到错误

错误：# 字符串方法 TypeError：列不可迭代下面是代码和数据

|overall|       reviewsummary|     cleanreviewText|         reviewText1|  filteredreviewText|
+-------+--------------------+--------------------+--------------------+--------------------+
|    5.0|exactly what i ne...|exactly what i ne...|[exactly, what, i...|[exactly, needed,...|
|    2.0|i agree with the ...|i agree with the ...|[i, agree, with, ...|[agree, review, o...|
|    4.0|love these... i a...|love these... i a...|[love, these, i, ...|[love, going, ord...|
|    2.0|too tiny an openi...|too tiny an openi...|[too, tiny, an, o...|[tiny, opening, t...|
|    3.0|    okay three stars|    okay three stars|[okay, three, stars]|[okay, three, stars]|
|    5.0|exactly what i wa...|exactly what i wa...|[exactly, what, i...|[exactly, wanted,...|
|    4.0|these little plas...|these little plas...|[these, little, p...|[little, plastic,...|
|    3.0|mother - in - law...|mother - in - law...|[mother, in, law,...|[mother, law, wan...|
|    3.0|item is of good q...|item is of good q...|[item, is, of, go...|[item, good, qual...|
|    3.0|i had used my las...|i had used my las...|[i, had, used, my...|[used, last, el, ...|
+-------+--------------------+--------------------+--------------------+--------------------+
only showing top 10 rows


In [18]: dfStopwordRemoved.printSchema()
root
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- reviewsummary: string (nullable = true)
 |-- cleanreviewText: string (nullable = true)
 |-- reviewText1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- filteredreviewText: array (nullable = true)
 |    |-- element: string (containsNull = true)

函数引理

def get_part_of_speech(word):
probable_part_of_speech = wordnet.synsets(word)

pos_counts = Counter()
pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )

most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
return most_likely_part_of_speech

def Lemmatizing_Words(Words):
Lm = WordNetLemmatizer()
Lemmatized_Words = []
for word in Words:
    Lemmatized_Words.append(Lm.lemmatize(word,get_part_of_speech(word)))
return Lemmatized_Words

（函数调用）

x2=list()
for word in dfStopwordRemoved.select('filteredreviewText'):
x_temp = Lemmatizing_Words(word)
x2.append(x_temp)

请参考以下链接了解错误 Error

【问题讨论】：

标签： python pyspark nltk apache-spark-ml lemmatization

【解决方案1】：

错误消息是准确的：您可以像标准 python 迭代器一样遍历数据框列。要应用 sum、mean 等标准函数，我们需要使用 withColumn() 或 select() 函数。在您的情况下，您有自己的自定义功能。因此，您需要将您的函数注册为 udf 并将其与 withColumn() 或 select() 一起使用

以下是 spark 文档中的 udf 示例 - https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html?highlight=udf#pyspark.sql.functions.udf

>>> from pyspark.sql.types import IntegerType
>>> slen = udf(lambda s: len(s), IntegerType())
>>> :udf
... def to_upper(s):
...     if s is not None:
...         return s.upper()
...
>>> :udf(returnType=IntegerType())
... def add_one(x):
...     if x is not None:
...         return x + 1
...
>>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age"))
>>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")).show()
+----------+--------------+------------+
|slen(name)|to_upper(name)|add_one(age)|
+----------+--------------+------------+
|         8|      JOHN DOE|          22|
+----------+--------------+------------+

【讨论】：