【发布时间】:2020-07-04 22:22:58
【问题描述】:
我正在尝试将用 python 编写的情感分析重新实现到 pyspark,因为我正在使用大数据,我是 pyspark 语法的新手,并且在尝试应用 nltk 包中的词形还原函数时遇到错误
错误:# 字符串方法 TypeError:列不可迭代 下面是代码和数据
|overall| reviewsummary| cleanreviewText| reviewText1| filteredreviewText|
+-------+--------------------+--------------------+--------------------+--------------------+
| 5.0|exactly what i ne...|exactly what i ne...|[exactly, what, i...|[exactly, needed,...|
| 2.0|i agree with the ...|i agree with the ...|[i, agree, with, ...|[agree, review, o...|
| 4.0|love these... i a...|love these... i a...|[love, these, i, ...|[love, going, ord...|
| 2.0|too tiny an openi...|too tiny an openi...|[too, tiny, an, o...|[tiny, opening, t...|
| 3.0| okay three stars| okay three stars|[okay, three, stars]|[okay, three, stars]|
| 5.0|exactly what i wa...|exactly what i wa...|[exactly, what, i...|[exactly, wanted,...|
| 4.0|these little plas...|these little plas...|[these, little, p...|[little, plastic,...|
| 3.0|mother - in - law...|mother - in - law...|[mother, in, law,...|[mother, law, wan...|
| 3.0|item is of good q...|item is of good q...|[item, is, of, go...|[item, good, qual...|
| 3.0|i had used my las...|i had used my las...|[i, had, used, my...|[used, last, el, ...|
+-------+--------------------+--------------------+--------------------+--------------------+
only showing top 10 rows
In [18]: dfStopwordRemoved.printSchema()
root
|-- overall: double (nullable = true)
|-- reviewText: string (nullable = true)
|-- summary: string (nullable = true)
|-- reviewsummary: string (nullable = true)
|-- cleanreviewText: string (nullable = true)
|-- reviewText1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- filteredreviewText: array (nullable = true)
| |-- element: string (containsNull = true)
函数引理
def get_part_of_speech(word):
probable_part_of_speech = wordnet.synsets(word)
pos_counts = Counter()
pos_counts["n"] = len( [ item for item in probable_part_of_speech if item.pos()=="n"] )
pos_counts["v"] = len( [ item for item in probable_part_of_speech if item.pos()=="v"] )
pos_counts["a"] = len( [ item for item in probable_part_of_speech if item.pos()=="a"] )
pos_counts["r"] = len( [ item for item in probable_part_of_speech if item.pos()=="r"] )
most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
return most_likely_part_of_speech
def Lemmatizing_Words(Words):
Lm = WordNetLemmatizer()
Lemmatized_Words = []
for word in Words:
Lemmatized_Words.append(Lm.lemmatize(word,get_part_of_speech(word)))
return Lemmatized_Words
(函数调用)
x2=list()
for word in dfStopwordRemoved.select('filteredreviewText'):
x_temp = Lemmatizing_Words(word)
x2.append(x_temp)
请参考以下链接了解错误 Error
【问题讨论】:
标签: python pyspark nltk apache-spark-ml lemmatization