Python、Pandas 和 NLTK 类型错误“int”对象在调用系列时不可调用答案

【问题标题】：Python, Pandas, and NLTK Type Error 'int' object is not callable when calling a seriesPython、Pandas 和 NLTK 类型错误“int”对象在调用系列时不可调用
【发布时间】：2020-05-22 01:16:29
【问题描述】：

我正在尝试获取数据框中包含的每条推文中术语的词频。这是我的代码：

import pandas as pd
import numpy as np
import nltk
import string
import collections
from collections import Counter
nltk.download('stopwords')
sw= set(nltk.corpus.stopwords.words ('english'))
punctuation = set (string.punctuation)
data= pd.read_csv('~/Desktop/tweets.csv.zip', compression='zip')

print (data.columns)
print(data.text)
data['text'] = [str.lower () for str in data.text if str.lower () not in sw and str.lower () not in punctuation] 
print(data.text)
data["text"] = data["text"].str.split()
data['text'] = data['text'].apply(lambda x: [item for item in x if item not in sw])
print(data.text)
data['text'] = data.text.astype(str)
print(type(data.text))
tweets=data.text

data['words']= tweets.apply(nltk.FreqDist(tweets))
print(data.words)

这是我的错误和回溯：

名称：文本，长度：14640，dtype：对象 Traceback（最近一次调用）：

文件“”，第 1 行，在 runfile('C:/Users/leska/.spyder-py3/untitled1.py', wdir='C:/Users/leska/.spyder-py3')

文件 "C:\Users\leska\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", 第 827 行，在运行文件中 execfile(文件名，命名空间)

文件 "C:\Users\leska\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", 第 110 行，在 execfile 中 exec（编译（f.read（），文件名，'exec'），命名空间）

文件“C:/Users/leska/.spyder-py3/untitled1.py”，第 30 行，在数据['words']= tweets.apply(nltk.FreqDist(tweets))

文件 "C:\Users\leska\Anaconda3\lib\site-packages\pandas\core\series.py", 第 4018 行，申请中 return self.aggregate(func, *args, **kwds)

文件 "C:\Users\leska\Anaconda3\lib\site-packages\pandas\core\series.py", 第 3883 行，总计结果，如何 = self._aggregate(func, *args, **kwargs)

文件 “C:\Users\leska\Anaconda3\lib\site-packages\pandas\core\base.py”，行第506章结果 = _agg(arg, _agg_1dim)

文件 “C:\Users\leska\Anaconda3\lib\site-packages\pandas\core\base.py”，行第456章结果[fname] = func(fname, agg_how)

文件 “C:\Users\leska\Anaconda3\lib\site-packages\pandas\core\base.py”，行 440，在_agg_1dim return colg.aggregate(how, _level=(_level or 0) + 1)

文件 "C:\Users\leska\Anaconda3\lib\site-packages\pandas\core\series.py", 第 3902 行，总计结果 = func(self, *args, **kwargs)

TypeError: 'int' 对象不可调用

我已经验证了data.text的类型是Pandas系列。

我之前尝试过一个我认为有效的解决方案，它使用标记化和创建一个单词列表来获取字数，但它导致了所有的频率分布推文而不是每条推文。这是我根据之前的问题尝试过的代码：

import pandas as pd
import numpy as np
import nltk
import string
import collections
from collections import Counter
nltk.download('stopwords')
sw= set(nltk.corpus.stopwords.words ('english'))
punctuation = set (string.punctuation)
data= pd.read_csv('~/Desktop/tweets.csv.zip', compression='zip')

print (data.columns)
print (len(data.tweet_id))
tweets = data.text
test = pd.DataFrame(data)
test.column = ["text"]
# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
test['tweet_without_stopwords'] = test['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (sw) and word for word in x.split() if word not in punctuation]))
print(test)
chirps = test.text
splitwords = [ nltk.word_tokenize( str(c) ) for c in chirps ]
allWords = []
for wordList in splitwords:
    allWords += wordList
allWords_clean = [w.lower () for w in allWords if w.lower () not in sw and w.lower () not in punctuation]   
tweets2 = pd.Series(allWords)

words = nltk.FreqDist(tweets2)

我真的需要每条推文的术语和计数，但我对自己做错了什么感到困惑。

【问题讨论】：

标签： python python-3.x pandas dataframe nltk

【解决方案1】：

在第一个代码 sn-p 中，您将函数应用于列的方式是问题的根源。

# this line caused the problem
data['words']= tweets.apply(nltk.FreqDist(tweets))

假设您在清理推文后获得了这个简单的数据框，并希望应用 nltk.FreqDist 来计算每条推文中的词频。该函数接受任何可调用的。

import pandas as pd

df = pd.DataFrame(
    {
        "tweets": [
            "Hello world",
            "I am the abominable snowman",
            "I did not copy this text",
        ]
    }
)

数据框如下所示：

|    | tweets                      |
|---:|:----------------------------|
|  0 | Hello world                 |
|  1 | I am the abominable snowman |
|  2 | I did not copy this text    |

现在让我们找出这三个句子中每个句子的词频。

import nltk

# define the fdist function
def find_fdist(sentence):
    tokens = nltk.tokenize.word_tokenize(sentence)
    fdist = FreqDist(tokens)

    return dict(fdist)

# apply the function on `tweets` column
df["words"] = df["tweets"].apply(find_fdist)

生成的数据框应如下所示：

|    | tweets                      | words                                                         |
|---:|:----------------------------|:--------------------------------------------------------------|
|  0 | Hello world                 | {'Hello': 1, 'world': 1}                                      |
|  1 | I am the abominable snowman | {'I': 1, 'am': 1, 'the': 1, 'abominable': 1, 'snowman': 1}    |
|  2 | I did not copy this text    | {'I': 1, 'did': 1, 'not': 1, 'copy': 1, 'this': 1, 'text': 1} |

【讨论】：

检查您的 data["text"] 列的外观。如果每一行都有有效的句子，这不应该发生。在这种情况下，也许您可以与我共享数据集，我可以尝试重现该问题。
成功了！太感谢了。我想更好地清理结果，但这是我需要的，我想我什至理解它。非常感谢！
而我得到的错误是因为我忘记了 return 语句.....简直是愚蠢的错误！