从数据框列中提取关键字到另一列答案

【问题标题】：Extract keywords from a dataframe column to another column从数据框列中提取关键字到另一列
【发布时间】：2020-06-25 07:57:19
【问题描述】：

我有一个格式如下的数据框： link to the csv file

      image_name caption_number                caption

0   1000092795.jpg  0   Two young guys with shaggy hair look at their...
1   1000092795.jpg  1   Two young , White males are outside near many...
2   1000092795.jpg  2   Two men in green shirts are standing in a yard .
3   1000092795.jpg  3   A man in a blue shirt standing in a garden .
4   1000092795.jpg  4   Two friends enjoy time spent together .

我想添加另一列keywords，它使用 NLP 关键字提取方法提取关键字。

这是我尝试过的：

df = pd.read_csv('results.csv', delimiter='|')
df.columns = ['image_name', 'caption_number', 'caption']
stop_words = stopwords.words('english')

def get_keywords(row):
    some_text = row['caption']
    lowered = some_text.lower()
    tokens = nltk.tokenize.word_tokenize(some_text)
    keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words]
    keywords_string = ','.join(keywords)
    return keywords_string


df['Keywords'] = df['caption'].apply(get_keywords, axis=1)

以上返回错误：get_keywords() got an unexpected keyword argument 'axis'

【问题讨论】：

结果如何？它有什么问题？你有什么问题？
我收到一个错误get_keywords() got an unexpected keyword argument 'axis'
当你用双括号写df[['caption']].apply(get_keywords, axis=1)或省略axis关键字时会发生什么？您正在将 DataFrame 隐式折叠为系列。
如果我使用双方括号，我得到'float' object has no attribute 'lower'", 'occurred at index 19999'，当我删除轴关键字时，我得到string indices must be integers

标签： python pandas keyword

【解决方案1】：

原因是标题列有 nan 值，因此需要在应用函数之前删除 nan 值。

#replaces all occurring digits in the strings with nothing
df['caption'] = df['caption'].str.replace('\d+', '')
#drop all the nan values 
df=df.dropna()
#if you need the whole row to be passed inside the function
df['Keywords'] = df.apply(lambda row:get_keywords(row), axis=1)

【讨论】：

我收到一个错误'string indices must be integers', 'occurred at index 0'
哦，好的，您能否也请用您的错误更新问题。会有帮助的。
好的，我会这样做的
我认为你的标题栏有一些数字。因此，在将它们发送到函数之前，有必要清理它们。
我仍然收到错误 'float' object has no attribute 'lower'", 'occurred at index 19999 。我已经包含了数据集的链接，因此您可以查看它。仅使用 .csv 文件