【发布时间】:2018-05-26 07:27:09
【问题描述】:
我正在尝试从 csv 文件中逐行提取关键字并创建一个关键字字段。现在我能够得到完整的提取。如何获取每行/字段的关键字?
数据:
id,some_text
1,"What is the meaning of the word Himalaya?"
2,"Palindrome is a word, phrase, or sequence that reads the same backward as forward"
代码:这是搜索整个文本,但不是逐行搜索。除了replace(r'\|', ' '),我还需要放其他东西吗?
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
df = pd.read_csv('test-data.csv')
# print(df.head(5))
text_context = df['some_text'].str.lower().str.replace(r'\|', ' ').str.cat(sep=' ') # not put lower case?
print(text_context)
print('')
tokens=nltk.tokenize.word_tokenize(text_context)
word_dist = nltk.FreqDist(tokens)
stop_words = stopwords.words('english')
punctuations = ['(',')',';',':','[',']',',','!','?']
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
print(keywords)
最终输出:
id,some_text,new_keyword_field
1,What is the meaning of the word Himalaya?,"meaning,word,himalaya"
2,"Palindrome is a word, phrase, or sequence that reads the same backward as forward","palindrome,word,phrase,sequence,reads,backward,forward"
【问题讨论】: