【问题标题】:Python: extract keywords row by row from csvPython:从csv中逐行提取关键字
【发布时间】:2018-05-26 07:27:09
【问题描述】:

我正在尝试从 csv 文件中逐行提取关键字并创建一个关键字字段。现在我能够得到完整的提取。如何获取每行/字段的关键字?

数据:

id,some_text
1,"What is the meaning of the word Himalaya?"
2,"Palindrome is a word, phrase, or sequence that reads the same backward as forward"

代码:这是搜索整个文本,但不是逐行搜索。除了replace(r'\|', ' '),我还需要放其他东西吗?

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

df = pd.read_csv('test-data.csv')
# print(df.head(5))

text_context = df['some_text'].str.lower().str.replace(r'\|', ' ').str.cat(sep=' ') # not put lower case?
print(text_context)
print('')
tokens=nltk.tokenize.word_tokenize(text_context)
word_dist = nltk.FreqDist(tokens)
stop_words = stopwords.words('english')
punctuations = ['(',')',';',':','[',']',',','!','?']
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
print(keywords)

最终输出:

id,some_text,new_keyword_field
1,What is the meaning of the word Himalaya?,"meaning,word,himalaya"
2,"Palindrome is a word, phrase, or sequence that reads the same backward as forward","palindrome,word,phrase,sequence,reads,backward,forward"

【问题讨论】:

    标签: python nlp nltk


    【解决方案1】:

    这是一种使用 pandas apply 向数据框添加新关键字列的简洁方法。 Apply 通过首先定义一个函数(在我们的例子中为get_keywords)来工作,我们可以应用到每一行或每一列。

    import pandas as pd
    import nltk
    from nltk.tokenize import word_tokenize
    from nltk.corpus import stopwords
    
    # I define the stop_words here so I don't do it every time in the function below
    stop_words = stopwords.words('english')
    # I've added the index_col='id' here to set your 'id' column as the index. This assumes that the 'id' is unique.
    df = pd.read_csv('test-data.csv', index_col='id')  
    

    在这里,我们定义了我们的函数,该函数将在下一个单元格中使用 df.apply 应用于每一行。您可以看到此函数get_keywordsrow 作为其参数,并返回一串逗号分隔的关键字,就像您在上面所需的输出中一样(“meaning,word,himalaya”)。在这个函数中,我们使用isalpha() 降低、标记、过滤掉标点符号、过滤掉我们的stop_words,并将我们的关键字连接在一起以形成所需的输出。

    # This function will be applied to each row in our Pandas Dataframe
    # See the docs for df.apply at: 
    # https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
    def get_keywords(row):
        some_text = row['some_text']
        lowered = some_text.lower()
        tokens = nltk.tokenize.word_tokenize(lowered)
        keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words]
        keywords_string = ','.join(keywords)
        return keywords_string
    

    现在我们已经定义了要应用的函数,我们调用df.apply(get_keywords, axis=1)。这将返回一个 Pandas 系列(类似于列表)。由于我们希望这个系列成为我们数据框的一部分,我们使用 df['keywords'] = df.apply(get_keywords, axis=1) 将其添加为新列

    # applying the get_keywords function to our dataframe and saving the results
    # as a new column in our dataframe called 'keywords'
    # axis=1 means that we will apply get_keywords to each row and not each column
    df['keywords'] = df.apply(get_keywords, axis=1)
    

    输出: Dataframe after adding 'keywords' column

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-10-08
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多