根据特定关键字从 CSV 文件中提取行答案

【问题标题】：extracting rows from CSV file based on specific keywords根据特定关键字从 CSV 文件中提取行
【发布时间】：2017-08-27 09:50:52
【问题描述】：

我创建了一个代码来帮助我从 csv 文件中检索数据

  import re
keywords = {"metal", "energy", "team", "sheet", "solar" "financial", "transportation", "electrical", "scientists",
            "electronic", "workers"}  # all your keywords


keyre=re.compile("energy",re.IGNORECASE)
with open("2006-data-8-8-2016.csv") as infile:
    with open("new_data.csv", "w") as outfile:
        outfile.write(infile.readline())  # Save the header
        for line in infile:
            if len(keyre.findall(line))>0:
                outfile.write(line)

我需要它在 "position" 和 "Job description" 两个主要列中查找每个关键字，然后取出包含这些单词的整行并将它们写入新文件中。关于如何以最简单的方式完成此操作的任何想法？

【问题讨论】：

我需要它来查看所有关键字，例如它应该在“职位”和“职位描述”下查找包含“金属”字的行，并提取整行并将它们写入文件，然后查找第二个单词并执行相同操作直到最后一个单词

标签： python csv extract operator-keyword

【解决方案1】：

如果您要从关键字列表中查找仅包含一个单词的行，则可以使用 pandas 执行此操作，如下所示：

keywords = ["metal", "energy", "team", "sheet", "solar" "financial", "transportation", "electrical", "scientists",
            "electronic", "workers"]

# read the csv data into a dataframe 
# change "," to the data separator in your csv file 
df = pd.read_csv("2006-data-8-8-2016.csv", sep=",")
# filter the data: keep only the rows that contain one of the keywords 
# in the position or the Job description columns
df = df[df["position"].isin(keywords) | df["Job description"].isin(keywords)] 
# write the data back to a csv file 
df.to_csv("new_data.csv",sep=",", index=False)

如果您要在行中查找子字符串（例如在 financial engineering 中查找 financial），那么您可以执行以下操作：

keywords = ["metal", "energy", "team", "sheet", "solar" "financial", "transportation", "electrical", "scientists",
            "electronic", "workers"]
searched_keywords = '|'.join(keywords)

# read the csv data into a dataframe 
# change "," to the data separator in your csv file 
df = pd.read_csv("2006-data-8-8-2016.csv", sep=",")
# filter the data: keep only the rows that contain one of the keywords 
# in the position or the Job description columns
df = df[df["position"].str.contains(searched_keywords) | df["Job description"].str.contains(searched_keywords)] 
# write the data back to a csv file 
df.to_csv("new_data.csv",sep=",", index=False)

【讨论】：

这很简单，看起来不错，我得到了代码。但它不只保存标题的任何数据:(虽然我确信文件中包含很多关键字，特别是在职位和职位描述@MedAli
@Eng.Reem 你能分享你的数据样本吗？
这行不通，因为“职位描述”列不只是一个词。
@VincentK，“职位描述”列只是一个选择器标签，与它是否工作无关。
@MedAli 我的意思是，“职位描述”列中的每一行都不会只包含一个单词。如果写成“这项工作包括做出财务决策”，即使句子中有“财务”，它也不会匹配任何关键字，Dataframe.isin 将行项作为一个整体。

【解决方案2】：

试试这个，循环一个数据帧并将一个新的数据帧写回一个 csv 文件。

import pandas as pd

keywords = {"metal", "energy", "team", "sheet", "solar", "financial", 
        "transportation", "electrical", "scientists",
        "electronic", "workers"}  # all your keywords

df = pd.read_csv("2006-data-8-8-2016.csv", sep=",")

listMatchPosition = []
listMatchDescription = []

for i in range(len(df.index)):
    if any(x in df['position'][i] or x in df['Job description'][i] for x in keywords):
        listMatchPosition.append(df['position'][i])
        listMatchDescription.append(df['Job description'][i])


output = pd.DataFrame({'position':listMatchPosition, 'Job description':listMatchDescription})
output.to_csv("new_data.csv", index=False)

编辑：如果你有很多列要添加，修改后的代码就可以了。

df = pd.read_csv("2006-data-8-8-2016.csv", sep=",")

output = pd.DataFrame(columns=df.columns)

for i in range(len(df.index)):
    if any(x in df['position'][i] or x in df['Job description'][i] for x in keywords):
    output.loc[len(output)] = [df[j][i] for j in df.columns]

output.to_csv("new_data.csv", index=False)

【讨论】：

请注意，如果“职位描述”不仅仅是一个词，我认为它不是，这与 Dataframe.isin 方法相反
csv 文件还包括我需要提取并放入新文件的其他列。关于如何做到这一点的任何想法？ @Vincent K
您的意思是像“Salary”、“Location”这样的列需要一起提取？如果是，如果只是多几列，只需添加更多 listMatchxxx
是的，还有 18 列需要提取，例如薪水、教育等。我会试一试，看看我会得到什么！谢谢@Vincent K