【问题标题】:extracting rows from CSV file based on specific keywords根据特定关键字从 CSV 文件中提取行
【发布时间】:2017-08-27 09:50:52
【问题描述】:

我创建了一个代码来帮助我从 csv 文件中检索数据

  import re
keywords = {"metal", "energy", "team", "sheet", "solar" "financial", "transportation", "electrical", "scientists",
            "electronic", "workers"}  # all your keywords


keyre=re.compile("energy",re.IGNORECASE)
with open("2006-data-8-8-2016.csv") as infile:
    with open("new_data.csv", "w") as outfile:
        outfile.write(infile.readline())  # Save the header
        for line in infile:
            if len(keyre.findall(line))>0:
                outfile.write(line)

我需要它在 "position" 和 "Job description" 两个主要列中查找每个关键字,然后取出包含这些单词的整行并将它们写入新文件中。关于如何以最简单的方式完成此操作的任何想法?

【问题讨论】:

  • 我需要它来查看所有关键字,例如它应该在“职位”和“职位描述”下查找包含“金属”字的行,并提取整行并将它们写入文件,然后查找第二个单词并执行相同操作直到最后一个单词

标签: python csv extract operator-keyword


【解决方案1】:

如果您要从关键字列表中查找仅包含一个单词的行,则可以使用 pandas 执行此操作,如下所示:

keywords = ["metal", "energy", "team", "sheet", "solar" "financial", "transportation", "electrical", "scientists",
            "electronic", "workers"]

# read the csv data into a dataframe 
# change "," to the data separator in your csv file 
df = pd.read_csv("2006-data-8-8-2016.csv", sep=",")
# filter the data: keep only the rows that contain one of the keywords 
# in the position or the Job description columns
df = df[df["position"].isin(keywords) | df["Job description"].isin(keywords)] 
# write the data back to a csv file 
df.to_csv("new_data.csv",sep=",", index=False) 

如果您要在行中查找子字符串(例如在 financial engineering 中查找 financial),那么您可以执行以下操作:

keywords = ["metal", "energy", "team", "sheet", "solar" "financial", "transportation", "electrical", "scientists",
            "electronic", "workers"]
searched_keywords = '|'.join(keywords)

# read the csv data into a dataframe 
# change "," to the data separator in your csv file 
df = pd.read_csv("2006-data-8-8-2016.csv", sep=",")
# filter the data: keep only the rows that contain one of the keywords 
# in the position or the Job description columns
df = df[df["position"].str.contains(searched_keywords) | df["Job description"].str.contains(searched_keywords)] 
# write the data back to a csv file 
df.to_csv("new_data.csv",sep=",", index=False) 

【讨论】:

  • 这很简单,看起来不错,我得到了代码。但它不只保存标题的任何数据:(虽然我确信文件中包含很多关键字,特别是在职位和职位描述@MedAli
  • @Eng.Reem 你能分享你的数据样本吗?
  • 这行不通,因为“职位描述”列不只是一个词。
  • @VincentK,“职位描述”列只是一个选择器标签,与它是否工作无关。
  • @MedAli 我的意思是,“职位描述”列中的每一行都不会只包含一个单词。如果写成“这项工作包括做出财务决策”,即使句子中有“财务”,它也不会匹配任何关键字,Dataframe.isin 将行项作为一个整体。
【解决方案2】:

试试这个,循环一个数据帧并将一个新的数据帧写回一个 csv 文件。

import pandas as pd

keywords = {"metal", "energy", "team", "sheet", "solar", "financial", 
        "transportation", "electrical", "scientists",
        "electronic", "workers"}  # all your keywords

df = pd.read_csv("2006-data-8-8-2016.csv", sep=",")

listMatchPosition = []
listMatchDescription = []

for i in range(len(df.index)):
    if any(x in df['position'][i] or x in df['Job description'][i] for x in keywords):
        listMatchPosition.append(df['position'][i])
        listMatchDescription.append(df['Job description'][i])


output = pd.DataFrame({'position':listMatchPosition, 'Job description':listMatchDescription})
output.to_csv("new_data.csv", index=False)

编辑: 如果你有很多列要添加,修改后的代码就可以了。

df = pd.read_csv("2006-data-8-8-2016.csv", sep=",")

output = pd.DataFrame(columns=df.columns)

for i in range(len(df.index)):
    if any(x in df['position'][i] or x in df['Job description'][i] for x in keywords):
    output.loc[len(output)] = [df[j][i] for j in df.columns]

output.to_csv("new_data.csv", index=False)

【讨论】:

  • 请注意,如果“职位描述”不仅仅是一个词,我认为它不是,这与 Dataframe.isin 方法相反
  • csv 文件还包括我需要提取并放入新文件的其他列。关于如何做到这一点的任何想法? @Vincent K
  • 您的意思是像“Salary”、“Location”这样的列需要一起提取?如果是,如果只是多几列,只需添加更多 listMatchxxx
  • 是的,还有 18 列需要提取,例如薪水、教育等。我会试一试,看看我会得到什么!谢谢@Vincent K
猜你喜欢
  • 1970-01-01
  • 2013-12-24
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-01-09
相关资源
最近更新 更多