【问题标题】:Loop through CSV files to pull out specific columns循环浏览 CSV 文件以提取特定列
【发布时间】:2021-01-16 12:23:22
【问题描述】:

我在同一目录中有多个 CSV 文件,其中包含不同的调查回复数据(不同的问题、不同的问题顺序)。我希望实现的是循环遍历所有 CSV,以查找特定的列标题并将结果存储在 pandas 数据框中。

到目前为止我所拥有的:

import pandas as pd
import csv
import os
import glob

path = "file/path/"
all_files = glob.glob(os.path.join(path, "*.csv")) #make list of paths

for file in all_files:
    file_name = os.path.splitext(os.path.basename(file))
    dfn = pd.read_csv(file, encoding='latin1')
    dfn.index.name = file_name

所以代码当前从目录中读取所有 CSV,现在我想我需要另一个循环来遍历它们以查找列中的数据。我正在寻找的有问题的列包含文本“将推荐”(有可能并非所有列名的措辞都相同,因此需要包含)。我对 Python 还是很陌生,并且非常挣扎,非常感谢任何帮助。

CSV1 示例:

Programme,"Overall, I am satisfied with the quality of the programme",I would recommend the company to a friend or colleague,Please comment on any positive aspects of your experience of this programme
Nursing,4,4,[IMAGE]
Nursing,1,3,very good
Nursing,4,5,I enjoyed studying tis programme

CSV2 示例:

Programme,I would recommend the company to a friend,The programme was well organised and running smoothly,It is clear how students' feedback on the programme has been acted on
IT,4,2,4
IT,5,5,5
IT,5,4,5

【问题讨论】:

  • 检查这个答案 -> stackoverflow.com/a/11531402/8150371
  • 你能举一个两个 CSV 的例子以及你希望最终数据框保存的数据吗?
  • @rcriii 我在问题中添加了两个 CSV 的简化版本。我希望数据框保存程序和“会推荐”问题的回答
  • 如果它们是我可以复制和粘贴的文本会更好。
  • 抱歉,我不确定如何添加实际的 CSV。我以为只是让您了解数据!

标签: python pandas loops csv


【解决方案1】:

我会将列的名称更改为一个通用值,然后将它们concat 一起使用,使用join 参数指定您只需要通用列。

import pandas as pd
from io import StringIO

csv1 = StringIO("""Programme,"Overall, I am satisfied with the quality of the programme",I would recommend the company to a friend or colleague,Please comment on any positive aspects of your experience of this programme
Nursing,4,4,IMAGE
Nursing,1,3,very good
Nursing,4,5,I enjoyed studying tis programme""")

csv2 = StringIO("""Programme,I would recommend the company to a friend,The programme was well organised and running smoothly,It is clear how students' feedback on the programme has been acted on
IT,4,2,4
IT,5,5,5
IT,5,4,5""")

dfout = pd.DataFrame(columns=['Programme', 'Recommends'])

for file in [csv1, csv2]:
    dfn = pd.read_csv(file)
    matching = [s for s in dfn.columns if "would recommend" in s]
    if matching:
        dfn = dfn.rename(columns={matching[0]:'Recommends'})
        dfout = pd.concat([dfout, dfn], join="inner")

print(dfout)
Programme Recommends
0   Nursing          4
1   Nursing          3
2   Nursing          5
0        IT          4
1        IT          5
2        IT          5

【讨论】:

  • 这适用于您给出的示例,但是当我尝试读取 CSV 并通过它运行循环时,我收到一条错误消息: ValueError(msg.format(_type=type( filepath_or_buffer))) ValueError:无效的文件路径或缓冲区对象类型:
  • 这是我尝试的代码: df = pd.read_csv('data.csv') dfout = pd.DataFrame(columns=['Subunit', 'Recommends']) for file in [df ]: dfn = pd.read_csv(file) 匹配 = [s for s in dfn.columns if "would Recommendation" in s] 如果匹配: dfn = dfn.rename(columns={matching[0]:'Recommends'}) dfout = pd.concat([dfout, dfn], join="inner") print(dfout)
【解决方案2】:

你不需要循环:

matching = [s for s in dfn.columns if "would recommend" in s]

在“匹配”中,您将找到符合您条件的列的名称

【讨论】:

    猜你喜欢
    • 2022-01-09
    • 2013-10-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-08-04
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多