循环浏览 CSV 文件以提取特定列答案

【问题标题】：Loop through CSV files to pull out specific columns循环浏览 CSV 文件以提取特定列
【发布时间】：2021-01-16 12:23:22
【问题描述】：

我在同一目录中有多个 CSV 文件，其中包含不同的调查回复数据（不同的问题、不同的问题顺序）。我希望实现的是循环遍历所有 CSV，以查找特定的列标题并将结果存储在 pandas 数据框中。

到目前为止我所拥有的：

import pandas as pd
import csv
import os
import glob

path = "file/path/"
all_files = glob.glob(os.path.join(path, "*.csv")) #make list of paths

for file in all_files:
    file_name = os.path.splitext(os.path.basename(file))
    dfn = pd.read_csv(file, encoding='latin1')
    dfn.index.name = file_name

所以代码当前从目录中读取所有 CSV，现在我想我需要另一个循环来遍历它们以查找列中的数据。我正在寻找的有问题的列包含文本“将推荐”（有可能并非所有列名的措辞都相同，因此需要包含）。我对 Python 还是很陌生，并且非常挣扎，非常感谢任何帮助。

CSV1 示例：

Programme,"Overall, I am satisfied with the quality of the programme",I would recommend the company to a friend or colleague,Please comment on any positive aspects of your experience of this programme
Nursing,4,4,[IMAGE]
Nursing,1,3,very good
Nursing,4,5,I enjoyed studying tis programme

CSV2 示例：

Programme,I would recommend the company to a friend,The programme was well organised and running smoothly,It is clear how students' feedback on the programme has been acted on
IT,4,2,4
IT,5,5,5
IT,5,4,5

【问题讨论】：

检查这个答案 -> stackoverflow.com/a/11531402/8150371
你能举一个两个 CSV 的例子以及你希望最终数据框保存的数据吗？
@rcriii 我在问题中添加了两个 CSV 的简化版本。我希望数据框保存程序和“会推荐”问题的回答
如果它们是我可以复制和粘贴的文本会更好。
抱歉，我不确定如何添加实际的 CSV。我以为只是让您了解数据！

标签： python pandas loops csv

【解决方案1】：

我会将列的名称更改为一个通用值，然后将它们concat 一起使用，使用join 参数指定您只需要通用列。

import pandas as pd
from io import StringIO

csv1 = StringIO("""Programme,"Overall, I am satisfied with the quality of the programme",I would recommend the company to a friend or colleague,Please comment on any positive aspects of your experience of this programme
Nursing,4,4,IMAGE
Nursing,1,3,very good
Nursing,4,5,I enjoyed studying tis programme""")

csv2 = StringIO("""Programme,I would recommend the company to a friend,The programme was well organised and running smoothly,It is clear how students' feedback on the programme has been acted on
IT,4,2,4
IT,5,5,5
IT,5,4,5""")

dfout = pd.DataFrame(columns=['Programme', 'Recommends'])

for file in [csv1, csv2]:
    dfn = pd.read_csv(file)
    matching = [s for s in dfn.columns if "would recommend" in s]
    if matching:
        dfn = dfn.rename(columns={matching[0]:'Recommends'})
        dfout = pd.concat([dfout, dfn], join="inner")

print(dfout)

Programme Recommends
0   Nursing          4
1   Nursing          3
2   Nursing          5
0        IT          4
1        IT          5
2        IT          5

【讨论】：

这适用于您给出的示例，但是当我尝试读取 CSV 并通过它运行循环时，我收到一条错误消息： ValueError(msg.format(_type=type( filepath_or_buffer))) ValueError：无效的文件路径或缓冲区对象类型：
这是我尝试的代码： df = pd.read_csv('data.csv') dfout = pd.DataFrame(columns=['Subunit', 'Recommends']) for file in [df ]: dfn = pd.read_csv(file) 匹配 = [s for s in dfn.columns if "would Recommendation" in s] 如果匹配: dfn = dfn.rename(columns={matching[0]:'Recommends'}) dfout = pd.concat([dfout, dfn], join="inner") print(dfout)

【解决方案2】：

你不需要循环：

matching = [s for s in dfn.columns if "would recommend" in s]

在“匹配”中，您将找到符合您条件的列的名称

【讨论】：