Python：使用熊猫匹配替换部分文件路径答案

【问题标题】：Python: replace part of file path using pandas matchPython：使用熊猫匹配替换部分文件路径
【发布时间】：2019-04-22 09:31:25
【问题描述】：

具有 2 列的数据框：old_path 和 new_path。数据框可以包含数百行。

脚本遍历文件列表。

对于列表中的每个文件，检查其文件夹路径的任何部分是否与 old_path 列中的值匹配。如果匹配，则将文件匹配的old_path 替换为对应的new_path 值。

我通过for index, row in df.iterrows(): 或for row in df.itertuples(): 实现了这一点，但我认为应该有一种更有效的方法来做到这一点，而不必使用第二个for 循环。

感谢任何帮助。下面的示例使用df.iterrows()

import pandas as pd
import os

df = pd.read_csv('path_lookup.csv')
# df:
#                                         old_path                      new_path
# 0               F:\Business\Budget & Forecasting  M:\Business\Finance\Forecast
# 1                    F:\Business\Treasury Shared  M:\Business\Finance\Treasury
# 2                                        C:\Temp                    C:\NewTemp

excel_link_analysis_list = [
    {'excel_filename': 'C:\\Temp\\12345\\Distribution Adjusted Claim.xlsx',
     'file_read': 'OK'},
    {'excel_filename': 'C:\\Temp\\SubFolder\\cost estimates.xlsx',
     'file_read': 'OK'}
]

for i in excel_link_analysis_list:
    for index, row in df.iterrows():
        if row['old_path'].lower() in i['excel_filename'].lower():
            dest_path_and_file = i['excel_filename'].lower().replace(row['old_path'].lower(), 
                                                                     row['new_path'].lower())
            print(dest_path_and_file)

打印：

c:\newtemp\12345\distributionadjusted claim.xlsx

c:\newtemp\子文件夹\成本估算.xlsx

【问题讨论】：

标签： python pandas loops for-loop filepath

【解决方案1】：

是的，pandas 有很好的内置字符串比较函数，请看这里：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains

这就是您如何使用Series.str.contains 来获取匹配值的索引（即来自old_path 列）。然后，您可以使用该索引返回并获取 new_path 的值

编辑：更新以处理 new_path_matches 具有一个值的情况。

import pandas as pd

old_path = df['old_path']
new_path = df['new_path']

for filename in filenames:
    b = old_path.str.contains(filename)

    # Get the index of matches from `old_path` column
    indeces_of_matches = b[b].index.values    

    # use the index of matches to get the corresponding `new_path' values 
    new_path_matches = old_path.loc[indeces_of_matches]

    if (new_path_matches.value.shape[0]>0):
        print new_path_matches.values[0]   # print the new_path value

【讨论】：

感谢您的快速回复。运行上述内容时，稍作修改，出现错误：raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape \T at position 2 在线：b = old_path.str.contains(i['excel_filename'])。认为它与文件路径中的反斜杠有关。
通过更改为 b = old_path.str.contains(i['excel_filename'], regex=False, case=False) 来修复它。这仍然不匹配。如果您之后立即执行print(b)，则所有迭代都会返回False。
现在添加以下内容后匹配成功：file_path = os.path.dirname(os.path.abspath(i['excel_filename']))，然后将下一行更改为b = old_path.str.contains(file_path, case=False, regex=False).any()。现在收到一个新错误indices_of_matches = b[b].index.values 给出错误：AttributeError: 'numpy.ndarray' object has no attribute 'index'
那是因为对.any() 的调用返回一个numpy 数组而不是一个系列。如果你打印 b 会发生什么？
删除对.any() 的调用有效。打印 b 给出0 False 1 False 2 True Name: old_path, dtype: bool，这很棒。最后一行：print(new_path_matches[0]) 给出了KeyError: 0。