Pythonic 方式来检索区分大小写的路径？答案

【问题标题】：Pythonic way to retrieve case sensitive path?Pythonic 方式来检索区分大小写的路径？
【发布时间】：2013-01-23 20:05:54
【问题描述】：

我想知道是否有更快的方法来实现在 python 中返回区分大小写的路径的函数。我想出的解决方案之一适用于 linux 和 windows，但需要我迭代 os.listdir，这可能很慢。

此解决方案适用于不需要足够速度的应用程序和上下文：

def correctPath(start, path):
    'Returns a unix-type case-sensitive path, works in windows and linux'
    start = unicode(start);
    path = unicode(path);
    b = '';
    if path[-1] == '/':
        path = path[:-1];
    parts = path.split('\\');
    d = start;
    c = 0;
    for p in parts:
        listing = os.listdir(d);
        _ = None;
        for l in listing:
            if p.lower() == l.lower():
                if p != l:
                    c += 1;
                d = os.path.join(d, l);
                _ = os.path.join(b, l);
                break;
        if not _:
            return None;
        b = _;

    return b, c; #(corrected path, number of corrections)

>>> correctPath('C:\\Windows', 'SYSTEM32\\CmD.EXe')
(u'System32\\cmd.exe', 2)

但是，当上下文从超过 50,000 个条目的大型数据库中收集文件名时，这将不会那么快。

一种方法是为每个目录创建一个字典树。将字典树与路径的目录部分匹配，如果发生键未命中，执行 os.listdir 为新目录查找并创建一个字典条目，并删除未使用的部分或保留变量计数器作为一种方法为每个目录分配一个“生命周期”。

【问题讨论】：

附带说明，PEP8 说您应该始终将您的文档字符串用三个双引号括起来，如下所示："""Returns a unix-type case-sensitive path, works in windows and linux"""
另一个注意事项，在 Python 中，正斜杠在路径中有效，包括 Windows。因此，您始终可以在内部使用 / 并仅在/当您在 Python 之外需要它时才呈现 `\`。
抱歉，还有一个注意事项。所有的分号是怎么回事？除非您将多个程序行放在单个文本行上，否则您实际上并不需要它们，即使那样，这也被认为是不好的做法。
我使用的数据库使用的是windows路径，所以我觉得不用太担心。我正在将元数据数据库从 Winamp 转换为 Rhythmbox。还要感谢文档字符串提示，没有人告诉我这些事情。还有关于分号的事情，我无能为力。如果我要发布我的程序，我保证我会删除它们。
听起来是个有趣的项目

标签： python linux windows path directory

【解决方案1】：

以下是对您自己的代码的轻微重写，并进行了三处修改：在匹配之前检查文件名是否已经正确，在测试之前将列表处理为小写，使用索引查找相关的“真实大小写”文件。

def corrected_path(start, path):
    '''Returns a unix-type case-sensitive path, works in windows and linux'''
    start = unicode(start)
    path = unicode(path)
    corrected_path = ''
    if path[-1] == '/':
        path = path[:-1]
    parts = path.split('\\')
    cd = start
    corrections_count = 0

    for p in parts:
        if not os.path.exists(os.path.join(cd,p)): # Check it's not correct already
            listing = os.listdir(cd)

            cip = p.lower()
            cilisting = [l.lower() for l in listing]

            if cip in cilisting:
                l = listing[ cilisting.index(cip) ] # Get our real folder name
                cd = os.path.join(cd, l)
                corrected_path = os.path.join(corrected_path, l)
                corrections_count += 1
            else:
                return False # Error, this path element isn't found
        else:
            cd = os.path.join(cd, p)
            corrected_path = os.path.join(corrected_path, p)

    return corrected_path, corrections_count

我不确定这是否会更快，尽管正在进行的测试少了一点，加上开头的“已经正确”的捕获可能会有所帮助。

【讨论】：

不确定一次性全部降低它们是否比每次检查都降低它们更快，但它确实看起来更好。
是的，尽管我认为即使对于大量列表来说它也会相当快。如果您正在处理 >1 个路径，则最大的加速将是通过缓存每个级别的更正。我会推出树的每个步骤（corrected_path，以及从 p 构建的等效 uncorrected_path）并在开始步行之前使用它执行查找。如果你愿意，我可以写一个例子吗？
您的意思是在缓存中存储以前更正的路径以最小化 os.listdir 调用？如果是这样，那么我已经在某个地方实现了该实现，再发布一个也无妨。

【解决方案2】：

带有不区分大小写缓存的扩展版本，用于提取更正的路径：

import os,re

def corrected_paths(start, pathlist):
    ''' This wrapper function takes a list of paths to correct vs. to allow caching '''

    start = unicode(start)
    pathlist = [unicode(path[:-1]) if path[-1] == '/' else unicode(path) for path in pathlist ]

    # Use a dict as a cache, storing oldpath > newpath for first-pass replacement
    # with path keys from incorrect to corrected paths
    cache = dict() 
    corrected_path_list = []
    corrections_count = 0
    path_split = re.compile('(/+|\+)')

    for path in pathlist:
        cd = start
        corrected_path = ''
        parts = path_split.split(path)

        # Pre-process against the cache
        for n,p in enumerate(parts):
            # We pass *parts to send through the contents of the list as a series of strings
            uncorrected_path= os.path.join( cd, *parts[0:len(parts)-n] ).lower() # Walk backwards
            if uncorrected_path in cache:
                # Move up the basepath to the latest matched position
                cd = os.path.join(cd, cache[uncorrected_path])
                parts = parts[len(parts)-n:] # Retrieve the unmatched segment
                break; # First hit, we exit since we're going backwards

        # Fallback to walking, from the base path cd point
        for n,p in enumerate(parts):

            if not os.path.exists(os.path.join(cd,p)): # Check it's not correct already
            #if p not in os.listdir(cd): # Alternative: The above does not work on Mac Os, returns case-insensitive path test

                listing = os.listdir(cd)

                cip = p.lower()
                cilisting = [l.lower() for l in listing]

                if cip in cilisting:

                    l = listing[ cilisting.index(cip) ] # Get our real folder name
                    # Store the path correction in the cache for next iteration
                    cache[ os.path.join(cd,p).lower() ] = os.path.join(cd, l)
                    cd = os.path.join(cd, l)
                    corrections_count += 1

                else:
                    print "Error %s not in folder %s" % (cip, cilisting)
                    return False # Error, this path element isn't found

            else:
                cd = os.path.join(cd, p)

        corrected_path_list.append(cd)

    return corrected_path_list, corrections_count

在运行一组路径的示例中，这会大大减少列表目录的数量（这显然取决于您的路径有多相似）：

corrected_paths('/Users/', ['mxF793/ScRiPtS/meTApaTH','mxF793/ScRiPtS/meTApaTH/metapAth/html','mxF793/ScRiPtS/meTApaTH/metapAth/html/css','mxF793/ScRiPts/PuBfig'])
([u'/Users/mxf793/Scripts/metapath', u'/Users/mxf793/Scripts/metapath/metapath/html', u'/Users/mxf793/Scripts/metapath/metapath/html/css', u'/Users/mxf793/Scripts/pubfig'], 14)
([u'/Users/mxf793/Scripts/metapath', u'/Users/mxf793/Scripts/metapath/metapath/html', u'/Users/mxf793/Scripts/metapath/metapath/html/css', u'/Users/mxf793/Scripts/pubfig'], 5)

在实现这一点的过程中，我意识到 Mac OSX Python 返回路径匹配，就好像它们不区分大小写一样，因此存在测试总是成功的。在这种情况下，listdir 可以向上移动以替换它。

【讨论】：

看起来不错。我真的很喜欢用于检测冗余斜杠的正则表达式对象。我觉得应该是r'(/+|\+)'，使用[]会不经意匹配|字符。