在范围字典中查找值 - python答案

【问题标题】：Finding a value within a dictionary of ranges - python在范围字典中查找值 - python
【发布时间】：2013-03-13 17:44:04
【问题描述】：

我将 2 个文件与初始标识符列、起始值和结束值进行比较。第二个文件包含相应的标识符和另一个值列。

例如。

文件 1：

A     200     900
A     1000    1200
B     100     700
B     900     1000

文件 2：

我想从第二个文件中查找包含在第一个文件中找到的范围内的所有值，以便我的输出如下所示：

现在我已经从第一个文件中创建了一个包含范围列表的字典：例如。

if Identifier in Dictionary:
    Dictionary[Identifier].extend(range(Start, (End+1)))
else:
    Dictionary[Identifier] = range(Start, (End+1))

然后我浏览第二个文件并在范围字典中搜索值：例如。

if Identifier in Dictionary:
    if Value in Dictionary[Identifier]:
    OutFile.write(Line + "\n")

虽然这不适用于相对较小的文件，但我有几个大文件，而且这个程序被证明是非常低效的。我需要优化我的程序，让它运行得更快。

【问题讨论】：

标识符在两个文件中出现的顺序是否相同？值和范围是否始终按排序顺序排列？

标签： python dictionary range

【解决方案1】：

from collections import defaultdict
ident_ranges = defaultdict(list)
with open('file1.txt', 'r') as f1
    for row in f1:
        ident, start, end = row.split()
        start, end = int(start), int(end)
        ident_ranges[ident].append((start, end))
with open('file2.txt', 'r') as f2, open('out.txt', 'w') as output:  
    for line in f2:
        ident, value = line.split()
        value = int(value)
        if any(start <= value <= end for start, end in ident_ranges[ident]):
            output.write(line)

注意事项：使用defaultdict 允许您在字典中添加范围，而无需先检查键是否存在。使用any 允许短路范围检查。使用链式比较是一个不错的 Python 语法快捷方式 (start <= value <= end)。

【讨论】：

【解决方案2】：

你需要构造range(START, END)吗？当你可以这样做时，这似乎很浪费：

if START <= x <= END:
    # process

检查值是否在范围内很慢，因为 a) 您必须构建列表 b) 对列表执行线性搜索才能找到它。

【讨论】：

【解决方案3】：

你可以试试这样的：

In [27]: ranges=defaultdict(list)

In [28]: with open("file1") as f:
    for line in f:
        name,st,end=line.split()
        st,end=int(st),int(end)
        ranges[name].append([st,end])
   ....:         

In [30]: ranges
Out[30]: defaultdict(<type 'list'>, {'A': [[200, 900], [1000, 1200]], 'B': [[100, 700], [900, 1000]]})

In [29]: with open("file2") as f:
    for line in f:
        name,val=line.split()
        val=int(val)
        if any(y[0]<=val<=y[1] for y in ranges[name]):
            print name,val
   ....:             
A 200
A 250
B 100
B 150

【讨论】：

【解决方案4】：

巧妙的技巧：Python 允许您将 in 与 xrange 对象进行比较，这比使用 range 进行 in 快得多，并且内存效率更高。

所以，你可以这样做

from collections import defaultdict

rangedict = defaultdict(list)
...
rangedict[ident].append(xrange(start, end+1))
...

for i in rangedict:
    for r in rangedict[i]:
        if v in r:
            print >>outfile, line

【讨论】：

【解决方案5】：

由于您的范围很大，而您的问题本质上只是一堆比较，因此存储开始/结束元组几乎肯定比整个范围更快（尤其是因为您现在拥有的内容将复制大部分如果两个碰巧重叠，则范围内的数字）。

# Building the dict
if not ident in d:
    d[ident] = (lo, hi)
else:
    old_lo, old_hi = d[ident]
    d[ident] = (min(lo, old_lo), max(hi, old_hi))

那么你的比较看起来像：

# comparing...
if ident in d:
    if d[ident][0] <= val <= d[ident][1]:
        outfile.write(line+'\n')

如果您不对if ident in d 进行单独检查，这两个部分会更快。 Python 字典既好又快，所以首先调用它。你有能力为字典提供默认值，所以使用它。我没有对这个或任何东西进行基准测试以了解加速是什么，但你肯定会得到一些，而且它确实有效：

# These both make use of the following somewhat silly hack:
# In Python, None is treated as less than everything (even -float('inf))
# and empty containers (e.g. (), [], {}) are treated as greater than everything.
# So we use the tuple ((), None) as if it was (float('inf'), float('-inf))

for line in file1:
    ident, lo, hi = line.split()
    lo = int(lo)
    hi = int(hi)
    old_lo, old_hi = d.get(ident, ((), None))
    d[ident] = (min(lo, old_lo), max(hi, old_hi))

# comparing:
for line in file2:
    ident, val = line.split()
    val = int(val)
    lo, hi = d.get(ident, ((), None))
    if lo <= val <= hi:
        outfile.write(line) # unless you stripped it off, this still has a \n

上面的代码是我用来测试的；它在几秒钟内运行在一百万行的file2。

【讨论】：