最近邻计算中的条件检查错误？答案

【问题标题】：Faulty conditional check in nearest neighbor calculation?最近邻计算中的条件检查错误？
【发布时间】：2021-04-16 03:07:40
【问题描述】：

我正在尝试编写一个函数，以使用nearest neighbor algorithm 从列出的第一个城市开始计算通过城市列表的近似旅行商路线。但是，每次我运行它都会得到IndexError: list index out of range。

在调试错误时，我发现index 的值从一个循环迭代到下一个循环迭代保持不变，而不是改变。当需要追加时，代码会检查if not in 条件；因为它是False，它将1 添加到i 并移动到循环的下一个迭代。一旦它达到比数组中存在的更高的数字，它就会给我错误。

所以我的问题是，为什么执行不进入第一个if not in 块？似乎代码忽略了它。

对于我的实际问题，我正在阅读一个包含 317 个城市的文件，每个城市都有一个索引和两个坐标。下面是一个简短的测试城市示例列表：

Nodelist = [
    (1, 63, 71),
    (2, 94, 71),
    (3, 142, 370),
    (4, 173, 1276),
    (5, 205, 1213),
    (6, 213, 69),
    (7, 244, 69),
    (8, 276, 630),
    (9, 283, 732),
    (10, 362, 69),
]

这是函数的代码：

def Nearest(Nodelist,Distance,index):
    Time_Calculation = time.time()
    Newarray=[]
    Newarray.append(Nodelist[0])
    for i in range(0,len(Nodelist)):
        for j in range(1,len(Nodelist)):
            if (Nodelist[j] not in Newarray):
                DisEquation = math.sqrt(pow(Nodelist[j][1]-Newarray[i][1],2)+pow(Nodelist[j][2]-Newarray[i][2],2))
                if Distance==0:
                    Distance=DisEquation
                if Distance > DisEquation:
                    index=j
                    Distance=DisEquation
        if(Nodelist[index] not in Newarray):
            Newarray.append(Nodelist[index])
        Distance=0
    print (time.time() - Time_Calculation)
    return Newarray

调用它的代码：

NearestMethodArr=Nearest(Cities,b,index)
print(NearestMethodArr)
print(len(NearestMethodArr))

print 语句应该产生：

[(1, 63, 71), (2, 94, 71), (6, 213, 69), (7, 244, 69), (10, 362, 69), (3, 142, 370), (8, 276, 630), (9, 283, 732), (5, 205, 1213), (4, 173, 1276)]
10

【问题讨论】：

您是否阅读过有关制作minimal, reproducible example 的帮助页面？上下文很好，但是您包含的细节似乎不相关（主要是文件读取代码）。你也不包括测试这个所需的一切——即一个（小）样本城市列表。（此外，更清晰的变量名称会帮助所有人，可能包括您自己。）
这是迄今为止我描述我的问题的最佳方式，但我会在这方面做更多工作我将编辑并添加我拥有的城市的小列表以及我的错误出现的地方，感谢您的反馈
@CrazyChucky 我希望这个编辑后的版本更清晰

标签： python traveling-salesman

【解决方案1】：

Newarray.append(Nodelist[0])#adding Nodelist[0] to NewArray    
for i in range(0,len(Nodelist)):
        for j in range(1,len(Nodelist)):
            if (Nodelist[j] not in Newarray):
                DisEquation = math.sqrt(pow(Nodelist[j][1]-Newarray[i [1],2)+pow(Nodelist[j][2]-Newarray[i][2],2)) #you access Newarray at i
                if Distance==0:
                    Distance=DisEquation
                if Distance > DisEquation:
                    index=j
                    Distance=DisEquation
        if(Nodelist[index] not in Newarray):
            Newarray.append(Nodelist[index])#you conditionally add new elements to newarray

如果您看到我添加到您的代码中的 cmets，问题应该很清楚。您遍历 Nodelist 的所有元素并调用索引 i 您已经向 NewArray 添加了一个元素，因此第一次索引 0 存在。然后你点击不在 Newarray 中的 Nodelist[index]，如果它是真的，NewArray 会变大 1，然后 NewArray[1] 工作，如果由于任何原因这不是真的，那么 NewArray 保持相同的大小，下一个 NewArray[i]将是索引超出范围错误。

编辑：感谢 CrazyChucky 让我直接进入 cmets。我在下面调整了

我对失败的评论是正确的，尽管我没有确定根本原因，即没有按照作者确定的那样设置索引。我没有在脑海中正确解析代码。更新版本中的代码可以运行，但如果您执行以下操作会更快、更容易阅读：

def new_nearest(Nodelist):
    start_time = time.time()
    shortest_path = [Nodelist[0]]
    Nodelist = Nodelist[1:]
    while len(Nodelist) > 0:
        shortest_dist_sqr = -1
        next_node = None
        for potential_dest in Nodelist:
            dist_sqr = (shortest_path[-1][1] - potential_dest[1])**2 + (shortest_path[-1][2] - potential_dest[2])**2 #you don't keep the distances so there is no need to sqrt as if a > b then a**2 > b**2
            if shortest_dist_sqr < 0 or dist_sqr < shortest_dist_sqr:
                next_node = potential_dest
                shortest_dist_sqr = dist_sqr
        shortest_path.append(next_node)
        Nodelist.remove(next_node)
    print(time.time() - start_time)
    return shortest_path

这会返回相同的结果，但执行速度更快。更改为从内部循环中删除节点的方法可以更清楚地了解正在发生的事情，这可能会使代码变慢一些（它会在 C 中，但 python 在各个地方都有很多开销，这可能会使这成为净收益， ) 并且因为不需要计算实际距离，因为您不存储它，您可以比较距离的平方而不做任何平方根。如果您确实想要距离，您可以在确定最近的节点后对其进行平方根。

编辑：我忍不住检查了一下。从 Nodelist 中删除节点实际上代表了大部分时间节省，而缺少 sqrt 确实可以可靠地加快速度（我使用了 timeit 并改变了代码。）在较低级别的语言中，做微小的事情是超快的，所以很可能更快地离开数组并跳过已经使用的元素（这实际上可能不是真的，因为它会混淆分支预测性能分析真的很难，并且取决于您使用的处理器架构......） python 尽管即使是小东西也非常昂贵（添加两个变量：找出它们是什么类型，解码可变字节长度整数，添加，为结果创建新对象......）所以即使从列表中删除一个值可能更多比跳过值和单独留下列表更昂贵，它将导致更多的小操作在 Python 中非常慢。如果使用低级语言，您还可以识别节点的顺序是任意的（除了哪个是第一个），因此您可以只拥有一个包含所有节点的数组，而不是创建一个新的小数组来跟踪数组中使用的值的长度，并将数组中的最后一个值复制到为路径中的下一个节点选择的值上。

再次编辑 :P : 我又忍不住好奇了。覆盖部分节点列表而不是删除条目的方法让我想知道它在 python 中是否会更快，因为它确实创建了更多在 python 中很慢的工作，但减少了删除节点元素所涉及的工作量。事实证明，即使在 python 中，这种方法也能显着提高速度（虽然不是很显着，略低于 2%，但在微基准测试中是一致的），因此下面的代码甚至更快：

def newer_nearest(Nodelist):
    shortest_path = [Nodelist[0]]
    Nodelist = Nodelist[1:]
    while len(Nodelist) > 0:
        shortest_dist_sqr = -1
        next_node = None
        for index, potential_dest in enumerate(Nodelist):
            dist_sqr = (shortest_path[-1][1] - potential_dest[1])**2 + (shortest_path[-1][2] - potential_dest[2])**2 #you don't keep the distances so there is no need to sqrt as if a > b then a**2 > b**2
            if shortest_dist_sqr < 0 or dist_sqr < shortest_dist_sqr:
                next_node = index
                shortest_dist_sqr = dist_sqr
        shortest_path.append(Nodelist[next_node])
        Nodelist[next_node] = Nodelist[-1]
        Nodelist = Nodelist[:-1]
    return shortest_path

【讨论】：

嘿！感谢对我的代码的反馈，我解决了我的问题，但我想告诉你为什么我在追加上添加条件以及为什么它实际上有效，首先我追加 Nodelist[0] 作为我数组中的第一个元素，这样我可以比较这个城市和其他城市之间的距离并选择最短的，然后关于我的情况以及为什么我确定 i int 是最好的方法，我知道每次运行循环时它都会附加一个元素到我拥有的新数组，最后它将达到 318 个元素，与旧数组（Nodelist）相同
条件存在是因为它在达到最大值后运行3次并将最后一个索引元素添加到最后的新数组中
如果您确定将始终采用代码路径将附加值添加到 Newarray 那么没有理由使用 if，您可以每次添加值而不检查值是否在 NodeList[index] 已经在 Newarray 中。此外，如果您使用我展示的循环结构并且没有重复节点，那么距离永远不会为 0，因此您可以删除与此相关的所有内容。
@DavidOldford 这种循环方法确实更惯用，但在这种情况下，算法通常需要将节点与 before 节点进行比较，因为旅行推销员当然不保证按顺序遍历节点列表。
哦，你是对的，对不起，我也很困惑，以为你是作者，我已经工作了 11 多个小时，有一段时间了，还是有点茫然。所以他们跳过了第一个节点，因为他们知道他们已经访问过它，在我看来，他们只是在生成节点之间距离的查找表。所以 NewArray 实际上就是路径。如果他们在循环的开始而不是结束时检查 NewArray，或者只是维护一个列表或一组未访问的节点，他们会随着时间的推移从这些节点中删除值并循环遍历，这将更容易阅读和更有效。

【解决方案2】：

David Oldford 的answer 在大多数功能改进方面击败了我，但我想谈谈一些可以使您的代码更简洁和更 Pythonic 的具体内容。可读性很重要。

主要改进如下：

有rarely, if ever，需要在Python中使用for i in range(len(sequence))。这样做违背了 Python for 循环的设计精神。如果您需要索引，请使用for i, element in enumerate(sequence)。（当您不需要索引时，只需使用for element in sequence。）
为变量名称提供清晰的名称（它们说明它们是什么）并且格式一致（最好是snake_case，但如果您更喜欢camelCase 或其他格式，请选择一个并坚持使用）。这不仅可以帮助其他人阅读您的代码；当你一个月后回来并且不记得你为什么写你所做的事情时，它会帮助你。
打断长行；在赋值（=）和比较（例如==）和逗号之后放置空格；使用空行（谨慎地）分隔代码块以提高可读性。
在 Python 中，您可以使用 float('inf') 创建无穷大（或使用 -float('inf') 创建负无穷大。这是通过比较找到最小或最大的惯用方法。
您的函数在节点/城市列表上运行。提供距离或索引作为函数的参数是没有意义的。
Multiple assignment 通常是通过为索引分配描述性名称来使代码更清晰的好方法。与pow(Nodelist[j][1]-Newarray[i][1],2)+pow(Nodelist[j][2]-Newarray[i][2],2) 相比，阅读(x2 - x1)**2 + (y2 - y1)**2 并了解它的作用要容易得多。
我不会在我自己的代码中包含这么多 cmets，如果它不适合这样的帖子。但是一些 cmets 解释控制流会很有帮助。

import time

def nearest_neighbor_path(cities):
    """Find traveling salesman path via nearest neighbor algorithm."""
    start_time = time.time()

    # Start at the first city, and maintain a list of unvisited cities.
    path = [cities[0]]
    remaining_cities = cities[1:]

    # Loop until every city has been visited. (An empty list evaluates
    # to False.)
    while remaining_cities:
        # In each loop, set starting coordinates to those of the current
        # city, and initialize shortest distance to infinity.
        _, x1, y1 = path[-1]
        shortest_so_far = float('inf')
        nearest_city = None

        # Investigate each possible city to visit next.
        for index, other_city in enumerate(remaining_cities):
            # Since we don't need the *actual* distance, only a
            # comparison, there's no need to take the square root.
            # (Credit to David Oldford, good catch!)
            _, x2, y2 = other_city
            distance_squared = (x2 - x1)**2 + (y2 - y1)**2

            # If it's the closest one we've seen so far, record it.
            if distance_squared < shortest_so_far:
                shortest_so_far = distance_squared
                index_of_nearest, nearest_city = index, other_city

        # After checking all possible destinations, add the nearest one
        # to the path...
        path.append(nearest_city)
        # ...and remove it from the list of remaining cities. This could
        # simply be remaining_cities.remove(nearest_city), in which case
        # we wouldn't need the index or enumerate() at all. But doing it
        # this way saves an extra iteration to find the city again in
        # the list.
        remaining_cities.pop(index_of_nearest)
    
    print(f'Elapsed time: {time.time() - start_time :f} seconds')
    return path

如果您不介意为其argmin 函数导入NumPy，则可以进一步简化while 循环，如下所示：

import numpy as np
import time

def nearest_neighbor_path(cities):
    """Find traveling salesman path via nearest neighbor algorithm."""
    start_time = time.time()
    path = [cities[0]]
    remaining_cities = cities[1:]

    while remaining_cities:
        _, x1, y1 = path[-1]
        distances = [(x2 - x1)**2 + (y2 - y1)**2
                     for _, x2, y2 in remaining_cities]
        index_of_nearest = np.argmin(distances)
        
        nearest_city = remaining_cities.pop(index_of_nearest)
        path.append(nearest_city)
    
    print(f'Elapsed time: {time.time() - start_time :f} seconds')
    return path

我还建议您查看官方 Python 样式指南 PEP 8。这是一套非常好的保持 Python 代码清晰易读的指南。您的代码越容易阅读，您就越容易发现问题和解决方案。

【讨论】：

从概念上讲，您希望从集合中删除项目应该比从列表中删除更快，但是现代处理器确实擅长处理许多非常简单的事情，而不擅长处理更复杂的事情。集合更复杂，但减少了东西的数量，而列表有很多复制，没有复杂的东西。即使对于相当大量的值，列表也确实更快。
import timeit def rem_list(a): a.remove(0) def rem_set(a): a.remove(0) def run_bench(n): a = [i for i in range(n)] print(f"list:{timeit.timeit(lambda:remList(a.copy()),number=1000)} set:{timeit.timeit(lambda:remSet(set(a.copy())),number=1000)}") for n in [10,100,1000,1000000]: run_bench(n) 列表即使有一百万个元素也会领先，实际上出乎意料的是，集合/列表越大，似乎越多的列表出现在前面。列表可以针对从头开始删除值进行优化，但如果我从中间拉出，我会得到相同的结果。
@DavidOldford 令人着迷......我刚刚在多达 10,000 个城市进行了测试，我得到了两个相同的时间结果。我真的没想到会这样。我过去读过的所有内容都一直警告不要从列表中间添加或删除元素。我已经编辑了我的答案，谢谢！
哦，我没想到我们也在对集合或列表进行迭代，而且对集合的迭代通常更慢。这肯定也是其中的一部分。
我需要回顾一下我之前的结果。我犯了一个错误，并在 timeit lambda 内转换为一个集合。当我修复东西时，当您想按价值副索引删除时，在数据中间删除的东西确实更快，这是我忘记考虑的另一件事。如果您已经知道某个值的位置，即使它位于数据集的中间，列表也可以更快地删除早期值，并且在按索引删除时更快。所以我是正确的，删除元素更快我忘记了按值查找它们的查找时间，并且列表更适合迭代。

【解决方案3】：

我发现我的代码有什么问题，当我将距离重新分配给 x 时，我忘记了我需要用它重新分配索引，因为第一个测试的城市距离最短，并且在我的第一个代码中，我仅当 x 时才重新分配变量索引

新代码：

def Nearest(Nodelist,Distance,index):
    Time_Calculation = time.time()
    Newarray=[]
    Newarray.append(Nodelist[0])
    for i in range(0,len(Nodelist)):
        for j in range(1,len(Nodelist)):
            if (Nodelist[j] not in Newarray):
                DisEquation = math.sqrt(pow(Nodelist[j][1]-Newarray[i][1],2)+pow(Nodelist[j][2]-Newarray[i][2],2))
                if Distance==0:
                    Distance=DisEquation
                    index=j
                if Distance > DisEquation:
                    index=j
                    Distance=DisEquation
        if(Nodelist[index] not in Newarray):
            Newarray.append(Nodelist[index])
        Distance=0
    print (time.time() - Time_Calculation)
    return Newarray
    ```

【讨论】：

为什么是 Distance 和 index 参数，你为它们提供什么？
另外，如果您只将节点添加到最终列表中（如果它们还没有的话），您真的找到最近的邻居了吗？两个不同的节点可以各自具有相同的最近邻居。（图中三个节点排成一行。左边节点的最近邻居是中间节点。右边节点的最近邻居也是中间节点。）
当我想到创建这个函数时，我对变量有很多问题，尤其是因为它是 python，所以我通常发送我喜欢将变量作为参数发送，以避免 python 出现任何问题或错误。
这是真的，但我的目标不是创建一个完整的图表，这里我只是想从城市 0 出发并经过每个城市，所以我选择最接近每个城市city 并找到最短路径，我忘了多解释这是最短路径算法
所以在这里我将计算距离并找到允许我访问每个城市的最短路径