作为一个 Python 初学者，我不明白为什么我会在 while 中得到一个无限循环？答案

【问题标题】：As a beginning Pythoner I don't understand why I get an infinity loop with while?作为一个 Python 初学者，我不明白为什么我会在 while 中得到一个无限循环？
【发布时间】：2014-03-05 22:50:54
【问题描述】：

这段代码在while中总是给出一个无限循环：

pos1 = 0
pos2 = 0
url_string = '''<h1>Daily News </h1><p>This is the daily news.</p><p>end</p>'''
i = int(len(url_string))
#print i  # debug
while i > 0:
    pos1 = int(url_string.find('>'))
    #print pos1 # debug
    pos2 = int(url_string.find('<', pos1))
    #print pos2  # debug
    url_string = url_string[pos2:]
    #print url_string  # debug
    print int(len(url_string))  # debug
    i =  int(len(url_string))

我尝试了一切都没有结果。

更多信息：

Python 2.7.5+（默认，2013 年 9 月 19 日，13:48:49）
[GCC 4.8.1] 在 linux2 上
Ubuntu 13.10
在 GNOME 终端 3.6.1 中运行（也在 Emacs 和 PyCharm 中尝试过，但没有解决无穷大问题）

【问题讨论】：

你的调试输出是什么？这一定是一个很大的暗示。 print url_string
注意：不需要int 转换并且html代码不是“url”。

标签： python-2.7 while-loop infinite-loop

【解决方案1】：

pos1 = int(url_string.find('>'))
pos2 = int(url_string.find('<', pos1))

您将找到在第一个 > 之后出现的第一个 <。在第一个> 之后并不总是有<。当find 找不到< 时，它会返回-1，如下：

url_string = url_string[pos2:]

将使用url_string[-1:]，一个由url_string 的最后一个字符组成的切片。此时，Python 不断循环，找不到<，并取url_string 的最后一个字符，直到您感到无聊并按Ctrl+C。

目前还不清楚修复是什么，因为甚至不清楚您想要做什么。你可以使用while i > 1；或者你可以在pos1和pos2的计算中切换>和<，并使用url_string = url_string[pos2+1:]；或者你可能会做其他事情。这取决于您要实现的目标。

【讨论】：

【解决方案2】：

正如上面@user2357112 所指出的，您永远不会超过字符串的结尾。

有几种解决方案，但一个简单的解决方案（基于不真正了解您要实现的目标）是在循环中包含 pos1 和 pos2 的知识。

while (i > 0 && pos1 >= 0 && pos2 >= 0):

如果没有找到您要查找的任何字符，则循环将停止。

【讨论】：

【解决方案3】：

这样拆分字符串并计算字母的数量更容易：

map(len, url_string.split('<')) # This equals [0, 14, 4, 25, 3, 5, 3]

这不是你想要的。您想要此列表的累积总和。像这样得到它：

import numpy as np
lens = np.cumsum( map(len, url_string.split('<')) )

现在我们还不是你。您还需要添加在使用它拆分时从字符串中过滤掉的缺失的 '

 lens = lens + arange(len(lens))

这应该适用于单个字符拆分。

编辑

正如所指出的，要求只是提取不属于标签的内容。然后是一个班轮......

''.join( map(lambda x: x.split('>')[-1] ,  url_string.split('<')) )

应该做的工作。感谢您指出了这一点！

【讨论】：

对于多个字符，例如在'<p>'处分割，需要将最后一行修改为lens = lens + len('<p>') * arange(len(lens))
这不也包括所有的标签吗？我确实相信这个想法是输出所有不是标签的东西。
这些是标签的位置。我想我误解了这个问题。给我一秒钟。生病再看代码...
在这种情况下，一个班轮''.join( map(lambda x: x.split('>')[-1] , url_string.split('<')) ) 应该这样做。

【解决方案4】：

看起来您正在尝试解析 HTML 以从元素中获取数据（例如，我想要 h1 标签内的数据，例如“每日新闻”）。如果是这种情况，我建议在此链接中使用另一个名为 BeautifulSoup4 的库：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start

也就是说，由于我不确定该程序的用途，因此我分解了您的代码，以便您更容易看到变量的情况（现在，暂时停止环形）。这将让您准确地看到您的代码在不陷入无限循环的情况下做了什么。

# Setup Variables
pos1 = 0
pos2 = 0
url_string = '''<h1>Daily News </h1><p>This is the daily news.</p><p>end</p>'''
i = int(len(url_string)) # the url_string length is 60 characters
print "Setting up Variables with string at ", i, " characters"
print "String is: ", url_string

"""string.find(s, sub[, start[, end]])
Return the lowest index in s where the substring sub is found such that sub is 
wholly contained in s[start:end]. Return -1 on failure. Defaults for start and 
end and interpretation of negative values is the same as for slices.

Source: http://docs.python.org/2/library/string.html
"""

print "Running through program first time"
pos1 = int(url_string.find('>'))
# This finds the first occurrence of '>', which is at position 6

pos2 = int(url_string.find('<', pos1))
# This finds the first occurrence of '<' after position 3 ('>'),
# which is at position 15
print "Pos1 is at:", pos1, " and pos2 is at:", pos2

url_string = url_string[pos2:] # trimming string down?
print "The string is now: ", url_string
# </h1><p>This is the daily news.</p><p>end</p>

print "The string length is now: ", int(len(url_string)) # string length now 45
i = int(len(url_string)) # updating the length var to the new length

这是它在终端上的样子：

【讨论】：