在网页上搜索字符串时获取“TypeError：参数应该是整数或类似字节的对象，而不是'str'”答案

【问题标题】：Getting "TypeError: argument should be integer or bytes-like object, not 'str'" when searching for string on web page在网页上搜索字符串时获取“TypeError：参数应该是整数或类似字节的对象，而不是'str'”
【发布时间】：2019-07-09 16:16:09
【问题描述】：

我正在使用 Python 3.7 和 Django。我想在 HTML 页面中搜索字符串。我试过这个...

req = urllib2.Request(article.path, headers=settings.HDR)
html = urllib2.urlopen(req, timeout=settings.SOCKET_TIMEOUT_IN_SECONDS).read()
is_present = html.find(token_str) >= 0

但这会导致错误

TypeError: argument should be integer or bytes-like object, not 'str'

抱怨最后一行，我在哪里“找到”。在 HTML 中搜索字符串的正确方法是什么？

【问题讨论】：

可以打印html的内容吗？您可以使用html.decode('utf8').find(token_str)...，但可能有更好的方法，具体取决于输出的外观。（请注意，您最好阅读标题以了解解码类型）。

标签： python django python-3.x urllib2

【解决方案1】：

戴夫！

为了从 HTML 文件中提取数据，我非常推荐库 Beautiful Soup。目前，您可能只是在 HTML 文件的所有标签中搜索该标记，但在其他时候，您可能正在寻找更复杂的东西，例如查找仅在某个段落标签中的一段字符串。使用 pip 安装它：

pip install beautifulsoup4

对于您的情况，这里有一个经过测试的 sn-p 可以解决您的问题。它对您要查找的令牌使用简单的正则表达式模式。如果在 HTML 标记中存在与该标记匹配的内容，则返回 True。否则为假。

我希望这个功能可以帮助你开始使用Beautifulsoup。这是一个非常强大的库。

import re

from bs4 import BeautifulSoup

html_doc = """
<html>
 <head>
  <title>
   Here goes somet title
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    Hello World!
   </b>
  </p>
  <p class="class1">
   Once upon a time..... there was a my_token here....
   <a class="token" href="http://example.com/token" id="link1">
  </p>

  <p class="class2">
   Nope....
  </p>
 </body>
</html>
"""


def search_inside_whole_html_tags_for(html_doc, str_lookup):
    """
    Looks for a str_lookup using a simple regexp pattern. Returns
    True if the str_lookup was found in the whole HTML text. Otherwise,
    returns False.
    """
    html_soup = BeautifulSoup(html_doc, "html.parser")

    # simple regepx pattern: the fixed str lookup
    rslt = html_soup.find_all(text=re.compile(str_lookup))

    return bool(rslt)


print(search_inside_whole_html_tags_for(html_doc, str_lookup="my_tokenx"))
print(search_inside_whole_html_tags_for(html_doc, str_lookup="my_token"))  # this the token

>>> False
>>> True

【讨论】：

谢谢。 BeautifulSoup 非常适合解析 HTML 文件，但我想知道这个字符串是否出现在页面上的任何位置，无论它可能出现在哪个元素中，无论是它的文本还是属性的一部分。
仅查看该字符串是否在 html_doc 上的任何位置（假设它是一个字符串），也许只是表达式：html_doc 中的“my_token”就可以解决问题。但如果在文本或属性中找到它，则不会返回。
这个答案与问题无关！提到的错误是 Python 3 中的典型错误，您不能通过建议使用特殊包来解决它！顺便说一句，这与错误无关！！！

【解决方案2】：

您正在将字符串与整数进行比较，因此会出现类型错误。需要转换为字符串上的整数或测试如果不是 None。

测试：

>>> token_str = 'test'
>>> token_str >= 0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '>=' not supported between instances of 'str' and 'int'
>>> token_str != None
True

推荐解决方案：

is_present = html.find(int(token_str)) >= 0

或

is_present = html.find(token_str) != None

【讨论】：