【问题标题】:cannot use a string pattern on a bytes-like object (Python)不能在类似字节的对象上使用字符串模式(Python)
【发布时间】:2021-02-08 18:08:13
【问题描述】:

我正在 python 中创建一个爬虫来列出网站中的所有链接,但是我收到一个错误,我看不出是什么原因造成的 错误是:

Traceback (most recent call last):
  File "vul_scanner.py", line 8, in <module>
    vuln_scanner.crawl(target_url)
  File "C:\Users\Lenovo x240\Documents\website\website\spiders\scanner.py", line 18, in crawl
    href_links= self.extract_links_from(url)
  File "C:\Users\Lenovo x240\Documents\website\website\spiders\scanner.py", line 15, in extract_links_from
    return re.findall('(?:href=")(.*?)"', response.content)
  File "C:\Users\Lenovo x240\AppData\Local\Programs\Python\Python38\lib\re.py", line 241, in findall
    return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object

我的代码是:在scanner.py 文件中:

# To ignore numpy errors:
#     pylint: disable=E1101
import urllib
import requests
import re
from urllib.parse import urljoin

class Scanner:
    def __init__(self, url):
        self.target_url = url
        self.target_links = []

    def extract_links_from(self, url):
        response = requests.get(url)
        return re.findall('(?:href=")(.*?)"', response.content)

    def crawl(self, url):
        href_links= self.extract_links_from(url)
        for link in href_links:
            link = urljoin(url, link)   

            if "#" in link:
                link = link.split("#")[0]

            if self.target_url in link and link not in self.target_links:
                self.target_links.append(link)
                print(link)
                self.crawl(link)     

在 vul_scanner.py 文件中:

import scanner
# To ignore numpy errors:
#     pylint: disable=E1101


target_url = "https://www.amazon.com"
vuln_scanner = scanner.Scanner(target_url)
vuln_scanner.crawl(target_url)

我运行的命令是:python vul_scanner.py

【问题讨论】:

  • 分享完整的错误信息可能有助于人们回答您的问题

标签: python python-3.x web-crawler re


【解决方案1】:
return re.findall('(?:href=")(.*?)"', response.content)

response.content 在这种情况下是二进制类型。所以要么你使用response.text,所以你得到纯文本并可以按照你现在的计划进行处理,或者你可以看看这个:

Regular expression parsing a binary file?

如果你想继续二进制之路。

干杯

【讨论】:

    猜你喜欢
    • 2014-02-21
    • 1970-01-01
    • 1970-01-01
    • 2020-08-26
    • 2016-10-09
    • 1970-01-01
    • 1970-01-01
    • 2016-06-03
    • 2015-09-10
    相关资源
    最近更新 更多