html源python BeautifulSoup中不存在奇怪的字符答案

【问题标题】：Weird character not exists in html source python BeautifulSouphtml源python BeautifulSoup中不存在奇怪的字符
【发布时间】：2020-11-26 18:11:36
【问题描述】：

我观看了一个视频，该视频教如何使用 BeautifulSoup 并请求抓取网站这是代码

from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd

pages_to_scrape = 1

for i in range(1,pages_to_scrape+1):
    url = ('http://books.toscrape.com/catalogue/page-{}.html').format(i)
    pages.append(url)
for item in pages:
    page = requests.get(item)
    soup = bs4(page.text, 'html.parser')
    #print(soup.prettify())
for j in soup.findAll('p', class_='price_color'):
    price=j.getText()
    print(price)

我运行良好的代码。但至于结果，我注意到欧元符号之前有奇怪的字符，并且在检查 html 源代码时，我没有找到那个字符。任何想法为什么会出现这个角色？以及如何解决这个问题.. 是使用足够的替换还是有更好的方法？

【问题讨论】：

听起来您不了解字符集，并且正在查看启用了一些旧字符集的 UTF-8，例如 Windows 代码页 1251。
stackoverflow.com/questions/10611455/… 的可能重复项；另见meta.stackoverflow.com/questions/379403/…

标签： python beautifulsoup python-requests

【解决方案1】：

在我看来，您错误地解释了您的问题。我假设您使用的是 Windows，而您的终端 IDLE 使用默认编码 cp1252，

但是你正在处理UTF-8，你必须用UTF-8配置你的终端/空闲

import requests
from bs4 import BeautifulSoup


def main(url):
    with requests.Session() as req:
        for item in range(1, 10):
            r = req.get(url.format(item))
            print(r.url)
            soup = BeautifulSoup(r.content, 'html.parser')
            goal = [(x.h3.a.text, x.select_one("p.price_color").text)
                    for x in soup.select("li.col-xs-6")]
            print(goal)


main("http://books.toscrape.com/catalogue/page-{}.html")

尽量始终使用The DRY Principle，即Don’t Repeat Yourself”。
由于您正在处理相同的host，因此您必须保持相同的会话而不是保持打开tcp 套接字流然后关闭它然后再次打开它。这可能导致阻止您的请求并将其视为DDOS 攻击，其中TCP 标志被后端捕获。 想象一下，您打开浏览器，然后打开一个网站，然后关闭它并重复循环！
Python functions 通常看起来不错且易于阅读，而不是让代码看起来像日记文本。

注意事项：range()和{}格式字符串的用法，CSS选择器。

【讨论】：

如何提取属性的星数？我试图像(x.h3.a.text, x.select_one("p.price_color").text, x.select_one("p.star-rating").attrs.items()) 那样修改您发布的代码，但我没有得到它。我知道这是错误的，但我怎样才能获得属性值？
我可以得到星号dict_items([('class', ['star-rating', 'Three'])])的结果。我怎样才能得到Three 的结果？
@YasserKhalil (x.h3.a.text, x.select_one("p.star-rating")['class'][-1], x.select_one("p.price_color").text)

【解决方案2】：

您可以使用page.content.decode('utf-8') 代替page.text。正如 cmets 中的人所说，这是一个编码问题，.content 将 HTML 作为字节返回，然后您可以使用 .decode('utf-8') 将其转换为正确编码的字符串，而 .text 返回编码错误的字符串（可能是 cp1252） .最终代码可能如下所示：

from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd

pages_to_scrape = 1
pages = [] # You forgot this line

for i in range(1,pages_to_scrape+1):
    url = ('http://books.toscrape.com/catalogue/page-{}.html').format(i)
    pages.append(url)
for item in pages:
    page = requests.get(item)
    soup = bs4(page.content.decode('utf-8'), 'html.parser') # Replace .text with .content.decode('utf-8')
    #print(soup.prettify())
for j in soup.findAll('p', class_='price_color'):
    price=j.getText()
    print(price)

这应该可以工作

P.S：抱歉直接写答案，我没有足够的声誉在 cmets 中写：D

【讨论】：