CSS 选择器不遵循 bs4 的 ':nth-child' 逻辑答案

【问题标题】：CSS Selector not following ':nth-child' logic for bs4CSS 选择器不遵循 bs4 的 ':nth-child' 逻辑
【发布时间】：2021-04-06 15:38:11
【问题描述】：

我正在通过请求和 bs4 抓取以下网络漫画网站以下载漫画图片：www.qwantz.com

在浏览器检查器中，当我选择 webcomic 元素并复制 CSS 选择器时，我得到以下信息：

comicElem = soup.select('body > center > table > tbody > tr > td:nth-child(2) > img')

查看网站的 html，这是有道理的。该部分中的元素是这样对齐的：

.\
  <tr>
    <td>...</td>
    <td>...</td>
    <td>...</td>
  <tr>
.\

但是，此选择器返回一个空列表。当我将选择器备份到 (... > td') 时，我在选择器对象中获得了三个同级元素。

对于我尝试 1 - 2 的每个数字参数，以下所有结果也会产生空列表：

comicElem = soup.select('body > center > table > tbody > tr > td:nth-child(1)')
comicElem = soup.select('body > center > table > tbody > tr > td')[1]

使用comicElem = soup.select('body > center > table > tbody > tr > td > td > img') 可以获得我想要的结果。但我想知道这里发生了什么导致从网络检查器复制的 CSS 选择器失败。简而言之，我希望我的代码使用从浏览器检查器复制的 CSS 选择器工作。例如td:nth-child(2)。

供参考，以下是相关代码：

#! python3
# scheduledWebComicDL.py - Downloads comics but first checks if there is
# an update before

import requests, os, bs4, threading

folderName = 'Web Comics'
os.makedirs(folderName, exist_ok=True) # store comics in folderName

def downloadQwantz():
    # Web comic site to parse.
    site = 'http://www.qwantz.com'

    # Make the soup with requests & bs4.
    res = requests.get(site)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, 'html.parser')

    # Get image url.
    comicElem = soup.select('body > center > table > tbody > tr > td > td > img')
    comicUrlShort = comicElem[0].get('src')
    comicUrl = site + '/' + comicUrlShort

    # Confirm that there is an img url.
    if comicElem == 0:
        print('Could not find comic element at %s.' % (site))

    # Begin download of img with url.
    else:
            checkAndDownload(comicUrl)

downloadQwantz()

【问题讨论】：

你能提供完整的代码来重现这个错误吗？
已提供完整代码，以及更具体的我想要实现的目标。
我已经投票决定重新开放。我认为您只需要与 downloadQwant() 相关的代码，您可以将其简化为仅几行来演示问题。可能是错误的 html 导致解析器出错。
感谢您的反馈。对此相对较新。

标签： python css beautifulsoup css-selectors

【解决方案1】：

问题是显示的tbody 标签需要浏览器渲染才能出现。我认为至少在某些浏览器中缺少tbody 标签是implicitly added。页面源代码视图以及您从requests 返回的内容都缺少它。我也没有看到自定义 js 添加它；因此，您从浏览器（存在tbody）的路径在应用于缺少的soup 对象时会失败。

检查上面的级别：

soup.select('body > center > table')

有效，但在可见的 html 中没有 tbody。

在选择器中有tbody：

soup.select('body > center > table > tbody')

[] 即返回空列表

没有tbody的复制路径：

soup.select('body > center > table > tr > td:nth-child(2) > img')

多田！一个匹配的节点。

另见：

Why do browsers still inject <tbody> in HTML5?

【讨论】：

【解决方案2】：

你只需要这个：

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get("https://qwantz.com/").text, "html.parser")
comic_src = f"https://qwantz.com/{soup.select_one('.comic')['src']}"

print(comic_src)

with open(comic_src.rsplit("/")[-1], "wb") as f:
    f.write(requests.get(comic_src).content)

输出：

comics/comic2-2162.png

还有您本地文件夹中的漫画图像。

【讨论】：

我知道按班级排序会让我到达那里。但是，我希望使用从 Web 检查器复制的 CSS 选择器来获得相同的结果。