requests.get() 和/或 BeautifulSoup() 行为不一致答案

【问题标题】：requests.get() and / or BeautifulSoup() behaviour inconsistentrequests.get() 和/或 BeautifulSoup() 行为不一致
【发布时间】：2018-06-15 15:27:03
【问题描述】：

我有以下代码：

__PARENT_TAG = "article"

def _navigate_to_xxx(self):
    """acquire html from xxx and beautify the raw html"""
    html = requests.get(xxx.__BASE_URL + xxx.__EXTENDED_URL)
    self.beautified_html = BeautifulSoup(html.content, "html.parser")

def _extract(self):
    """helper function that extracts elements from beautified_html and returns it"""
    element_list = None
    element_list = self.beautified_html.findAll(self.__PARENT_TAG)
    logging.debug("The number of __PARENT_TAG is: {0}".format(len(element_list)))
    return element_list

问题出在同一个网页上，我从调试行得到的结果有时是 18，有时是 20（我预计是 20）。

有人知道为什么会这样吗？

【问题讨论】：

标签： python beautifulsoup request

【解决方案1】：

我认为我们需要查看您的 __PARENT_TAG 调用是什么样子才能准确诊断它，但我知道从常规网站的 html 制作的 BeautifulSoup 中有很多空洞和无关的部分。仔细查看 bs4 文档中 findall() 的确切行为：https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all 并确保您位于 html 树的正确部分。有些部分可能有一个额外的<div> 类或您没有预料到的东西。

【讨论】：

我把它添加到我原来的问题中