BeautifulSoup 解析答案

【问题标题】：BeautifulSoup ParsingBeautifulSoup 解析
【发布时间】：2016-09-23 07:50:37
【问题描述】：

我一直在努力使用 BeautifulSoup 解析这棵树以获取我正在寻找的文本。在美化 HTML 之后，我得到了一个我感兴趣的表格。

    <td>
       <a href="/inventoryCheck/16783169/?zip=93817">
        <h3>
         Product A
        </h3>
       </a>
       <a class="show_hide" href="/inventoryCheck/16783169/?zip=93817" style="color:red">
        Not Available
       </a>
       <br/>
       Available at roughly
       <a style="color:red">
        0%
       </a>
       of Stores Nationwide
      </td>
     </tr>
     <tr>
      <td style="padding:10px">
       <a href="/inventoryCheck/32201303/?zip=93817">
        <img src="/prod_image/32201303.jpg"/>
       </a>
      </td>
      <td>
       <a href="/inventoryCheck/32201303/?zip=93817">
        <h3>
         Product B
        </h3>
       </a>
       <a class="show_hide" href="/inventoryCheck/32201303/?zip=93817" style="color:red">
        Not Available
       </a>
       <br/>
       Available at roughly
       <a style="color:red">
        0%
       </a>
       of Stores Nationwide
      </td>
     </tr>
     <tr>
      <td style="padding:10px">
       <a href="/inventoryCheck/29236000/?zip=93817">
        <img src="/prod_image/29236000.jpg"/>
       </a>
      </td>
      <td>
       <a href="/inventoryCheck/29236000/?zip=93817">
        <h3>
         Product C
        </h3>
       </a>
       <a class="show_hide" href="/inventoryCheck/29236000/?zip=93817" style="color:red">
        Not Available
       </a>
       <br/>
       Available at roughly
       <a style="color:red">
        0%
       </a>
       of Stores Nationwide
      </td>
     </tr>
     <tr>
      <td style="padding:10px">
       <a href="/inventoryCheck/35536199/?zip=93817">
        <img src="/prod_image/35536199.jpg"/>
       </a>
      </td>
      <td>
       <a href="/inventoryCheck/35536199/?zip=93817">
        <h3>
         Product D
        </h3>
       </a>
       <a class="show_hide" href="/inventoryCheck/35536199/?zip=93817" style="color:red">
        Not Available
       </a>
       <br/>
       Available at roughly
       <a style="color:red">
        0%
       </a>
       of Stores Nationwide
      </td>

“h3”标签表示产品，所以我想获取该标签中的文本，如果有 h3，那么我还想查看下一个“a”标签，看看该产品是否可用。

最终在 Python 中，我只想要一行包含产品名称及其可用性的行。

我尝试过使用 .children、.descendants 等，但真的无济于事。

谁能提供线索。

【问题讨论】：

标签： python beautifulsoup

【解决方案1】：

至少，您希望找到所有内部具有 h3 元素的 td 元素 - 这些将是您的产品。然后，您可以从具有show_hide 类和inventoryCheck 内部href 的a 元素获取可用性。工作代码：

from bs4 import BeautifulSoup, Tag

data = """
your HTML
"""

soup = BeautifulSoup(data, "html.parser")
for product in soup.find_all(lambda tag: tag and tag.name == "td" and tag.h3):
    name = product.h3
    availability = product.find("a", class_="show_hide", href=lambda href: href and "inventoryCheck" in href)
    availability_stats = " ".join([item.get_text(strip=True) if isinstance(item, Tag) else item.strip()
                                   for item in availability.next_siblings])

    print(name.get_text(strip=True), availability.get_text(strip=True), availability_stats.strip())

对于提供的示例 HTML，它将打印：

(u'Product A', u'Not Available', u'Available at roughly 0% of Stores Nationwide')
(u'Product B', u'Not Available', u'Available at roughly 0% of Stores Nationwide')
(u'Product C', u'Not Available', u'Available at roughly 0% of Stores Nationwide')
(u'Product D', u'Not Available', u'Available at roughly 0% of Stores Nationwide')

【讨论】：

【解决方案2】：

您要查找的是.parent 和.nextSibling 属性。它们帮助您相对于 h3 标签导航树。关于 BeautifulSoup（以及任何 HTML/XML/等）要记住的重要一点是它是基于树的。您的 HTML 的粗略结构是这样的：

td
├─ a
│  └─ h3
├─ a
├─ a
└─ br

所以你的h3 是第一个a 的孩子，也是你想要的a 的“侄女/侄子”。所以你需要得到h3 的父母的下一个兄弟姐妹。 BeautifulSoup 文档在navigating the tree 上有一个很好的部分。

试试这个：

from bs4 import BeautifulSoup

testdata = """
Your data here
"""

soup = BeautifulSoup(testdata)

items = []

for item in soup.find_all('h3'):
    name = item.text
    availability = item.parent.nextSibling.text

    items.append({'name': name, 'availability': availability})

您将获得一个 items 数组，其中包含每个产品的字典：

 [{'name': u'Product A', 'availability': u'Not Available'},
  {'name': u'Product B', 'availability': u'Not Available'},
  {'name': u'Product C', 'availability': u'Not Available'},
  {'name': u'Product D', 'availability': u'Not Available'}]

【讨论】：

【解决方案3】：

如果您只想要产品和可用性，您可以使用 css 选择器，将 h3 标签拉到 td 标签内，然后使用 find_next 获取锚点：

soup = BeautifulSoup(h,"html.parser")
h3s = soup.select("td  h3")
print([(h3.text.strip(), h3.find_next("a").text.strip()) for h3 in h3s])

输出：

[(u'Product A', u'Not Available'), (u'Product B', u'Not Available'), (u'Product C', u'Not Available'), (u'Product D', u'Not Available')]

【讨论】：