网页抓取中的多级标签存在检查——提高python的可读性答案

【问题标题】：Multi-level tag existence check in web scraping - improving readability in python网页抓取中的多级标签存在检查——提高python的可读性
【发布时间】：2018-10-15 13:47:39
【问题描述】：

我正在开发一个爬虫，它可以遍历由同一模板构建的许多页面。每个页面都包含有关特定项目的一些信息。在乐观的情况下，我想获取所有可用数据，为简单起见，假设它表示名称、价格和描述。

页面结构如下：

<div id="content">
  <h1>Product name</h1>
  <table id="properties">
    <tbody>
      <tr id="manufacturer-row">
        <th>Manufacturer</th>
        <td>Some-Mark</td>
      </tr>
    </tbody>
  </table>
  <p>Full description of the product</p>
</div>

适用于本案的条件：

标签是嵌套的，因此我需要测试每个级别是否存在，
有些页面会丢失一些数据 - 表中的空列与丢失表一样，
有些页面根本没有内容，
标签中的空文本是有效值，但必须记录缺少的标签，
丢失数据并非特殊情况。

实际上，我测试检查每条信息是否存在，这会导致代码难以阅读：

content = soup.select_one("#content")
if content:
    product_name_tag = content.select_one("h1")
    if product_name_tag:
        name = product_name_tag.text
    else:
        log("Product name tag not found")

    table = content.select_one("table")
    if table:
        manufacturer_tag = table.select_one("#manufacturer-row > td")
        if manufacturer_tag:
            manufacturer = manufacturer_tag.text
        else:
            log("Manufacturer tag not found")
    else:
        log("Table not found")
else:
    log("Tag '#content' not found")

return (
    name if name in locals() else None,
    manufacturer if manufacturer in locals() else None
)

在实际应用中，代码更难阅读，因为我正在寻找的属性通常更嵌套，我需要在提取其文本之前检查每个标签的存在。我想知道在代码可读性和简洁性方面是否有任何巧妙的方法来处理这个问题？我的想法：

如果标签存在，创建一个函数来提取标签的文本 - 会节省几行，但在实际应用中我必须使用正则表达式从文本中提取一些短语，所以单个函数是不够的。

创建一个包装器来记录丢失的部分，如果返回 None 而不是在“else”代码下 - 以提高可读性。

将每条数据的提取放到单独的函数中，比如_get_content_if_available、_get_name_if_available

这些解决方案似乎都不够好和简洁，所以我想请教您的想法。

我还想知道，仅在满足某些条件时才初始化变量然后检查当前上下文中是否存在变量的方式是否是个好主意。

【问题讨论】：

标签： python error-handling web-scraping beautifulsoup

【解决方案1】：

一切都取决于您希望如何构建代码。我的建议是使用来自collections 的ChainMap。使用ChainMap，您可以为您的标签/键指定默认值，并只解析没有丢失的值。这样你的代码库中就不会有 if/else 混乱：

data = """<div id="content">
  <h1>Product name</h1>
  <table id="properties">
    <tbody>
      <tr id="manufacturer-row">
        <th>Manufacturer</th>
        <td>Some-Mark</td>
      </tr>
    </tbody>
  </table>
  <p>Full description of the product</p>
</div>"""

from bs4 import BeautifulSoup
from collections import ChainMap

def my_parse(soup):
    def is_value_missing(k, v):
        if v is None:
            print(f'Value "{k}" is missing!') # or log it!
        return v is None

    d = {}
    d['product_name_tag'] = soup.select_one("h1")
    d['manufacturer_tag'] = soup.select_one("#manufacturer-row td")
    d['description'] = soup.select_one("p")
    d['other value'] = soup.select_one("nav")   # this is missing!
    return {k:v.text for k, v in d.items() if is_value_missing(k, v) == False}

soup = BeautifulSoup(data, 'lxml')
c = ChainMap(my_parse(soup), {'product_name_tag': '-default name tag-',
             'manufacturer_tag': '-default manufacturer tag-',
             'description': '-default description-',
             'other value': '-default other value-',
             })

print("Product name = ", c['product_name_tag'])
print("Other value = ", c['other value'])

这将打印：

Value "other value" is missing!
Product name =  Product name
Other value =  -default other value-

【讨论】：