如何捕获 HTML，不受捕获库的干扰？答案

【问题标题】：How can I capture HTML, unmolested by the capturing library?如何捕获 HTML，不受捕获库的干扰？
【发布时间】：2018-11-24 02:39:22
【问题描述】：

是否有 Python 库可以让我获得任意 HTML sn-p 而不会干扰标记？据我所知，lxml、BeautifulSoup 和 pyquery 都让soup.find(".arbitrary-class") 之类的东西变得容易，但它返回的 HTML 是格式化的。我想要原始的原始标记。

例如，假设我有这个：

<html>
  <head>
    <title>test</title>
  </head>
  <body>
    <div class="arbitrary-class">
      This is some<br />
      markup with <br>
      <p>some potentially problematic</p>
      stuff in it <input type="text" name="w00t">
    </div>
  </body>
</html>

我想准确地捕捉：

"
      This is some<br />
      markup with <br>
      <p>some potentially problematic</p>
      stuff in it <input type="text" name="w00t">
    "

...空格和所有，并且不破坏标签以正确格式化（例如<br />）。

问题在于，似乎所有 3 个库似乎都是在内部构造 DOM 并简单地返回一个 Python 对象，该对象表示文件应该是什么而不是它 是什么 ，所以我不知道在哪里/如何获得我需要的原始代码 sn-p。

【问题讨论】：

标签： python html web-scraping beautifulsoup lxml

【解决方案1】：

这段代码：

from bs4 import BeautifulSoup
with open("index.html") as fp:
    soup = BeautifulSoup(fp, "html.parser")
    print soup.select(".arbitrary-class")[0].contents

将返回列表：

[u'\n      This is some', <br/>, u'\n      markup with ', <br/>, u'\n', <p>some potentially problematic</p>, u'\n      stuff in it ', <input name="w00t" type="text"/>, u'\n']

编辑：

正如 Daniel 在 cmets 中指出的那样，这会产生标准化标签。

我能找到的唯一替代方法是使用解析器生成器，例如 pyparsing。下面的代码是对他们的一些example code 的withAttribute 函数的轻微修改。

from pyparsing import *

html = """<html>
<head>
    <title>test</title>
</head>
<body>
    <div class="arbitrary-class">
    This is some<br />
    markup with <br>
    <p>some potentially problematic</p>
    stuff in it <input type="text" name="w00t">
    </div>
</body>
</html>"""

div,div_end = makeHTMLTags("div")

# only match div tag having a class attribute with value "arbitrary-class"
div_grid = div().setParseAction(withClass("arbitrary-class"))
grid_expr = div_grid + SkipTo(div | div_end)("body")
for grid_header in grid_expr.searchString(html):
    print repr(grid_header.body)

这段代码的输出如下：

'\n    This is some<br />\n    markup with <br>\n    <p>some potentially problematic</p>\n    stuff in it <input type="text" name="w00t">'

请注意，第一个<br/> 现在有一个空格，而<input> 标签在结束> 之前不再添加/。与您的规范的唯一区别是缺少尾随空格。您或许可以通过改进此解决方案来解决此差异。

【讨论】：

这修改了标记。注意第一个<br/> 缺少空格，而<input> 标签在结束> 之前添加了一个/。
@Daniel 如果您对编辑后的版本感到满意，请将我的回答标记为已接受。谢谢