无法用崇高的文字或漂亮的汤美化 html 代码答案

【问题标题】：Cannot prettify html code with sublime text nor beautiful soup无法用崇高的文字或漂亮的汤美化 html 代码
【发布时间】：2019-09-24 07:19:13
【问题描述】：

我正在尝试抓取一些网站以获取信息。我已将要抓取的页面保存为 .html 文件并使用sublime text 打开它，但有些部分无法以美化方式显示；尝试使用 beautifulsoup 时遇到同样的问题；见下图（我不能真正分享完整的代码，因为它会泄露私人信息）。

【问题讨论】：

能否请您提供一些代码
正是我宁愿不;它实际上是 facebook 公共页面的 html 代码...
beautifulsoup 不需要美化代码即可工作。
@furas true 但我需要美化代码来检测我正在寻找的信息的关键......
在 Web 浏览器中打开页面，转到 DevTools (Chrome/Firefox)，您可以看到格式良好的 HTML。我总是使用 DevTools 来检查 HTML 并获取抓取路径。 DevTool 甚至可以为所选元素提供 xpath 或 css 选择器。或者我可以使用 JavaScript document.getElementByXXX 来检查它。

标签： html web-scraping beautifulsoup

【解决方案1】：

只需将 HTML 作为多行字符串提供给 BeautifulSoup 对象并使用 soup.prettify()。那应该行得通。然而 beautifulsoup 的默认缩进为 2 个空格。所以如果你想要自定义缩进，你可以写一个像这样的小包装：

def indentPrettify(soup, indent=4):
    # where desired_indent is number of spaces as an int()
    pretty_soup = str()
    previous_indent = 0
    # iterate over each line of a prettified soup
    for line in soup.prettify().split("\n"):
        # returns the index for the opening html tag '<'
        current_indent = str(line).find("<")
        # which is also represents the number of spaces in the lines indentation
        if current_indent == -1 or current_indent > previous_indent + 2:
            current_indent = previous_indent + 1
            # str.find() will equal -1 when no '<' is found. This means the line is some kind
            # of text or script instead of an HTML element and should be treated as a child
            # of the previous line. also, current_indent should never be more than previous + 1.
        previous_indent = current_indent
        pretty_soup += writeOut(line, current_indent, indent)
    return pretty_soup

def writeOut(line, current_indent, desired_indent):
    new_line = ""
    spaces_to_add = (current_indent * desired_indent) - current_indent
    if spaces_to_add > 0:
        for i in range(spaces_to_add):
            new_line += " "
    new_line += str(line) + "\n"
    return new_line

【讨论】：