Beautiful Soup 4 自定义属性顺序输出答案

【问题标题】：Beautiful Soup 4 Custom Attribute Order OutputBeautiful Soup 4 自定义属性顺序输出
【发布时间】：2020-01-28 17:20:03
【问题描述】：

我想在 BS4 中创建一个自定义输出格式化程序，它将以特定方式重新排列 XML 中标签属性的顺序，这不是字母顺序。

比如我想输出如下标签：

<word form="συ" head="2610" id="2357" lemma="συ" postag="p-s----n-" relation="ExD_AP"/>

作为：

<word id="2357" head="2610" postag="p-s----n-" form="συ" lemma="συ" relation="ExD_AP"/>

BS4 的文档提供了从哪里开始的线索。他们举了以下例子：

from bs4.formatter import HTMLFormatter
class UnsortedAttributes(HTMLFormatter):
    def attributes(self, tag):
        for k, v in tag.attrs.items():
            if k == 'm':
                continue
            yield k, v
print(attr_soup.p.encode(formatter=UnsortedAttributes()))

这将创建一个自定义 HTML 输出格式化程序，它将按照输入顺序保留属性并忽略某些标签，但我不知道如何更改它以便它以我想要的任何顺序输出。谁能帮帮我？

【问题讨论】：

标签： python-3.x xml dom beautifulsoup attributes

【解决方案1】：

这个怎么样？

from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<word form="συ" head="2610" id="2357" lemma="συ" postag="p-s----n-" relation="ExD_AP"/>
'''
def toString(ele):
  order = ['id','head','postag','from','lemma','relation']
  result = '<'+ele.tag
  for p in order:
    result+=' {}="{}"'.format(p,ele[p])
  return result+'/>'
doc = SimplifiedDoc(html)
ele = doc.word
print (toString(ele))

结果：

<word id="2357" head="2610" postag="p-s----n-" from="None" lemma="συ" relation="ExD_AP"/>

【讨论】：

为了这个目的，我可能不得不抓取简化的scrapy。知道它是否可以处理 XML 吗？我知道 BS4 如果你不明确告诉它你正在使用 XML，它会改变你的文档以符合 HTML 标准，这对我正在使用的东西不利。
它基本上不会改变你的文档，除了一些换行符。他比 bs4 更轻，没有其他依赖，而且速度更快。这是一个示例：github.com/yiyedata/simplified-scrapy-demo/tree/master/…

【解决方案2】：

严格来说，我对自己的问题有一个答案，但要以我喜欢的方式实际实施它还需要更多的工作。以下是操作方法。

创建 XMLFormatter 的子类（或 HTMLFormatter，如果您使用 HTML），将其命名为您想要的名称。我选择了“排序属性”。编写函数“attributes”，以便它按照您想要的顺序返回一个元组列表：[(attribute1, value1), (attribute2, value2), etc.]。我的可能看起来很冗长，但我这样做是因为我使用非常不一致的 XML。

from bs4 import BeautifulSoup
from bs4.formatter import XMLFormatter


class SortAttributes(XMLFormatter):
    def attributes(self, tag):
        """Reorder a tag's attributes however you want."""
        attrib_order = ['id', 'head', 'postag', 'relation', 'form', 'lemma']
        new_order = []
        for element in attrib_order:
            if element in tag.attrs:
                new_order.append((element, tag[element]))
        for pair in tag.attrs.items():
            if pair not in new_order:
                new_order.append(pair)
        return new_order


xml_string = '''
<word form="συ" head="2610" id="2357" lemma="συ" postag="p-s----n-" relation="ExD_AP"/>
'''
soup = BeautifulSoup(xml_string, 'xml')
print(soup.encode(formatter=SortAttributes()))

这将输出我想要的：

<word id="2357" head="2610" postag="p-s----n-" relation="ExD_AP" form="συ" lemma="συ"/>

很方便，我可以使用相同的编码方法对整个文档执行此操作。但是，如果我将其作为字符串写入文件，则所有标签都将首尾相连。示例如下：

<sentence id="783"><word id="2357" head="2610" postag="p-s----n-" relation="ExD_AP" form="συ" lemma="συ"/><word id="2358" head="2610" postag="p-s----n-" relation="ExD_AP" form="συ" lemma="συ"/><word id="2359" head="2610" postag="p-s----n-" relation="ExD_AP" form="συ" lemma="συ"/></sentence>

而不是我更喜欢的东西：

<sentence id="783">
  <word id="2357" head="2610" postag="p-s----n-" relation="ExD_AP" form="συ" lemma="συ"/>
  <word id="2358" head="2610" postag="p-s----n-" relation="ExD_AP" form="συ" lemma="συ"/>
  <word id="2359" head="2610" postag="p-s----n-" relation="ExD_AP" form="συ" lemma="συ"/>
</sentence>

要解决这个问题，我不能只是 .prettify 它，因为 prettify 会将属性重新排列回字母顺序。我将不得不详细介绍 XMLFormatter 子类。我希望将来有人会发现这对您有所帮助！

【讨论】：

如果使用SimplifiedDoc的方案，可以加换行符:) 比如return result + '/>\n'
您也可以使用正则表达式统一处理重排后的换行符。 re.sub(re.compile('/>\n
只是添加作为任何正在寻找它的人的参考 - 你可以简单地将这个 XMLFormatter 子类传递给 prettify() 方法，它很有效。