【问题标题】:Node.toprettyxml() adds newlines to DOCTYPE in PythonNode.toprettyxml() 在 Python 中为 DOCTYPE 添加换行符
【发布时间】:2012-01-30 21:14:31
【问题描述】:

当使用prettify 时,我的 DOCTYPE 被分成三行。如何保持一行?

“损坏”的输出:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE smil
  PUBLIC '-//W3C//DTD SMIL 2.0//EN'
  'http://www.w3.org/2001/SMIL20/SMIL20.dtd'>
<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
  <head>
    <meta base="rtmp://cp23636.edgefcs.net/ondemand"/>
  </head>
  <body>
    <switch>
      <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_256.mp4" system-bitrate="336000"/>
      <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_512.mp4" system-bitrate="592000"/>
      <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_768.mp4" system-bitrate="848000"/>
      <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_1128.mp4" system-bitrate="1208000"/>
    </switch>
  </body>
</smil>

脚本:

import csv
import sys
import os.path

from xml.etree import ElementTree
from xml.etree.ElementTree import Element, SubElement, Comment, tostring

from xml.dom import minidom

def prettify(doctype, elem):
    """Return a pretty-printed XML string for the Element.
    """
    rough_string = doctype + ElementTree.tostring(elem, 'utf-8')
    reparsed = minidom.parseString(rough_string)
    return reparsed.toprettyxml(indent="  ", encoding = 'utf-8')

doctype = '<!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 2.0//EN" "http://www.w3.org/2001/SMIL20/SMIL20.dtd">'

video_data = ((256, 336000),
              (512, 592000),
              (768, 848000),
              (1128, 1208000))


with open(sys.argv[1], 'rU') as f:
    reader = csv.DictReader(f)
    for row in reader:
        root = Element('smil')
        root.set('xmlns', 'http://www.w3.org/2001/SMIL20/Language')
        head = SubElement(root, 'head')
        meta = SubElement(head, 'meta base="rtmp://cp23636.edgefcs.net/ondemand"')
        body = SubElement(root, 'body')

        switch_tag = ElementTree.SubElement(body, 'switch')

        for suffix, bitrate in video_data:
            attrs = {'src': ("mp4:soundcheck/{year}/{id}/{file_root_name}_{suffix}.mp4"
                             .format(suffix=str(suffix), **row)),
                     'system-bitrate': str(bitrate),
                     }
            ElementTree.SubElement(switch_tag, 'video', attrs)

        file_root_name = row["file_root_name"]
        year = row["year"]
        id = row["id"]
        path = year+'-'+id

        file_name = row['file_root_name']+'.smil'
        full_path = os.path.join(path, file_name)
        output = open(full_path, 'w')
        output.write(prettify(doctype, root))

【问题讨论】:

    标签: python xml pretty-print elementtree minidom


    【解决方案1】:

    查看了您当前的脚本以及您就该主题提出的其他问题,我认为您可以通过使用字符串操作来构建您的 smil 文件,从而让您的生活变得更简单。

    文件中几乎所有的 xml 都是静态的。您需要担心正确处理的唯一数据是video 标记的属性值。为此,标准库中有一个方便的函数可以完全满足您的需求:xml.sax.saxutils.quoteattr

    因此,考虑到这些要点,下面是一个更易于使用的脚本:

    import sys, os, csv
    from xml.sax.saxutils import quoteattr
    
    smil_header = '''\
    <?xml version="1.0" encoding="utf-8"?>
    <!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 2.0//EN" "http://www.w3.org/2001/SMIL20/SMIL20.dtd">
    <smil xmlns="http://www.w3.org/2001/SMIL20/Language">
      <head>
        <meta base="rtmp://cp23636.edgefcs.net/ondemand"/>
      </head>
      <body>
        <switch>
    '''
    smil_video = '''\
          <video src=%s system-bitrate=%s/>
    '''
    smil_footer = '''\
        </switch>
      </body>
    </smil>
    '''
    
    src_format = 'mp4:soundcheck/%(year)s/%(id)s/%(file_root_name)s_%(suffix)s.mp4'
    
    video_data = (
        ('256', '336000'), ('512', '592000'),
        ('768', '848000'), ('1128', '1208000'),
        )
    
    root = os.getcwd()
    if len(sys.argv) > 2:
        root = sys.argv[2]
    
    with open(sys.argv[1], 'rU') as stream:
    
        for row in csv.DictReader(stream):
            smil = [smil_header]
            for suffix, bitrate in video_data:
                row['suffix'] = suffix
                smil.append(smil_video % (
                    quoteattr(src_format) % row, quoteattr(bitrate)
                    ))
            smil.append(smil_footer)
    
            directory = os.path.join(root, '%(year)s-%(id)s' % row)
            try:
                os.makedirs(directory)
            except OSError:
                pass
            path = os.path.join(directory, '%(file_root_name)s.smil' % row)
            print ':: writing file:', path
            with open(path, 'wb') as stream:
                stream.write(''.join(smil))
    

    【讨论】:

      【解决方案2】:

      我认为不可能删除Node.toprettyxmlDOCTYPE 生成的换行符,至少以Pythonic 方式。

      这是DocumentType 类的writexml 方法,它从minidom module 的第1284 行开始,插入了有问题的换行符。插入的换行字符串最初来自Node.toprettyxml 方法,并通过Document 类的writexml 方法传递。同样的换行字符串也被传递给Node 的各种其他子类的writexml 方法。将调用中的换行字符串更改为 Node.prettyxml 将更改整个输出 XML 中使用的换行字符串。

      有多种解决方法:修改 minidom 模块的本地副本、“monkey-patch”writexml 类的 writexml 方法或对 XML 字符串进行后处理以删除不需要的换行符.但是,这些方法都不吸引我。

      对我来说,最好的方法似乎是保持现状。将DOCTYPE 拆分为多行真的是一个严重的问题吗?

      【讨论】:

        【解决方案3】:

        我认为你至少有三个选择:

        1. 只需接受换行符即可。它们可能不受欢迎且丑陋,但它们是完全合法的。

        2. 添加一个用更好的 DOCTYPE 替换坏 DOCTYPE 的组件。也许是这样的:

          import re
          
          pretty_xml = prettify(doctype, elem)
          m = re.search("(<!.*dtd'>)", pretty_xml, re.DOTALL)
          ugly_doctype = m.group() 
          fixed_xml = pretty_xml.replace(ugly_doctype, doctype)
          
        3. 使用功能更丰富的 XML 包。 lxml 浮现在脑海中;它主要与 ElementTree 兼容。通过使用 lxml 的 tostring 函数,您将不需要 prettify 函数,并且 DOCTYPE 会按照您的需要出现。示例:

          from lxml import etree 
          
          doctype = '<!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 2.0//EN" "http://www.w3.org/2001/SMIL20/SMIL20.dtd">'
          
          XML = '<smil xmlns="http://www.w3.org/2001/SMIL20/Language"><head><meta base="rtmp://cp23636.edgefcs.net/ondemand"/></head><body><switch><video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_256.mp4" system-bitrate="336000"/><video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_512.mp4" system-bitrate="592000"/><video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_768.mp4" system-bitrate="848000"/><video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_1128.mp4" system-bitrate="1208000"/></switch></body></smil>'
          
          elem = etree.fromstring(XML)
          print etree.tostring(elem, doctype=doctype, pretty_print=True,
                               xml_declaration=True, encoding="utf-8")
          

          输出:

          <?xml version='1.0' encoding='utf-8'?>
          <!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 2.0//EN" "http://www.w3.org/2001/SMIL20/SMIL20.dtd">
          <smil xmlns="http://www.w3.org/2001/SMIL20/Language">
            <head>
              <meta base="rtmp://cp23636.edgefcs.net/ondemand"/>
            </head>
            <body>
              <switch>
                <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_256.mp4" system-bitrate="336000"/>
                <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_512.mp4" system-bitrate="592000"/>
                <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_768.mp4" system-bitrate="848000"/>
                <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_1128.mp4" system-bitrate="1208000"/>
              </switch>
            </body>
          </smil>
          

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2011-10-17
          • 1970-01-01
          • 2015-10-08
          • 1970-01-01
          • 2014-09-21
          • 2014-12-22
          • 1970-01-01
          相关资源
          最近更新 更多