xml中的Python编码问题答案

【问题标题】：Python encoding problems in xmlxml中的Python编码问题
【发布时间】：2013-02-02 00:13:57
【问题描述】：

我有一个媒体播放器，我想将我正在播放的内容发送到 trakt.tv，除了标题/路径中的外国字母外，一切正常。系统运行python 2.7.3

def getStatus(self,ip,timeout=10.0):
    oPchStatus = PchStatus()
    try:
        oResponse = urlopen("http://" + ip + ":8008/playback?arg0=get_current_vod_info",None,timeout)
        oPchStatus = self.parseResponse(oResponse.readlines()[0])
    return oPchStatus

这将返回类似这样的东西。

<?xml version="1.0"?>
<theDavidBox>
  <request>
    <arg0>get_current_vod_info</arg0>
    <module>playback</module>
  </request>
  <response>
    <currentStatus>pause</currentStatus>
    <currentTime>3190</currentTime>
    <downloadSpeed>0</downloadSpeed>
    <fullPath>/opt/sybhttpd/localhost.drives/HARD_DISK/Storage/NAS/Videos/FILMS/A.Haunted.House.(2013)/A Haunted House.avi</fullPath>
    <lastPacketTime>0</lastPacketTime>
    <mediatype>OTHERS</mediatype>
    <seekEnable>true</seekEnable>
    <title/>
    <totalTime>4860</totalTime>
  </response>
  <returnValue>0</returnValue>
</theDavidBox>

下一步采用上述方法并将每个项目分配给一个变量。

class PchStatus:
    def __init__(self):
        self.status=EnumStatus.NOPLAY
        self.fullPath = u""
        self.fileName = u""
        self.currentTime = 0
        self.totalTime = 0
        self.percent = 0
        self.mediaType = ""
        self.currentChapter = 0 # For Blu-ray Disc only
        self.totalChapter = 0 # For Blu-ray Disc only
        self.error = None

class PchRequestor:

    def parseResponse(self, response):
        oPchStatus = PchStatus()
        try:
            response = unescape(response)
            oXml = ElementTree.XML(response)
            if oXml.tag == "theDavidBox": # theDavidBox should be the root
                if oXml.find("returnValue").text == '0' and int(oXml.find("response/totalTime").text) > 90:#Added total time check to avoid scrobble while playing adverts/trailers
                    oPchStatus.totalTime = int(oXml.find("response/totalTime").text)
                    oPchStatus.status = oXml.find("response/currentStatus").text
                    oPchStatus.fullPath = oXml.find("response/fullPath").text
                    oPchStatus.currentTime = int(oXml.find("response/currentTime").text)

等等。使用上面返回的xml，

oPchStatus.totalTime 将是“4860” oPchStatus.status 将是“暂停” oPchStatus.fullPath 将是“/opt/sybhttpd/localhost.drives/HARD_DISK/Storage/NAS/Videos/FILMS/A.Haunted.House.(2013)/A Haunted House.avi” oPchStatus.currentTime 将是“3190”

就像我说的那样，这在标题中出现外国字母之前效果很好。像 Le.Fabuleux.Destin.d'Amélie.Poulain.(2001).avi 这样的标题将使 oPchStatus.fullPath 包含字符串“/opt/sybhttpd/localhost.drives/HARD_DISK/Storage/NAS/Videos/Le.Fabuleux。 Destin.d'Am\xe9lie.Poulain.(2001).avi"

而不是

“/opt/sybhttpd/localhost.drives/HARD_DISK/Storage/NAS/Videos/Le.Fabuleux.Destin.d'Amélie.Poulain.(2001).avi”

如我所愿。

在脚本中还有一些例程可以扫描 xml 文件中的文件名并创建 FILENAME.watched，因此我需要文件名与实际文件名匹配，而不是替换任何字母。

确保正确编码这些类型的文件名的最佳方法是什么？我已尝试提供尽可能多的信息，但如果您需要更多信息，请尽管询问。

【问题讨论】：

d&apos;Am\xe9lie 值看起来是正确的如果这是 python 向您显示元素。如果这是写入 XML 文件的内容（所以 \xe9，字面意思是 4 个字符），那么还有其他问题。
什么是oResponse.info()（特别是Content-Type 标头）？响应中是否有 xml 声明，例如 <?xml version="1.0" encoding="UTF-8" ?>？为什么使用response = unescape(response)？
最后但同样重要的是，这是 Python 2 还是 3？
@MartijnPieters：如果输入是有效的xml；根本不需要使用unescape()（因此是“为什么”问题）。 "&apos;" 是 predefined xml entity 和 ElementTree understands it
@J.F.Sebastian：确实很有趣。看起来我们需要更多上下文。

标签： python xml utf-8 decode encode

【解决方案1】：

Python 只是通过向您显示é 字符\xe9 的转义码来保持您的字符串值可在ASCII 中打印。

关于链接源代码的一些注释：

您应该不将要解析的响应转换为 unicode。 改为解析原始字节。解析器希望自己解码内容。事实上，ElementTree 解析器会再次对数据进行编码，以便能够解析它。
当您在字节串中包含 XML 时，我会改用 ElementTree.fromstring() 函数；是的，它下面使用ElementTree.XML()，就像你一样，但fromstring() 是documented API。

否则，您的示例输入正在按应有的方式工作。如果我在文件路径中使用非 ASCII 字符从您的示例创建 XML 文档，我会得到以下信息：

>>> tree = ElementTree.fromstring(response)
>>> print tree.find("response/fullPath").text
/opt/sybhttpd/localhost.drives/HARD_DISK/Storage/NAS/Videos/Le.Fabuleux.Destin.d'Amélie.Poulain.(2001).avi
>>> tree.find("response/fullPath").text
u"/opt/sybhttpd/localhost.drives/HARD_DISK/Storage/NAS/Videos/Le.Fabuleux.Destin.d'Am\xe9lie.Poulain.(2001).avi"

如您所见，来自.text 的unicode() 包含一个é 字符（Unicode 代码点U+00E9，带有锐音的拉丁小写字母E）。当作为 Python 文字打印时，Python 通过为我提供该代码点的 Python 转义码 \xe9 来确保它可以在 ASCII 上下文中打印。 这是正常的，没有任何损坏。

【讨论】：

现在的方式是，oPchStatus.fullPath = oXml.find("response/fullPath").text 获取带有 ASCII 码的字符串，因此当我以后想使用此字符串创建 FILENMAE 时。观看的文件名称不匹配 path = u'{0}.watched'.format(path) if not isfile(path): f = open(path, 'w') f.close() path= oPchStatus.fullPath"/ opt/sybhttpd/localhost.drives/HARD_DISK/Storage/NAS/Videos/Le.Fabuleux.Destin.d'Amélie.Poulain.(2001).avi”所以输出文件应该是“/opt/sybhttpd/localhost.drives/HARD_DISK /Storage/NAS/Videos/Le.Fabuleux.Destin.d'Amélie.Poulain.(2001).avi.watched"
响应以和这个 python 2.7.3 开头
您可能需要编辑您的问题以添加该信息。目前尚不清楚oPchStatus 是什么，oXml.find('response/fullPath').text 的值是什么（如果可以的话，使用repr(...) 和type(...)）。
当 oPchStatus.fullPath = oXml.find("response/fullPath").text 被调用时，它会复制包括 \xe9 在内的所有内容，但我需要用 é 替换它。你可以在这里看到整个程序github.com/cptjhmiller/pchtrakt/tree/dvp/pchtrakt
Martijn，非常感谢您的帮助，但我仍然遇到问题。对于普通文件，我没有遇到任何获取 oXml 的问题，但是对于外来字母，oXml = 命令失败。我试了一下：oXml = ElementTree.fromstring(response) except: response = '' + response) oXml = ElementTree.fromstring(response) 然后就可以了但是 oXml 数据无法使用，当我尝试从我得到的路径中拆分文件名时，“元素”对象没有属性“拆分”