【发布时间】:2012-03-29 07:34:09
【问题描述】:
我希望忽略我的 xml 中的 unicode。我愿意在处理输出时以某种方式对其进行更改。
我的蟒蛇:
import urllib2, os, zipfile
from lxml import etree
doc = etree.XML(item)
docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
target = doc.xpath('//references-cited/citation/nplcit/*/text()')
#target = '-'.join(target).replace('\n-','')
print "docID: {0}\nCitation: {1}\n".format(docID,target)
outFile.write(str(docID) +"|"+ str(target) +"\n")
创建一个输出:
docID: US-D0607176-S1-20100105
Citation: [u"\u201cThe birth of Lee Min Ho's donuts.\u201d Feb. 25, 2009. Jazzholic. Apr. 22, 2009 <http://www
但是,如果我尝试重新添加 '-'join(target).replace('\n-',''),print 和 outFile.write 都会出现此错误:
Traceback (most recent call last):
File "C:\Documents and Settings\mine\Desktop\test_lxml.py", line 77, in <module>
print "docID: {0}\nCitation: {1}\n".format(docID,target)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)
如何忽略 unicode,以便可以将 target 与 outFile.write 串起来?
【问题讨论】:
-
当你
from __future__ import unicode_literals时会发生什么?
标签: python xml unicode lxml python-unicode