【问题标题】:get_text() has UnicodeEncodeErrorget_text() 有 UnicodeEncodeError
【发布时间】:2012-05-03 04:13:42
【问题描述】:

我有以下 HTML:

<div class="dialog">
<div class="title title-with-sort-row">
    <h2>Description</h2>
    <div class="dialog-search-sort-bar">
    </div>
</div>
<div class="content"><div style="margin-right: 20px; margin-left: 30px;">
    <span class="description2">
        With “Antonia Polygon – Standard”, you have a figure that is unique in the Poser community. 
        She is made available under a Creative Commons License that gives endless opportunities for further development. 
        This figure was developed by a group of talented members of the Poser community in a thirty-month effort. 
        The result is a figure that has very good bending and morphing behavior.
        <br />
    </span>
</div>
</div>

我需要从class="dialog"的几个div中找出这个div,然后拉出span class="description2"中的文字。

当我使用代码时:

description = soup.find(text = re.compile('Description'))
if description != None:
    someEl = description.parent
    parent1 = someEl.parent
    parent2 = parent1.parent
    description = parent2.find('span', {'class' : 'description2'})
    print 'Description: ' + str(description)

我明白了:

<span class="description2">
    With “Antonia Polygon – Standard”, you have a figure that is unique in the Poser community. 
    She is made available under a Creative Commons License that gives endless opportunities for further development. 
    This figure was developed by a group of talented members of the Poser community in a thirty-month effort. 
    The result is a figure that has very good bending and morphing behavior.
    <br/>
</span>

如果我尝试只获取文本,没有 HTML 和非 ASCII 字符,使用

description = description.get_text()

我收到了一个(UnicodeEncodeError): 'ascii' codex can't encode character u'\x93'

如何将这段 HTML 转换为纯 ascii?

【问题讨论】:

  • 字符 不是 ASCII 字符。您的目标是识别最相似的 ASCII 字符(这很难),还是您的目标是简单地删除所有非 ASCII 字符?或者是你真正想要输出正确的 Unicode,例如UTF-8,而不是 ASCII?
  • 只是为了删除所有非ASCII字符
  • 必填:bit.ly/unipain

标签: python unicode ascii beautifulsoup


【解决方案1】:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

foo = u'With “Antonia Polygon – Standard”, you have a figure that is unique in the Poser community.She is made available under a Creative Commons License that gives endless opportunities for further development. This figure was developed by a group of talented members of the Poser community in a thirty-month effort. The result is a figure that has very good bending and morphing behavior.'

print foo.encode('ascii', 'ignore')

需要注意的三件事。

首先是编码方法的'ignore' 参数。它指示方法删除不在所选编码范围内的字符(在这种情况下,ascii 是安全的)。

其次,我们通过在字符串前面加上 u 来明确地将 foo 设置为 unicode。

第三个是显式文件编码指令:# -*- coding: utf8 -*-

此外,如果您没有阅读此答案所附的丹妮丝在 cmets 中的非常好的观点,那么您就是一个愚蠢的笨蛋。如果要在 HTML/XML 中使用输出,则可以使用 xmlcharrefreplace 代替上面的 ignore 实现大正义。

【讨论】:

  • 使用xmlcharrefreplace 作为第二个参数在这种情况下会好很多,因为他正在处理html。
  • 是的,我同意。我只是懒惰,因为 OP 在评论中说他只想删除所有行为不端的角色。 :)
  • 不过,值得一提的是,如果其他人遇到类似问题,可能会遇到此问题。
猜你喜欢
  • 1970-01-01
  • 2014-05-13
  • 2016-01-23
  • 2019-12-20
  • 1970-01-01
  • 2015-04-07
  • 2016-11-25
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多