【发布时间】:2014-06-30 00:28:22
【问题描述】:
我有一个html页面:
<div class="theater">
<div class="desc" id="theater_16109207495969942346">
<h2 class="name"><a href="/movies?near=pune&tid=df8f66de0a592b4a" id="link_1_theater_16109207495969942346">Esquare Victory Camp</a></h2>
<div class="info">site no 2429,general thimayya road, camp contonment,oppositekayani bakery, Pune - 020 2613 2975
<a class="fl" href="" target="_top"></a>
</div>
</div>
<div class="showtimes">
<div class="show_left">
<div class="movie">
<div class="name"><a href="/movies?near=pune&mid=1cdcf90092189400">Hawaa Hawaai</a>
</div><span class="info">Drama - Hindi</span>
<div class="times"><span style="color:#666"><span style="padding:0 "></span>
<!-- -->10:30am</span><span style="color:#666"><span style="padding:0 "> &nbsp</span>
<!-- -->3:45</span><span style="color:#666"><span style="padding:0 "> &nbsp</span>
<!-- -->6:00</span><span style="color:"><span style="padding:0 "> &nbsp</span>
<!-- -->8:30pm</span>
</div>
</div>
</div>
<div class="show_right">
<div class="movie">
<div class="name"><a href="/movies?near=pune&mid=6b59ad39004d895b">The Amazing Spider Man 2</a>
</div><span class="info">Action/Adventure/Thriller - English - <a class="fl" href="/url?q=http://www.youtube.com/watch%3Fv%3DSCjCk59PIzw&sa=X&oi=movies&ii=0&usg=AFQjCNGpVM5U04h0acABA7eApb6EIO4Ejw">Trailer</a></span>
<div class="times"><span style="color:#666"><span style="padding:0 "></span>
<!-- -->1:00</span><span style="color:"><span style="padding:0 "> &nbsp</span>
<!-- -->10:45pm</span>
</div>
</div>
</div>
<p class="clear"></p>
</div>
</div>
我们可以看到我们在很多地方都有&amp;nbsp。还有许多其他 unicode 字符。我想提取此页面的内容。
我正在做的是:
def removeNonAscii(s): return "".join(i for i in s if ord(i)<128)
myName = soup.findAll("div", {"class" : "theater"})
for x in myName:
xt = str(x)
print removeNonAscii(xt)
print "<br>"
结果:
Esquare Victory Camp
site no 2429,general thimayya road, camp contonment,oppositekayani bakery, Pune - 020 2613 2975
Hawaa Hawaai
Drama - Hindi
10:30am  3:45  6:00  8:30pm
The Amazing Spider Man 2
Action/Adventure/Thriller - English - Trailer
1:00  10:45pm
除了 &nbsp 之外,一切看起来都不错。我尝试替换 ,并搜索了其他解决方案,但仍然没有解决方案。我认为没有; 的&nbsp 正在制造问题。 &nbsp 怎么去掉?
【问题讨论】:
-
你的角色是不是已经像这样双倍转义了?如果可以的话,最好的选择是从好的数据开始。
-
是的。人物只是这样。我没有其他选择可以使用它。有什么方法可以杀死那些 unicode 字符
&nbsp?
标签: python html string unicode beautifulsoup