【发布时间】:2015-06-03 20:28:41
【问题描述】:
我正在用 BS4 解析 HTML 页面:
import re
import codecs
import MySQLdb
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("sprt.htm"), from_encoding='utf-8')
sprt = [[0 for x in range(3)] for x in range(300)]
i = 0
for para in soup.find_all('p'):
if para.strong is not None:
sprt[i][0] = para.strong.get_text()
sprt[i][1] = para.get_text()
sprt[i][1] = re.sub(re.escape(sprt[i][0]), "", sprt[i][1], re.UNICODE)
sprt[i][2] = sprt[i][1]
sprt[i][2] = re.sub(r".+[\.\?][\s\S\n]", "", sprt[i][1], re.S)
sprt[i][2] = re.sub(r".+Panel", "Panel", sprt[i][2], re.S)
sprt[i][1] = re.sub(re.escape(sprt[i][2]), "", sprt[i][1])
i += 1
x = 0
我正在解析的页面充满了类似 3 的段落:
<p><strong>Name name. </strong>The Visual Politics of Play: On The Signifying Practices of Digital Games. Panel Proposal (2p)</p>
<p><strong>Name name and Name name. </strong>Pain, Art and Communication. Panel Proposal (2p)</p>
<p><strong>Name name, Name name and Name name. </strong>Waves of Technology: The Hidden Ideologies of Cognitive Neuroscience and the future production of the Iconic. Panel Proposal (2p)</p>
解析工作正常,直到最后一段:
<p><strong>Name name, Name name and Name name. </strong>Waves of Technology: The Hidden Ideologies of Cognitive Neuroscience and the future production of the Iconic. Panel Proposal (2p)</p>
我在数组的最后一个槽中找到的是这样的:
[u'Name name, Name name\xa0and Name name.\xa0', u'Waves\n of Technology: The Hidden Ideologies of Cognitive Neuroscience and the \nfuture production of the Iconic.\xa0Panel Proposal (2p)', u'Waves\n of Technology: The Hidden Ideologies of Cognitive Neuroscience and the \nfuture production of the Iconic.\xa0Panel Proposal (2p)']
有两个换行符 (\n) 出现在奇怪的地方(Waves 之后和 future 之前)。它们总是出现在相同的位置,而不是随机出现。
我以为是因为段落过长,但有些较长的段落没有出现\n。
我试图删除它们:
sprt[i][2] = re.sub("\n", "", sprt[i][1], re.U, re.S)
但它没有用。
换行是因为我在某处犯了错误吗?有没有办法去除它们?
【问题讨论】:
-
它们是字面意思
\n吗? -
不,当我从终端复制过去到记事本++时,如果我搜索“\n”我什么都没有,我猜这是一个特殊的“\”。如果我打印 sprt[][thelastline],我有类似的东西:Waves [NEWLINE] of Technology: The Hidden Ideologies of Cognitive Neuroscience and the [NEWLINE] the Hidden Ideologies of the Iconic.\xa0Panel Proposal (2)