【发布时间】:2013-04-21 08:28:15
【问题描述】:
我怎样才能找到所有包含'blue' 类且包含格式文本的跨度:
04/18/13 7:29pm
因此可能是:
04/18/13 7:29pm
或:
Posted on 04/18/13 7:29pm
就构建执行此操作的逻辑而言,这是我目前所得到的:
new_content = original_content.find_all('span', {'class' : 'blue'}) # using beautiful soup's find_all
pattern = re.compile('<span class=\"blue\">[data in the format 04/18/13 7:29pm]</span>') # using re
for _ in new_content:
result = re.findall(pattern, _)
print result
我一直在参考https://stackoverflow.com/a/7732827 和https://stackoverflow.com/a/12229134 来尝试找出一种方法来做到这一点,但以上就是我目前所掌握的全部内容。
编辑:
为了澄清这个场景,有跨度:
<span class="blue">here is a lot of text that i don't need</span>
和
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
请注意,我只需要 04/18/13 7:29pm 而不是其余的内容。
编辑 2:
我也试过了:
pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
for _ in new_content:
result = re.findall(pattern, _)
print result
得到错误:
'TypeError: expected string or buffer'
【问题讨论】:
标签: python regex beautifulsoup