【发布时间】:2016-07-08 19:04:46
【问题描述】:
我的第一个爬虫快写完了!
但是,我遇到了一个障碍:我似乎无法获取包含表格的帖子的内容(换句话说,帖子引用了另一个帖子)。
这是从汤对象中提取帖子内容的代码。它工作得很好:
def getPost_contents(soup0bj):
try:
soup0bj = (soup0bj)
post_contents = []
for content in soup0bj.findAll('', {'class' : 'post_content'}, recursive = 'True'):
post_contents.append(content.text.strip())
...#Error management
return (post_contents)
以下是我需要抓取的示例(以黄色突出显示):
(网址,以防万一:http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm#t657906)
如何获取我突出显示的内容?为什么我当前的 getPostcontents 函数在这个特定实例中不起作用?据我所见,字符串仍在 div class=post_contents 下。
编辑编辑编辑
这就是我获得 BeautifulSoup 的方式:
from bs4 import BeautifulSoup as Soup
def getHTMLsoup(url):
try:
html = urlopen(url)
...#Error management
try:
soup0bj = Soup(html.read().decode('utf-8', 'replace'))
time.sleep(5)
...#Error management
return (soup0bj)
编辑2 编辑2 编辑2
这些是刮板的相关位:(对不起转储!)
from bs4 import BeautifulSoup as Soup
from urllib.request import urlopen, urlretrieve
from urllib.error import HTTPError, URLError
import time, re
def getHTMLsoup(url):
try:
html = urlopen(url)
except HTTPError as e:
return None
print('The server hosting{} is unavailable.'.format(url), '\n')
print('Trying again in 10 minutes...','\n')
time.sleep(600)
getHTMLsoup(url)
except URLError as e:
return None
print('The webpage found at {} is unavailable.'.format(url),'\n')
print('Trying again in 10 minutes...','\n')
time.sleep(600)
getHTMLsoup(url)
try:
soup0bj = Soup(html.read().decode('utf-8', 'replace'))
time.sleep(5)
except AttributeError as e:
return None
print("Ooops, {}'s HTML structure wasn't detected.".format(url),'\n')
return soup0bj
def getMessagetable(soup0bj):
try:
soup0bj = (soup0bj)
messagetable = []
for data in soup0bj.findAll('tr', {'class' : re.compile('message.*')}, recursive = 'True'):
except AttributeError as e:
print(' ')
return (messagetable)
def getTime_stamps(soup0bj):
try:
soup0bj = (soup0bj)
time_stamps = []
for stamp in soup0bj.findAll('span', {'class' : 'topic_posted'}):
time_stamps.append(re.search('..\/..\/20..', stamp.text).group(0))
except AttributeError as e:
print('No time-stamps found. Moving on.','\n')
return (time_stamps)
def getHandles(soup0bj):
try:
soup0bj = (soup0bj)
handles = []
for handle in soup0bj.findAll('span', {'data-id_user' : re.compile('.*')}, limit = 1):
handles.append(handle.text)
except AttributeError as e:
print("")
return (handles)
def getPost_contents(soup0bj):
try:
soup0bj = (soup0bj)
post_contents = []
for content in soup0bj.findAll('div', {'class' : 'post_content'}, recursive = 'True'):
post_contents.append(content.text.strip())
except AttributeError as e:
print('Ooops, something has gone wrong!')
return (post_contents)
html = ('http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm')
for soup in getHTMLsoup(html):
for messagetable in getMessagetable(soup):
print(getTime_stamps(messagetable),'\n')
print(getHandles(messagetable),'\n')
print(getPost_contents(messagetable),'\n')
【问题讨论】:
-
您当前的输出是多少?我看到这个帖子和其他帖子一样被提取。
-
对我也有用,虽然它停留在前面的文本中
-
嘿,谢谢你的提问!目前,它输出所有帖子,保存那些在 div class=post_contents 中有表格的帖子
-
我的问题可能是什么?有什么猜测吗?
-
@Gabriel,Beautifulsoup 是什么版本,你是如何获得 html 的?
标签: python-3.x web-scraping beautifulsoup