程序目的:
前段时间弄了个论坛(http://www.yyjun.net),懒于手动找贴发帖,特写个抓取指定url中帖子内容的小程序(完善中)
已经实现功能:
由于我们需要抓取页面特定DOM下的链接,如程序中,我们抓取
http://news.sina.com.cn/society/ 页面中大标题的链接,因此可以用函数getlink获取,返回list型结果,代码如下:
url="http://news.sina.com.cn/society/"
sinaurls=getlink(url, \'h1\', {}),
花括号为我们指定了附属条件,可以添加附加条件限制
返回结果[url1, url2]
接下来就是获取页面中的内容,如:获取新闻标题、内容(主要函数getcontent), 代码如下
getcontent(u, \'h1\', {\'id\':\'artibodyTitle\'},
subtag=None).encode(\'utf-8\')
getcontent(u, \'div\', {\'id\':\'artibody\'}, subtag=None).encode(\'utf-8\')
subtag=None代表前面的参数已经可以获取到结果了,否则的话指定子DOM节点tagname和tagattr属性
这样我们就获取到了所要的数据,不过现在程序还不太灵活,接下来要完成的事情:
1. 数据源的配置(url,tagname, tagattr等)
2. 如何定时抓取
学习资料:
Python简明教程, python手册
开发环境:
WindowXP + Python2.6 + BeautifulSoup + Mysql5
#coding=utf-8
#spider
#2011-9-8 23:06
from datetime import datetime
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen, Request,re
import MySQLdb
import sys
class mysql:
def __init__(self, host=\'localhost\', user=\'root\', pwd=\'\',
db=\'\'):
try:
self.db =
MySQLdb.connect(host=host,user=user,passwd=pwd,db=db,charset="utf8",
use_unicode=True)
self.cursor=self.db.cursor()
self.cursor.execute("set names utf8")
self.db.commit()
except Exception, e:
print e
def execute(self, sql):
try:
self.cursor.execute("set names utf8")
self.cursor.execute(sql)
except Exception, e:
print e
def fetchone(self):
try:
row =
self.cursor.fetchone()
return row[0]
except Exception, e:
print e
def commit(self):
self.db.commit()
def rollback(self):
self.db.rollback()
def getsoup(url):
req=Request(url)
try:
rep=urlopen(req)
except IOError:
print \'不能链接到指定地址,请确认%s可访问\' % url
return None
soup=BeautifulSoup(rep.read())
return soup
def getlink(url, tag, attr={}, subtagattr={}):
\'\'\'
获取url页面中DOM元素下的所有链接
url:地址
tag:一级DOM节点的tagname
attr:一级DOM节点的附加属性
subtagattr:子孙DOM节点的附加属性
\'\'\'
soup=getsoup(url)
mainnode=soup.findAll(tag, attrs=attr)
linknodes=[]
for mn in mainnode:
linknodes = linknodes + [n for n in
mn.findAll(\'a\', attrs=subtagattr) if n not in linknodes]
urls=[]
for l in linknodes:
urls.append(l[\'href\'])
return urls;
def getcontent(url, tag, attr={}, subtag=\'a\', subattr={}):
soup=getsoup(url)
node=soup.find(tag, attr)
if subtag != None:
node=node.find(subtag,
subattr)
contents=node.contents
contents=[unicode(n) for n in contents if True]
return filtercontent(u\'\'.join(contents))
def getcontentlist(url, tag, attr={}, subtag=\'a\', subattr={}):
soup=getsoup(url)
node=soup.findAll(tag, attr)
if subtag != None:
node=node[0].findAll(subtag,
subattr)
l=[]
for n in node:
l.append(u\'\'.join([unicode(c) for c in
n.contents if True]))
return l
def filtercontent(content):
r0=re.compile(\'\n\')
r1=re.compile(\'<!--.*?-->\')
r2=re.compile(\'<script.*?>.*?</script>\')
r3=re.compile(\'<style.*?>.*?</style>\')
content=r0.sub(\'\', content)
content=r1.sub(\'\', content)
content=r2.sub(\'\', content)
content=r3.sub(\'\', content)
return content
if __name__ == \'__main__\':
INSERT_FORUM_THREAD_SQL=\'\'\'
INSERT INTO forum_thread(fid, author, authorid, subject, dateline, lastpost,
lastposter)
values(38, \'小俊\', 11, \'%s\', UNIX_TIMESTAMP(now()), UNIX_TIMESTAMP(now()), \'小俊\')
\'\'\'
LAST_INSERT_ID_SQL = \'SELECT LAST_INSERT_ID()\'
INSERT_FORUM_POST_SQL=\'\'\'
INSERT INTO forum_post
(fid, tid, first, author, authorid, subject, dateline, message,useip, usesig,
htmlon, bbcodeoff)
VALUES
(38, %s, 0, \'小俊\', 11, \'%s\', UNIX_TIMESTAMP(now()), \'%s\', \'125.39.155.30\', 1, 1, 1)
\'\'\'
UPDATE_FORUM_FORUM_SQL=\'\'\'
UPDATE forum_forum
SET
threads=threads+%s,
posts=posts+%s,
lastpost=concat(\'%s %s \'
,CAST(UNIX_TIMESTAMP(now()) AS CHAR(50)),\' 小俊\'),
todayposts=%s
WHERE fid = 38
\'\'\'
db=mysql(pwd=\'abc\', db=\'test\')
url="http://news.sina.com.cn/society/"
sinaurls=getlink(url, \'h1\', {})
tid=\'\'
title=\'\'
timeline=\'\'
author=\'\'
i=0
try:
for u in sinaurls:
title=MySQLdb.escape_string(getcontent(u, \'h1\', {\'id\':\'artibodyTitle\'},
subtag=None).encode(\'utf-8\'))
body=MySQLdb.escape_string(getcontent(u, \'div\', {\'id\':\'artibody\'},
subtag=None).encode(\'utf-8\'))
db.execute(INSERT_FORUM_THREAD_SQL
% title)
db.execute(LAST_INSERT_ID_SQL)
insertid=db.fetchone()
db.execute(INSERT_FORUM_POST_SQL % (insertid, title, body))
tid=insertid
title=title
i=i+1
db.execute(UPDATE_FORUM_FORUM_SQL % (i,
i, tid, title, i))
db.commit()
except:
db.rollback()