【发布时间】:2009-11-27 15:20:38
【问题描述】:
我有以下脚本用于从我的 uni 网站抓取数据并插入到 GAE Db 中
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import datetime
__author__ = "Nash Rafeeq"
url = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
viewurl = "http://localhost:8000/timekeeper/intake/checkintake/"
inserturl = "http://localhost:8000/timekeeper/intake/addintake/"
print url
mech = Browser()
try:
page = mech.open(url)
html = page.read()
except Exception, err:
print str(err)
#print html
soup = BeautifulSoup(html)
soup.prettify()
tables = soup.find('select')
for options in tables:
intake = options.string
#print intake
try:
#print viewurl+intake
page = mech.open(viewurl+intake)
html = page.read()
print html
if html=="Exist in database":
print intake, " Exist in the database skiping"
else:
page = mech.open(inserturl+intake)
html = page.read()
print html
if html=="Ok":
print intake, "added to the database"
else:
print "Error adding ", intake, " to database"
except Exception, err:
print str(err)
我想知道优化此脚本的最佳方法是什么,以便我可以在应用引擎服务器上运行它。事实上,它现在正在抓取 300 多个条目,并且需要 10 多分钟才能将所有数据插入到我的本地机器上
用于存储数据的模型是
class Intake(db.Model):
intake=db.StringProperty(multiline=False, required=True)
#@permerlink
def get_absolute_url(self):
return "/timekeeper/%s/" % self.intake
class Meta:
db_table = "Intake"
verbose_name_plural = "Intakes"
ordering = ['intake']
【问题讨论】: