【发布时间】:2016-05-07 07:57:15
【问题描述】:
我有一个数据库,其中存储了一些网站的主页。我希望这个脚本从这个数据库中获取一个链接,然后必须找到页面上存在的其他 href 链接,然后将它们插入到 mysql 中的其他表中。这是脚本:-
import requests
from bs4 import BeautifulSoup
import MySQLdb
import os
import urllib2
conn = MySQLdb.connect(host= "localhost",
user="user",
passwd="password",
db="crw")
n = "no"
cat1 = "MOVIES"
cat2 = "NEWS"
loc = "SL"
act = "YES"
cursor = conn.cursor()
ext1 = ("SELECT LINK FROM LINK_MASTER WHERE ACT = %s and CAT1 = %s AND CAT2 = %s AND LOC = %s")
cursor.execute(ext1, (act, cat1, cat2, loc))
urlq = cursor.fetchone()
url = urlq
print url
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
for item in soup.find_all(attrs={"class": "post-title"}):
for link in item.find_all('a'):
p = (link.get('href'))
print p
cursor.execute("INSERT IGNORE INTO URL(URL,FD,CAT1,CAT2) VALUES (%s,%s,%s,%s)", (p,n,cat1,cat2))
conn.commit()
我收到以下错误。请帮助我解决这个问题,因为我是 python 新手并试图学习新东西。
Traceback (most recent call last):
File "news.py", line 25, in <module>
response = requests.get(url)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 428, in request
env_proxies = get_environ_proxies(url) or {}
File "/usr/lib/python2.7/dist-packages/requests/utils.py", line 516, in get_environ_proxies
if should_bypass_proxies(url):
File "/usr/lib/python2.7/dist-packages/requests/utils.py", line 478, in should_bypass_proxies
netloc = urlparse(url).netloc
File "/usr/lib/python2.7/urlparse.py", line 143, in urlparse
tuple = urlsplit(url, scheme, allow_fragments)
File "/usr/lib/python2.7/urlparse.py", line 182, in urlsplit
i = url.find(':')
AttributeError: 'tuple' object has no attribute 'find'
【问题讨论】:
-
在异常发生时找出(调试?)
url的值......它可能会给出提示
标签: python mysql beautifulsoup web-crawler