【问题标题】:python beautifulsoup crawler error while picking URL from mysql从mysql中选择URL时python beautifulsoup爬虫错误
【发布时间】:2016-05-07 07:57:15
【问题描述】:

我有一个数据库,其中存储了一些网站的主页。我希望这个脚本从这个数据库中获取一个链接,然后必须找到页面上存在的其他 href 链接,然后将它们插入到 mysql 中的其他表中。这是脚本:-

import requests
from bs4 import BeautifulSoup
import MySQLdb
import os
import urllib2
conn = MySQLdb.connect(host= "localhost",
                 user="user",
                 passwd="password",
                 db="crw")
n = "no"
cat1 = "MOVIES"
cat2 = "NEWS"
loc  = "SL"
act = "YES"
cursor = conn.cursor()
ext1 = ("SELECT LINK FROM LINK_MASTER WHERE ACT = %s and CAT1 = %s AND CAT2 = %s AND LOC = %s")
cursor.execute(ext1, (act, cat1, cat2, loc))
urlq = cursor.fetchone()
url = urlq
print url
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
for item in soup.find_all(attrs={"class": "post-title"}):
    for link in item.find_all('a'):
        p = (link.get('href'))
        print p
        cursor.execute("INSERT IGNORE INTO URL(URL,FD,CAT1,CAT2) VALUES (%s,%s,%s,%s)", (p,n,cat1,cat2))
        conn.commit()

我收到以下错误。请帮助我解决这个问题,因为我是 python 新手并试图学习新东西。

Traceback (most recent call last):
  File "news.py", line 25, in <module>
    response = requests.get(url)
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 55, in get
    return request('get', url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 428, in request
    env_proxies = get_environ_proxies(url) or {}
  File "/usr/lib/python2.7/dist-packages/requests/utils.py", line 516, in get_environ_proxies
    if should_bypass_proxies(url):
  File "/usr/lib/python2.7/dist-packages/requests/utils.py", line 478, in should_bypass_proxies
    netloc = urlparse(url).netloc
  File "/usr/lib/python2.7/urlparse.py", line 143, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "/usr/lib/python2.7/urlparse.py", line 182, in urlsplit
    i = url.find(':')
AttributeError: 'tuple' object has no attribute 'find'

【问题讨论】:

  • 在异常发生时找出(调试?)url 的值......它可能会给出提示

标签: python mysql beautifulsoup web-crawler


【解决方案1】:

url 在 fetchone 返回的 tuple 中,您需要将其传递给请求而不是元组本身:

url = cursor.fetchone()[0]

【讨论】:

    猜你喜欢
    • 2015-05-12
    • 2013-07-11
    • 2015-03-08
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-08-13
    • 1970-01-01
    • 2018-09-11
    相关资源
    最近更新 更多