【问题标题】:Sometimes Code Runs and Sometimes it gives error有时代码运行,有时它给出错误
【发布时间】:2017-05-16 14:19:37
【问题描述】:

下面是我用漂亮的汤抓取网站的代码。代码在 Windows 上运行良好,但在 ubuntu 上出现问题。在ubuntu中,代码时而运行时而报错。

错误如下:

Traceback (most recent call last):
  File "Craftsvilla.py", line 22, in <module>
    source =  requests.get(new_url)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 70, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 56, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 487, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.craftsvilla.com', port=80): Max retries exceeded with url: /shop/01-princess-ayesha-cotton-salwar-suit-for-rudra-house/5601472 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f6685fc3310>: Failed to establish a new connection: [Errno -2] Name or service not known',))

下面是我的代码:

import requests
import lxml
from bs4 import BeautifulSoup
import xlrd
import xlwt

file_location = "/home/nitink/Python Linux/BeautifulSoup/Craftsvilla/Craftsvilla.xlsx"

workbook = xlrd.open_workbook(file_location)

sheet = workbook.sheet_by_index(0)

products = []
for r in range(sheet.nrows):
    products.append(sheet.cell_value(r,0))

book = xlwt.Workbook(encoding= "utf-8", style_compression = 0)
sheet = book.add_sheet("Sheet11", cell_overwrite_ok=True)

for index, url in enumerate(products):
    new_url = "http://www." + url
    source =  requests.get(new_url)
    data = source.content
    soup = BeautifulSoup(data, "lxml")

    sheet.write(index, 0, url)

    try:
        Product_Name = soup.select(".product-title")[0].text.strip()
        sheet.write(index, 1, Product_Name)

    except Exception:
        sheet.write(index, 1, "")

book.save("Craftsvilla Output.xls")

将以下链接另存为 Craftsvilla.xlsx

craftsvilla.com/shop/01-princess-ayesha-cotton-salwar-suit-for-rudra-house/5601472
craftsvilla.com/shop/3031-pista-prachi/3715170
craftsvilla.com/shop/795-peach-colored-stright-salwar-suit/5608295
craftsvilla.com/catalog/product/view/id/5083511/s/dharm-fashion-villa-embroidery-navy-blue-slawar-suit-gown

注意:对于某些人来说,代码会运行,但尝试一段时间..相同的代码会出错..不知道为什么??..相同的代码永远不会出错在窗户上。

【问题讨论】:

  • 我认为您在短时间内从同一IP地址发送了太多请求,因此服务器可能会拒绝您的连接。
  • 但是为什么相同的代码在 windows 上永远不会出错。
  • new_url后面加print(new_url),我想你是看了xlsx文件,得到的数据不完整。
  • pip install pyopenssl。有时它只是一个 ssl 错误,其中 oyur requests 行不断重试并失败。

标签: python beautifulsoup


【解决方案1】:

访问网站的频率似乎太高了,而服务器拒绝了您的请求。成为good web-scraping citizen 并在后续请求之间添加时间延迟:

import time

for index, url in enumerate(products):
    new_url = "http://www." + url
    source =  requests.get(new_url)
    data = source.content
    soup = BeautifulSoup(data, "lxml")

    # ...

    time.sleep(1)  # one second delay

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2019-09-25
    • 1970-01-01
    • 2019-05-11
    • 1970-01-01
    • 1970-01-01
    • 2021-09-06
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多