【问题标题】:Blocked while scraping goodreads.com抓取 goodreads.com 时被阻止
【发布时间】:2018-10-13 14:49:15
【问题描述】:

我正在尝试从“https://www.goodreads.com/book/show/”上获取大量书籍样本(100k+),但我一直被阻止。 到目前为止,我已经尝试在我的代码中实现以下解决方案:

  • 检查 robots.txt 以查找哪些站点/元素无法访问

  • 指定一个或多个随机变化的标题

  • 使用多个工作代理以避免被阻止

  • 在每次使用 10 个并发线程的抓取迭代之间设置最长 20 秒的延迟

这是一个简化版本的代码,在尝试仅抓取书名和作者时被阻止,而不使用多个同时线程:

import requests
from lxml import html
import random

proxies_list = ["http://89.71.193.86:8080", "http://178.77.206.21:59298", "http://79.106.37.70:48550",
                "http://41.190.128.82:47131", "http://159.224.109.140:38543", "http://94.28.90.214:37641",
                "http://46.10.241.140:53281", "http://82.147.120.30:56281", "http://41.215.32.86:55561"]
proxies = {"http": random.choice(proxies_list)}

# real header
# headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}

# multiple headers
headers_list = ['Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36',
                'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
                'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
                'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.38 Safari/537.36',
                'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.103 Safari/537.36',
                'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36',
                'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1623.0 Safari/537.36',
                'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36']
headers = {"user-agent": random.choice(headers_list)}

first_url = 1
last_url = 10000     # Last book is 8,630,000
sleep_time = 20

for book_reference_number in range(first_url, last_url):
    try:
        goodreads_html = requests.get("https://www.goodreads.com/book/show/" + str(book_reference_number), timeout=5, headers=headers, proxies=proxies)
        doc = html.fromstring(goodreads_html.text)
        book_title = doc.xpath('//div[@id="topcol"]//h1[@id="bookTitle"]')[0].text.strip(", \t\n\r")
        try:
            author_name = doc.xpath('//div[@id="topcol"]//a[@class="authorName"]//span')[0].text.strip(", \t\n\r")
        except:
            author_name = ""
        time.sleep(sleep_time)
        print(str(book_reference_number), book_title, author_name)
    except:
        print(str(book_reference_number) + " cannot be scraped.")
        pass

【问题讨论】:

  • 我支持 Goodreads ????
  • 您是否成功地抓取了 Goodreads?即使我尝试使用 selenium,我也总是被阻止..
  • 我的意思是我在这场冲突中支持 Goodreads。我确实构建了爬虫,但它们(a)从不假装是人类; (b) 宣布自己; (c) 服从robots.txt

标签: python web-scraping proxy http-headers robots.txt


【解决方案1】:

如果你真的想刮大数据库,那么我会推荐 selenium,被阻塞的机会会很低,而且很稳定。不需要time.sleep()(时间延迟,但您可以添加以使其更稳定)。检查下面的代码...

import time
from bs4 import BeautifulSoup
from selenium import webdriver
##copy chromedriver into python folder
driver = webdriver.Chrome()
#driver.set_window_position(-2000,0)#this function will minimize the window
first_url = 1
last_url = 10000     # Last book is 8,630,000

for book_reference_number in range(first_url, last_url):
    driver.get("https://www.goodreads.com/book/show/"+str(book_reference_number))
    #time.sleep(2)#optional
    soup = BeautifulSoup(driver.page_source, 'lxml')
    try:
        book_title = soup.select('.gr-h1.gr-h1--serif')[0].text.strip()
    except:
        book_title = ''
    try:
        author_name = soup.select('.authorName')[0].text.strip()
    except:
        author_name = ''

    print('NO.', book_reference_number, 'TITLE: ', book_title, 'AUTHOR: ', author_name) 

【讨论】:

  • 感谢您的评论,我正在考虑将硒作为最后的手段。希望刮10万以上不要太久。仍然想知道为什么 goodreads.com 不能用通常的请求库以任何方式抓取
  • 我没有检查你的代码,但我认为你在短时间内发送了太多的请求......还有一些网站有一些反抓取机制来防止机器人/抓取。这就是我建议 selenium 的原因(您可以在无头模式下运行 selenium)。而且我看不出 requests 和 selenium 之间有什么区别。除非你使用多线程!
猜你喜欢
  • 2023-01-07
  • 1970-01-01
  • 1970-01-01
  • 2016-02-03
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多