【问题标题】:How to detect captchas when scraping google?抓取谷歌时如何检测验证码?
【发布时间】:2017-01-18 05:47:19
【问题描述】:

我正在使用带有BeautifulSouprequests 包从Google 新闻中获取查询的搜索结果数量。我得到了两种类型的IndexError,我想区分它们:

  1. 当搜索结果数为空时。这里#resultStats 返回空字符串'[]'。似乎正在发生的事情是,当查询字符串太长时,谷歌甚至不会说“0 搜索结果”;它只是什么也没说。
  2. 第二个IndexError 是谷歌给我一个验证码。

我需要区分这些情况,因为我希望我的爬虫在 google 向我发送验证码时等待五分钟,而不是在它只是一个空结果字符串时。

我目前有一个陪审团操纵的方法,我发送另一个带有已知非零搜索结果的查询,这使我能够区分这两个IndexErrors。我想知道是否有更优雅、更直接的方法来做到这一点,使用BeautifulSoup

这是我的代码:

import requests, bs4, lxml, re, time, random
import pandas as pd
import numpy as np

URL = 'https://www.google.com/search?tbm=nws&q={query}&tbs=cdr%3A1%2Ccd_min%3A{year}%2Ccd_max%3A{year}&authuser=0'
headers = {
    "User-Agent":
        "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}

def tester(): # test for captcha
    test = requests.get('https://www.google.ca/search?q=donald+trump&safe=off&client=ubuntu&espv=2&biw=1910&bih=969&source=lnt&tbs=cdr%3A1%2Ccd_min%3A2016%2Ccd_max%3A&tbm=nws', headers=headers)
    dump = bs4.BeautifulSoup(test.text,"lxml")
    result = dump.select('#resultStats')
    num = result[0].getText()
    num = re.search(r"\b\d[\d,.]*\b",num).group() # regex
    num = int(num.replace(',',''))
    num = (num > 0)
    return num

def search(**params):
    response = requests.get(URL.format(**params),headers=headers)
    print(response.content, response.status_code) # check this for google requiring Captcha
    soup = bs4.BeautifulSoup(response.text,"lxml")
    elems = soup.select('#resultStats')

    try: # want code to flag if I get a Captcha
        hits = elems[0].getText()
        hits = re.search(r"\b\d[\d,.]*\b",hits).group() # regex
        hits = int(hits.replace(',',''))
        print(hits)    
        return hits
    except IndexError:
        try:
            tester() > 0 # if captcha, this will throw up another IndexError
            print("Empty results!")
            hits = 0
            return hits
        except IndexError:
            print("Captcha'd!")
            time.sleep(120) # should make it rotate IP when captcha'd
            hits = 0
            return hits

for qry in list:
    hits = search(query= qry, year=2016)

【问题讨论】:

    标签: beautifulsoup python-requests screen-scraping captcha google-search


    【解决方案1】:

    我只搜索“captcha”元素,例如,如果这是Google Recaptcha,您可以搜索包含令牌的隐藏输入:

    is_captcha_on_page = soup.find("input", id="recaptcha-token") is not None
    

    【讨论】:

    • 如果使用了 soup = BeautifulSoup(pageSource, 'html.parser') 你可以试试这个。 is_captcha_on_page = soup.find("div", id="recaptcha") 不是无
    猜你喜欢
    • 1970-01-01
    • 2018-01-22
    • 2016-04-22
    • 2012-03-02
    • 2016-07-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-07-18
    相关资源
    最近更新 更多