无法使我的脚本使用代理获取所需的内容答案

【问题标题】：Can't make my script fetch desired content using proxies无法使我的脚本使用代理获取所需的内容
【发布时间】：2018-12-30 20:18:12
【问题描述】：

我在 python 中结合 selenium 编写了一个脚本，使用 proxies 来获取导航到 url 时填充的不同链接的文本，如 this one。我想从那里解析的是连接到每个链接的可见文本。

到目前为止，当在其中调用此函数 start_script() 时，我尝试使用的脚本能够生成新的代理。问题是这个网址将我带到了这个redirected link。只有当我继续尝试直到 url 对代理感到满意时，我才能摆脱这种重定向。我当前的脚本只能使用两个新代理尝试两次。

我如何在 get_texts() 函数中使用任何循环，以便它可以继续尝试使用新的代理，直到它解析所需的内容？ p>

到目前为止我的尝试：

import requests
import random
from itertools import cycle
from bs4 import BeautifulSoup
from selenium import webdriver

link = 'http://www.google.com/search?q=python'

def get_proxies():   
    response = requests.get('https://www.us-proxy.org/')
    soup = BeautifulSoup(response.text,"lxml")
    proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
    return proxies

def start_script():
    proxies = get_proxies()
    random.shuffle(proxies)
    proxy = next(cycle(proxies))
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument(f'--proxy-server={proxy}')
    driver = webdriver.Chrome(chrome_options=chrome_options)
    return driver

def get_texts(url):
    driver = start_script()
    driver.get(url)
    if "index?continue" not in driver.current_url:
        for item in [items.text for items in driver.find_elements_by_tag_name("h3")]:
            print(item)
    else:
        get_texts(url)

if __name__ == '__main__':
    get_texts(link)

【问题讨论】：

标签： python python-3.x selenium selenium-webdriver web-scraping

【解决方案1】：

下面的代码对我来说效果很好，但是它不能帮助你处理不好的代理。它还遍历代理列表并尝试一个，直到成功或列表用完。它会打印它使用的代理，以便您可以看到它尝试了多次。

但是正如https://www.us-proxy.org/ 指出的那样：

什么是谷歌代理？支持在 Google 上搜索的代理是称为谷歌代理。有些程序需要他们制作大量的谷歌上的查询。自 2016 年以来，所有的谷歌代理都死了。阅读该文章了解更多信息。

Article:

2016 年 Google Blocks Proxy Google 显示一个页面来验证您是如果检测到代理，则由人类代替机器人。年前 2016，谷歌允许使用该代理一段时间，如果你可以通过这种人工验证。

from contextlib import contextmanager
import random

from bs4 import BeautifulSoup
import requests
from selenium import webdriver


def get_proxies():   
    response = requests.get('https://www.us-proxy.org/')
    soup = BeautifulSoup(response.text,"lxml")
    proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
    random.shuffle(proxies)
    return proxies


# Only need to fetch the proxies once
PROXIES = get_proxies()


@contextmanager
def proxy_driver():
    try:
        proxy = PROXIES.pop()
        print(f'Running with proxy {proxy}')
        chrome_options = webdriver.ChromeOptions()
        # chrome_options.add_argument("--headless")
        chrome_options.add_argument(f'--proxy-server={proxy}')
        driver = webdriver.Chrome(options=chrome_options)
        yield driver
    finally:
        driver.close()

def get_texts(url):
    with proxy_driver() as driver:
        driver.get(url)
        if "index?continue" not in driver.current_url:
            return [items.text for items in driver.find_elements_by_tag_name("h3")]
        print('recaptcha')

if __name__ == '__main__':
    link = 'http://www.google.com/search?q=python'
    while True:
        links = get_texts(link)
        if links:
            break
    print(links)

【讨论】：

【解决方案2】：

while True:
  driver = start_script()
  driver.get(url)
  if "index?continue" in driver.current_url:
    continue
  else:
    break

这将循环直到index?continue不在url中，然后break退出循环。

此答案仅解决您的具体问题 - 它没有解决您可能正在创建大量 Web 驱动程序的问题，但您永远不会破坏未使用/失败的驱动程序。提示：你应该这样做。

【讨论】：

其实你可以只用if not "index?continue" in driver.current_url: break - 不需要使用else 块
虽然，它在一个while循环内，但我非常怀疑你建议的更改会使脚本运行不止一次@Danielle M.
@asmitu 是什么让你怀疑它？
执行@Danielle M。我已经尝试过了，这就是反馈。