【问题标题】:Want to send a request get in python from different country想从不同国家发送请求获取python
【发布时间】:2020-10-04 05:47:03
【问题描述】:

所以我想从https://bookdepository.com 中抓取详细信息 问题是它检测到国家并改变价格。 我希望它是一个不同的国家。 这是我的成本,我在 real.it 上运行,我需要图书托管网站才能认为我来自以色列。

headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"}
bookdepo_url = 'https://www.bookdepository.com/search?search=Find+book&searchTerm=' + "0671646788".replace(' ', "+")
search_result = requests.get(bookdepo_url, headers = headers)
soup = BeautifulSoup(search_result.text, 'html.parser')
result_divs = soup.find_all("div", class_= "book-item")

【问题讨论】:

  • 服务器会看到客户端 IP,再多的处理请求标头也不会改变这一点。

标签: python get request http-get


【解决方案1】:

您需要通过代理服务器、VPN 路由您的请求,或者您需要在位于以色列的机器上执行您的代码。

话虽如此,以下工作(截至撰写本文时):


import pprint

from bs4 import BeautifulSoup
import requests


def make_proxy_entry(proxy_ip_port): 
    val = f"http://{proxy_ip_port}" 
    return dict(http=val, https=val) 

headers = {
  "User-Agent": (
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 '
      '(KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36')
}

bookdepo_url = (
    'https://www.bookdepository.com/search?search=Find+book&searchTerm='
    '0671646788'
)

ip_opts = ['82.166.105.66:44081', '82.81.32.165:3128', '82.81.169.142:80',
           '81.218.45.159:8080', '82.166.105.66:43926', '82.166.105.66:58774',
           '31.154.189.206:8080', '31.154.189.224:8080', '31.154.189.211:8080',
           '213.8.208.233:8080', '81.218.45.231:8888', '192.116.48.186:3128',
           '185.138.170.204:8080', '213.151.40.43:8080', '81.218.45.141:8080']

search_result = None
for ip_port in ip_opts:
    proxy_entry = make_proxy_entry(ip_port)
    try:
        search_result = requests.get(bookdepo_url, headers=headers,
                                     proxies=proxy_entry)
        pprint.pprint('Successfully gathered results')
        break
    except Exception as e:
        pprint.pprint(f'Failed to connect to endpoint, with proxy {ip_port}.\n'
                      f'Details: {pprint.saferepr(e)}')
else:
    pprint.pprint('Never made successful connection to end-point!')
    search_result = None

if search_result:
    soup = BeautifulSoup(search_result.text, 'html.parser') 
    result_divs = soup.find_all("div", class_= "book-item")
    pprint.pprint(result_divs)

此解决方案利用请求库的proxies 参数。我从众多免费代理列表站点之一中抓取了一个代理列表:http://spys.one/free-proxy-list/IL/

代理 IP 地址和端口列表是使用以下 JavaScript sn-p 创建的,以通过浏览器的开发工具从页面上刮取数据:

console.log(
    "['" +
    Array.from(document.querySelectorAll('td>font.spy14'))
    .map(e=>e.parentElement)
    .filter(e=>e.offsetParent !== null)
    .filter(e=>window.getComputedStyle(e).display !== 'none')
    .filter(e=>e.innerText.match(/\s*(\d{1,3}\.){3}\d{1,3}\s*:\s*\d+\s*/))
    .map(e=>e.innerText)
    .join("', '") +
    "']"
)

注意:是的,JavaScript 丑陋而粗俗,但它完成了工作。

在 Python 脚本执行结束时,我确实看到最终货币根据需要解析为以色列新谢克尔 (ILS),基于生成的 HTML 中的以下元素:

<a ... data-currency="ILS" data-isbn="9780671646783" data-price="57.26" ...>

【讨论】:

    猜你喜欢
    • 2017-03-02
    • 1970-01-01
    • 2015-07-29
    • 1970-01-01
    • 2016-11-27
    • 1970-01-01
    • 1970-01-01
    • 2020-01-01
    • 1970-01-01
    相关资源
    最近更新 更多