Python Selenium 绕过检测？答案

【问题标题】：Python Selenium go around detection?Python Selenium 绕过检测？
【发布时间】：2020-04-03 12:41:58
【问题描述】：

我正在尝试抓取此页面 - https://www.g2.com/products/dropbox/reviews 但是我一收到请求就会被检测到，有没有办法解决这个问题？

在此之前尝试使用请求，并且也被检测到。 *我不能在这个项目中使用 Scrapy。而且我在网上找不到有关如何解决它的正确信息...

也许我需要添加自定义标题？

现在代码的输出是（告诉你被检测到的页面标题）：

Pardon Our Interruption

代码：

from selenium import webdriver
import selenium as se

def fetch(URL):
    options = se.webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--disable-infobars')
    options.add_argument('--disable-extensions')
    options.add_argument('--profile-directory=Default')
    options.add_argument('--incognito')
    options.add_argument('--disable-plugins-discovery')
    options.add_argument('--start-maximized')
    driver = webdriver.Chrome('chromedriver',chrome_options=options)
    driver.get(URL)

    print(driver.title)


fetch('https://www.g2.com/products/dropbox/reviews')

编辑： 能够四处走动，获得单页，但在第二次运行时，被检测到。代码：

def fetch(URL):

    firefox_profile = webdriver.FirefoxProfile()
    firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
    browser = webdriver.Firefox(executable_path='geckodriver.exe', firefox_profile=firefox_profile)
    browser.get(URL)
    print(browser.title)

fetch('https://www.g2.com/products/dropbox/reviews')

【问题讨论】：

你可以看here
好吧，我现在可以得到一个页面，但在第二次运行时我被检测到了。可能代理轮换会有所帮助。
有时将用户代理字符串编辑为更“正常”的东西。 Selenium 用户代理有点奇怪。虽然很明显这个网站试图阻止你试图执行的确切活动哈哈哈

标签： python selenium google-chrome selenium-webdriver selenium-chromedriver

【解决方案1】：

我拿了你的代码，做了一些调整，用 ChromeDriver / Chrome 组合执行了脚本，遇到了类似的问题，即带有 title 的页面em> 请原谅我们的打扰如下：

代码块：

from selenium import webdriver

options = webdriver.ChromeOptions() 
options.add_argument('window-size=1200x600')
options.add_argument('--headless')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("https://www.g2.com/products/dropbox/reviews")
print(driver.page_source)
driver.quit()

控制台输出：

<html lang="zxx"><head>
    <title>Pardon Our Interruption</title>
    <link rel="stylesheet" type="text/css" href="//cdn.distilnetworks.com/css/distil.css" media="all">
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
    <meta name="viewport" content="width=1000">
    <meta name="robots" content="noindex, nofollow">
    <meta http-equiv="cache-control" content="max-age=0">
    <meta http-equiv="cache-control" content="no-cache">
    <meta http-equiv="expires" content="0">
    <meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT">
    <meta http-equiv="pragma" content="no-cache">

    <script type="text/javascript" async="" src="https://www.gstatic.com/recaptcha/releases/PRkVene3wKrZUWATSylf69ja/recaptcha__en.js"></script><script>
        function showBlockPage() {
        document.getElementsByClassName("container")[0].style.display = "";
        }
        setTimeout(showBlockPage, 10000);
    </script>
    <script type="text/javascript" src="/g2-meta-data" async="" defer=""></script>
    <script>if (window.sessionStorage) { sessionStorage.setItem('distil_referrer', document.referrer); }</script>

            <script src="https://www.google.com/recaptcha/api.js" async="" defer=""></script>
            <script>
            function solvedCaptcha(payload) {
                const timeoutMs = 10000;
                protectionSubmitCaptcha("recaptcha", payload, timeoutMs).then(function() {
                window.location.reload(true);
                });
            }
            </script>

    </head>
    <body class="block-page">



    <div class="container" style="">
        <script>document.getElementsByClassName("container")[0].style.display = "none";</script>
        <noscript>This page requires JavaScript!</noscript>

        <div class="row">
        <div class="sidebar col-lg-4 col-sm-5">
            <img src="//cdn.distilnetworks.com/images/anomaly-detected.png" alt="0">
        </div>
        <div class="content col-lg-8 col-sm-7">
            <h1>Pardon Our Interruption...</h1>
            <p>
            As you were browsing something about your browser made us think you were a bot. There are a few reasons this might happen:
            </p>
            <ul>
            <li>You're a power user moving through this website with super-human speed.</li>
            <li>You've disabled JavaScript and/or cookies in your web browser.</li>
            <li>A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this <a title="Third party browser plugins that block javascript" href="http://ds.tl/help-third-party-plugins" target="_blank">support article</a>.</li>
            </ul>


            <script>showBlockPage()</script>

            <p>After completing the CAPTCHA below, you will immediately regain access to the site again.</p>

        <div class="g-recaptcha" data-sitekey="6LcfNLkUAAAAALPSa4GI_zHIPcYVGlxNOdvMsUsh" data-callback="solvedCaptcha"><div style="width: 304px; height: 78px;"><div><iframe src="https://www.google.com/recaptcha/api2/anchor?ar=1&amp;k=6LcfNLkUAAAAALPSa4GI_zHIPcYVGlxNOdvMsUsh&amp;co=aHR0cHM6Ly93d3cuZzIuY29tOjQ0Mw..&amp;hl=en&amp;v=PRkVene3wKrZUWATSylf69ja&amp;size=normal&amp;cb=m8amuk5fpfe" width="304" height="78" role="presentation" name="a-x8exk2gk39a9" frameborder="0" scrolling="no" sandbox="allow-forms allow-popups allow-same-origin allow-scripts allow-top-navigation allow-modals allow-popups-to-escape-sandbox"></iframe></div><textarea id="g-recaptcha-response" name="g-recaptcha-response" class="g-recaptcha-response" style="width: 250px; height: 40px; border: 1px solid rgb(193, 193, 193); margin: 10px 25px; padding: 0px; resize: none; display: none;"></textarea></div></div>
        </div>
        </div>
    </div>



<div id="d__fFH" style="position: absolute !important; top: -5000px !important; left: -5000px !important;"><object id="d_dlg" classid="clsid:3050f819-98b5-11cf-bb82-00aa00bdce0b" width="0px" height="0px"></object><span id="d__fF" style="font-family: ZWAdobeF, serif !important; font-size: 72px !important; visibility: hidden;">mmmmmmmmlli</span></div><div style="background-color: rgb(255, 255, 255); border: 1px solid rgb(204, 204, 204); box-shadow: rgba(0, 0, 0, 0.2) 2px 2px 3px; position: absolute; transition: visibility 0s linear 0.3s, opacity 0.3s linear 0s; opacity: 0; visibility: hidden; z-index: 2000000000; left: 0px; top: -10000px;"><div style="width: 100%; height: 100%; position: fixed; top: 0px; left: 0px; z-index: 2000000000; background-color: rgb(255, 255, 255); opacity: 0.05;"></div><div class="g-recaptcha-bubble-arrow" style="border: 11px solid transparent; width: 0px; height: 0px; position: absolute; pointer-events: none; margin-top: -11px; z-index: 2000000000;"></div><div class="g-recaptcha-bubble-arrow" style="border: 10px solid transparent; width: 0px; height: 0px; position: absolute; pointer-events: none; margin-top: -10px; z-index: 2000000000;"></div><div style="z-index: 2000000000; position: relative;"><iframe title="recaptcha challenge" src="https://www.google.com/recaptcha/api2/bframe?hl=en&amp;v=PRkVene3wKrZUWATSylf69ja&amp;k=6LcfNLkUAAAAALPSa4GI_zHIPcYVGlxNOdvMsUsh&amp;cb=yl5twmy9lj55" name="c-x8exk2gk39a9" frameborder="0" scrolling="no" sandbox="allow-forms allow-popups allow-same-origin allow-scripts allow-top-navigation allow-modals allow-popups-to-escape-sandbox" style="width: 100%; height: 100%;"></iframe></div></div></body></html>

分析

在检查页面时，您会发现<body> 标签包含：

<script>window.distilReferrerValue = function() {
  var value;

  try {
    if (window.sessionStorage) {
      value = sessionStorage.getItem('distil_referrer');
      sessionStorage.removeItem('distil_referrer');
    }
  } catch(e) {}

  window.distilReferrerValue = function() {
    return value;
  };
  return value;
};</script>

这清楚地表明网站https://www.g2.com/products/dropbox/reviews 受到Bot Management 服务提供商Distil Networks 的保护，ChromeDriver 的导航被检测到并随后被阻止。

蒸馏

根据文章There Really Is Something About Distil.it...：

Distil 通过观察网站行为和识别抓取工具特有的模式来保护网站免受自动内容抓取机器人的攻击。当 Distil 在一个站点上识别出恶意机器人时，它会创建一个列入黑名单的行为配置文件，并部署到其所有客户。类似于 bot 防火墙，Distil 检测模式并做出反应。

进一步，

"One pattern with **Selenium** was automating the theft of Web content"，Distil 首席执行官 Rami Essaid 在上周接受采访时表示。 "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".

参考

您可以在Chrome browser initiated through ChromeDriver gets detected找到相关讨论

【讨论】：