如何使用 selenium 在 Linkedin 搜索页面上加载惰性内容答案

【问题标题】：How to load lazy content on Linkedin search page using selenium如何使用 selenium 在 Linkedin 搜索页面上加载惰性内容
【发布时间】：2019-06-01 12:34:20
【问题描述】：

总结

我正在尝试在 LinkedIn 搜索页面上抓取帐户的所有首次连接的个人资料链接。但由于页面动态加载其余内容（当您向下滚动时），我无法获得页面末尾的“下一页”按钮。

问题描述

https://linkedin.com/search/results/people/?facetGeoRegion=["tr%3A0"]&facetNetwork=["F"]&origin=FACETED_SEARCH&page=YOUR_PAGE_NUMBER

我可以使用 selenium 和上面的链接导航到搜索页面。我想知道有多少页面可以浏览它们，只需更改上面链接的page= 变量即可。

为了实现我想检查Next 按钮的存在。只要有下一步按钮，我就会请求下一页进行抓取。但是，如果您不向下滚动到页面底部 - 这是“下一步”按钮所在的位置 - 您将找不到 Next 按钮，也找不到有关其他配置文件的信息，因为它们尚未加载。

这是当您不向下滚动并使用 Firefox 截图工具截取整个页面时的外观。

我是如何实现的

我可以通过将向下滚动操作硬编码到我的代码中并让驱动程序等待visibilityOfElementLocated 来解决此问题。但我想知道是否有比我的方法更好的方法。如果通过这种方法，驱动程序找不到Next按钮，程序以退出代码1退出。

当我向下滚动页面时检查请求时，它只是对图像等的请求，如下所示。当我向下滚动页面时，我无法弄清楚页面如何加载有关配置文件的更多信息。

源代码

这是我在代码中实现它的方式。这个应用程序只是一个简单的实现，它试图在页面上找到Next 按钮。

package com.andreyuhai;

import org.openqa.selenium.By;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;

public class App 
{
    WebDriver driver;

    public static void main( String[] args )
    {
        Bot bot = new Bot("firefox", false, false, 0, 0, null, null, null);

        int pagination = 1;

        bot.get("https://linkedin.com");
        if(bot.attemptLogin("username", "pw")){
            bot.get("https://www.linkedin.com/" +
                    "search/results/people/?facetGeoRegion=" +
                    "[\"tr%3A0\"]&origin=FACETED_SEARCH&page=" + pagination);


            JavascriptExecutor js = (JavascriptExecutor) bot.driver;

            js.executeScript("scrollBy(0, 2500)");

            WebDriverWait wait = new WebDriverWait(bot.driver, 10);
            wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath("//button[@class='next']/div[@class='next-text']")));

            WebElement nextButton = bot.driver.findElement(By.xpath("//button[@class='next']/div[@class='next-text']"));


            if(nextButton != null ) {
                System.out.println("Next Button found");
                nextButton.click();
            }else {
                System.out.println("Next Button not found");
            }
        }
    }
}

另一个我想知道的工具：LinkedIn Spider

这个 chrome 扩展名为 linkedIn Spider

这也正是我想要实现的目标，但我猜是使用 JavaScript，我不确定。但是当我在同一个搜索页面上运行这个扩展时。这不会对数据进行任何向下滚动或一一加载其他页面。

所以我的问题是：

能否请您解释一下 LinkedIn 是如何做到这一点的？我的意思是，如果我没有提出任何请求等，它如何在我向下滚动时加载配置文件信息。我真的不知道这一点。我将不胜感激任何源链接或解释。
您有更好（我的意思是更快）的想法来实现我想要实现的东西吗？
能否请您解释一下LinkedIn Spider 在不向下滚动等情况下如何工作。

【问题讨论】：

这叫延迟加载而不是动态内容
我什至不知道它是怎么被调用的，所以我把它写成dynamic content(?)。现在我将编辑标题。谢谢。

标签： java selenium selenium-webdriver web-scraping

【解决方案1】：

我检查了 div 结构和linkedin 显示结果的方式。因此，如果您直接点击 url 并通过以下 xpath 进行检查：//li[contains(@class,'search-result')] 您会发现所有结果都已加载到页面上，但linkedin 一次只显示 5 个结果，并且在滚动时，它会显示接下来 5 个结果，但是所有结果都已加载到页面上，并且可以通过提到的 xpath 找到。

当您在输入 xpath 并点击 url 时找到结果时，请参阅此图像突出显示 div 结构和结果：https://imgur.com/Owu4NPh 和
参考这张图片，它突出显示了将页面滚动到底部然后使用相同的 xpath 查找结果后的 div 结构和结果：https://imgur.com/7WNR830

您可以看到结果集是相同的，但是在最后 5 个结果中的

标记中还有一个额外的 search-result__occlusion-hint 部分，并且通过这个linkedin 隐藏了接下来的 5 个结果，并且只显示了前 5 个结果先去吧。

现在是实现部分，我已经检查了“下一步”按钮仅在您滚动浏览页面上的整个结果时出现，所以不要滚动到确定的坐标，因为可以针对不同的屏幕尺寸和窗口进行更改，您可以采取结果在 webelement 列表中并获取它的大小，然后滚动到该列表的最后一个元素。在这种情况下，如果总共有 10 个结果，则页面将滚动到第 10 个结果，如果只有 4 个结果，则页面将滚动到第 4 个结果，滚动后您可以检查是否存在“下一步”按钮页面与否。为此，您可以检查“下一步”按钮 web 元素列表的列表大小，如果列表大小大于 0，则表示页面上存在下一个按钮，如果不大于 0，则表示下一个按钮不在列表中，您可以在那里停止执行。

所以为了实现它，我采用了一个初始值为 true 的布尔值，代码将在循环中运行，直到该布尔值变为 false 并且当 Next 按钮列表大小等于 0 时它将变为 false。

请参考以下代码：

public class App 
{    
    WebDriver driver;

  // For initialising javascript executor
  public Object executeScript(String script, Object... args) {
    JavascriptExecutor exe = (JavascriptExecutor) driver;
    return exe.executeScript(script, args);
  }

  // Method for scrolling to the element
  public void scrollToElement(WebElement element) {
    executeScript("window.scrollTo(arguments[0],arguments[1])", element.getLocation().x, element.getLocation().y);

    }

  public static void main(String[] args) {
    // You can change the driver to bot according to your usecase
    driver = new FirefoxDriver();
    // Add your direct URL here and perform the login after that, if necessary
    driver.get(url);
    // Wait for the URL to load completely
    Thread.sleep(10000);
    // Initialising the boolean
    boolean nextButtonPresent = true;
    while (nextButtonPresent) {
        // Fetching the results on the page by the xpath
        List<WebElement> results = driver.findElements(By.xpath("//li[contains(@class,'search-result')]"));
        // Scrolling to the last element in the list
        scrollToElement(results.get(results.size() - 1));
        Thread.sleep(2000);

        // Checking if next button is present on the page
        List<WebElement> nextButton = driver.findElements(By.xpath("//button[@class='next']"));
        if (nextButton.size() > 0) {
            // If yes then clicking on it
            nextButton.get(0).click();
            Thread.sleep(10000);
        } else {
            // Else setting the boolean as false
            nextButtonPresent = false;
            System.out.println("Next button is not present, so ending the script");
        }
      }
   }
}

【讨论】：

感谢您的回答。最后一个问题：有没有办法在不向下滚动的情况下加载该内容？例如最后 5 个搜索结果。
页面打开时结果已经加载，但是这些结果是不可见的，但是如果你想点击下一步按钮，那么你需要滚动到最后一个结果。
但是怎么做呢？我的意思是列表元素已经加载，但如果你检查它，内容是空白的。请参阅this image
是的，我已经检查过了，您需要滚动页面。 linkedin 所做的是将所有数据存储在 js 中，并检测滚动并在滚动时将数据注入 html 中，然后将所有数据填充到 html div 中。如果检查最后 5 个元素，它们在中具有附加值，即 search-result__occlusion-hint，并且通过此链接仅填充滚动上的数据。
然而，在第一次开始时，linkedin 会维护所有结果行，因此您可以获取行数，然后滚动到最后一行，然后获取所有数据，因为所有数据都被提取到linkedin 检测到滚动后的html。

【解决方案2】：

我观察到的是，内容已经加载到页面中，当我们向下滚动时它会显示给我们。

但是如果我们通过使用类名'next'手动加载页面来检查'Next>'按钮，例如如下所示，

//按钮[@class='next']

在我们向下滚动之前我们无法找到它，因为它对我们不可见。但是通过使用下面的 XPath，我们可以识别所有的配置文件链接计数，无论它们是否显示？

//h3[contains(@class, 'search-results__total')]/parent::div/ul/li

当您想从页面中获取所有配置文件链接时，我们可以使用上面的 XPath 帮助来做到这一点。我们将使用上面的 XPath 获取链接计数，然后我们将一次滚动到每个元素视图，然后我们将按如下方式获取配置文件链接：

// Identifying the all the profile links
List<WebElement> totalProfileLinks = driver.findElements(By.xpath("//h3[contains(@class, 'search-results__total')]/parent::div/ul/li"));
// Looping for getting the profile link
for(int i=1;i<totalProfileLinks.size();i++) {
    // Scrolling so that it will be visible
    ((JavascriptExecutor) driver).executeScript("arguments[0].scrollIntoView(true);", totalProfileLinks.get(i));
    // Fetching the anchor node
    final WebElement link = driver.findElement(By.xpath("(//h3[contains(@class, 'search-results__total')]/parent::div/ul/li//div[contains(@class, 'search-result__info')]//a)["+i+"]"));
    // Avoiding the StaleElementReferenceException
    new FluentWait<WebDriver>(driver).withTimeout(1, TimeUnit.MINUTES).pollingEvery(1, TimeUnit.SECONDS).ignoring(StaleElementReferenceException.class).until(new Function<WebDriver, WebElement>() {
        public WebElement apply(WebDriver arg0) {
            return link;
        }
    });
    // Fetching and printing the link from anchor node
    System.out.println(link.getAttribute("href").trim());
}

所以，如果我们想首先单击“下一步”按钮，我们需要检查它是否存在（当我们在获取配置文件链接时滚动时，“下一步”按钮也会显示）。我们可以使用 `driver.findElements();` 方法来获取该元素计数的匹配项，并将其存储在一些列表中（因为它返回 WebElements 列表），如下所示：

List<WebElement> nextButton = driver.findElements(By.className("next"));

使用上述技术的好处是，如果没有元素匹配，脚本也不会失败，如果没有匹配，我们将有一个空列表。

然后我们可以使用 List 接口的 size() 方法来获取匹配数，如下所示：

int size = nextButton.size();

如果大小大于 0，则该元素存在，否则不存在，我们可以检查该条件，如下所示：

if(size > 0) {
    nextButton.get(0).click(); // Do some operation like clicking on it
    System.out.println("=> 'Next >' button is there and clicked on it...");
} else {
    System.out.println("=> 'Next >' button is NOT there...");
}

当内容被加载并且元素可见时，我们将使用 JavaScriptExecutor 来定位并点击它。

将上述代码包装在 while 循环中，并在每次单击上一个“下一步 >”按钮后检查是否存在“下一步 >”按钮，如下所示：

boolean next = true;
while(next) {
    // Checking 'Next >' button is there or not in the page
    List<WebElement> nextButton = driver.findElements(By.className("next"));
    // If the 'Next >' button is there then clicking on it otherwise stopping the execution
    if(nextButton.size() > 0) {
        doClickUsingJSE(nextButton.get(0));
        System.out.println("=> 'Next >' button is there and clicked on it...");
    } else {
        next = false;
        System.out.println("=> 'Next >' button is NOT there so stopping the execution...");
    }
    Thread.sleep(1000);
}

如果上述代码中的“if”条件失败，循环将中断，因为“next”将变为“false”。如果我们使用 Fluent Wait，那么它将帮助我们避免一些“异常”，例如“WebDriverException”和“StaleElementReferenceException”。所以我写了一个单独的方法，它会通过避免一些异常来等待一个元素，如果条件得到满足就点击它。

检查下面的代码：

private static void doClickUsingJSE(final WebElement element) {
    // Using the Fluent Wait to avoid some exceptions like WebDriverException and StaleElementReferenceException
    Wait<WebDriver> wait = new FluentWait<WebDriver>(driver).withTimeout(1, TimeUnit.MINUTES).pollingEvery(1, TimeUnit.SECONDS).ignoring(WebDriverException.class, StaleElementReferenceException.class);
    WebElement waitedElement = wait.until(new Function<WebDriver, WebElement>() {
        public WebElement apply(WebDriver driver) {
            return element;
        }
    });
    wait.until(ExpectedConditions.visibilityOf(waitedElement));
    wait.until(ExpectedConditions.elementToBeClickable(waitedElement));
    // Clicking on the particular element using the JavaScriptExcecutor
    ((JavascriptExecutor) driver).executeScript("arguments[0].click();", waitedElement);
    }

正如我之前提到的 JavaScriptExecutor，我也只在上述方法中包含了它的使用。

试试下面的端到端工作代码：

import java.util.List;
import java.util.concurrent.TimeUnit;
import org.openqa.selenium.By;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.Keys;
import org.openqa.selenium.StaleElementReferenceException;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebDriverException;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.FluentWait;
import org.openqa.selenium.support.ui.Wait;

import com.google.common.base.Function;

public class BasePage 
{
    // Declaring WebDriver
    private static WebDriver driver;

    private static void doClickUsingJSE(final WebElement element) {
        // Using the Fluent Wait to avoid some exceptions like WebDriverException and StaleElementReferenceException
        Wait<WebDriver> wait = new FluentWait<WebDriver>(driver).withTimeout(1, TimeUnit.MINUTES).pollingEvery(1, TimeUnit.SECONDS).ignoring(WebDriverException.class, StaleElementReferenceException.class);
        WebElement waitedElement = wait.until(new Function<WebDriver, WebElement>() {
            public WebElement apply(WebDriver driver) {
                return element;
            }
        });
        wait.until(ExpectedConditions.visibilityOf(waitedElement));
        wait.until(ExpectedConditions.elementToBeClickable(waitedElement));
        // Clicking on the particular element using the JavaScriptExcecutor
        ((JavascriptExecutor) driver).executeScript("arguments[0].click();", waitedElement);
    }

    public static void main( String[] args ) throws Exception
    {
        System.setProperty("webdriver.chrome.driver", "C:\\NotBackedUp\\chromedriver.exe");

        // Initializing the Chrome Driver
        driver = new ChromeDriver();

        // Launching the LinkedIn site
        driver.get("https://linkedin.com/search/results/people/?facetGeoRegion=[\"tr%3A0\"]&facetNetwork=[\"F\"]&origin=FACETED_SEARCH&page=YOUR_PAGE_NUMBER");

        // You can avoid this and it to your convience way
        // As there are no connections in my page, I have used like this
        //------------------------------------------------------------------------------------
        // Switching to the login from - iframe involved
        driver.switchTo().frame(driver.findElement(By.className("authentication-iframe")));

        // Clicking on the Sign In button
        doClickUsingJSE(driver.findElement(By.xpath("//a[text()='Sign in']")));

        // Entering the User Name
        WebElement element = driver.findElement(By.id("username"));
        doClickUsingJSE(element);
        element.sendKeys("something@gmail.com");

        // Entering the Password
        element = driver.findElement(By.id("password"));
        doClickUsingJSE(element);
        element.sendKeys("anything"+Keys.ENTER);

        // Clicking on the People drop down
        Thread.sleep(8000);
        element = driver.findElement(By.xpath("//span[text()='People']"));
        doClickUsingJSE(element);

        // Selecting the All option
        Thread.sleep(2000);
        element = driver.findElement(By.xpath("//ul[@class='list-style-none']/li[1]"));
        element.click();

        // Searching something in the LinkedIn search box
        Thread.sleep(3000);
        element = driver.findElement(By.xpath("//input[@role='combobox']"));
        doClickUsingJSE(element);
        element.sendKeys("a"+Keys.ENTER);
        Thread.sleep(8000);
        //------------------------------------------------------------------------------------

        boolean next = true;
        while(next) {
            // Identifying the all the profile links
            List<WebElement> totalProfileLinks = driver.findElements(By.xpath("//h3[contains(@class, 'search-results__total')]/parent::div/ul/li"));
            // Looping for getting the profile link
            for(int i=1;i<totalProfileLinks.size();i++) {
                // Scrolling so that it will be visible
                ((JavascriptExecutor) driver).executeScript("arguments[0].scrollIntoView(true);", totalProfileLinks.get(i));
                // Fetching the anchor node
                final WebElement link = driver.findElement(By.xpath("(//h3[contains(@class, 'search-results__total')]/parent::div/ul/li//div[contains(@class, 'search-result__info')]//a)["+i+"]"));
                // Avoiding the StaleElementReferenceException
                new FluentWait<WebDriver>(driver).withTimeout(1, TimeUnit.MINUTES).pollingEvery(1, TimeUnit.SECONDS).ignoring(StaleElementReferenceException.class).until(new Function<WebDriver, WebElement>() {
                    public WebElement apply(WebDriver arg0) {
                        return link;
                    }
                });
                // Fetching and printing the link from anchor node
                System.out.println(link.getAttribute("href").trim());
            }

            // Checking 'Next >' button is there or not in the page
            List<WebElement> nextButton = driver.findElements(By.className("next"));
            // If the 'Next >' button is there then clicking on it otherwise stopping the execution
            if(nextButton.size() > 0) {
                doClickUsingJSE(nextButton.get(0));
                System.out.println("=> 'Next >' button is there and clicked on it...");
            } else {
                next = false;
                System.out.println("=> 'Next >' button is NOT there so stopping the execution...");
            }
            Thread.sleep(1000);
        }

    }
}

希望对您有所帮助...快乐编码...

【讨论】：

您好，谢谢您的回答。虽然效果很好，但 Samee 的回答可能是更好的方法。
欢迎@AndréYuhai，是的，我已经看到了他的回答——但是在这种方法中您不会丢失任何数据，并且您可以避免单击每个页码。您可以删除线程以使其工作得更快......我仍在寻找答案，它将为您提供所有个人资料链接，而无需向下滚动。如果我得到解决方案，我会更新答案...
那太好了，但我找不到解决方案。有人说我得找那个JS函数，然后手动调用，但我不确定。