有没有办法用不变的 URL 抓取一个网站？答案

【问题标题】：Is there a way to web scarpe a website with unchanging URLs?有没有办法用不变的 URL 抓取一个网站？
【发布时间】：2020-12-14 16:37:54
【问题描述】：

我正在尝试使用 selenium、beautifulsoup 和 python 抓取动态页面，并且能够抓取第一页。但是当我尝试进入下一页时，url 没有改变，当我检查时，我也无法看到表单数据。有人可以帮助我吗？

import time
from selenium import webdriver
from parsel import Selector
from bs4 import BeautifulSoup
import random
import re
import csv
import requests
import pandas as pd

companies = []
overview = []
people = []

driver = webdriver.Chrome(executable_path=r'C:\\Users\\rahul\Downloads\\chromedriver_win32 (1)\\chromedriver.exe')

driver.get('https://coverager.com/data/companies/')
driver.maximize_window()
src = driver.page_source
soup = BeautifulSoup(src, 'lxml')

table = soup.find('tbody')
descrip = []
table_rows = table.find_all('tr')
for tr in table_rows:
    td = tr.find_all('td')
    #print(td)
    row = [i.text.strip() for i in td]
    descrip.append(row)
    #print(row)
    
#file = open('gag.csv','w')
#with file:
#        write = csv.writer(file)
#        write.writerows(descrip)


url = ('https://coverager.com')
a_tags = table.find_all('a', href = True)
for link in a_tags:
       ol = link.get('href')
       pl = link.string.strip()
       #companies.append(row)
       #print(pl)
       #print(ol)
       driver.get(url + ol)
       driver.implicitly_wait(1000)
       data1 = driver.find_element_by_class_name('tab-details').text
       overview.append(data1.strip())
       data2 = driver.find_element_by_link_text('People').click()
       p_tags = driver.find_element_by_class_name('tab-details').text
       people.append(p_tags)

【问题讨论】：

标签： python selenium web web-scraping beautifulsoup

【解决方案1】：

在您的 https://coverager.com/data/companies/ 的情况下，抓取 api 调用而不是页面上的 html 会容易得多。

打开开发工具（在 chrome 上右键单击并点击检查）并转到网络选项卡。当您点击“下一步”按钮时，网络选项卡中应显示一行。单击此行，然后转到预览。您应该会在此选项卡中看到该公司。

该 api 正在访问如下所示的链接：

https://coverager.com/wp-json/ath/v1/coverager-data/companies?per_page=20&page=2&draw=4&column=3&dir=desc&filters=%7B%22companies%22:[],%22company_lob%22:[],%22industry%22:[],%22company_type%22:[],%22company_category%22:[],%22region%22:[],%22founded%22:[],%22company_stage%22:[],%22company_business_model%22:[]%7D

似乎所有页面都调用了相同的 api url，但更改了 page= 和 raw=，它们相隔 2。

因此，只需使用请求来调用此类链接，并根据需要循环浏览尽可能多的页面！您还可以根据需要更改每页返回的公司数量。不过，您必须对其进行测试。

【讨论】：

谢谢安德鲁，这确实很有帮助。
我还需要抓取该表中每个公司的信息，就像我在上面的代码中所做的那样，但是由于我无法通过这种方法获得任何链接，所以我被卡住了。对我该如何做有什么建议吗？
似乎获取公司信息的最佳方法是使用更传统的网络爬虫。我建议要么请求 + 美丽的汤或 Scrapy。您可以使用 requests + beautiful soup 或 Scrapy 从主网站页面收集 url 以访问这些公司页面（注意：公司 url 似乎是 coverager.com/company/company-name）。收集网址后，可以通过转到“下一个”链接来循环分页。