无法使用 Python 抓取具有不变 URL 的多个页面答案

【问题标题】：Unable to scrape multiple pages with an unchanging URL with Python无法使用 Python 抓取具有不变 URL 的多个页面
【发布时间】：2018-12-13 10:25:49
【问题描述】：

搜索提示后发现我的问题与this question 密切相关，基于this answer 我以为我即将解决我的问题，但我没有这样做。

我需要从该站点http://elempleo.com/cr/ofertas-empleo 中提取所有 URL，我做了以下操作：

page_no=1
payload = {"jobOfferId":0,
           "salaryInfo":[],
           "city":0,
           "publishDate":0,
           "area":40,
           "countryId":0,
           "departmentId":0,
           "companyId":0,
           "pageIndex":page_no,
           "pageSize":"20"},
           "sortExpression":"PublishDate_Desc"}

page = requests.get('http://elempleo.com/cr/ofertas-empleo/get', params=payload)
soup = BeautifulSoup(page.content, 'html.parser')

href_list=soup.select(".text-ellipsis")

for urls in href_list:
    print("http://elempleo.com"+urls.get("href"))

http://elempleo.com/cr/ofertas-trabajo/ap-representative/757190
http://elempleo.com/cr/ofertas-trabajo/ingeniero-de-procesos-sap/757189
http://elempleo.com/cr/ofertas-trabajo/sr-program-analyst-months/757188
http://elempleo.com/cr/ofertas-trabajo/executive-asistant/757187
http://elempleo.com/cr/ofertas-trabajo/asistente-comercial-bilingue/757186
http://elempleo.com/cr/ofertas-trabajo/accounting-assistant/757185
http://elempleo.com/cr/ofertas-trabajo/asistente-contable/757184
http://elempleo.com/cr/ofertas-trabajo/personal-para-cajas-alajuela-con-experiencia-en-farmacia/757183
http://elempleo.com/cr/ofertas-trabajo/oficial-de-seguridad/743703
http://elempleo.com/cr/ofertas-trabajo/tecnico-de-mantenimiento-en-extrusion/757182
http://elempleo.com/cr/ofertas-trabajo/gerente-servicio-al-cliente-y-ventas/757181
http://elempleo.com/cr/ofertas-trabajo/encargadoa-departamento-de-recursos-humanos-ingles-intermedio/757180
http://elempleo.com/cr/ofertas-trabajo/director-of-development/757177
http://elempleo.com/cr/ofertas-trabajo/generalista-de-recursos-humanos-ingles-intermedio/757178
http://elempleo.com/cr/ofertas-trabajo/accounts-payable-specialist-seasonal-contract/757176
http://elempleo.com/cr/ofertas-trabajo/electricista-industrial/757175
http://elempleo.com/cr/ofertas-trabajo/payroll-analyst-months-contract/757174
http://elempleo.com/cr/ofertas-trabajo/gerente-servicio-post-venta/757172
http://elempleo.com/cr/ofertas-trabajo/operario-de-proceso/757171
http://elempleo.com/cr/ofertas-trabajo/cajero-de-kiosco-ubicacion-area-metropolitana-fines-de-semana-disponibilidad-de-horarios/757170

如您所见，它显示了 20 个 url，这没关系，但是如果我 chage page_no=2, page_no=3, ...page_no=100 并再次运行上述代码，它会返回与以前相同的结果；我需要本网站所有页面的所有网址。有人可以帮帮我吗？

另外，我在Área de trabajo 字段中设置了"area":40，它对应于sistemas 类别。它什么都不做，因为结果没有过滤为sistemas 类别。

我在 Ubuntu 18.04 上运行的 Python3 中使用了beautifulsoup。

也欢迎在 R 中使用 rvest 包的答案！

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup

【解决方案1】：

要设置 selenium，请访问link

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
url = "http://elempleo.com/cr/ofertas-empleo/"

注意：您需要从link下载合适的浏览器驱动，并将其路径添加到系统环境变量中

# here I am using chrome webdriver
# setting up selenium
driver = webdriver.Chrome(executable_path=r"F:\Projects\sms_automation\chromedriver.exe")  # initialize webdriver instance
driver.get(url)  # open URL in browser
driver.find_element_by_id("ResultsByPage").send_keys('100')  # set items per page to 100
time.sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
url_set = ["http://elempleo.com"+i.get("href") for i in soup.select(".text-ellipsis")]
while True:
    try:
        driver.find_element_by_class_name("js-btn-next").click()  # go to next page
        time.sleep(3)
        soup = BeautifulSoup(driver.page_source, "html.parser")
        current_page_url = ["http://elempleo.com"+i.get("href") for i in soup.select(".text-ellipsis")]
        if url_set[-1] == current_page_url[-1]:
            break
        url_set += current_page_url
    except WebDriverException:
        time.sleep(5)

结果：

print(len(url_set))   # outputs 2641
print(url_set)  # outputs ['http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/analista-de-sistemas-financieros/753845', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/balance-sheet-and-cash-flow-specialist/755211', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/coordinador-de-compensacion/757369', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/gerente-de-agronomia/757368', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/responsable-de-capacitacion-y-desempeno/757367', 'http://elempleo.comhttp://www.elempleo.com/cr/ofertas-trabajo/pmp-gestor-de-proyectos/757366', ....]

【讨论】：

【解决方案2】：

如果您尝试在打开 Web 控制台的情况下滚动页面，您会注意到分页是通过 findByFilter javascript 查询完成的。 Python 请求无法处理这种页面修改。

你有两个选择：

使用 selenium 浏览器获取支持 javascript 的爬虫
尝试模拟 http://elempleo.com/cr/api/joboffers/findbyfilter POST 请求的标头和请求有效负载，并直接从 api 获取数据（这也可以轻松地为您提供可以直接放入 python 字典的 json 响应）。

【讨论】：