【发布时间】:2020-04-30 16:12:48
【问题描述】:
目前我正在尝试从该站点提取 CMS 历史数据。我有一些工作代码可以从页面中提取下载链接。我的问题是链接分为页面。我需要遍历所有可用页面并提取下载链接。这里显而易见的选择是使用 Selenium 单击下一页并获取数据。由于公司政策,我不能在环境中运行硒。有没有办法我可以通过页面并提取链接。一旦您尝试转到下一页,该网站不会显示帖子链接。我没有想法尝试在没有发布链接或不使用 selenium 的情况下进入下一页。
用于从第一页拉取链接的当前工作代码
import pandas as pd
from datetime import datetime
#from selenium import webdriver
from lxml import html
import requests
def http_request_get(url, session=None, payload=None, parse=True):
""" Sends a GET HTTP request to a website and returns its HTML content and full url address. """
if payload is None:
payload = {}
if session:
content = session.get(url, params=payload, verify=False, headers={"content-type":"text"})
else:
content = requests.get(url, params=payload, verify=False, headers={"content-type":"text"})
content.raise_for_status() # Raise HTTPError for bad requests (4xx or 5xx)
if parse:
return html.fromstring(content.text), content.url
else:
return content.text, content.url
def get_html(link):
"""
Returns a html.
"""
page_parsed, _ = http_request_get(url=link, payload={'t': ''}, parse=True)
return page_parsed
cmslink = "https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-
Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report"
content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
linkTable = content.cssselect('td[headers="view-dlf-1-title-table-column"]')[0]
headers = linkTable[0].xpath('//a/@href')
df1 = pd.DataFrame(headers,columns= ['links'])
df1SubSet = df1[df1['links'].str.contains('contract-summary', case=False)]
【问题讨论】:
标签: python-3.x selenium web-scraping