【问题标题】:Finding the data corresponding to a specific selection in an html form [closed]在 html 表单中查找与特定选择相对应的数据 [关闭]
【发布时间】:2021-02-14 01:14:15
【问题描述】:
我正在尝试从位于 http://appl101.lsu.edu/booklet2.nsf/Selector2?OpenForm 的表单中抓取数据
表单的操作是“/booklet2.nsf/f5e6e50d1d1d05c4862584410071cd2e?CreateDocument”。对于选择的每一对(学期、部门),我们会得到一个包含数据的相应页面。
我的目标是编写一些 python 代码来查找每对(学期、部门)页面的 URL。首先,我试图找到特定选择的 URL,例如(2020 年秋季,数学)。
我是网络抓取的新手,只知道一些基本的 html。如果有人能指引我正确的方向,将不胜感激。另外,请详细说明一下这个表单的作用。
【问题讨论】:
标签:
python
html
forms
web-scraping
【解决方案1】:
您可以使用此示例获取每对的 URL(但大多数会返回 NoCourseDept):
import requests
from bs4 import BeautifulSoup
base_url = 'http://appl101.lsu.edu/booklet2.nsf/Selector2?OpenForm'
post_url = 'http://appl101.lsu.edu/booklet2.nsf/f5e6e50d1d1d05c4862584410071cd2e?CreateDocument'
soup = BeautifulSoup(requests.get(base_url).content, 'lxml')
semesters = []
for s in soup.select('[name="SemesterDesc"] [value]'):
semesters.append(s['value'])
departments = []
for d in soup.select('[name="Department"] option'):
departments.append(d.get_text(strip=True))
for s in semesters:
for d in departments:
data = {
'%%Surrogate_SemesterDesc':1,
'SemesterDesc':s,
'%%Surrogate_Department': 1,
'Department':d
}
r = requests.post(post_url, data=data)
print('{:<30} {:<30} {}'.format(s, d, r.url))
打印:
...
Second Summer Module 2021 BUSINESS ADMINISTRATION https://appl101.lsu.edu/booklet2.nsf/All/FFAC316D00E5F3D5862585C7002EF1AA?OpenDocument
Second Summer Module 2021 BUSINESS EDUCATION https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 BUSINESS LAW https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 CHEMICAL ENGINEERING https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 CHEMISTRY https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 CHILD AND FAMILY STUDIES https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 CHINESE https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 CIVIL ENGINEERING https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 CIVIL & ENVIRONMENTAL ENGINEER https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 CLASSICAL STUDIES https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 COMMUNICATION DISORDERS https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 COMMUNICATION STUDIES https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 COMPARATIVE BIOMEDICAL SCIENCE https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 COMPARATIVE LITERATURE https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 COMPUTER SCIENCE https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 CONSTRUCTION MANAGEMENT https://appl101.lsu.edu/booklet2.nsf/All/637EAD668A213EDC862585F200296FAE?OpenDocument
Second Summer Module 2021 CURRICULUM & INSTRUCTION https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 DAIRY SCIENCE https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 DIGITAL MEDIA ARTS & ENGINEERI https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 DISASTER SCIENCE MANAGEMENT https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 DOCTOR OF DESIGN https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 ECONOMICS https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021 EDUC LEADERSHIP RESEARCH COUNS https://appl101.lsu.edu/booklet2.nsf/All/B0D27015A5F630CF86258602002C263E?OpenDocument
Second Summer Module 2021 EDUCATION https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
...
【解决方案2】:
表单的动作:只是同一服务器上的另一个页面,只是有一个奇怪的名字
对于刮削部分:
一对(学期,部门)对应的每个页面没有唯一的 URL。发生的情况是,您选择的学期和部门将在 post 请求中提交到服务器,并且根据提交的数据,与表单操作中的 URL 对应的页面内容会有所不同(动态网页)。
解决方案:一种方法是存储奇怪的 url 和 (semester, department) 对
当需要获取pair对应的页面时,向URL发送post请求,并在请求中以键值方式提供pair(学期,部门),键为“选择输入”的名称和部门从表格中输入“然后你应该收到html中的页面,你可以抓取它并提取你需要的信息。
例如[发布请求的键值]
SemesterDesc=Fall 2020,因为“SemesterDesc”是表单中学期/年选择输入的名称,同样适用于部门输入。
你可以搜索如何发出请求。