查看此页面时,您缺少一些请求标头,特别是真实性令牌。要抓住它,我们必须解析前一页的 HTML 以找到它。看看这个简单的例子:
# Imports
from bs4 import BeautifulSoup
from requests import Session
# Session Object
session = Session()
# Add a user agent, so the request looks more human like.
session.headers.update({
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
})
# Initial sesssion, you need to fetch the url first, so the authenticity
# token can be parsed out of the html
init_session = session.get(url="https://www.azjobconnection.gov/ada/mn_warn_dsp.cfm?def=false")
# Beautiful soup object, used for HTML parsing
soup = BeautifulSoup(init_session.content, "html.parser")
# Get all of the input tags
inputs = soup.findAll('input')
# Upon running, we see that the authenticity token, is the first element in the array.
authenticty_token = inputs[0]['value']
# Now we can make our request!
# Request data
data = {
"authenticity_token" : authenticty_token,
"coname": "",
"coName_ADAdefault": "",
"coName_verify_char[0|50]": "The value you have supplied for Company Name is too long.",
"city": "",
"city_ADAdefault": "",
"city_verify_char[0|45]": "The value you have supplied for City is too long.",
"zip": "",
"zip_ADAdefault": "",
"zip_verify_char[0|10]": "The value you have supplied for Zip/Postal Code is too long.",
"sda": "",
"startdate": "01/01/2020",
"startDate_ADAdefault": "mm/dd/yyyy",
"startDate_verify_date4": "",
"startDate_verify_char[0|45]": "The value you have supplied for Start Date is too long.",
"enddate": "mm/dd/yyyy",
"endDate_ADAdefault": "mm/dd/yyyy",
"endDate_verify_date4": "",
"endDate_verify_char[0|45]": "The value you have supplied for End Date is too long.",
"layoffType": "y",
"search": "Search",
"old_choice": 1,
"ZIP_prev": "",
"def_prev": "false",
"CITY_prev": "",
"SDA_prev": "",
"STARTDATE_prev": "",
"CONAME_prev": "",
"ENDDATE_prev": "",
"FormName": "Form0",
}
# Get the data
get_warn_data = session.post("https://www.azjobconnection.gov/ada/mn_warn_dsp.cfm?securitysys=on&FormID=0", data=data)
# print the data, this looks messy, so lets prettify with bs4!
#print(get_warn_data.content)
soup = BeautifulSoup(get_warn_data.content, "html.parser")
print(soup.prettify())
这将为您提供您正在寻找的 HTML。现在在这个 HTML 中,您需要解析 a href 标记以获取您需要的链接。例如,它们将如下所示:
<tr class="cfOutputTableRow cfAlternate">
<td align="left" class="cfPadLeft cfAlternate" colspan="1" valign="top">
<span class="blTransparent">
<a href="mn_warn_dsp.cfm?id=399&callingfile=mn_warn_dsp.cfm&hash=0C2428869560C6832A1D929070C0278F">
Aecom
</a>
</span>
</td>
<td align="left" class="cfAlternate" colspan="1" valign="top">
<span class="blTransparent">
Glendale
</span>
</td>
<td align="left" class="cfAlternate" colspan="1" valign="top">
<span class="blTransparent">
85310
</span>
</td>
<td align="left" class="cfAlternate" colspan="1" valign="top">
<span class="blTransparent">
7
</span>
</td>
<td align="left" class="cfAlternate cfPadRight" colspan="1" valign="top">
<span class="blTransparent">
01/17/2020
</span>
</td>
</tr>
具体来说:
<a href="mn_warn_dsp.cfm?id=399&callingfile=mn_warn_dsp.cfm&hash=0C2428869560C6832A1D929070C0278F">
获取此链接后,请务必在其前面加上 https://www.azjobconnection.gov/ada/。
https://www.azjobconnection.gov/ada/mn_warn_dsp.cfm?id=399&callingfile=mn_warn_dsp.cfm&hash=0C2428869560C6832A1D929070C0278F