【发布时间】:2015-11-20 03:35:53
【问题描述】:
我正在尝试编写一个 Python 脚本,该脚本使用 Requests 模块来处理 HTTP 请求,该脚本从司法统计局获取数据。我从中请求数据的页面具有“多项选择”字段,允许用户从列表中选择一个或多个选项。
我尝试下载数据的页面位于:http://www.ucrdatatool.gov/Search/Crime/Local/OneYearofData.cfm
这是我要提交的表格(在下载过程的第二步,在您在上面的链接中提交“状态”选择表格之后):
<form name="CFForm_1" id="CFForm_1" action="RunCrimeOneYearofData.cfm" method="post" onsubmit="return _CF_checkCFForm_1(this)">
<INPUT TYPE="Hidden" Name="StateId" Value="1">
<INPUT TYPE="Hidden" Name="BJSPopulationGroupId" Value="">
<table width="94%" border="0" height="151">
<tr>
<td width="27%" valign="top"><font size="2" class="text"><b>
<LABEL FOR="agencies">a. Choose one or more agencies:</LABEL>
</b></font><BR> <BR> <font size="2" class="text">
<select name="CrimeCrossId" size="4" MULTIPLE ID="agencies">
<option value="102" >Alabaster Police Dept</option>
<option value="104" >Albertville Police Dept</option>
<option value="105" >Alexander City Police Dept</option>
<option value="110" >Anniston Police Dept</option>
<option value="119" >Athens Police Dept</option>
<option value="120" >Atmore Police Dept</option>
<option value="122" >Auburn Police Dept</option>
<option value="127" >Baldwin County Sheriff Deptartment</option>
<option value="134" >Bessemer Police Dept</option>
<option value="136" >Birmingham Police Dept</option>
<option value="138" >Blount County Sheriff Department</option>
<option value="156" >Calera Police Dept</option>
<option value="157" >Calhoun County Sheriff Department</option>
<option value="174" >Chilton County Sheriff Department</option>
<option value="204" >Cullman County Sheriff Department</option>
<option value="205" >Cullman Police Dept</option>
<option value="210" >Daphne Police Dept</option>
<option value="213" >Decatur Police Dept</option>
<option value="214" >Dekalb County Sheriff Department</option>
<option value="218" >Dothan Police Dept</option>
<option value="228" >Elmore County Sheriff Department</option>
<option value="229" >Enterprise Police Dept</option>
<option value="232" >Etowah County Sheriff Department</option>
<option value="233" >Eufaula Police Dept</option>
<option value="237" >Fairfield Police Dept</option>
<option value="238" >Fairhope Police Dept</option>
<option value="247" >Florence Police Dept</option>
<option value="248" >Foley Police Dept</option>
<option value="251" >Fort Payne Police Dept</option>
<option value="259" >Gadsden Police Dept</option>
<option value="262" >Gardendale Police Dept</option>
<option value="281" >Gulf Shores Police Dept</option>
<option value="292" >Hartselle Police Dept</option>
<option value="296" >Helena Police Dept</option>
<option value="305" >Homewood Police Dept</option>
<option value="306" >Hoover Police Dept</option>
<option value="307" >Houston County Sheriff Department</option>
<option value="308" >Hueytown Police Dept</option>
<option value="310" >Huntsville Police Dept</option>
<option value="314" >Irondale Police Dept</option>
<option value="315" >Jackson County Sheriff Department</option>
<option value="318" >Jacksonville Police Dept</option>
<option value="320" >Jasper Police Dept</option>
<option value="321" >Jefferson County Sheriff Department</option>
<option value="334" >Lauderdale County Sheriff Department</option>
<option value="335" >Lawrence County Sheriff Department</option>
<option value="337" >Lee County Sheriff Department</option>
<option value="338" >Leeds Police Dept</option>
<option value="343" >Limestone County Sheriff Department</option>
<option value="358" >Madison County Sheriff Department</option>
<option value="359" >Madison Police Dept</option>
<option value="365" >Marshall County Sheriff Department</option>
<option value="371" >Millbrook Police Dept</option>
<option value="374" >Mobile County Sheriff Department</option>
<option value="375" >Mobile Police Dept</option>
<option value="381" >Montgomery Police Dept</option>
<option value="382" >Moody Police Dept</option>
<option value="383" >Morgan County Sheriff Department</option>
<option value="388" >Mountain Brook Police Dept</option>
<option value="391" >Muscle Shoals Police Dept</option>
<option value="400" >Northport Police Dept</option>
<option value="406" >Opelika Police Dept</option>
<option value="410" >Oxford Police Dept</option>
<option value="411" >Ozark Police Dept</option>
<option value="413" >Pelham Police Dept</option>
<option value="414" >Pell City Police Dept</option>
<option value="417" >Phenix Police Dept</option>
<option value="426" >Pleasant Grove Police Dept</option>
<option value="429" >Prattville Police Dept</option>
<option value="431" >Prichard Police Dept</option>
<option value="451" >Saraland Police Dept</option>
<option value="454" >Scottsboro Police Dept</option>
<option value="456" >Selma Police Dept</option>
<option value="458" >Shelby County Sheriff Department</option>
<option value="470" >St. Clair County Sheriff Department</option>
<option value="478" >Sylacauga Police Dept</option>
<option value="481" >Talladega County Sheriff Department</option>
<option value="482" >Talladega Police Dept</option>
<option value="497" >Troy Police Dept</option>
<option value="500" >Trussville Police Dept</option>
<option value="501" >Tuscaloosa County Sheriff Department</option>
<option value="502" >Tuscaloosa Police Dept</option>
<option value="517" >Vestavia Hills Police Dept</option>
<option value="522" >Walker County Sheriff Department</option>
</select>
</font> </td>
<td width="34%" valign="top"><font size="2" class="text"><b>
<LABEL FOR="groups">b. Choose one or more variable groups:</LABEL>*
</b></font><BR>
<BR> <font size="2" class="text">
<select name="DataType" size="4" Multiple ID="groups">
<option value="1" >Number
of violent crimes</option>
<option value="2" >Number
of property crimes</option>
<option value="3" >Violent
crime rates</option>
<option value="4" >Property
crime rates</option>
</select>
</font> </td>
<td width="31%" rowspan="2" valign="top" NOWRAP><font size="2" class="text"><b>
<LABEL FOR="year">c. Choose one year:</LABEL>
</b></font><BR> <BR> <font size="2" class="text">
<SELECT Name="YearStart" Size="1" ID="year">
<OPTION Value="1985" >
1985 </OPTION>
<OPTION Value="1986" >
1986 </OPTION>
<OPTION Value="1987" >
1987 </OPTION>
<OPTION Value="1988" >
1988 </OPTION>
<OPTION Value="1989" >
1989 </OPTION>
<OPTION Value="1990" >
1990 </OPTION>
<OPTION Value="1991" >
1991 </OPTION>
<OPTION Value="1992" >
1992 </OPTION>
<OPTION Value="1993" >
1993 </OPTION>
<OPTION Value="1994" >
1994 </OPTION>
<OPTION Value="1995" >
1995 </OPTION>
<OPTION Value="1996" >
1996 </OPTION>
<OPTION Value="1997" >
1997 </OPTION>
<OPTION Value="1998" >
1998 </OPTION>
<OPTION Value="1999" >
1999 </OPTION>
<OPTION Value="2000" >
2000 </OPTION>
<OPTION Value="2001" >
2001 </OPTION>
<OPTION Value="2002" >
2002 </OPTION>
<OPTION Value="2003" >
2003 </OPTION>
<OPTION Value="2004" >
2004 </OPTION>
<OPTION Value="2005" >
2005 </OPTION>
<OPTION Value="2006" >
2006 </OPTION>
<OPTION Value="2007" >
2007 </OPTION>
<OPTION Value="2008" >
2008 </OPTION>
<OPTION Value="2009" >
2009 </OPTION>
<OPTION Value="2010" >
2010 </OPTION>
<OPTION Value="2011" >
2011 </OPTION>
<OPTION Value="2012" >
2012 </OPTION>
</SELECT>
</font> </td>
</tr>
<tr>
<td colspan="2" valign="top" NOWRAP><BR>
<table border="1" cellspacing="0" cellpadding="4" bordercolor="#999999" bgcolor="#FFFFCC" align="left" width="450">
<tr>
<td align="center" nowrap><font size="2" class="text" color="#330099"><b>Hold
down the control key to select more than one option.</b></font></td>
</tr>
</table> </td>
</tr>
<tr>
<td valign="top" NOWRAP> <BR> <BR> <p>
<input name="NextPage" type="submit" value="Get Table">
<input name="PreviousPage" type="submit" value="Previous">
<input name="Cancel" type="reset" value="Reset Form">
</p></td>
<td colspan="2" valign="top" NOWRAP><table width="300" border="0" cellspacing="0" cellpadding="3">
<tr align="left">
<td width="4%" valign="top"><strong>* </strong></td>
<td width="48%" valign="top">Violent crimes:</td>
<td colspan="2" valign="top">Property crimes :</td>
</tr>
<tr>
<td align="center" valign="top"></td>
<td valign="top"> <font class=text size=2> •murder<br>
•forcible rape<br>
•robbery<br>
•aggravated assault </font></td>
<td width="4%"> </td>
<td valign="top"> •burglary<br>
•larceny-theft<br> •motor
vehicle theft</td>
</tr>
<tr align="left">
<td colspan="4" valign="top"><FONT class=text size=2>Tables with
many variables may be very wide.</FONT> </td>
</tr>
</table>
<br> <FONT class=text
size=2>See <B><A
href="/offenses.cfm">UCR Offense Definitions</A></B>
for additional information about these crimes.</FONT> </td>
</tr>
</table>
</form>
我正在尝试选择这些多个字段中的几个字段中的所有 s(例如,选择所有机构/犯罪类型 / 等)并提交包含所有这些字段的 HTTP 发布请求。
当我在 Firefox 中手动提交此表单时,查看 Live HTTP 标头的输出,我可以看到 POST 请求包含以下查询字符串:
STATEID = 1&BJSPopulationGroupId =&CrimeCrossId = 102&CrimeCrossId = 104&CrimeCrossId = 105&CrimeCrossId = 110&CrimeCrossId = 119&CrimeCrossId = 120&CrimeCrossId = 122&CrimeCrossId = 127&CrimeCrossId = 134&CrimeCrossId = 136&CrimeCrossId = 138&CrimeCrossId = 156&CrimeCrossId = 157&CrimeCrossId = 174&CrimeCrossId = 204&CrimeCrossId = 205&CrimeCrossId = 210&CrimeCrossId = 213&CrimeCrossId = 214&CrimeCrossId = 218&CrimeCrossId = 228&CrimeCrossId = 229&CrimeCrossId = 232&CrimeCrossId = 233&CrimeCrossId = 237&CrimeCrossId = 238&CrimeCrossId = 247&CrimeCrossId = 248&CrimeCrossId = 251&CrimeCrossId = 259&CrimeCrossId = 262&CrimeCrossId = 281&CrimeCrossId = 292&CrimeCrossId = 296&CrimeCrossId = 305&CrimeCrossId = 306&CrimeCrossId = 307&CrimeCrossId = 308&CrimeCrossId = 310&CrimeCrossId = 314&CrimeCrossId = 315&CrimeCrossId = 318&CrimeCrossId = 320&CrimeCrossId =321&CrimeCrossId=334&CrimeCrossId=335&CrimeCrossId=337&CrimeCrossId=338&CrimeCrossId=343&CrimeCrossId=358&CrimeCrossId=359&CrimeCrossId=365&CrimeCrossId=371&CrimeCrossId=374&CrimeCrossId318&CrossId=3175& rimeCrossId = 382&CrimeCrossId = 383&CrimeCrossId = 388&CrimeCrossId = 391&CrimeCrossId = 400&CrimeCrossId = 406&CrimeCrossId = 410&CrimeCrossId = 411&CrimeCrossId = 413&CrimeCrossId = 414&CrimeCrossId = 417&CrimeCrossId = 426&CrimeCrossId = 429&CrimeCrossId = 431&CrimeCrossId = 451&CrimeCrossId = 454&CrimeCrossId = 456&CrimeCrossId = 458&CrimeCrossId = 470&CrimeCrossId = 478&CrimeCrossId = 481&CrimeCrossId = 482&CrimeCrossId = 497&CrimeCrossId = 500&CrimeCrossId = 501&CrimeCrossId=502&CrimeCrossId=517&CrimeCrossId=522&DataType=1&DataType=2&DataType=3&DataType=4&YearStart=2010&NextPage=Get+Table
这是迄今为止我尝试执行此操作的python代码...请注意我尝试构造 post_data2 的部分...这不起作用(它只是让我回到“第一步”页面):
import requests
from bs4 import BeautifulSoup as BS
base_url = 'http://www.ucrdatatool.gov/Search/Crime/Local/'
dl_page_url = base_url + 'OneYearofData.cfm'
post_url = base_url + 'OneYearofDataStepTwo.cfm'
r = requests.get(dl_page_url)
page = BS(r.content)
select_states = page.find('form', id = 'CFForm_1').find('select', id = 'state')
state_choices = select_states.findAll('option')
state = state_choices[2] #DEBUGGING
#for state in state_choices:
state_id = int(state.get('value'))
state_name = state.text
post_data = { 'StateId': state_id, 'BJSPopulationGroupId' : ''}
r2 = requests.post(post_url, post_data)
page2 = BS(r2.content)
step2_form = page2.find('form', id = 'CFForm_1')
select_agencies = step2_form.find('select', id = 'agencies')
select_crimes = step2_form.find('select', id = 'groups')
select_year = step2_form.find('select', id = 'year')
agency_choices = select_agencies.findAll('option')
crime_choices = select_crimes.findAll('option')
year_choices = select_year.findAll('option')
post_data2 = {'CrimeCrossId': list([a.get('value') for a in agency_choices]),
'DataType' : list([c.get('value') for c in crime_choices]),
'YearStart': '2010'}
post_url2 = base_url + 'RunCrimeOneYearofData.cfm'
r3 = requests.post(post_url2, post_data2)
state_results_page = BS(r3.content)
使用 Python 请求模块提交这样的多选字段的正确方法是什么?谢谢!
【问题讨论】:
标签: python http web-crawler python-requests forms