【发布时间】:2016-11-23 13:09:13
【问题描述】:
我通过使用requests 模块发送信息来获取URL 查询的结果,从而在html 中获得了一个表。现在我想使用BeautifulSoup从输出中获取一个表格
import urllib, requests, re
from bs4 import BeautifulSoup
def find_between( s, first, last ):
try:
start = s.index( first ) + len( first )
end = s.index( last, start )
return s[start:end]
except ValueError:
return ""
payload = {'lon': '1:35:00', 'lat': '-10:13:00', 'radius':'18.0', 'hconst':'73', 'omegam':'0.27','omegav':'0.73','search_type':'Near Position Search','in_equinox':'J2000.0','ot_include':'ANY','in_csys':'Equatorial','in_objtypes1': ['GClusters', 'GGroups']}
r = requests.get('https://ned.ipac.caltech.edu/cgi-bin/objsearch', params=payload,verify=False)
print(r.url)
soup = BeautifulSoup(r.text, 'html.parser')
tables = soup.table
#find the column's names
header=soup.find_all('strong')[-1]
columns=re.split(r'\n*', header.text)[-2]
mylist=re.split(r'\s*', columns)
#Storing the names of columns in a list
mycolumns=[];flag=0
for element in mylist:
if ((element!=u'') and (flag<1) and ('(' not in element) ):
mycolumns.append(element)
if (('(' in element) and (flag<1)):
object=element
flag=1
if (('(' not in element) and (flag>0)):
if ')' not in element:
object+=element
else:
object+=element
flag=0
mycolumns.append(object)
上面的代码给了我越来越少的东西,但我想这不是最好的方法。网页中查询的结果是这样的:
Row Object Name EquJ2000.0 Object Velocity/Redshift Mag./ Separ. Number of Row
No. (* => Essential Note) RA DEC Type km/s z Qual Filter arcmin Refs Notes Phot Posn Vel/z Diam Assoc Images Spectra No.
1 GMBCG J023.72560-10.18783 01h34m54.1s -10d11m16s GClstr >30000 0.346000 PHOT ... 2.252 1 0 0 0 1 0 0 Retrieve Retrieve 1
2 SDSSCGB 21433 01h34m40.5s -10d14m14s GGroup ... ... ... 4.956 1 0 0 0 0 0 0 Retrieve Retrieve 2
3 WHL J013438.0-101743 01h34m38.0s -10d17m43s GClstr >30000 0.372800 PHOT ... 7.179 2 0 0 0 2 0 0 Retrieve Retrieve 3
4 SDSSCGB 18836 01h34m37.4s -10d20m25s GGroup ... ... ... 9.272 1 0 0 0 0 0 0 Retrieve Retrieve 4
5 GMBCG J023.65477-10.06935 01h34m37.1s -10d04m10s GClstr >30000 0.336000 PHOT ... 10.477 1 0 0 0 1 0 0 Retrieve Retrieve 5
6 GMBCG J023.95379-10.20892 01h35m48.9s -10d12m32s GClstr >30000 0.179000 PHOT ... 12.043 1 0 0 0 1 0 0 Retrieve Retrieve 6
7 SDSSCGB 11439 01h34m07.6s -10d12m01s GGroup ... ... ... 12.930 1 0 0 0 0 0 0 Retrieve Retrieve 7
8 GMBCG J023.53330-10.16959 01h34m08.0s -10d10m11s GClstr >30000 0.438000 PHOT ... 13.105 1 0 0 0 1 0 0 Retrieve Retrieve 8
9 WHL J013404.8-101438 01h34m04.8s -10d14m38s GClstr >30000 0.321800 PHOT ... 13.678 2 0 0 0 2 0 0 Retrieve Retrieve 9
10 GMBCG J023.90759-10.03946 01h35m37.8s -10d02m22s GClstr >30000 0.298000 PHOT ... 14.131 1 0 0 0 1 0 0 Retrieve Retrieve 10
11 SDSSCGB 20022 01h36m00.4s -10d09m21s GGroup ... ... ... 15.302 1 0 0 0 0 0 0 Retrieve Retrieve 11
12 GMBCG J024.00318-10.15744 01h36m00.7s -10d09m27s GClstr >30000 0.385000 PHOT ... 15.368 1 0 0 0 1 0 0 Retrieve Retrieve 12
13 MaxBCG J023.98788-10.04339 01h35m57.1s -10d02m36s GClstr >30000 0.297050 PHOT ... 17.479 2 0 0 0 1 0 0 Retrieve Retrieve 13
我想只从查询的前八列中提取信息,但使用BeautifulSoup 来做这件事并不是很简单。如有任何建议,我将不胜感激。
【问题讨论】:
标签: regex beautifulsoup html-parsing html-table python-requests