使用 BeautifulSoup 从 URL 查询中解析列答案

【问题标题】：parsing the columns from a URL query using BeautifulSoup使用 BeautifulSoup 从 URL 查询中解析列
【发布时间】：2016-11-23 13:09:13
【问题描述】：

我通过使用requests 模块发送信息来获取URL 查询的结果，从而在html 中获得了一个表。现在我想使用BeautifulSoup从输出中获取一个表格

import urllib, requests, re
from bs4 import BeautifulSoup
def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""

payload = {'lon': '1:35:00', 'lat': '-10:13:00', 'radius':'18.0', 'hconst':'73', 'omegam':'0.27','omegav':'0.73','search_type':'Near Position Search','in_equinox':'J2000.0','ot_include':'ANY','in_csys':'Equatorial','in_objtypes1': ['GClusters', 'GGroups']}
r = requests.get('https://ned.ipac.caltech.edu/cgi-bin/objsearch', params=payload,verify=False)
print(r.url)
soup = BeautifulSoup(r.text, 'html.parser')
tables = soup.table
#find the column's names
header=soup.find_all('strong')[-1]
columns=re.split(r'\n*', header.text)[-2]
mylist=re.split(r'\s*', columns)
#Storing the names of columns in a list 
mycolumns=[];flag=0
for element in mylist:
    if ((element!=u'') and (flag<1) and ('(' not in element) ):
       mycolumns.append(element)
    if (('(' in element) and (flag<1)):
          object=element
          flag=1
    if (('(' not in element) and (flag>0)):
       if ')' not in element:
           object+=element
       else:
           object+=element
           flag=0
           mycolumns.append(object)

上面的代码给了我越来越少的东西，但我想这不是最好的方法。网页中查询的结果是这样的：

Row          Object Name                 EquJ2000.0       Object  Velocity/Redshift    Mag./  Separ.               Number of                                 Row 
 No.     (* => Essential Note)       RA               DEC  Type     km/s       z   Qual Filter arcmin Refs Notes Phot Posn Vel/z Diam Assoc Images   Spectra  No. 
1    GMBCG J023.72560-10.18783      01h34m54.1s -10d11m16s GClstr >30000  0.346000 PHOT  ...    2.252    1     0    0    0     1    0     0 Retrieve Retrieve 1   
2    SDSSCGB 21433                  01h34m40.5s -10d14m14s GGroup    ...       ...       ...    4.956    1     0    0    0     0    0     0 Retrieve Retrieve 2   
3    WHL J013438.0-101743           01h34m38.0s -10d17m43s GClstr >30000  0.372800 PHOT  ...    7.179    2     0    0    0     2    0     0 Retrieve Retrieve 3   
4    SDSSCGB 18836                  01h34m37.4s -10d20m25s GGroup    ...       ...       ...    9.272    1     0    0    0     0    0     0 Retrieve Retrieve 4   
5    GMBCG J023.65477-10.06935      01h34m37.1s -10d04m10s GClstr >30000  0.336000 PHOT  ...   10.477    1     0    0    0     1    0     0 Retrieve Retrieve 5   
6    GMBCG J023.95379-10.20892      01h35m48.9s -10d12m32s GClstr >30000  0.179000 PHOT  ...   12.043    1     0    0    0     1    0     0 Retrieve Retrieve 6   
7    SDSSCGB 11439                  01h34m07.6s -10d12m01s GGroup    ...       ...       ...   12.930    1     0    0    0     0    0     0 Retrieve Retrieve 7   
8    GMBCG J023.53330-10.16959      01h34m08.0s -10d10m11s GClstr >30000  0.438000 PHOT  ...   13.105    1     0    0    0     1    0     0 Retrieve Retrieve 8   
9    WHL J013404.8-101438           01h34m04.8s -10d14m38s GClstr >30000  0.321800 PHOT  ...   13.678    2     0    0    0     2    0     0 Retrieve Retrieve 9   
10   GMBCG J023.90759-10.03946      01h35m37.8s -10d02m22s GClstr >30000  0.298000 PHOT  ...   14.131    1     0    0    0     1    0     0 Retrieve Retrieve 10  
11   SDSSCGB 20022                  01h36m00.4s -10d09m21s GGroup    ...       ...       ...   15.302    1     0    0    0     0    0     0 Retrieve Retrieve 11  
12   GMBCG J024.00318-10.15744      01h36m00.7s -10d09m27s GClstr >30000  0.385000 PHOT  ...   15.368    1     0    0    0     1    0     0 Retrieve Retrieve 12  
13   MaxBCG J023.98788-10.04339     01h35m57.1s -10d02m36s GClstr >30000  0.297050 PHOT  ...   17.479    2     0    0    0     1    0     0 Retrieve Retrieve 13

我想只从查询的前八列中提取信息，但使用BeautifulSoup 来做这件事并不是很简单。如有任何建议，我将不胜感激。

【问题讨论】：

标签： regex beautifulsoup html-parsing html-table python-requests

【解决方案1】：

你不能使用 BeautifulSoup，因为你从网站上得到的结果是不是格式良好的 HTML（至少对于您想要获得的部分）。整个表格内容在一个 table/tr/td/pre 元素中，作为大多数纯文本。

如果您想使用正则表达式——如果数据发生变化可能会导致不稳定——您可以使用这种方法（基于您当前的代码）：

# coding: utf-8
import requests, re, pprint
from bs4 import BeautifulSoup
payload = {'lon': '1:35:00', 'lat': '-10:13:00', 'radius':'18.0', 'hconst':'73', 'omegam':'0.27','omegav':'0.73','search_type':'Near Position Search','in_equinox':'J2000.0','ot_include':'ANY','in_csys':'Equatorial','in_objtypes1': ['GClusters', 'GGroups']}
r = requests.get('https://ned.ipac.caltech.edu/cgi-bin/objsearch', params=payload,verify=False)
print(r.url)
soup = BeautifulSoup(r.text, 'html.parser')
tables = soup.table
text = tables.text
rows = text.split("\n")

result = []
for row in rows:
    if re.match("^\d+\s", row):
        row = row.replace(u'\xa0', u' ')  # normalize non-breaking spaces

        # split by regex
        search = re.match(
            "(\d+)\s+(.+?)\s+(\d+h\S+)\s+([-\w]+)\s+(\w+)\s+(\.{3}|[<>\d]+)\s+(\.{3}|[\d.]+)\s+(\w+)?\s+\s+(\.{3}|\w+)",
            # for all columns, add this part to the regex:
            # \s+([-.\d]+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\w+)\s+(\w+)\s+(\d+)
            row)

        # map the regex groups to the table row names
        tmp_result = {
            "row": search.group(1),
            "objectName": search.group(2),
            "EquJ2000_ra": search.group(3),
            "EquJ2000_dec": search.group(4),
            "objectType": search.group(5),
            "Velocity": search.group(6),
            "Redshift": search.group(7),
            "Qual": search.group(8),
            "Filter": search.group(9),
            # further columns
            # "arcmin": search.group(10),
            # "Refs": search.group(11),
            # "Notes": search.group(12),
            # "Phot": search.group(13),
            # "Posn": search.group(14),
            # "Vel_z": search.group(15),
            # "Diam": search.group(16),
            # "Assoc": search.group(17),
            # "Images": search.group(18),
            # "Spectra": search.group(19),
        }

        # append the result with the row number as key
        n = int(search.group(1))
        result.append({ n: tmp_result })

print pprint.pprint(result)

结果是：

[{1: {'EquJ2000_dec': u'-10d11m16s',
      'EquJ2000_ra': u'01h34m54.1s',
      'Filter': u'...',
      'Qual': u'PHOT',
      'Redshift': u'0.346000',
      'Velocity': u'>30000',
      'objectName': u'GMBCG J023.72560-10.18783',
      'objectType': u'GClstr',
      'row': u'1'}},
 {2: {'EquJ2000_dec': u'-10d14m14s',
      'EquJ2000_ra': u'01h34m40.5s',
      'Filter': u'...',
      'Qual': None,
      'Redshift': u'...',
      'Velocity': u'...',
      'objectName': u'SDSSCGB 21433',
      'objectType': u'GGroup',
      'row': u'2'}},
 {3: {'EquJ2000_dec': u'-10d17m43s',
      'EquJ2000_ra': u'01h34m38.0s',
      'Filter': u'...',
      'Qual': u'PHOT',
      'Redshift': u'0.372800',
      'Velocity': u'>30000',
      'objectName': u'WHL J013438.0-101743',
      'objectType': u'GClstr',
      'row': u'3'}},
 {4: {'EquJ2000_dec': u'-10d20m25s',
      'EquJ2000_ra': u'01h34m37.4s',
      'Filter': u'...',
      'Qual': None,
      'Redshift': u'...',
      'Velocity': u'...',
      'objectName': u'SDSSCGB 18836',
      'objectType': u'GGroup',
      'row': u'4'}},
 {5: {'EquJ2000_dec': u'-10d04m10s',
      'EquJ2000_ra': u'01h34m37.1s',
      'Filter': u'...',
      'Qual': u'PHOT',
      'Redshift': u'0.336000',
      'Velocity': u'>30000',
      'objectName': u'GMBCG J023.65477-10.06935',
      'objectType': u'GClstr',
      'row': u'5'}},
 {6: {'EquJ2000_dec': u'-10d12m32s',
      'EquJ2000_ra': u'01h35m48.9s',
      'Filter': u'...',
      'Qual': u'PHOT',
      'Redshift': u'0.179000',
      'Velocity': u'>30000',
      'objectName': u'GMBCG J023.95379-10.20892',
      'objectType': u'GClstr',
      'row': u'6'}},
 {7: {'EquJ2000_dec': u'-10d12m01s',
      'EquJ2000_ra': u'01h34m07.6s',
      'Filter': u'...',
      'Qual': None,
      'Redshift': u'...',
      'Velocity': u'...',
      'objectName': u'SDSSCGB 11439',
      'objectType': u'GGroup',
      'row': u'7'}},
 {8: {'EquJ2000_dec': u'-10d10m11s',
      'EquJ2000_ra': u'01h34m08.0s',
      'Filter': u'...',
      'Qual': u'PHOT',
      'Redshift': u'0.438000',
      'Velocity': u'>30000',
      'objectName': u'GMBCG J023.53330-10.16959',
      'objectType': u'GClstr',
      'row': u'8'}},
 {9: {'EquJ2000_dec': u'-10d14m38s',
      'EquJ2000_ra': u'01h34m04.8s',
      'Filter': u'...',
      'Qual': u'PHOT',
      'Redshift': u'0.321800',
      'Velocity': u'>30000',
      'objectName': u'WHL J013404.8-101438',
      'objectType': u'GClstr',
      'row': u'9'}},
 {10: {'EquJ2000_dec': u'-10d02m22s',
       'EquJ2000_ra': u'01h35m37.8s',
       'Filter': u'...',
       'Qual': u'PHOT',
       'Redshift': u'0.298000',
       'Velocity': u'>30000',
       'objectName': u'GMBCG J023.90759-10.03946',
       'objectType': u'GClstr',
       'row': u'10'}},
 {11: {'EquJ2000_dec': u'-10d09m21s',
       'EquJ2000_ra': u'01h36m00.4s',
       'Filter': u'...',
       'Qual': None,
       'Redshift': u'...',
       'Velocity': u'...',
       'objectName': u'SDSSCGB 20022',
       'objectType': u'GGroup',
       'row': u'11'}},
 {12: {'EquJ2000_dec': u'-10d09m27s',
       'EquJ2000_ra': u'01h36m00.7s',
       'Filter': u'...',
       'Qual': u'PHOT',
       'Redshift': u'0.385000',
       'Velocity': u'>30000',
       'objectName': u'GMBCG J024.00318-10.15744',
       'objectType': u'GClstr',
       'row': u'12'}},
 {13: {'EquJ2000_dec': u'-10d02m36s',
       'EquJ2000_ra': u'01h35m57.1s',
       'Filter': u'...',
       'Qual': u'PHOT',
       'Redshift': u'0.297050',
       'Velocity': u'>30000',
       'objectName': u'MaxBCG J023.98788-10.04339',
       'objectType': u'GClstr',
       'row': u'13'}}]

请注意，“Qual”可以是 None，就像在第 2 行中一样，它是空的。

【讨论】：

感谢您的回答。但是为什么你只得到前八行并且不可能将它扩展到所有行？
如何获取表格中的行数？
我误读了您的帖子，尽管您想要前八行而不是列。我编辑了我的答案，现在你得到了所有行，只有前八列（行号为九）。可以通过len(result)获取行数。