用 BeautifulSoup 解析一个 html 表答案

【问题标题】：Parsing an html table with BeautifulSoup用 BeautifulSoup 解析一个 html 表
【发布时间】：2012-01-12 07:29:52
【问题描述】：

我正在尝试从此时间表中获取给定日期的数据：click here

我已经能够使用 Beautiful Soup 将任何给定日期（在本例中为星期一或“星期一”）的整行添加到使用此代码的列表中：

from BeautifulSoup import BeautifulSoup

day ='Mon'

with open('timetable.txt', 'rt') as input_file:
  html = input_file.read()
  soup = BeautifulSoup(html)
  #finds correct day tag
  starttag = soup.find(text=day).parent.parent
  print starttag
  nexttag = starttag
  row=[]
  x = 0
  #puts all td tags for that day in a list
  while x < 18:
    nexttag = nexttag.nextSibling.nextSibling
    row.append(nexttag)
    x += 1
print row

如您所见，该命令返回一个 TD 标签列表，这些标签构成了时间表中的“mon”行。

我的问题是，我不知道如何进一步解析或搜索返回的列表以找到相关信息（COMP1740 等）。

如果我能找到如何在列表中的每个元素中搜索模块代码，然后我可以将它们与另一个时间列表连接起来，给出一天的时间表。

欢迎所有帮助！（包括完全不同的方法）

【问题讨论】：

标签： python beautifulsoup html-table html-parsing

【解决方案1】：

您可以使用正则表达式（即模式匹配）查找课程编号等信息。

我不知道您使用它们的经验，但 Python 包含一个“re”模块。查看“四个字母 C-O-M-P 后跟一个或多个数字”的模式。给出COMP\d+ 的正则表达式，其中\d 是一位数，下面的+ 表示要尽可能多地查找（在本例中为4）。

from BeautifulSoup import BeautifulSoup
import re

day ='Mon'
codePat = re.compile(r'COMP\d+')

with open('timetable.txt', 'rt') as input_file:
  html = input_file.read()
  soup = BeautifulSoup(html)
  #finds correct day tag
  starttag = soup.find(text=day).parent.parent
#  print starttag
  nexttag = starttag
  row=[]
  x = 0
  #puts all td tags for that day in a list
  while x < 18:
    nexttag = nexttag.nextSibling.nextSibling
    found = codePat.search(repr(nexttag))
    if found:
      row.append(found.group(0))
    x += 1
print row

这给了我输出，

['COMP1940', 'COMP1550', 'COMP1740']

就像我说的，我不知道您对正则表达式的了解在哪里，所以如果您能描述模式，我可以尝试编写它们。 Here's a good resource 如果你决定自己做的话。

【讨论】：

非常感谢您的帮助。结果只有我的模块代码以'COMP'开头，所以我只是将搜索模式更改为'rowspan =“1”'，因为这是代码中唯一会泄露表中那个点的模块的其他内容。我将发布新代码作为答案。
@Ben，关于你的新答案：当你越过最后一个兄弟时，nexttag 将为无，所以你可以说if not nexttag: break。它比 try/catch 更干净。

【解决方案2】：

from BeautifulSoup import BeautifulSoup
import re

#day input
day ='Thu'
#searches for a module (where html has rowspan="1")
module = re.compile(r'rowspan=\"1\"')
#lengths of module search (depending on html colspan attribute)
#1.5 hour
perlen15 = re.compile(r'colspan=\"3\"')
#2 hour
perlen2 = re.compile(r'colspan=\"4\"')
#2.5 hour etc.
perlen25 = re.compile(r'colspan=\"5\"')
perlen3 = re.compile(r'colspan=\"6\"')
perlen35 = re.compile(r'colspan=\"7\"')
perlen4 = re.compile(r'colspan=\"8\"')
#times correspond to first row of timetable.
times = ['8:00', '8:30', '9:00', '9:30', '10:00', '10:30', '11:00', '11:30', '12:00', '12:30', '13:00', '13:30', '14:00', '14:30', '15:00', '15:30']

#opens full timetable html
with open('timetable.txt', 'rt') as input_file:
  html = input_file.read()
  soup = BeautifulSoup(html)
  #finds correct day tag
  starttag = soup.find(text=day).parent.parent
  nexttag = starttag
  row=[]
  #movement of cursor iterating over times list
  curmv = 0
  #puts following td tags for that day in a list
  for time in times:
    nexttag = nexttag.nextSibling.nextSibling
    #detect if a module is found
    found = module.search(repr(nexttag))
    #detect length of that module
    hour15 = perlen15.search(repr(nexttag))
    hour2 = perlen2.search(repr(nexttag))
    hour25 = perlen25.search(repr(nexttag))
    hour3 = perlen3.search(repr(nexttag))
    hour35 = perlen35.search(repr(nexttag))
    hour4 = perlen4.search(repr(nexttag))
    if found: 
      row.append(times[curmv])
      row.append(nexttag)
      if hour15:
        curmv += 3
      elif hour2:
        curmv += 4
      elif hour25:
        curmv += 5
      elif hour3:
        curmv += 6
      elif hour35:
        curmv += 7
      elif hour4:
        curmv += 8
      else:
        curmv += 2
    else:
      curmv += 1
#write day to html file
with open('output.html', 'wt') as output_file:
  for e in row:
    output_file.write(str(e))

如您所见，代码可以区分 1 小时和 2 小时的讲座以及 1.5、2.5 小时的讲座等。

我现在唯一的问题是第 32 行，我需要一种更好的方法来告诉代码停止在表格中水平移动，也就是：知道何时停止 for 循环（在之前的代码中，我有 while x < 18: 仅适用于星期一，因为行中有 18 个 td 标签。当循环到达父 </tr> 标签时，如何让循环停止？

谢谢！

编辑：如果我将“时间”设置一直设置到 18:00，我将尝试使用 try 和 except 块来捕获我得到的错误。

EDIT2：成功了！ :D

【讨论】：