【发布时间】:2018-10-30 02:17:08
【问题描述】:
我正在开展一个项目,以从特定图书馆抓取书籍的目录信息。到目前为止,我的脚本可以从表格中刮掉所有单元格。但是,我对如何只返回新不列颠图书馆的特定单元格感到困惑。
import requests
from bs4 import BeautifulSoup
mypage = 'http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt'
response = requests.get(mypage)
soup = BeautifulSoup(response.text, 'html.parser')
data = []
table = soup.find('table', attrs={'class':'itemTable'})
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
for index, libraryinfo in enumerate(data):
print(index, libraryinfo)
以下是脚本中新不列颠图书馆的示例输出:
["New Britain, Main Library - Children's Department", 'J FIC PALACIO', 'Check Shelf']
与其归还所有单元格,我将如何仅归还与新不列颠图书馆有关的单元格?我也只想要库名称和结帐状态。
期望的输出是:
["New Britain, Main Library - Children's Department", 'Check Shelf']
可以有多个单元格,因为一本书可以在同一个图书馆有多个副本。
【问题讨论】:
标签: python beautifulsoup screen-scraping