【发布时间】:2022-06-10 20:25:10
【问题描述】:
我想用 2 个不同的 HTML 模板抓取多个 URL。我可以毫无问题地自行抓取每个 HTML,但是在尝试组合两个抓取器时遇到了问题。以下是我的代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page_url1 = 'https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory'
page_url2 = 'https://www.med.upenn.edu/apps/faculty/index.php/g20001100/p8866286'
page_url_lst = {'url': [page_url1, page_url2], 'template': [1,2]}
page_url_df = pd.DataFrame(page_url_lst)
data = []
if page_url_df['template'] == 1:
for url in page_url_df['url']:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for e in soup.select('#tabs-publications em'):
data.append({
'author':e.previous.get_text(strip=True)[:-1],
'title':e.get_text(strip=True),
'journal':e.next_sibling.get_text(strip=True),
'source': url
})
else:
for url_2 in page_url_df['url']:
r_2 = requests.get(url_2)
soup_2 = BeautifulSoup(r_2.text, 'lxml')
for a in soup_2.find_all('span',{'class':'fac_citation'}):
data.append({
'author':a.find('b').get_text(),
'title':a.find('i').get_text(strip=True),
'journal':a.find('i').next_sibling.get_text(strip=True),
'source': url_2
})
这里的逻辑如果列'template'返回值1,则使用第一个模板提取数据,否则使用第二个模板提取数据。但是,此代码返回此错误:The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
提前谢谢你!
【问题讨论】:
标签: python pandas beautifulsoup