Beautifulsoup 无法在表中获得“tr”答案

【问题标题】：beautifulsoup can't get 'tr' in tableBeautifulsoup 无法在表中获得“tr”
【发布时间】：2021-03-17 00:36:58
【问题描述】：

我正在尝试从该网站https://www.bvca.co.uk/Member-Directory 获取公司名称（例如 01Venture）和类型（例如 GENERAL PATERNER）的列表。我正在使用下面的代码：

import requests
from bs4 import BeautifulSoup
URL = 'https://www.bvca.co.uk/Member-Directory'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

table = soup.find('table', attrs={'id':'searchresults'})
table_body = table.find('tbody')
rows = table_body.find_all('tr')

print(rows)

我得到了一个空列表。

【问题讨论】：

动态加载。使用网络选项卡查看数据的真正来源（附加 xhr）或使用 selenium
你可能不得不使用 selenium 来获取源代码。

标签： python web-scraping beautifulsoup html-table

【解决方案1】：

使用selenium包，你需要安装chromedriver。

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

URL = 'https://www.bvca.co.uk/Member-Directory'

BrowserOptions = Options()
BrowserOptions.add_argument("--headless")
Browser = webdriver.Chrome(executable_path=r'chromedriver.exe', options=BrowserOptions)
Browser.get(URL)
while True:
    if Browser.find_elements_by_class_name('companyName'):
        break
    
html_source_code = Browser.execute_script("return document.body.innerHTML;")

soup = BeautifulSoup(html_source_code, 'html.parser')

x = [r.text for r in soup.find_all('h5',class_='companyName')]
print(x)

>>> ['01 Ventures', '01 Ventures', '17Capital LLP', '17Capital LLP', '1818 Venture Capital', ..., 'Zouk Capital LLP', 'Zouk Capital LLP']

while 循环会一直等到公司名称加载完毕，然后再保存 html 代码

输出太大，无法放入答案，所以我只能显示其中的一部分。

【讨论】：