为什么我不能在使用 BeautifulSoup 抓取表格标题时使用“.text”来删除不需要的 HTML答案

【问题标题】：Why can't I use ".text" while scraping table headers with BeautifulSoup to remove unwanted HTML为什么我不能在使用 BeautifulSoup 抓取表格标题时使用“.text”来删除不需要的 HTML
【发布时间】：2021-05-27 09:19:56
【问题描述】：

当我运行这段代码时，我可以看到标题列表填充了我想要的结果，但是它们被一些我不想保留的 html 包围。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

# barchart.com uses javascript, so for now I need selenium to get full html
url = 'https://www.barchart.com/stocks/quotes/qqq/constituents'
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
browser = webdriver.Chrome(options=chrome_options)
browser.get(url)
page = browser.page_source

#  BeautifulSoup find table
soup = BeautifulSoup(page, 'lxml')
table = soup.find("table")
browser.quit()

# create list headers, then populate with th tagged cells
headers = []

for i in table.find_all('th'):
    title = i()
    headers.append(title)

所以我尝试了：

for i in table.find_all('th'):
    title = i.text()
    headers.append(title)

返回"TypeError: 'str' object is not callable"

这似乎在一些示例文档中有效，但那里使用的维基百科表格似乎比 Barchart 上的更简单。有什么想法吗？

【问题讨论】：

去掉括号()。代替i.text()，使用i.text。
问得好，@朱利安！写得很好，格式很好，你向我们展示了你尝试了什么以及失败的地方。欢迎加入 StackOverflow 大家庭！

标签： python selenium web-scraping beautifulsoup html-table

【解决方案1】：

正如@MendelG 所指出的，错误在于i.text()，因为text 是属性而不是函数。

您也可以使用get_text()，这是一个函数。

我还建议添加strip() 以消除文本周围的多余空格。或者如果你想使用get_text()，它内置了这个：

title = i.get_text(strip=True)

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

# barchart.com uses javascript, so for now I need selenium to get full html
url = 'https://www.barchart.com/stocks/quotes/qqq/constituents'
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
browser = webdriver.Chrome(options=chrome_options)
browser.get(url)
page = browser.page_source

#  BeautifulSoup find table
soup = BeautifulSoup(page, 'lxml')
table = soup.find("table")
browser.quit()

# create list headers, then populate with th tagged cells
headers = []

for i in table.find_all('th'):
    title = i.text.strip()
    # Or alternatively:
    #title = i.get_text(strip=True)
    headers.append(title)

print(headers)

打印出来：

['Symbol', 'Name', '% Holding', 'Shares', 'Links']

【讨论】：