Web Scraping w/BeautifulSoup4 - 如何过滤包含特定字符串的标签？答案

【问题标题】：Web Scraping w/ BeautifulSoup4 - How to filter a tag that contains a specific string?Web Scraping w/BeautifulSoup4 - 如何过滤包含特定字符串的标签？
【发布时间】：2020-12-26 01:23:04
【问题描述】：

如何过滤以下 HTML 以将包含“Codigo”的 span 标签附加到列表 A；包含“Acao”的跨度标签到列表 B 等？

Expected output:

List A: ['ABEV3', 'AZUL4']
List B: ['AMBEV S/A', 'AZUL']
List C: ['ON', 'PN']
List D: [4355174839, 326903173]
List E: [2.948, 0.432]

[...]
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblCodigo">ABEV3</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblAcao">AMBEV S/A</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblTipo">ON</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblQtdeTeorica_Formatada">4.355.174.839</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblPart_Formatada">2,948</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblCodigo">AZUL4</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblAcao">AZUL</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblTipo">PN      N2</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblQtdeTeorica_Formatada">326.903.173</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblPart_Formatada">0,432</span>
[...]

【问题讨论】：

标签： python beautifulsoup request

【解决方案1】：

要获取各种列表，您可以使用 CSS 选择器[id$="..."]，它会找到带有id= 且以指定字符串结尾的标签。例如：

from bs4 import BeautifulSoup


html_data = '''
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblCodigo">ABEV3</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblAcao">AMBEV S/A</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblTipo">ON</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblQtdeTeorica_Formatada">4.355.174.839</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblPart_Formatada">2,948</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblCodigo">AZUL4</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblAcao">AZUL</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblTipo">PN      N2</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblQtdeTeorica_Formatada">326.903.173</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblPart_Formatada">0,432</span>
'''

soup = BeautifulSoup(html_data, 'html.parser')

list_a = [t.text for t in soup.select('[id$="_lblCodigo"]')]
list_b = [t.text for t in soup.select('[id$="_lblAcao"]')]
list_c = [t.text for t in soup.select('[id$="_lblTipo"]')]
list_d = [int(t.text.replace('.', '')) for t in soup.select('[id$="_lblQtdeTeorica_Formatada"]')]
list_e = [float(t.text.replace(',', '.')) for t in soup.select('[id$="_lblPart_Formatada"]')]

print(list_a)
print(list_b)
print(list_c)
print(list_d)
print(list_e)

打印：

['ABEV3', 'AZUL4']
['AMBEV S/A', 'AZUL']
['ON', 'PN      N2']
[4355174839, 326903173]
[2.948, 0.432]

【讨论】：