【问题标题】:Web Scraping w/ BeautifulSoup4 - How to filter a tag that contains a specific string?Web Scraping w/BeautifulSoup4 - 如何过滤包含特定字符串的标签?
【发布时间】:2020-12-26 01:23:04
【问题描述】:

如何过滤以下 HTML 以将包含“Codigo”的 span 标签附加到列表 A;包含“Acao”的跨度标签到列表 B 等?

Expected output:

List A: ['ABEV3', 'AZUL4']
List B: ['AMBEV S/A', 'AZUL']
List C: ['ON', 'PN']
List D: [4355174839, 326903173]
List E: [2.948, 0.432]
[...]
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblCodigo">ABEV3</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblAcao">AMBEV S/A</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblTipo">ON</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblQtdeTeorica_Formatada">4.355.174.839</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblPart_Formatada">2,948</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblCodigo">AZUL4</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblAcao">AZUL</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblTipo">PN      N2</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblQtdeTeorica_Formatada">326.903.173</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblPart_Formatada">0,432</span>
[...]

【问题讨论】:

    标签: python beautifulsoup request


    【解决方案1】:

    要获取各种列表,您可以使用 CSS 选择器[id$="..."],它会找到带有id= 且以指定字符串结尾的标签。例如:

    from bs4 import BeautifulSoup
    
    
    html_data = '''
    <span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblCodigo">ABEV3</span>,
    <span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblAcao">AMBEV S/A</span>,
    <span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblTipo">ON</span>,
    <span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblQtdeTeorica_Formatada">4.355.174.839</span>,
    <span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblPart_Formatada">2,948</span>,
    <span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblCodigo">AZUL4</span>,
    <span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblAcao">AZUL</span>,
    <span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblTipo">PN      N2</span>,
    <span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblQtdeTeorica_Formatada">326.903.173</span>,
    <span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblPart_Formatada">0,432</span>
    '''
    
    soup = BeautifulSoup(html_data, 'html.parser')
    
    list_a = [t.text for t in soup.select('[id$="_lblCodigo"]')]
    list_b = [t.text for t in soup.select('[id$="_lblAcao"]')]
    list_c = [t.text for t in soup.select('[id$="_lblTipo"]')]
    list_d = [int(t.text.replace('.', '')) for t in soup.select('[id$="_lblQtdeTeorica_Formatada"]')]
    list_e = [float(t.text.replace(',', '.')) for t in soup.select('[id$="_lblPart_Formatada"]')]
    
    print(list_a)
    print(list_b)
    print(list_c)
    print(list_d)
    print(list_e)
    

    打印:

    ['ABEV3', 'AZUL4']
    ['AMBEV S/A', 'AZUL']
    ['ON', 'PN      N2']
    [4355174839, 326903173]
    [2.948, 0.432]
    

    【讨论】:

      猜你喜欢
      • 2014-05-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-12-05
      • 2018-04-17
      • 2021-09-08
      相关资源
      最近更新 更多