【问题标题】:With Beautiful Soup get specific string in <div> tag使用 Beautiful Soup 在 <div> 标签中获取特定字符串
【发布时间】:2021-10-15 04:23:48
【问题描述】:

我有一个我提取的标签列表:

soup.findAll('div', {'class': 'formelement'}):

输出是:

[<div class="formelement">
 <label class="libelle" for="field_tit">Etat :</label>
                    Publié              </div>,
 <div class="formelement">
 <label class="libelle" for="field_tit">Type de produit :</label>
                    Plaque de plâtre                </div>,
 <div class="formelement">
 <label class="libelle" for="field_tit">Numéro :</label>
                                            PP/48-05                                    </div>,
 <div class="formelement">
 <label class="libelle" for="field_tit">Titulaire :</label>
                    CIA ESPAÑOLA DE AISLAMIENTOS SA             </div>,
 <div class="formelement">
 <label class="libelle" for="field_ref">Usine :</label>
                    39              </div>,
 <div class="formelement">
 <label class="libelle" for="field_tit">Date d'admission :</label>
                        13/07/2017                      </div>,
 <div class="formelement">
 <label class="libelle" for="field_tit">Date de reconduction :</label>
                        04/02/2021                      </div>,
 <div class="formelement">
 <label class="libelle" for="field_tit">Date de fin de validité :</label>
                        04/05/2022                      </div>,
 <div class="formelement">
 <label class="libelle" for="field_tit">Certificat PDF :</label>
 <a href="application/docs/certificats/PP_48_05.pdf" target="_blank">
 <img src="public/images/pdf.gif" title="Télécharger le certificat au format PDF"/>
 </a>
 </div>]

我的目标是有一个 dict :

product_data = {
"Numéro": "PP/48-05",
"Titulaire": "CIA ESPAÑOLA DE AISLAMIENTOS SA",
"Usine": "39",
"Date de fin de validité": "04/05/2022",
"Certificat PDF": "application/docs/certificats/PP_48_05.pdf"
}

我试过了

for div in soup.findAll('div', {'class': 'formelement'}):
        product_data[div.text] = div.next_sibling

但它会获取标签内的所有字符串(显然)并且找不到任何方法来分别获取 div 内的两个字符串。如何单独获取这些字符串?

我希望我的问题足够明确。

【问题讨论】:

    标签: python dictionary web-scraping beautifulsoup


    【解决方案1】:

    你可以销毁/分解内部标签

    from bs4 import BeautifulSoup
    
    html="""
    <div class="formelement">
     <label class="libelle" for="field_tit">Etat :</label>
                        Publié              </div>,
     <div class="formelement">
     <label class="libelle" for="field_tit">Type de produit :</label>
                        Plaque de plâtre                </div>,
     <div class="formelement">
     <label class="libelle" for="field_tit">Numéro :</label>
                                                PP/48-05                                    </div>,
     <div class="formelement">
     <label class="libelle" for="field_tit">Titulaire :</label>
                        CIA ESPAÑOLA DE AISLAMIENTOS SA             </div>,
     <div class="formelement">
     <label class="libelle" for="field_ref">Usine :</label>
                        39              </div>,
     <div class="formelement">
     <label class="libelle" for="field_tit">Date d'admission :</label>
                            13/07/2017                      </div>,
     <div class="formelement">
     <label class="libelle" for="field_tit">Date de reconduction :</label>
                            04/02/2021                      </div>,
     <div class="formelement">
     <label class="libelle" for="field_tit">Date de fin de validité :</label>
                            04/05/2022                      </div>,
     <div class="formelement">
     <label class="libelle" for="field_tit">Certificat PDF :</label>
     <a href="application/docs/certificats/PP_48_05.pdf" target="_blank">
     <img src="public/images/pdf.gif" title="Télécharger le certificat au format PDF"/>
     </a>
     </div>"""
    
    soup = BeautifulSoup(html, 'html.parser')
    data = {}
    for div in soup.findAll('div', {'class': 'formelement'}):
        label = div.find('label')
        key = label.text[:-2]
        label.decompose()
        try:
            value = div.find('a').get('href')
        except AttributeError:
            value = div.text.strip()
        data[key] = value
    print(data)
    

    输出

    {'Etat': 'Publié', 'Type de produit': 'Plaque de plâtre',
     'Numéro': 'PP/48-05', 'Titulaire': 'CIA ESPAÑOLA DE AISLAMIENTOS SA', 
     'Usine': '39', "Date d'admission": '13/07/2017', 
     'Date de reconduction': '04/02/2021', 'Date de fin de validité': '04/05/2022', 
     'Certificat PDF': 'application/docs/certificats/PP_48_05.pdf'}
    

    【讨论】:

    • 谢谢!这正是我想要的,我不知道decompose()函数。
    【解决方案2】:

    试试:

    from bs4 import BeautifulSoup
    
    html_doc = """
    <div class="formelement">
     <label class="libelle" for="field_tit">Etat :</label>
                        Publié              </div>,
     <div class="formelement">
     <label class="libelle" for="field_tit">Type de produit :</label>
                        Plaque de plâtre                </div>,
     <div class="formelement">
     <label class="libelle" for="field_tit">Numéro :</label>
                                                PP/48-05                                    </div>,
     <div class="formelement">
     <label class="libelle" for="field_tit">Titulaire :</label>
                        CIA ESPAÑOLA DE AISLAMIENTOS SA             </div>,
     <div class="formelement">
     <label class="libelle" for="field_ref">Usine :</label>
                        39              </div>,
     <div class="formelement">
     <label class="libelle" for="field_tit">Date d'admission :</label>
                            13/07/2017                      </div>,
     <div class="formelement">
     <label class="libelle" for="field_tit">Date de reconduction :</label>
                            04/02/2021                      </div>,
     <div class="formelement">
     <label class="libelle" for="field_tit">Date de fin de validité :</label>
                            04/05/2022                      </div>,
     <div class="formelement">
     <label class="libelle" for="field_tit">Certificat PDF :</label>
     <a href="application/docs/certificats/PP_48_05.pdf" target="_blank">
     <img src="public/images/pdf.gif" title="Télécharger le certificat au format PDF"/>
     </a>
     </div>
     """
    
    soup = BeautifulSoup(html_doc, "html.parser")
    
    allowed_keys = {
        "Numéro",
        "Titulaire",
        "Usine",
        "Date de fin de validité",
        "Certificat PDF",
    }
    
    data = []
    for f in soup.select(".formelement"):
        key_value = f.get_text(strip=True, separator="|").split("|")
        if len(key_value) == 1:
            a = f.find("a")
            if a:
                key_value = [key_value[0], a["href"]]
            else:
                continue
        key_value[0] = key_value[0].strip(" :")
        if key_value[0] not in allowed_keys:
            continue
        data.append(key_value)
    
    
    out = dict(data)
    print(out)
    

    打印:

    {
        "Numéro": "PP/48-05",
        "Titulaire": "CIA ESPAÑOLA DE AISLAMIENTOS SA",
        "Usine": "39",
        "Date de fin de validité": "04/05/2022",
        "Certificat PDF": "application/docs/certificats/PP_48_05.pdf",
    }
    

    【讨论】:

      猜你喜欢
      • 2021-10-09
      • 2022-01-16
      • 2016-11-27
      • 1970-01-01
      • 2020-07-04
      • 2011-05-24
      • 1970-01-01
      • 2017-07-28
      • 1970-01-01
      相关资源
      最近更新 更多