【问题标题】:How to get href from HTML class?如何从 HTML 类中获取 href?
【发布时间】:2021-10-07 22:38:48
【问题描述】:

我想从a = soup.find_all('div', class_='email-messages') 得到href

[<div class="email-messages">
<table>
<tr>
<td id="email-title">Message Title</td>
<td id="email-sender">Sender</td>
<td id="email-control">Control </td>
</tr>
<tr>
<td><a href="/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1">Fwd: [Microsoft Academic Verification] Confirming Your Academic Status</a></td>
<td id="email-sender"><span data-cf-modified-c9b86b506f187bfdc48368eb-="" onclick="if (!window.__cfRLUnblockHandlers) return false; show_sender_email(this, 'guidetuanhp@gmail.com')" style="cursor: pointer;">Tuấn Anh Vũ</span></td>
<td id="email-control"><a data-cf-modified-c9b86b506f187bfdc48368eb-="" href="/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete" onclick="if (!window.__cfRLUnblockHandlers) return false; return delete_mail('/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete');">[Delete]</a></td>
</tr>
<tr>
<td class="mail_message_counter" colspan="3">Total Messages: <strong>1</strong></td>
</tr>
</table>
</div>]

我的代码:

soup = BeautifulSoup(html_doc, 'lxml')
a = soup.find_all('div', class_='email-messages')
for link in a:
    print(link['href'])

我收到错误:

in __getitem__
    return self.attrs[key]
KeyError: 'href'

【问题讨论】:

    标签: python html web web-scraping beautifulsoup


    【解决方案1】:

    对于“单一目的”抓取,使用解析器定制非常有用,SoupStrainer。它更快(或者应该更快!),因为它本地化只需要抓取文档的所需部分。详情here.

    SoupStrainer 实例必须始终作为带有键 parse_onlyBeautifulSoup 实例的键值对传递:

    from bs4 import BeautifulSoup, SoupStrainer
    
    html_doc = # see above
    
    soup = BeautifulSoup(html_doc, 'lxml', parse_only=SoupStrainer('a', href=True))
    for tag in soup:
        print(tag['href'])
    

    输出

    /en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1
    /en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete
    

    记住

    1. 汤是“紧张的”,您将处理soupobject 而不是列表。所以循环变量是bs4.element.Tagobject!
    2. SoupStrainerfind_all 方法的签名相同

    【讨论】:

      【解决方案2】:

      您正试图从&lt;div&gt; 标记中获取“href”。尝试在&lt;div&gt;s 中查找所有&lt;a&gt; 标签:

      from bs4 import BeautifulSoup
      
      html_doc = """<div class="email-messages">
      <table>
      <tr>
      <td id="email-title">Message Title</td>
      <td id="email-sender">Sender</td>
      <td id="email-control">Control </td>
      </tr>
      <tr>
      <td><a href="/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1">Fwd: [Microsoft Academic Verification] Confirming Your Academic Status</a></td>
      <td id="email-sender"><span data-cf-modified-c9b86b506f187bfdc48368eb-="" onclick="if (!window.__cfRLUnblockHandlers) return false; show_sender_email(this, 'guidetuanhp@gmail.com')" style="cursor: pointer;">Tuấn Anh Vũ</span></td>
      <td id="email-control"><a data-cf-modified-c9b86b506f187bfdc48368eb-="" href="/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete" onclick="if (!window.__cfRLUnblockHandlers) return false; return delete_mail('/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete');">[Delete]</a></td>
      </tr>
      <tr>
      <td class="mail_message_counter" colspan="3">Total Messages: <strong>1</strong></td>
      </tr>
      </table>
      </div>"""
      
      soup = BeautifulSoup(html_doc, "html.parser")
      
      
      divs = soup.find_all("div", class_="email-messages")
      for div in divs:
          for link in div.find_all("a"):
              print(link["href"])
      

      打印:

      /en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1
      /en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete
      

      【讨论】:

        猜你喜欢
        • 2013-12-18
        • 2014-07-05
        • 2021-03-22
        • 1970-01-01
        • 2011-03-05
        • 1970-01-01
        • 2023-03-29
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多