【问题标题】:Get HTML href link that matches string from a list of strings with Beautiful Soup使用 Beautiful Soup 从字符串列表中获取与字符串匹配的 HTML href 链接
【发布时间】:2020-04-24 12:07:34
【问题描述】:

我正在尝试从具有 url 列表的网页中获取 url。我不想获取所有的 url,只有文本与列表中字符串文本匹配的那些。字符串列表是网页上链接文本的子集,我通过 scraping 页面提取并删除了我不想要的文本。我有一个存储在filenames 中的字符串列表。

我正在尝试提取列表中包含字符串的链接。下面返回一个空列表

 r = requests.get(url)

    soup = BeautifulSoup(r.content, 'html5lib')
    
    links = soup.findAll('a', string = filenames[0])
    
    file_links = [link['href'] for link in links if "export" in link['href']]

标签看起来像这样:

<p><a href="https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi">
                            ECZ Mathematics Paper 2 2019.</a></p>

<p><a href="https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf">
                            ECZ Mathematics Paper 1 2019.</a></p>

<p><a href="https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp">
                            ECZ Science Paper 3 2009.</a></p>

<p><a href="https://drive.google.com/uc?export=download&id=0B0lFc6TrfIg7aENYc1V6akRVVnc">
                            ECZ Civic Education Paper 2 2009.</a></p>

我想获得前三个而不是最后一个的 href 链接,因为字符串 'ECZ Civic Education Paper 2 2009.' 不是我的字符串列表的一部分。网站链接是here

我的字符串列表如下所示:


filenames = ['ECZ Mathematics Paper 2 2019.', 'ECZ Mathematics Paper 2 2019.',
             'ECZ Science Paper 3 2009.']

我只想要前三个链接,因为链接的文本在我的列表(文件名)中。我不想要第四个链接,因为 href 链接旁边的文本(ECZ Civic Education Paper 2 2009)不在我的列表中,因为我不想下载这个文件。

【问题讨论】:

  • 你能从你的stings列表中发布几个例子吗?
  • 我已经编辑了帖子,包括我的列表示例

标签: python html web-scraping beautifulsoup


【解决方案1】:

试试这个方法,看看是否有效:

   html = """    
    <p><a href="https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi">
                                ECZ Mathematics Paper 2 2019.</a></p>    
    <p><a href="https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf">
                                ECZ Mathematics Paper 1 2019.</a></p>    
    <p><a href="https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp">
                                ECZ Science Paper 3 2009.</a></p>    
    <p><a href="https://drive.google.com/uc?export=download&id=0B0lFc6TrfIg7aENYc1V6akRVVnc">
                                ECZ Civic Education Paper 2 2009.</a></p>   
   """
    filenames = ['ECZ Mathematics Paper 2 2019.', 'ECZ Mathematics Paper 2 2019.',
                 'ECZ Science Paper 3 2009.']

    soup = bs(html, 'html5lib')

    all_links = soup.findAll('a')

    for link in all_links:           
        for nam in filenames:                
            if link.text.strip()==nam:
                print(link['href'])

输出:

https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi
https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi
https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp

【讨论】:

    【解决方案2】:

    您可以构建 CSS 选择器,然后一次性选择链接。例如(html 是问题中的代码 sn-p):

    filenames = ['ECZ Mathematics Paper 1 2019.',
                 'ECZ Mathematics Paper 2 2019.',
                 'ECZ Science Paper 3 2009.']
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'html.parser')
    
    for a in soup.select(','.join('a:contains("{}")'.format(i) for i in filenames)):
        print(a['href'])
    

    打印:

    https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi
    https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf
    https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp
    

    【讨论】:

    • 这很好用,但我已经接受了答案。谢谢你。
    【解决方案3】:

    如果请求已成功接收。然后只需使用 bs 解析它并使用 findAll 查找链接“a”的标签。我认为 findAll 中没有必要传递 (string = filenames[0])。

    from bs4 import BeautifulSoup as bs
    temp = """<p><a href="https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi">
                                ECZ Mathematics Paper 2 2019.</a></p>
    
    <p><a href="https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf">
                                ECZ Mathematics Paper 1 2019.</a></p>
    
    <p><a href="https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp">
                                ECZ Science Paper 3 2009.</a></p>
    
    <p><a href="https://drive.google.com/uc?export=download&id=0B0lFc6TrfIg7aENYc1V6akRVVnc">
                                ECZ Civic Education Paper 2 2009.</a></p>"""
    
    soup =bs(temp, 'html5lib')
    links = soup.findAll('a')
    file_links = [link['href'] for link in links if "export" in link['href']]
    

    输出:

    ['https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi',
     'https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf',
     'https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp',
     'https://drive.google.com/uc?export=download&id=0B0lFc6TrfIg7aENYc1V6akRVVnc']
    

    【讨论】:

    • 我只想要前三个链接,因为链接的文本在我的列表(文件名)中。我不想要第四个链接,因为 href 链接旁边的文字(ECZ Civic Education Paper 2 2009.)不在我的列表中,因为我不想下载这个文件。
    猜你喜欢
    • 2019-11-09
    • 2011-05-24
    • 2021-10-09
    • 2017-02-15
    • 1970-01-01
    • 1970-01-01
    • 2014-02-06
    • 2021-10-15
    • 2013-04-22
    相关资源
    最近更新 更多