【问题标题】:How do I get hrefs from hrefs?如何从hrefs 中获取hrefs?
【发布时间】:2018-11-23 09:54:49
【问题描述】:

如何使用 Python 以类和方法格式从 hrefs 中获取 hrefs? 我试过了:

root_url = 'https://www.iea.org'

class IEAData:
       def __init__(self):
             try:--
             except:


       def get_links(self, url):
            all_links = []
            page = requests.get(root_url)
            soup = BeautifulSoup(page.text, 'html.parser')
            for href in soup.find_all(class_='omrlist'):
               all_links.append(root_url + href.find('a').get('href'))
            return all_links
            #print(all_links)

iea_obj = IEAData()
yearLinks = iea_obj.get_links(root_url + '/oilmarketreport/reports/')

reportLinks = []

for url in yearLinks:
    links =iea_obj.get_links(yearLinks)
    print(links)

推荐:链接变量必须有所有月份的href但不能获取,所以请告诉我应该怎么做。

【问题讨论】:

  • 这里有什么问题?你有错误吗?如果有,有哪些?我可以立即看到的是,您在最后一个循环中调用了iea_obj.get_links(yearLinks),其中yearLinks 是一个列表,但该函数期望它的参数是一个字符串。我想你的意思是links =iea_obj.get_links(url)
  • 在python的类和方法格式中,我需要解析所有链接,这些链接存在于hrefs中,即如果你点击years href,那么你会得到months href,但是在类和方法格式中

标签: python web-scraping beautifulsoup


【解决方案1】:

您的代码存在一些问题。您的 get_links() 函数没有使用传递给它的 url。在循环返回的链接时,您传递的是 yearLinks 而不是 url

以下内容应该可以帮助您:

from bs4 import BeautifulSoup                        
import requests

root_url = 'https://www.iea.org'

class IEAData:
    def get_links(self, url):
        all_links = []
        page = requests.get(url)
        soup = BeautifulSoup(page.text, 'html.parser')

        for li in soup.find_all(class_='omrlist'):
           all_links.append(root_url + li.find('a').get('href'))
        return all_links

iea_obj = IEAData()
yearLinks = iea_obj.get_links(root_url + '/oilmarketreport/reports/')

for url in yearLinks:
    links = iea_obj.get_links(url)
    print(url, links)

这会给你输出开始:

https://www.iea.org/oilmarketreport/reports/2018/ ['https://www.iea.org/oilmarketreport/reports/2018/0118/', 'https://www.iea.org/oilmarketreport/reports/2018/0218/', 'https://www.iea.org/oilmarketreport/reports/2018/0318/', 'https://www.iea.org/oilmarketreport/reports/2018/0418/', 'https://www.iea.org/oilmarketreport/reports/2018/0518/', 'https://www.iea.org/oilmarketreport/reports/2018/0618/', 'https://www.iea.org/oilmarketreport/reports/2018/0718/', 'https://www.iea.org/oilmarketreport/reports/2018/0818/', 'https://www.iea.org/oilmarketreport/reports/2018/1018/']
https://www.iea.org/oilmarketreport/reports/2017/ ['https://www.iea.org/oilmarketreport/reports/2017/0117/', 'https://www.iea.org/oilmarketreport/reports/2017/0217/', 'https://www.iea.org/oilmarketreport/reports/2017/0317/', 'https://www.iea.org/oilmarketreport/reports/2017/0417/', 'https://www.iea.org/oilmarketreport/reports/2017/0517/', 'https://www.iea.org/oilmarketreport/reports/2017/0617/', 'https://www.iea.org/oilmarketreport/reports/2017/0717/', 'https://www.iea.org/oilmarketreport/reports/2017/0817/', 'https://www.iea.org/oilmarketreport/reports/2017/0917/', 'https://www.iea.org/oilmarketreport/reports/2017/1017/', 'https://www.iea.org/oilmarketreport/reports/2017/1117/', 'https://www.iea.org/oilmarketreport/reports/2017/1217/']

【讨论】:

    【解决方案2】:

    我对编程还很陌生,我仍在学习并试图了解类和其他东西如何协同工作。但是试一试(这就是我们学习的方式,对吧?)

    不确定这是否是您要寻找的输出。我改变了两件事,并且能够将 yearLinks 中的所有链接放入一个列表中。请注意,它还将包括 PDF 链接以及我认为您想要的月份链接。如果您不想要那些 PDF 链接,并且只想要月份,那么就不要包含 pdf。

    这是我使用的代码,也许您可​​以使用它来适应您的结构。

    root_url = 'https://www.iea.org'
    
    
    class IEAData:
    
        def get_links(self, url):
    
           all_links = []
           page = requests.get(url)
           soup = bs4.BeautifulSoup(page.text, 'html.parser')
           for href in soup.find_all(class_='omrlist'):
               all_links.append(root_url + href.find('a').get('href'))
           return all_links
           #print(all_links)
    
    
    iea_obj = IEAData()
    yearLinks = iea_obj.get_links(root_url + '/oilmarketreport/reports/')
    
    reportLinks = []
    
    for url in yearLinks:
        links = iea_obj.get_links(url)
    
        # uncomment line below if you do not want the .pdf links
        #links = [ x for x in links if ".pdf" not in x ]
        reportLinks += links
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2023-01-18
      • 2020-08-03
      • 2011-12-27
      • 2011-02-24
      • 2011-06-17
      • 2015-06-17
      • 2021-10-15
      • 2010-10-03
      相关资源
      最近更新 更多