如何从hrefs 中获取hrefs？答案

【问题标题】：How do I get hrefs from hrefs?如何从hrefs 中获取hrefs？
【发布时间】：2018-11-23 09:54:49
【问题描述】：

如何使用 Python 以类和方法格式从 hrefs 中获取 hrefs？我试过了：

root_url = 'https://www.iea.org'

class IEAData:
       def __init__(self):
             try:--
             except:


       def get_links(self, url):
            all_links = []
            page = requests.get(root_url)
            soup = BeautifulSoup(page.text, 'html.parser')
            for href in soup.find_all(class_='omrlist'):
               all_links.append(root_url + href.find('a').get('href'))
            return all_links
            #print(all_links)

iea_obj = IEAData()
yearLinks = iea_obj.get_links(root_url + '/oilmarketreport/reports/')

reportLinks = []

for url in yearLinks:
    links =iea_obj.get_links(yearLinks)
    print(links)

推荐：链接变量必须有所有月份的href但不能获取，所以请告诉我应该怎么做。

【问题讨论】：

这里有什么问题？你有错误吗？如果有，有哪些？我可以立即看到的是，您在最后一个循环中调用了iea_obj.get_links(yearLinks)，其中yearLinks 是一个列表，但该函数期望它的参数是一个字符串。我想你的意思是links =iea_obj.get_links(url)。
在python的类和方法格式中，我需要解析所有链接，这些链接存在于hrefs中，即如果你点击years href，那么你会得到months href，但是在类和方法格式中

标签： python web-scraping beautifulsoup

【解决方案1】：

您的代码存在一些问题。您的 get_links() 函数没有使用传递给它的 url。在循环返回的链接时，您传递的是 yearLinks 而不是 url。

以下内容应该可以帮助您：

from bs4 import BeautifulSoup                        
import requests

root_url = 'https://www.iea.org'

class IEAData:
    def get_links(self, url):
        all_links = []
        page = requests.get(url)
        soup = BeautifulSoup(page.text, 'html.parser')

        for li in soup.find_all(class_='omrlist'):
           all_links.append(root_url + li.find('a').get('href'))
        return all_links

iea_obj = IEAData()
yearLinks = iea_obj.get_links(root_url + '/oilmarketreport/reports/')

for url in yearLinks:
    links = iea_obj.get_links(url)
    print(url, links)

这会给你输出开始：

https://www.iea.org/oilmarketreport/reports/2018/ ['https://www.iea.org/oilmarketreport/reports/2018/0118/', 'https://www.iea.org/oilmarketreport/reports/2018/0218/', 'https://www.iea.org/oilmarketreport/reports/2018/0318/', 'https://www.iea.org/oilmarketreport/reports/2018/0418/', 'https://www.iea.org/oilmarketreport/reports/2018/0518/', 'https://www.iea.org/oilmarketreport/reports/2018/0618/', 'https://www.iea.org/oilmarketreport/reports/2018/0718/', 'https://www.iea.org/oilmarketreport/reports/2018/0818/', 'https://www.iea.org/oilmarketreport/reports/2018/1018/']
https://www.iea.org/oilmarketreport/reports/2017/ ['https://www.iea.org/oilmarketreport/reports/2017/0117/', 'https://www.iea.org/oilmarketreport/reports/2017/0217/', 'https://www.iea.org/oilmarketreport/reports/2017/0317/', 'https://www.iea.org/oilmarketreport/reports/2017/0417/', 'https://www.iea.org/oilmarketreport/reports/2017/0517/', 'https://www.iea.org/oilmarketreport/reports/2017/0617/', 'https://www.iea.org/oilmarketreport/reports/2017/0717/', 'https://www.iea.org/oilmarketreport/reports/2017/0817/', 'https://www.iea.org/oilmarketreport/reports/2017/0917/', 'https://www.iea.org/oilmarketreport/reports/2017/1017/', 'https://www.iea.org/oilmarketreport/reports/2017/1117/', 'https://www.iea.org/oilmarketreport/reports/2017/1217/']

【讨论】：

【解决方案2】：

我对编程还很陌生，我仍在学习并试图了解类和其他东西如何协同工作。但是试一试（这就是我们学习的方式，对吧？）

不确定这是否是您要寻找的输出。我改变了两件事，并且能够将 yearLinks 中的所有链接放入一个列表中。请注意，它还将包括 PDF 链接以及我认为您想要的月份链接。如果您不想要那些 PDF 链接，并且只想要月份，那么就不要包含 pdf。

这是我使用的代码，也许您可以使用它来适应您的结构。

root_url = 'https://www.iea.org'


class IEAData:

    def get_links(self, url):

       all_links = []
       page = requests.get(url)
       soup = bs4.BeautifulSoup(page.text, 'html.parser')
       for href in soup.find_all(class_='omrlist'):
           all_links.append(root_url + href.find('a').get('href'))
       return all_links
       #print(all_links)


iea_obj = IEAData()
yearLinks = iea_obj.get_links(root_url + '/oilmarketreport/reports/')

reportLinks = []

for url in yearLinks:
    links = iea_obj.get_links(url)

    # uncomment line below if you do not want the .pdf links
    #links = [ x for x in links if ".pdf" not in x ]
    reportLinks += links

【讨论】：