【问题标题】:Downloading files by crawling sub-URLs in python通过在python中爬取子URL来下载文件
【发布时间】:2021-06-11 00:13:40
【问题描述】:

我正在尝试从大量 Web 链接下载文档(主要是 pdf 格式),如下所示:

https://projects.worldbank.org/en/projects-operations/document-detail/P167897?type=projects

https://projects.worldbank.org/en/projects-operations/document-detail/P173997?type=projects

https://projects.worldbank.org/en/projects-operations/document-detail/P166309?type=projects

但是,无法从这些链接直接访问 pdf 文件。需要单击子 URL 才能访问 pdf。有什么方法可以抓取子 URL 并从中下载所有相关文件?我正在尝试使用以下代码,但到目前为止还没有针对此处列出的这些 URL 取得任何成功。

如果您需要任何进一步的说明,请告诉我。我很乐意这样做。谢谢。

from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain, utils

class MySpider(Spider):
    name = 'download_pdf'
    allowed_domains = ["www.worldbank.org"]
    start_urls = [
        "https://projects.worldbank.org/en/projects-operations/document-detail/P167897?type=projects",
        "https://projects.worldbank.org/en/projects-operations/document-detail/P173997?type=projects",
        "https://projects.worldbank.org/en/projects-operations/document-detail/P166309?type=projects"
    ]  # Entry page

    def afterResponse(self, response, url, error=None, extra=None):
        if not extra:
            print ("The version of library simplified_scrapy is too old, please update.")
            SimplifiedMain.setRunFlag(False)
            return
        try:
            path = './pdfs'
            # create folder start
            srcUrl = extra.get('srcUrl')
            if srcUrl:
                index = srcUrl.find('year/')
                year = ''
                if index > 0:
                    year = srcUrl[index + 5:]
                    index = year.find('?')
                    if index>0:
                        path = path + year[:index]
                        utils.createDir(path)
            # create folder end

            path = path + url[url.rindex('/'):]
            index = path.find('?')
            if index > 0: path = path[:index]
            flag = utils.saveResponseAsFile(response, path, fileType="pdf")
            if flag:
                return None
            else:  # If it's not a pdf, leave it to the frame
                return Spider.afterResponse(self, response, url, error, extra)
        except Exception as err:
            print(err)

    def extract(self, url, html, models, modelNames):
        doc = SimplifiedDoc(html)
        lst = doc.selects('div.list >a').contains("documents/", attr="href")
        if not lst:
            lst = doc.selects('div.hidden-md hidden-lg >a')
        urls = []
        for a in lst:
            a["url"] = utils.absoluteUrl(url.url, a["href"])
            # Set root url start
            a["srcUrl"] = url.get('srcUrl')
            if not a['srcUrl']:
                a["srcUrl"] = url.url
            # Set root url end
            urls.append(a)

        return {"Urls": urls}

    # Download again by resetting the URL. Called when you want to download again.
    def resetUrl(self):
        Spider.clearUrl(self)
        Spider.resetUrlsTest(self)

SimplifiedMain.startThread(MySpider())  # Start download

【问题讨论】:

    标签: python-3.x web-scraping python-requests scrapy web-crawler


    【解决方案1】:

    有一个 API 端点,其中包含您在网站上看到的整个响应以及...文档 pdf 的 URL。 :D

    因此,您可以查询 API,获取 URL,最后获取文档。

    方法如下:

    import requests
    
    pids = ["P167897", "P173997", "P166309"]
    
    for pid in pids:
        end_point = f"https://search.worldbank.org/api/v2/wds?" \
                    f"format=json&includepublicdocs=1&" \
                    f"fl=docna,lang,docty,repnb,docdt,doc_authr,available_in&" \
                    f"os=0&rows=20&proid={pid}&apilang=en"
        documents = requests.get(end_point).json()["documents"]
        for document_data in documents.values():
            try:
                pdf_url = document_data["pdfurl"]
                print(f"Fetching: {pdf_url}")
                with open(pdf_url.rsplit("/")[-1], "wb") as pdf:
                    pdf.write(requests.get(pdf_url).content)
            except KeyError:
                continue
    
    

    输出:(完全下载的 .pdf 文件)

    Fetching: http://documents.worldbank.org/curated/en/106981614570591392/pdf/Official-Documents-Grant-Agreement-for-Additional-Financing-Grant-TF0B4694.pdf
    Fetching: http://documents.worldbank.org/curated/en/331341614570579132/pdf/Official-Documents-First-Restatement-to-the-Disbursement-Letter-for-Grant-D6810-SL-and-for-Additional-Financing-Grant-TF0B4694.pdf
    Fetching: http://documents.worldbank.org/curated/en/387211614570564353/pdf/Official-Documents-Amendment-to-the-Financing-Agreement-for-Grant-D6810-SL.pdf
    Fetching: http://documents.worldbank.org/curated/en/799541612993594209/pdf/Sierra-Leone-AFRICA-WEST-P167897-Sierra-Leone-Free-Education-Project-Procurement-Plan.pdf
    Fetching: http://documents.worldbank.org/curated/en/310641612199201329/pdf/Disclosable-Version-of-the-ISR-Sierra-Leone-Free-Education-Project-P167897-Sequence-No-02.pdf
    
    and more ...
    

    【讨论】:

    • 感谢@baduker。我之前尝试过使用他们的 API,但挑战是下载会因请求大量文档而中断。有没有办法从中断发生的地方重新开始,否则一切都需要从头开始下载?
    • 当然有。它称为重试逻辑,但您必须自己实现。提示:使用 while 循环和 try execpt 块,并在请求之间暂停。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2017-11-04
    • 1970-01-01
    • 1970-01-01
    • 2021-02-02
    • 2017-12-25
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多