【问题标题】:Download all the links(related documents) on a webpage using Python使用 Python 下载网页上的所有链接(相关文档)
【发布时间】:2011-05-12 07:17:06
【问题描述】:

我必须从网页下载大量文档。它们是 wmv 文件、PDF、BMP 等。当然,它们都有指向它们的链接。所以每次,我都必须 RMC 一个文件,选择“将链接另存为”,然后保存为所有文件类型。是否有可能在 Python 中做到这一点?我搜索了 SO DB,人们已经回答了如何从网页获取链接的问题。我想下载实际文件。提前致谢。 (这不是硬件问题:))。

【问题讨论】:

    标签: python


    【解决方案1】:
    • 遵循此链接中的 Python 代码:wget-vs-urlretrieve-of-python
    • 您也可以使用Wget 轻松完成此操作。在Wget 中尝试--limit--recursive--accept 命令行。例如: wget --accept wmv,doc --limit 2 --recursive http://www.example.com/files/

    【讨论】:

      【解决方案2】:

      这是一个示例,说明如何从 http://pypi.python.org/pypi/xlwt 下载一些选定的文件

      你需要先安装mechanize:http://wwwsearch.sourceforge.net/mechanize/download.html

      import mechanize
      from time import sleep
      #Make a Browser (think of this as chrome or firefox etc)
      br = mechanize.Browser()
      
      #visit http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
      #for more ways to set up your br browser object e.g. so it look like mozilla
      #and if you need to fill out forms with passwords.
      
      # Open your site
      br.open('http://pypi.python.org/pypi/xlwt')
      
      f=open("source.html","w")
      f.write(br.response().read()) #can be helpful for debugging maybe
      
      filetypes=[".zip",".exe",".tar.gz"] #you will need to do some kind of pattern matching on your files
      myfiles=[]
      for l in br.links(): #you can also iterate through br.forms() to print forms on the page!
          for t in filetypes:
              if t in str(l): #check if this link has the file extension we want (you may choose to use reg expressions or something)
                  myfiles.append(l)
      
      
      def downloadlink(l):
          f=open(l.text,"w") #perhaps you should open in a better way & ensure that file doesn't already exist.
          br.click_link(l)
          f.write(br.response().read())
          print l.text," has been downloaded"
          #br.back()
      
      for l in myfiles:
          sleep(1) #throttle so you dont hammer the site
          downloadlink(l)
      

      注意:在某些情况下,您可能希望将 br.click_link(l) 替换为 br.follow_link(l)。不同之处在于 click_link 返回一个 Request 对象,而 follow_link 将直接打开链接。见Mechanize difference between br.click_link() and br.follow_link()

      【讨论】:

      • robert kink,我运行您的代码仅用于下载 zip 文件 - 代码运行没有错误,但在 chrom 下载文件夹中我看不到文件
      • 嗯我认为该文件将被下载到您从中运行 python 脚本的文件夹中。见stackoverflow.com/questions/5137497/…
      • 人们也可以考虑做傀儡? pypi.org/project/pyppeteer
      • @newGIS 我遇到了同样的问题。用以下语句替换 br.click_link(l) 对我有用: br.retrieve(str(l.url), f'{l.text}.mp3')[0]
      猜你喜欢
      • 1970-01-01
      • 2012-07-01
      • 2014-06-21
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-03-27
      • 1970-01-01
      • 2023-04-09
      相关资源
      最近更新 更多