下载具有特定元素的网站链接答案

【问题标题】：Download website links having specific elements around下载具有特定元素的网站链接
【发布时间】：2018-01-06 05:01:18
【问题描述】：

我需要递归地镜像一些具有特定标记的网站壁纸图像，例如：

<div class="wb_more">
Original Resolution: <a href="//site.com/download/space_planet_sky_94434/4800x2700">4800x2700</a><br>
Views: <a href="/download/last">96661</a>
</div>

但不是其他人，例如：

<div class="wd_resolution">
<span class="wd_res_cat">Fullscreen</span>
<span class="wd_res_cat_raz"><a class="wb_res_select" href="//site.com/download/space_planet_sky_94434/1600x1200">1600x1200</a>
...
</span>
...
</span>
</div>

请注意，URL 是相同的，除了分辨率，但原件的分辨率可能会有所不同，所以只有周围的标记会有所不同，比如在链接前面加上 Original Resolution: 这样的文本。

有没有使用 wget 或 httrack 或其他工具的解决方案？

谢谢。

【问题讨论】：

标签： wget httrack

【解决方案1】：

您可以使用scrapy 等scraping 工具来完成此操作。您可以使用 css、xpath、regex 或其他方式解析 html 响应，以获取与您的规则匹配的链接。

我认为最好为每个站点制作 scraper。例如，对于第一个：

import scrapy

class imageLink(scrapy.Spider):
    name = 'imageLink'
    
    # Here the list of url to start scraping
    start_urls = ['https://images.com']

    def parse(self, response):
        # Get the link
        link = response.css('div.wb_more > a ').xpath('@href')
        # Make a callback to save the image
        yield scrapy.Request(url=url, callback=self.save_image)
   
    def save_image(self, response):
        link = response.url
        # Guess the filename from the link
        # space_planet_sky_94434
        filename = link.split('/')[5]
        # Save the image
        with open(filename, 'wb') as f:
            f.write(response.body)

如果网站有图片分页，你可以添加一个回调来解析下一页的链接。

我没有测试代码。

【讨论】：

您能举个例子吗？我对python没有太多经验，顺便说一句。
我添加了一些代码来帮助您开始。但是我认为如果你没有太多的python经验，可能很难使用scrapy。顺便说一句，我在 python 中的第一次体验之一是制作 scrapper 来自动化某些任务（但 scrapy 不存在，所以我从头开始：D）。
就我而言，我需要做一些新功能，我对python没有太多经验。所以现在我在忽略某些类函数体时遇到了一个奇怪的行为（我知道缩进是正确的）。
我有分页问题，这意味着它声明了一些页面数量并且它有更少或更多。是否有一种技术可以在使用yield 的请求回调中满足某些条件之前请求页面？

【解决方案2】：

您可以尝试使用普通的wget 并在其上使用正则表达式（例如sed 或perl）然后下载你获得的链接（wget可以做到）

基本脚本如下所示

wget [URL] -o MyPage.html
./GetTheFlag.pl -html MyPage.html > WallPaperList.txt
wget -i WallPaperList.txt #here you can put your image where you want

用GetFlag.pl的样子

use warnings; 
use strict; 
use Getopt::Long;
my $Url;
my $MyPage;
GetOptions("html=s" => \$MyPage);
open(FILE,$MyPage);
my $content=FILE;
my @urls=($content =~ //gi) #insert here a regex depending on the flags around
foreach my $i (@urls) {
    print "$i\n";
}

例如，如果您的网址是 <a href="url">New Wallpaper</a>，则正则表达式将为

 =~ /<a href="(\w+)">New Wallpaper</a>

关心\w它错过了一些不能在var名称中使用的字符-

希望这已经足够清楚了。

【讨论】：

好吧，一开始需要一些简单的例子，我不知道该怎么做。
顺便说一句，标准在元素之外。
抱歉来晚了。目前是堆栈溢出的新手。您能否提供标记的类型来创建正则表达式示例。