只抓取抓取网站后添加的新链接答案

【问题标题】：scrape only new link added after scrape the site只抓取抓取网站后添加的新链接
【发布时间】：2018-04-10 01:15:32
【问题描述】：

我有一个代码，可以使用某些关键字抓取产品的所有链接、标题和尺寸。第一次抓取完成后，如果添加了新项目，我希望脚本一次又一次地检查。我尝试 while True: 但它似乎不起作用，因为多次给我相同的数据。脚本是这样的：

import requests
import csv
from bs4 import BeautifulSoup
import time

headers = {"user-agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 
10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 
Safari/537.36"}
keywords = ["nike", "air"]
base_url = "https://www.julian-fashion.com"

while True:
    for page in range(0,11):
        url = "https://www.julian-fashion.com/en-US/men/shoes/sneakerscurrPage={}".format(page)
        r = requests.get(url)
        soup = BeautifulSoup(r.content,"html.parser")
        all_links = soup.find_all("li", attrs={"class":"product in-stock"})
        for link in all_links:
            for s in keywords:
                if s not in link.a["href"]:
                    found = False
                    break
                else:
                    product = link.a["href"]
                    found = True
                    if found:
                        print("Product found.")
                        print(base_url+link.a["href"])
                        print(link.img["title"])
                        print(link.div.div.ul.text)

【问题讨论】：

能否请将 URL 从原始问题改回（编辑前），因为我的答案解决了该问题，如果您在获得答案后尝试修复代码，则会变得不清楚。结局曾经是：sneakerscurrPage={}，而我评论缺失的?。

标签： python web-scraping beautifulsoup python-requests

【解决方案1】：

您在 currPage 之前缺少 ?，它应该如下所示：https://www.julian-fashion.com/en-US/men/shoes/sneakers?currPage={}.? 表示 query string 的开头。现在你的代码可以工作了。

您也可以省略页面0，因为此站点从1 开始分页，并提供0 给出404 Page not found。除此之外，您不需要while True，因为您只想执行此代码块一次。 For loop 负责换页，这就够了。

这里有个bug：

for s in keywords:
    if s not in link.a["href"]:
        found = False
        break

如果关键字不在link.a['href'] 中，则退出循环。请注意，如果列表中的第一个 keyword 不存在，并不意味着接下来的一个不会存在。

您的代码经过几次修复：

import requests
from bs4 import BeautifulSoup

headers = {
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
}
keywords = ["nike", "air"]
base_url = "https://www.julian-fashion.com"

for page in range(1, 11):
    print(f'[PAGE NR {page}]')
    url = "https://www.julian-fashion.com/en-US/men/shoes/sneakers?currPage={}".format(page)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    all_links = soup.find_all("li", attrs={"class": "product in-stock"})
    for link in all_links:
        if any(key in link.a["href"] for key in keywords):
            print("Product found.")
            print(base_url + link.a["href"])
            print(link.img["title"])
            print(link.div.div.ul.text)

这是我的代码版本，我使用了.select() 而不是.find_all()。这会更好，因为如果页面的创建者将一些新的类添加到您搜索的元素中，使用 CSS 选择器的.select() 仍然能够定位这些元素。我还使用urljoin 来创建绝对链接，请参阅here 为什么。

from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin

keywords = ["nike", "air"]
base_url = "https://www.julian-fashion.com"

for page in range(1, 11):
    print(f'[PAGE NR {page}]')
    url = "https://www.julian-fashion.com/en-US/men/shoes/sneakers?currPage={}".format(page)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")
    all_items = soup.select('li.product.in-stock')

    for item in all_items:
        link = urljoin(base_url, item.a['href'])

        if any(key in link for key in keywords):
            title = item.img["title"]
            sizes = [size.text for size in item.select('.sizes > ul > li')]

            print(f'ITEM FOUND: {title}\n'
                  f'sizes available: {", ".join(sizes)}\n'
                  f'find out more here: {link}\n')

也许您希望关键字成为过滤商品的品牌，如果是这样，您可以使用下面的代码，而不是检查关键字是否在商品的链接中。

    if item.select_one('.brand').text.lower() in keywords:

代替：

    if any(key in link for key in keywords):

监视器：

要制作一个简单的监视器来检查网站上的新项目，您可以使用下面的代码并根据您的需要进行调整：

from bs4 import BeautifulSoup
import requests
import time

item_storage = dict()

while True:
    print('scraping')
    html = requests.get('http://localhost:8000').text
    soup = BeautifulSoup(html, 'html.parser')

    for item in soup.select('li.product.in-stock'):
        item_id = item.a['href']

        if item_id not in item_storage:
            item_storage[item_id] = item
            print(f'NEW ITEM ADDED: {item_id}')

    print('sleeping')
    time.sleep(5)  # here you can adjust the frequency of checking for new items

您可以通过创建一个包含多个<li class="product in-stock"> 的index.html 文件在本地进行测试，您可以从网站上复制它们。 Enter Chrome DevTools，在Elements 选项卡中找到一些lis。右键单击一个 -> 复制 -> 复制 outerHTML，然后将其粘贴到 index.html 文件中。然后在控制台运行：python -m http.server 8000 并运行上面的脚本。在执行期间，您可以添加更多项目并查看它们的href 打印。

示例输出：

scraping
NEW ITEM /en-US/product/47341/nike/sneakers/air_maestro_ii_ltd_sneakers
NEW ITEM /en-US/product/47218/y3/sneakers/saikou_sneakers
sleeping
scraping
NEW ITEM /en-US/product/47229/y3/sneakers/tangutsu_slip_on
sleeping

【讨论】：

您好！感谢您的回答和建议！我不知道如何让脚本等待在网站中添加新链接。我尝试了 while true 但没有奏效。谢谢
您的意思是一个 24/7 全天候运行的脚本，一旦添加项目就会打印出来？
@Phil 我用一个简单的监视器的例子更新了答案。
感谢您的回复。我重新编辑了链接。我尝试使用最后一个脚本，但我得到：iMac-di-Filippo:desktop phil$ python moni.py 文件“moni.py”，第 17 行 print(f'NEW ITEM ADDED: {item_id}') ^ SyntaxError : 无效语法
哦，对了，忘了说，我用f-strings 来格式化字符串。它仅适用于 Python 3.6+。你可以把它改成print('NEW ITEM ADDED: {}'.format(item_id))。