无法从具有不同深度的某些链接中解析产品名称答案

【问题标题】：Trouble parsing product names out of some links with different depth无法从具有不同深度的某些链接中解析产品名称
【发布时间】：2019-02-03 12:12:34
【问题描述】：

我在 python 中编写了一个脚本来访问目标页面，其中每个类别在网站中都有其可用的项目名称。我下面的脚本可以从大多数链接中获取产品名称（通过巡回类别链接生成，然后是子类别链接）。

该脚本可以解析点击下图中可见的每个类别旁边的+ 标志后显示的子类别链接，然后解析目标页面中的所有产品名称。 This is one of such 目标页面。

但是，很少有链接的深度与其他链接不同。例如this link 和this one 与this one 等常用链接不同。

如何从所有链接中获取所有产品名称，而不管它们的深度不同？

这是我迄今为止尝试过的：

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

link = "https://www.courts.com.sg/"

res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".nav-dropdown li a"):
    if "#" in item.get("href"):continue  #kick out invalid links
    newlink = urljoin(link,item.get("href"))
    req = requests.get(newlink)
    sauce = BeautifulSoup(req.text,"lxml")
    for elem in sauce.select(".product-item-info .product-item-link"):
        print(elem.get_text(strip=True))

如何找到 trget 链接：

【问题讨论】：

为什么 URL 的深度很重要？这有什么变化？
如果您阅读@Cole 的帖子，为什么深度很重要。但是，当在刮板中应用现有逻辑时，这不允许我从这些链接（包含不同深度）中获取产品名称。
我读过它，我只是重读它。鉴于要求，链接的格式不是您要完成的重点。除非我错过了什么？
希望解析所有产品名称，仅此而已。

标签： python python-3.x web-scraping beautifulsoup

【解决方案1】：

该网站有六个主要产品类别。属于子类别的产品也可以在主类别中找到（例如/furniture/furniture/tables中的产品也可以在/furniture中找到），因此您只需从主类别中收集产品。您可以从主页获取类别链接，但使用站点地图会更容易。

url = 'https://www.courts.com.sg/sitemap/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
cats = soup.select('li.level-0.category > a')[:6]
links = [i['href'] for i in cats]

正如您所提到的，有些链接具有不同的结构，例如：/televisions。但是，如果您单击该页面上的View All Products 链接，您将被重定向到/tv-entertainment/vision/television。因此，您可以从/tv-entertainment 获得所有/televisions rpoducts。同样，品牌链接中的产品可以在主要类别中找到。例如，/asus 产品可以在/computing-mobile 和其他类别中找到。

下面的代码收集了所有主要类别的产品，因此它应该收集网站上的所有产品。

from bs4 import BeautifulSoup
import requests

url = 'https://www.courts.com.sg/sitemap/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

cats = soup.select('li.level-0.category > a')[:6]
links = [i['href'] for i in cats]
products = []

for link in links:
    link += '?product_list_limit=24'
    while link:
        r = requests.get(link)
        soup = BeautifulSoup(r.text, 'html.parser')
        link = (soup.select_one('a.action.next') or {}).get('href')
        for elem in soup.select(".product-item-info .product-item-link"):
            product = elem.get_text(strip=True)
            products += [product]
            print(product)

我已将每页的产品数量增加到 24 个，但此代码仍然需要大量时间，因为它收集了所有主要类别的产品及其分页链接。但是，我们可以使用threads 使其更快。

from bs4 import BeautifulSoup
import requests
from threading import Thread, Lock
from urllib.parse import urlparse, parse_qs

lock = Lock()
threads = 10
products = []

def get_products(link, products):
    soup = BeautifulSoup(requests.get(link).text, 'html.parser')
    tags = soup.select(".product-item-info .product-item-link")
    with lock:
        products += [tag.get_text(strip=True) for tag in tags]
        print('page:', link, 'items:', len(tags))

url = 'https://www.courts.com.sg/sitemap/'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
cats = soup.select('li.level-0.category > a')[:6]
links = [i['href'] for i in cats]

for link in links:
    link += '?product_list_limit=24'
    soup = BeautifulSoup(requests.get(link).text, 'html.parser')
    last_page = soup.select_one('a.page.last')['href']
    last_page = int(parse_qs(urlparse(last_page).query)['p'][0])
    threads_list = []

    for i in range(1, last_page + 1):
        page = '{}&p={}'.format(link, i)
        thread = Thread(target=get_products, args=(page, products))
        thread.start()
        threads_list += [thread]
        if i % threads == 0 or i == last_page:
            for t in threads_list:
                t.join()

print(len(products))
print('\n'.join(products))

此代码在大约 5 分钟内从 773 个页面收集了 18,466 个产品。我正在使用 10 个线程，因为我不想给服务器带来太多压力，但您可以使用更多（大多数服务器可以轻松处理 20 个线程）。

【讨论】：

看来这个帖子要解决了。在循环中找到你总是很高兴@t.m.adam。我应该在哪里以及如何在您的脚本中使用 print 语句？时机成熟时会接受的。谢谢。
我很高兴看到你还在身边！我想你可以在循环中打印每个产品（我已经更新了代码）。顺便说一句，您如何看待我的解决方案？我在网站上搜索了一下，我认为主要类别包括所有产品。如果您发现不在主要类别中的产品，请告诉我，以便我查看。另外，我正在尝试使用多线程，因为脚本太慢了。如果可以的话，我会更新代码。

【解决方案2】：

我看到网站解析，发现所有产品都在主页左下角https://www.courts.com.sg/。点击其中一个后，我们进入特定类别的广告首页。我们必须去的地方点击所有产品以获得它。

以下是整个代码：

import requests
from bs4 import BeautifulSoup

def parser():
    parsing_list = []
    url = 'https://www.courts.com.sg/'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")
    ul = soup.find('footer',{'class':'page-footer'}).find('ul')
    for l in ul.find_all('li'):
        nextlink = url + l.find('a').get('href')
        response = requests.get(nextlink)
        inner_soup = BeautifulSoup(response.text, "html.parser")
        parsing_list.append(url + inner_soup.find('div',{'class':'category-static-links ng-scope'}).find('a').get('href'))
return parsing_list

此函数将返回您的代码未从其中抓取的所有类别的所有产品的列表。

【讨论】：

【解决方案3】：

由于您的主要问题是查找链接，这里有一个生成器，它将使用 krflol 在他的解决方案中指出的站点地图查找所有类别和子类别链接：

from bs4 import BeautifulSoup
import requests


def category_urls():
    response = requests.get('https://www.courts.com.sg/sitemap')
    html_soup = BeautifulSoup(response.text, features='html.parser')
    categories_sitemap = html_soup.find(attrs={'class': 'xsitemap-categories'})

    for category_a_tag in categories_sitemap.find_all('a'):
        yield category_a_tag.attrs['href']

要查找产品名称，只需抓取每个生成的 category_urls。

【讨论】：

这仍然会出现同样的深度问题。查看此链接https://www.courts.com.sg/klipsch,https://www.courts.com.sg/simmons，它们的深度与这两个https://www.courts.com.sg/home-appliances/small-appliances,https://www.courts.com.sg/smart-tech/smart-gadgets 等不同。
@Topto 你是从 URL 本身解析数据吗？

【解决方案4】：

我建议您从页面站点地图开始抓取

Found here

如果他们要添加产品，它也可能会显示在这里。

【讨论】：