【问题标题】:Web Scrape - Span Price issue. Printing the last items price in loop instead of price for each itemWeb Scrape - 跨度价格问题。在循环中打印最后一个项目的价格而不是每个项目的价格
【发布时间】:2021-06-17 08:54:58
【问题描述】:

我对今天编写的这个脚本几乎感到满意。今天它得到了一些帮助(感谢迄今为止提供帮助的所有人)和我的一些可疑编程,但它在一定程度上是有用的。

我想将数据转储到 JSON。 除了价格(从<span></span> 获取)似乎正确转储了所有数据。我认为问题在于缩进,但我不是 100% 确定。

谁能把目光投向这个 sn-p 并纠正我看不到的地方。认为我因尝试了多种变化而无法看到正确的变化而失明。

from bs4 import BeautifulSoup
import requests
import shutil
import csv
import pandas
from pandas import DataFrame
import re
import os
import urllib.request as urllib2
import locale
import json
from selenium import webdriver
import lxml.html
import time
from selenium.webdriver.support.ui import Select

os.environ["PYTHONIOENCODING"] = "utf-8"

#selenium requests
browser = webdriver.Chrome(executable_path='C:/Users/admin/chromedriver.exe')
browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
time.sleep(2)

#beautiful soup requests
#URL = 'https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html'
#page = requests.get(URL)
#soup = BeautifulSoup(page.content, 'html.parser')
soup = BeautifulSoup(browser.page_source, features="lxml")
#products = soup.find_all("div", "GC62 Product")
products = soup.find_all("div", "GC62 Product")

for product in products:
    #barrel lengths
    barrels = product.find('select', attrs={'name': re.compile('length')})
    if barrels:
        barrels_list = [x['origvalue'][:2] for x in barrels.find_all('option')[1:]]

        for y in range(0, len(barrels_list)):
            #title
            title = product.find("h3") 
            titleText = title.text if title else ''

            #manufacturer name
            manufacturer = product.find("div", "GC5 ProductManufacturer")
            manuText = manufacturer.text if manufacturer else ''

            #image location
            img = product.find("div", "ProductImage")
            imglinks = img.find("a") if img else ''
            imglinkhref = imglinks.get('href')  if imglinks else ''
            imgurl = 'https://www.mcavoyguns.co.uk/contents'+imglinkhref
 
            #description
            description = product.find("div", "GC12 ProductDescription")
            descText = description.text if description else ''
            #descStr = str(descText)

            #more description
            more = product.find("div", "GC12 ProductDetailedDescription")
            moreText = more.text if more else ''

            #price
            spans = browser.find_elements_by_css_selector("div.GC20.ProductPrice span")
            for i in range(0,len(spans),2):
                span = spans[i].text
                i+=1 
                
                #print(span)
                #print(barrels_list[y])
                #print(titleText)
                #print(manuText)
                #print(descText)
                #print(moreText)
                #print(imgurl.replace('..', ''))
                #print("\n")

            x = {
                "price": span,
                "barrel length": barrels_list[y],
                "title": titleText,
                "manufacturer": manuText,
                "description": descText,
                "desc cont": moreText,
                "image Location": imgurl.replace('..', '')
            }

            dump = json.dumps(x)
            print(dump)
            y+=1    

【问题讨论】:

    标签: python json selenium web-scraping beautifulsoup


    【解决方案1】:

    我成功地通过修改您的代码使其工作。您的最后一个for 循环并不是真正有用,因为您已经找到了产品的标签。因此,您可以执行以下操作:

    from bs4 import BeautifulSoup
    import requests
    import shutil
    import csv
    import pandas
    from pandas import DataFrame
    import re
    import os
    import urllib.request as urllib2
    import locale
    import json
    from selenium import webdriver
    import lxml.html
    import time
    from selenium.webdriver.support.ui import Select 
    os.environ["PYTHONIOENCODING"] = "utf-8"
    
    #selenium requests
    browser = webdriver.Chrome(executable_path='C:/Users/admin/chromedriver.exe')
    browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
    time.sleep(2)
    
    #beautiful soup requests
    #URL = 'https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html'
    #page = requests.get(URL)
    #soup = BeautifulSoup(page.content, 'html.parser')
    soup = BeautifulSoup(browser.page_source, features="lxml")
    #products = soup.find_all("div", "GC62 Product")
    products = soup.find_all("div", "GC62 Product")
    
    
    
    
    for product in products:
        
        #barrel lengths
        barrels = product.find('select', attrs={'name': re.compile('length')})
        if barrels:
            barrels_list = [x['origvalue'][:2] for x in barrels.find_all('option')[1:]]
            
            #title
            title = product.find("h3") 
            titleText = title.text if title else ''
    
            #manufacturer name
            manufacturer = product.find("div", "GC5 ProductManufacturer")
            manuText = manufacturer.text if manufacturer else ''
    
            #image location
            img = product.find("div", "ProductImage")
            imglinks = img.find("a") if img else ''
            imglinkhref = imglinks.get('href')  if imglinks else ''
            imgurl = 'https://www.mcavoyguns.co.uk/contents' + imglinkhref
    
            #description
            description = product.find("div", "GC12 ProductDescription")
            descText = description.text if description else ''
    
            #more description
            more = product.find("div", "GC12 ProductDetailedDescription")
            moreText = more.text if more else ''
    
            #price
            price = product.findChild(name="span")
            print("price : ", price)
            price_raw = price.text
            print("price_raw : ", price_raw)
            price_replaced = price_raw.replace(',', '').replace('£', '')
            print("price_replaced : ", price_replaced)
            price_float = float(price_replaced)
    
            for barrel in barrels_list:
                x = {
                    "price": price_float,
                    "barrel length": barrel,
                    "title": titleText,
                    "manufacturer": manuText,
                    "description": descText,
                    "desc cont": moreText,
                    "image Location": imgurl.replace('..', '')
                }
                dump = json.dumps(x)
                print(dump)
    

    如果还是不行,不要犹豫!

    【讨论】:

    • 嗯,它对我有用!我试图简化代码,您可以尝试复制/粘贴以查看它是否有效?顺便说一句,当使用for指令时,不需要在末尾添加x += 1for指令会自行递增!
    • 好的,让我们调试一下,你能给我在你的屏幕上打印的所有price_...的内容吗(请参阅我的代码以及新的print()第65、67、69和71行)
    • 有趣,让我们尝试一些不同的东西,试试 nex 代码:)
    • 啊啊啊啊,我知道发生了什么,在我的屏幕上,£ 符号在金额后面。我想将金额转换为浮点数,但也许你不需要它?
    • 查看编辑后的代码以转换为浮点数。如果它不起作用,你能把price.text的打印值给我,我会让它起作用吗?
    猜你喜欢
    • 1970-01-01
    • 2014-04-03
    • 2021-08-19
    • 1970-01-01
    • 2015-02-05
    • 1970-01-01
    • 1970-01-01
    • 2012-07-17
    • 1970-01-01
    相关资源
    最近更新 更多