【问题标题】:Issue concatenating datafames inside loops scraping a web在抓取网页的循环内发出连接数据帧
【发布时间】:2021-01-13 11:27:15
【问题描述】:

我有以下代码

import pandas as pd
import requests
from bs4 import BeautifulSoup
import datetime
import time

# url = 'https://www.pccomponentes.com/procesadores?page='

url_list = [
    'https://www.pccomponentes.com/procesadores?page=',
    'https://www.pccomponentes.com/discos-duros/500-gb/conexiones-m-2/disco-ssd/internos?page=',
    'https://www.pccomponentes.com/discos-duros/1-tb/conexiones-m-2/disco-ssd/internos?page=',
    'https://www.pccomponentes.com/placas-base/amd-b550/atx?page=',
    'https://www.pccomponentes.com/placas-base/amd-x570/atx?page=',
    'https://www.pccomponentes.com/memorias-ram/16-gb/kit-2x8gb?page=',
    'https://www.pccomponentes.com/ventiladores-cpu?page=',
    'https://www.pccomponentes.com/fuentes-alimentacion/850w/fuente-modular?page=',
    'https://www.pccomponentes.com/fuentes-alimentacion/750w/fuente-modular?page=',
    'https://www.pccomponentes.com/cajas-pc/atx/con-ventana/sin-ventana?page='
    ]

# store = 'PCComponentes'
# df_hold_list = [] # capture dataframe for each link
# extraction_date = datetime.datetime.now()

for url in url_list:

    for x in range(1,2):

        headers = ({'User-Agent':
                    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
                    'Accept-Language': 'es-ES, es;q=0.5'})

        r = requests.get(url + str(x), headers = headers)
        print(r.status_code)
        soup = BeautifulSoup(r.content,'html.parser')
        # print(soup)

        items = soup.find_all('div',class_='col-xs-6 col-sm-4 col-md-4 col-lg-4')
        # print(product)

        store = ['PCComponentes']
        df_list =[] 
        df_hold_list = [] 
        df_final =[] 
        extraction_date = datetime.datetime.now()

        for item in items:
            
            product_name = item.find('h3',class_ = 'c-product-card__title').text
            try:
                price = item.find('div', class_ = 'c-product-card__prices-actual cy-product-price-normal').text[:-1]
            except AttributeError:
                price = item.find('div', class_ = 'c-product-card__prices-actual c-product-card__prices-actual--discount cy-product-price-discount').text[:-1]
            try:
                old_price = item.find('div',class_ = 'c-product-card__prices-pvp cy-product-price-normal').text[:-1]
            except AttributeError:
                old_price = "Sin descuento"
            # try:
            #     availability = item.find('div', class_ = 'c-product-card__availability disponibilidad-inmediata cy-product-availability-date').text.strip()
            # except AttributeError:
            #     availability = item.find('div', class_ = 'c-product-card__availability disponibilidad-moderada cy-product-availability-date').text.strip()
            # except AttributeError:
            #     availability = "Sin Fecha"
            try:
                rating = item.find('span',class_ = 'c-star-rating__text cy-product-text').text
            except AttributeError:
                "Sin valoracion"
            try:
                reviews = item.find('span',class_ = 'c-star-rating__text cy-product-rating-result').text
            except AttributeError:
                "Sin reviews"
            try:
                brand = item.find('article')['data-brand'] 
            except AttributeError:
                "Sin Marca"
            try:
                category = item.find('article')['data-category']
            except AttributeError:
                "Sin Categoria"

            # if None in (product_name, price, availability, rating, reviews, brand, category):
                # continue
            
            print(product_name, price, old_price, rating, reviews, brand, category, store, extraction_date)
            
            df = pd.DataFrame (
            {
                'product_name' : product_name,
                'price' : price,
                #'availability' : availability,
                'rating' : rating,
                'reviews' : reviews,
                'brand' : brand,
                'category' : category,
                'store' : store,
                'date_extraction' : extraction_date,
            })
            df_list.append(df)
    time.sleep(3)

    df_hold_list.append(df)

    data_PCCOMP = pd.concat(df_hold_list, axis=0)

    store = 'PCComponentes'
    # site = ‘mysite’
    path = '/home/pi/Documents/WebScraping Files/pccomp/'
    mydate = extraction_date.strftime('%Y%m%d')
    mytime = extraction_date.strftime('%H%M%S')
    filename = path+store+'_'+mydate+'_'+mytime+".csv"

    data_PCCOMP.to_csv(filename)

    print(data_PCCOMP)

代码在一组网页上循环,这些网页在页面上分页并提取数据以收集到数据框中。

最后将收集到的所有数据都插入到一个scv中。

它运行良好,但我无法附加数据帧以仅获取一个包含所有数据的 csv。

我需要帮助来实现我的目标,任何帮助都将不胜感激。

提前致谢。

问候。

【问题讨论】:

    标签: python pandas dataframe web-scraping


    【解决方案1】:

    通常,通过网络抓取收集数据时,我喜欢做的是构建:

    • 字典列表(包含元数据)(选项 1
    • 单个字典中的元数据列表以及对应的列名(数据、标题、价格等)(选项 2

    (我所说的“元数据”是描述单个项目的所有信息:在您的情况下,这将是:项目价格、提取日期、特定项目的评论等等。)

    当抓取完成后,我构建 DataFrame 只是作为最后一步。

    最后一点,我不想过多地混淆你的原始脚本,但我认为你应该考虑两件事:

    • 构建一个函数来包装你的抓取步骤(或者甚至是一个,这样你就可以添加一个函数来处理你正在做的所有类似的事情:collection产品元数据)
    • 您应该将“Sin Marc”、“Sin Reviews”替换为np.nan,这将使您的数据处理和分析更容易

    在我对您的脚本所做的修改中,我选择了选项 2。我不确定,但我猜选项 2选项 1 更有效。但是,我发现它很有用,有时,当您处理更复杂的数据以首先构建字典,然后将对应于单个项目的字典放入字典列表中(这将是 选项 1) :它可以更轻松地一次跟踪每个项目。

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    import datetime
    import time
    
    # url = 'https://www.pccomponentes.com/procesadores?page='
    
    url_list = [
        'https://www.pccomponentes.com/procesadores?page=',
        'https://www.pccomponentes.com/discos-duros/500-gb/conexiones-m-2/disco-ssd/internos?page=',
        'https://www.pccomponentes.com/discos-duros/1-tb/conexiones-m-2/disco-ssd/internos?page=',
        'https://www.pccomponentes.com/placas-base/amd-b550/atx?page=',
        'https://www.pccomponentes.com/placas-base/amd-x570/atx?page=',
        'https://www.pccomponentes.com/memorias-ram/16-gb/kit-2x8gb?page=',
        'https://www.pccomponentes.com/ventiladores-cpu?page=',
        'https://www.pccomponentes.com/fuentes-alimentacion/850w/fuente-modular?page=',
        'https://www.pccomponentes.com/fuentes-alimentacion/750w/fuente-modular?page=',
        'https://www.pccomponentes.com/cajas-pc/atx/con-ventana/sin-ventana?page='
        ]
    
    # store = 'PCComponentes'
    # df_hold_list = [] # capture dataframe for each link
    # extraction_date = datetime.datetime.now()
    
    for url in url_list:
    
        for x in range(1,2):
    
            headers = ({'User-Agent':
                        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
                        'Accept-Language': 'es-ES, es;q=0.5'})
    
            r = requests.get(url + str(x), headers = headers)
            print(r.status_code)
            soup = BeautifulSoup(r.content,'html.parser')
            # print(soup)
    
            items = soup.find_all('div',class_='col-xs-6 col-sm-4 col-md-4 col-lg-4')
            # print(product)
            
            # metadata
            prices = []
            product_names = []
            old_prices = []
            ratings = []
            reviews = []
            brands = []
            categories = []
            stores = []
            extraction_dates = []
    
            for item in items:
                
                extraction_data = datetime.datetime.now()
                extraction_dates.append(extraction_data)
                
                store = 'PCComponentes'
                stores.append(store)
                
                product_name = item.find('h3',class_ = 'c-product-card__title').text
                product_names.append(product_name)
                
                try:
                    price = item.find('div', class_ = 'c-product-card__prices-actual cy-product-price-normal').text[:-1]
                    prices.append(price)
                except AttributeError:
                    price = item.find('div', class_ = 'c-product-card__prices-actual c-product-card__prices-actual--discount cy-product-price-discount').text[:-1]
                    prices.append(price)
                try:
                    old_price = item.find('div',class_ = 'c-product-card__prices-pvp cy-product-price-normal').text[:-1]
                    old_prices.append(price)
                except AttributeError:
                    old_price = "Sin descuento"
                    old_prices.append(price)
                # try:
                #     availability = item.find('div', class_ = 'c-product-card__availability disponibilidad-inmediata cy-product-availability-date').text.strip()
                # except AttributeError:
                #     availability = item.find('div', class_ = 'c-product-card__availability disponibilidad-moderada cy-product-availability-date').text.strip()
                # except AttributeError:
                #     availability = "Sin Fecha"
                try:
                    rating = item.find('span',class_ = 'c-star-rating__text cy-product-text').text
                    ratings.append(rating)
                except AttributeError:
                    "Sin valoracion"
                    ratings.append("Sin valoracion")
                try:
                    review = item.find('span',class_ = 'c-star-rating__text cy-product-rating-result').text
                    reviews.append(review)
                except AttributeError:
                    "Sin reviews"
                    reviews.append("Sin reviews")
                try:
                    brand = item.find('article')['data-brand']
                    brands.append(brand)
                except AttributeError:
                    "Sin Marca"
                    brands.append("Sin Marca")
                try:
                    category = item.find('article')['data-category']
                    categories.append(category)
                except AttributeError:
                    "Sin Categoria"
                    categories.append("Sin Categoria")
    
                # if None in (product_name, price, availability, rating, reviews, brand, category):
                    # continue
                    
            dict_metadata = {
                'product_name' : product_names,
                'price' : prices,
                #'availability' : availability,
                'rating' : ratings,
                'reviews' : reviews,
                'brand' : brands,
                'category' : categories,
                'store': stores,
                'extraction_date': extraction_dates
            }
                            
    
    df = pd.DataFrame(dict_metadata)
    
    

    【讨论】:

    • 感谢您的回答。我对pyhton的了解还很少,需要改进。我想我错过了你的代码,因为当我运行添加 pint(df) 的代码时,我只得到最后一行而不是所有数据集。
    • 对我来说很好:(也许可以按照你自己的节奏浏览代码:如果你更改了我创建的变量的名称,请小心,因为我使用了复数名称(类别、品牌、评论)数据集每一行的单个值的列表和单数名称(类别、品牌、评论)。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-04-13
    • 1970-01-01
    • 1970-01-01
    • 2018-02-22
    • 1970-01-01
    • 2019-08-18
    • 2019-05-12
    相关资源
    最近更新 更多