【问题标题】:How can I keep only unique values in the header and get values corresponding to these in different rows?如何在标题中仅保留唯一值并在不同行中获取与这些值相对应的值?
【发布时间】:2020-10-16 18:21:45
【问题描述】:

我有一个链接,在该链接中,我有一些产品。在这些产品中的每一个中,都有一个规格表。该表是这样的,第一列应该是标题,第二列是与之对应的数据。这些表中的每一个的第一列都不同,有一些重叠的类别。我想得到一张包含所有这些类别的大表,并且成行显示不同的产品。我能够获取一张表(一种产品)的数据,如下所示:

import requests
import csv
from bs4 import BeautifulSoup 
def cpap_spider(max_pages):
    page=1
    while page<=max_pages:
        url= "https://www.1800cpap.com/cpap-masks/nasal?page=" +str(page)
        source_code= requests.get(url)
        plain_text= source_code.text
        soup= BeautifulSoup(plain_text, 'html.parser')
        for link in soup.findAll("a", {"class":"facets-item-cell-grid-title"}):
            
            href="https://www.1800cpap.com"+link.get("href")
            title= link.string
            each_item(href)
            print(href)
            #print(title)
        page+=1
        
data=[] 
def each_item(item_url):
    source_code= requests.get(item_url)
    plain_text= source_code.text
    soup= BeautifulSoup(plain_text, 'html.parser')
    table=soup.find("table", {"class":"table"})
    
    table_rows= table.find_all('tr')
    for row in table_rows:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele]) # Get rid of empty values
    b = open('all_appended.csv', 'w')
    a = csv.writer(b)
    a.writerows(data)
    b.close()
    
    
            
cpap_spider(1)            

此代码一个接一个地获取所有附加的表。但是,我想要一个大表,第一行有唯一的标题,并按顺序排列相应的产品值。

【问题讨论】:

    标签: python


    【解决方案1】:

    使用xlsxwriter而不是csv,因为如果文本包含一个逗号,旁边没有空格","而不是逗号旁边有空格", ",那么你的csv文件会引起问题,因为每个列的值由"," 分隔,例如如果text = "aa,bb" 则csv 会认为此文本包含两列,例如"aa""bb"

    这就是你需要的

    import requests
    import xlsxwriter
    from bs4 import BeautifulSoup 
    def cpap_spider(max_pages):
        global row_i
        page=1
        while page<=max_pages:
            url= "https://www.1800cpap.com/cpap-masks/nasal?page=" +str(page)
            source_code= requests.get(url)
            plain_text= source_code.text
            soup= BeautifulSoup(plain_text, 'html.parser')
            for link in soup.findAll("a", {"class":"facets-item-cell-grid-title"}):
                href="https://www.1800cpap.com"+link.get("href")
                title = link.string
                worksheet.write(row_i, 0, title)
                each_item(href)
                print(href)
                #print(title)
            page+=1
    
    def each_item(item_url):
        global cols_names, row_i
        source_code= requests.get(item_url)
        plain_text= source_code.text
        soup= BeautifulSoup(plain_text, 'html.parser')
        table=soup.find("table", {"class":"table"})
        if table:
            table_rows = table.find_all('tr')
        else:
            return
        for row in table_rows:
          cols = row.find_all('td')
          for ele in range(0,len(cols)):
            temp = cols[ele].text.strip()
            if temp:
              # Here if you want then you can remove unwanted characters like : ? from temp
              # For example "Actual Weight" and ""
              if temp[-1:] == ":":
                temp = temp[:-1]
              # Name of column
              if ele == 0:
                try:
                  cols_names_i = cols_names.index(temp)
                except:
                  cols_names.append(temp)
                  cols_names_i = len(cols_names) -  1
                  worksheet.write(0, cols_names_i + 1, temp)
                  continue;
              worksheet.write(row_i, cols_names_i + 1, temp)      
        row_i += 1
        
    cols_names=[]
    cols_names_i = 0
    row_i = 1
    workbook = xlsxwriter.Workbook('all_appended.xlsx')
    worksheet = workbook.add_worksheet()
    worksheet.write(0, 0, "Title")
        
    cpap_spider(1)
    #each_item("https://www.1800cpap.com/viva-nasal-cpap-mask-by-3b-medical")       
    workbook.close()
    

    【讨论】:

    • 我是否必须添加一个 try catch 表达式,因为对于某些链接这不起作用?
    • 对于某些链接,根本不存在可以从中获取信息的表。并且错误出现在'table_rows = table.find_all('tr')'行中。说没有类型的对象
    • 你应该先检查if table
    • 或尝试使用console.log(table)检查table的值
    • 我认为tableNone 如果你输入if table 然后检查行,否则继续下一个链接
    【解决方案2】:

    假设标题始终是每个表的第一行,您只需跳过除第一行之外的每个表中的该行。一种简单的方法是将要处理的第一行存储在初始化为 0 的变量中,并在处理函数中将其设置为 1。可能的代码:

    def cpap_spider(max_pages):
        page=1
        start_row = 0
        while page<=max_pages:
            ...
            for link in soup.findAll("a", {"class":"facets-item-cell-grid-title"}):
                ...
                each_item(href, start_row)
                start_row = 1        # only first call to each_item will get start_row=1
                print(href)
                #print(title)
            page+=1
    ...
    def each_item(item_url, start_row):
        ...    
        table_rows= table.find_all('tr')
        for row in table_rows[start_row:]:        # skip first row if start_row==1
            ...
    

    【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-11-11
    • 2018-06-13
    • 2019-01-02
    相关资源
    最近更新 更多