【问题标题】:How to format a scraper output如何格式化刮刀输出
【发布时间】:2019-08-04 19:56:44
【问题描述】:

我试图从一个站点推断价格以创建一个刮板,我在下面编写了程序。为了获得所有的 html 代码,我使用了 BeautifulSoup 和默认的 html.parser。然后我尝试使用名为 generale 的变量来清理信息,该变量等于 soup.findAll("span")。然后我需要进一步清理(列表(我想)它已经创建)以获取价格并且我被卡住了。有什么建议么?我不知道该怎么想才能解决问题

import smtplib

import time

from bs4 import BeautifulSoup as bs

import requests

URL = "https://www.allkeyshop.com/blog/buy-battlefield-5-cd-key-compare-prices/"

headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0"}

def Check_page1():

    page = requests.get(URL, headers=headers)

    soup = bs(page.content, 'html.parser')

    generale = soup.findAll('span')

    price = ?

    print(price)

    print(generale)

print(Check_page1())

【问题讨论】:

    标签: python python-3.x


    【解决方案1】:

    当你查看页面的源代码可以看到你正在寻找<span>,类名price,可以这样解析:

    import time
    
    import requests
    from bs4 import BeautifulSoup as bs
    
    URL = "https://www.allkeyshop.com/blog/buy-battlefield-5-cd-key-compare-prices/"
    headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0"}
    
    def CheckPage1():
        page = requests.get(URL, headers=headers)
        soup = bs(page.content, 'html.parser')
    
        # all spans with prices
        span_prices = soup.findAll("span", {"class": "price"})
    
        # to get all prices you need to extract text or content attribute
        for span in span_prices:
            price = span.text
            # remove whitespace and print price
            print(price.strip())
    
            # to get prices without money sign uncomment one of those lines
            # print(price.strip()[:-1])
            # print(price.strip().strip('€'))
    
    CheckPage1()
    

    【讨论】:

    • 假设我刚开始使用 python 编码并且我不知道所有参数是如何工作的,我还有 4 个其他问题,我怎么知道如何正确使用命令?通过格式化 "soup.findAll("span", {"class": "price"})" {"class": "price"} 实际用于什么?就像找到所有“跨度”然后在所有“跨度”之间找到 class="price" 吗?为什么是花括号?之间的列的含义?抱歉,我很好奇,我想提高我的知识
    • 还有一件事,你使用“for”“循环”来划分单个“spans”,可以直接编辑“span_prices”吗?什么样的变量是“span_prices”(我想是一个列表)?
    • 如果你想开始解析网页,你必须对 HTML 和 CSS 有一些了解。因此,从互联网上的一些网络教程中获得灵感。如果您不知道花括号的含义,我建议您也观看/阅读一些 Python 教程。
    • 对于答案:1:命令 - 你必须知道你想从页面中得到什么然后你试图找到它。 2 这意味着在给定页面上找到所有<span> 元素具有class="price"3 类似的东西,是的。 4 花括号 - dictionary in python 5 Colom 是函数中参数的分隔符。另请参阅doc for bs4
    • 编辑 span_prices 是什么意思?通过使用 BeautifulSoup,您不能修改(编辑)页面。它用于轻松解析网络以快速获取数据。是的 find_all 返回匹配给定条件(类名)的所有跨度的列表。然后你从<span>.text 得到文本。还可以在 Internet 上查看一些 BeautifulSoup 示例/教程。祝你好运:)
    【解决方案2】:

    似乎没有<span class="price">。 这就是我所做的。

    In [1]: import requests 
       ...:  
       ...: URL = "https://www.allkeyshop.com/blog/buy-battlefield-5-cd-key-compare-prices/" 
       ...:  
       ...: headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0"} 
    Out[1]: {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}
    
    In [2]: page = requests.get(URL, headers=headers)                                                        
    Out[2]: <Response [200]>
    
    In [3]: import re                                                                                        
    
    In [4]: re.findall(r'<span.*?</span>', page.text)
    

    有很多跨度。对我来说,以下看起来最像价格。

     '<span class="topclick-list-element-price">10.56&euro;</span>',
     '<span class="topclick-list-element-price">2.79&euro;</span>',
     '<span class="topclick-list-element-price">2.90&euro;</span>',
     '<span class="topclick-list-element-price">27.86&euro;</span>',
     '<span class="topclick-list-element-price">11.15&euro;</span>',
     '<span class="topclick-list-element-price">11.46&euro;</span>'
    

    所以我提炼了正则表达式

    In [7]: prices = [float(p) for p in re.findall(r'<span class="topclick-list-element-price">(.*)&euro;</span>', pag
       ...: e.text)] 
    
    In [8]: print(prices)                                                                                    
    [10.56, 2.79, 2.9, 27.86, 11.15, 11.46, 11.2, 18.67, 9.69, 24.25,
    20.25, 19.59, 44.21, 28.3, 31.92, 41.39, 4.76, 24.57, 8.75, 28.62, 
    27.14, 8.52, 31.95, 24.59, 27.93, 27.86, 5.5, 24.99, 37.99, 14.27, 
    36.0, 8.75, 35.99, 37.34, 23.4, 22.98, 31.95, 36.89, 25.57, 27.9, 
    35.88, 41.39, 33.22, 42.29, 31.29, 42.29, 38.09, 33.89, 33.59, 28.83,
    10.56, 2.79, 2.9, 27.86, 11.15, 11.46, 11.2, 18.67, 9.69, 24.25, 
    20.25, 19.59, 44.21, 28.3, 31.92, 41.39, 4.76, 24.57, 8.75, 28.62, 
    27.14, 8.52, 31.95, 24.59, 27.93, 27.86, 5.5, 24.99, 37.99, 14.27, 
    36.0, 8.75, 35.99, 37.34, 23.4, 22.98, 31.95, 36.89, 25.57, 27.9, 
    35.88, 41.39, 33.22, 42.29, 31.29, 42.29, 38.09, 33.89, 33.59, 28.83, 
    24.25, 12.11, 28.84, 37.36, 23.71, 2.19, 2.99, 34.25, 11.38, 14.99, 
    20.67, 4.99, 25.56, 1.81, 12.99, 19.73, 9.99, 9.99, 0.92, 11.99, 
    27.93, 22.94, 8.46, 32.78, 40.03, 11.19, 12.45, 13.29, 13.9, 26.22, 
    26.22, 23.34, 25.22, 32.78, 37.36, 21.5, 19.01, 26.53, 24.91, 17.96, 
    35.4, 17.05, 21.56, 16.39, 35.4, 8.98, 65.54, 13.45, 15.73, 22.39, 
    17.99, 40.17, 8.0, 11.34, 14.99, 17.99, 10.99, 24.99, 22.41, 17.99, 
    40.17, 7.2, 49.99, 41.1, 39.85, 16.99, 19.99, 21.99, 10.99, 19.73, 
    14.99, 22.39, 6.55, 32.98, 27.99, 29.89, 19.99, 29.99, 37.36, 19.99, 
    35.49, 15.99, 21.99, 46.71, 15.72, 42.97, 18.68, 18.87, 15.72, 19.99,
     29.99, 9.99, 28.02, 35.99, 39.99, 15.72, 15.72, 9.33, 44.48, 47.99, 
    43.99, 47.99, 38.8, 23.27, 20.69, 44.6, 41.97, 15.75, 44.49, 19.87, 
    51.99, 36.89, 15.99, 39.99, 27.99, 11.58, 43.99, 41.1, 19.99, 43.64, 
    19.99, 36.89, 25.69]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2023-03-11
      • 2020-11-14
      • 1970-01-01
      • 1970-01-01
      • 2014-02-27
      • 2016-01-26
      • 1970-01-01
      相关资源
      最近更新 更多