【问题标题】:How to scrape data from flexbox element/container with Python and Beautiful Soup如何使用 Python 和 Beautiful Soup 从 flexbox 元素/容器中抓取数据
【发布时间】:2020-09-21 17:25:46
【问题描述】:

我正在尝试使用 python、beautiful soup 和 selenium 从实用程序网站上抓取数据。我试图抓取的数据是:时间、原因、状态等。当我运行典型的页面请求时,解析页面并解析我正在寻找的数据(id="OutageListTable" 中的数据) ,然后打印出来,div 和字符串都找不到了。当我检查页面元素时,数据在那里,但它在一个弹性容器中。

这是我正在使用的代码:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import urllib3
from selenium import webdriver

my_url = 'https://www.pse.com/outage/outage-map'

browser = webdriver.Firefox()
browser.get(my_url)

html = browser.page_source
page_soup = soup(html, features='lxml')

outage_list = page_soup.find(id='OutageListTable')
print(outage_list)

browser.quit()

如何检索 flex/flexbox 容器中的信息?我没有在网上找到任何资源来帮助我弄清楚。

【问题讨论】:

    标签: python selenium beautifulsoup flexbox


    【解决方案1】:

    你想太多问题了。首先没有弹性板容器。这是分配正确 div 类的简单案例。你应该看看divclass_=col-xs-12 col-sm-6 col-md-4 listView-container

    from bs4 import BeautifulSoup
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.common.exceptions import TimeoutException
    from time import sleep
    
    # create object for chrome options
    chrome_options = Options()
    base_url = 'https://www.pse.com/outage/outage-map'
    
    chrome_options.add_argument('disable-notifications')
    chrome_options.add_argument('--disable-infobars')
    chrome_options.add_argument('start-maximized')
    chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
    # To disable the message, "Chrome is being controlled by automated test software"
    chrome_options.add_argument("disable-infobars")
    # Pass the argument 1 to allow and 2 to block
    chrome_options.add_experimental_option("prefs", { 
        "profile.default_content_setting_values.notifications": 2
        })
    # invoke the webdriver
    browser = webdriver.Chrome(executable_path = r'C:/Users/username/Documents/playground_python/chromedriver.exe',
                              options = chrome_options)
    browser.get(base_url)
    delay = 5 #secods
    
    while True:
        try:
            WebDriverWait(browser, delay)
            print ("Page is ready")
            sleep(5)
            html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
            #print(html)
            soup = BeautifulSoup(html, "html.parser")
            for item_n in soup.find_all('div', class_='col-xs-12 col-sm-6 col-md-4 listView-container'):
                for item_n_text in item_n.find_all(name="span"):
                    print(item_n_text.text)
        except TimeoutException:
            print ("Loading took too much time!-Try again")
    # close the automated browser
    browser.close()
    
    Cause: 
    Accident
    Status: 
    Crew assigned
    Last updated: 
    06/02 11:00 PM
    9. Woodinville
    Start time: 
    06/02 08:29 PM
    Est. restoration time: 
    06/03 03:30 AM
    Customers impacted: 
    2
    Cause: 
    Under Investigation
    Status: 
    Crew assigned
    Last updated: 
    06/03 12:15 AM
    Page is ready
    1. Bellingham
    Start time: 
    06/02 06:09 PM
    Est. restoration time: 
    06/03 06:30 AM
    Customers impacted: 
    1
    Cause: 
    Trees/Vegetation
    Status: 
    Crew assigned
    Last updated: 
    06/02 11:50 PM
    2. Deming
    Start time: 
    06/02 07:10 PM
    Est. restoration time: 
    06/03 03:30 AM
    

    【讨论】:

      【解决方案2】:

      数据是通过 Javascript 动态加载的。您可以使用requests模块获取数据。

      例如:

      import json
      import requests
      
      url = 'https://www.pse.com/api/sitecore/OutageMap/AnonymoussMapListView'
      
      data = requests.get(url).json()
      
      # uncomment this to print all data:
      #print(json.dumps(data, indent=4))
      
      for d in data['PseMap']:
          print('{} - {}'.format(d['DataProvider']['PointOfInterest']['Title'], d['DataProvider']['PointOfInterest']['MapType']))
          for info in d['DataProvider']['Attributes']:
              print(info['Name'], info['Value'])
          print('-' * 80)
      

      打印:

      Bellingham - Outage
      Start time 06/02 06:09 PM
      Est. restoration time 06/03 06:30 AM
      Customers impacted 1
      Cause Trees/Vegetation
      Status Crew assigned
      Last updated 06/02 11:50 PM
      --------------------------------------------------------------------------------
      Deming - Outage
      Start time 06/02 07:10 PM
      Est. restoration time 06/03 03:30 AM
      Customers impacted 568
      Cause Accident
      Status Repair crew onsite
      Last updated 06/02 11:50 PM
      --------------------------------------------------------------------------------
      Everest - Outage
      Start time 06/02 10:42 AM
      Customers impacted 4
      Cause Scheduled Outage
      Status Repair crew onsite
      Last updated 06/02 10:50 AM
      --------------------------------------------------------------------------------
      Kenmore - Outage
      Start time 06/02 09:59 PM
      Est. restoration time 05/29 01:00 AM
      Customers impacted 2
      Cause Scheduled Outage
      Status Repair crew onsite
      Last updated 06/02 10:05 PM
      --------------------------------------------------------------------------------
      Kent - Outage
      Start time 06/02 06:43 PM
      Est. restoration time To Be Determined
      Customers impacted 26
      Cause Car/Equip Accident
      Status Waiting for repairs
      Last updated 06/02 10:15 PM
      --------------------------------------------------------------------------------
      Kent - Outage
      Start time 06/02 10:09 PM
      Est. restoration time To Be Determined
      Customers impacted 13
      Cause Under Investigation
      Status Repair crew onsite
      Last updated 06/02 10:15 PM
      --------------------------------------------------------------------------------
      Northwest Bellevue - Outage
      Start time 06/02 11:28 PM
      Est. restoration time To Be Determined
      Customers impacted 14
      Cause Under Investigation
      Status Repair crew onsite
      Last updated 06/02 11:30 PM
      --------------------------------------------------------------------------------
      Pacific - Outage
      Start time 06/02 06:19 PM
      Est. restoration time 06/03 02:30 AM
      Customers impacted 3
      Cause Accident
      Status Crew assigned
      Last updated 06/02 11:00 PM
      --------------------------------------------------------------------------------
      Woodinville - Outage
      Start time 06/02 08:29 PM
      Est. restoration time 06/03 03:30 AM
      Customers impacted 2
      Cause Under Investigation
      Status Crew assigned
      Last updated 06/03 12:15 AM
      --------------------------------------------------------------------------------
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2023-03-31
        • 1970-01-01
        • 2013-01-09
        • 2022-01-21
        相关资源
        最近更新 更多