【问题标题】:How to collect specific data from HTML using Selenium Python如何使用 Selenium Python 从 HTML 中收集特定数据
【发布时间】:2021-08-26 23:35:42
【问题描述】:

我正在尝试通过抓取网页来创建天气预报。 (我以前的question

我的代码:

import time
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
from keyboard import press_and_release



def weather_forecast2():
    print('Hello, I can search up the weather for you.')
    while True:
        inp = input('Where shall I search? Enter a place :').capitalize()
        print('Alright, checking the weather in ' + inp + '...')

        URL = 'https://www.yr.no/nb'

        "Search for a place"
        driver = webdriver.Edge()  # Open Microsoft Edge
        driver.get(URL)  # Goes to the HTML-page of the given URL
        element = driver.find_element_by_id("søk")  # Find the search input box
        element.send_keys(inp)  # Enter input
        press_and_release('enter')  # Click enter

        cURL = driver.current_url  # Current URL

        "Find data"
        driver.get(cURL)  # Goes to the HTML-page that appeared after clicking button
        r = requests.get(cURL)  # Get request for contents of the page
        print(r.content)  # Outputs HTML code for the page
        soup = BeautifulSoup(r.content, 'html5lib')  # Parse the data with BeautifulSoup(HTML-string, HTML-parser)

我想从页面收集温度。我知道我正在寻找的元素的 xpath 是

//[@id="dailyWeatherListItem0"]/div[2]/div1/span[2]/span1/text() //[@id="dailyWeatherListItem0"]/div[2]/div1/span[2]/span[3]/text() //[@id="dailyWeatherListItem1"]/div[2]/div1/span[2]/span1/text() //[@id="dailyWeatherListItem1"]/div[2]/div1/span[2]/span[3]/text() //[@id="dailyWeatherListItem2"]/div[2]/div1/span[2]/span1/text() //[@id="dailyWeatherListItem2"]/div[2]/div1/span[2]/span[3]/text() //[@id="dailyWeatherListItem3"]/div[2]/div1/span[2]/span1/text() //[@id="dailyWeatherListItem3"]/div[2]/div1/span[2]/span[3]/text()

//等等...

基本上我想收集以下两个元素九次:

//[@id="dailyWeatherListItem{NUMBERS0-8}"]/div[2]/div1/span[2]/span1/text() //[@id="dailyWeatherListItem{NUMBER0-8}"]/div[2]/div1/span[2]/span[3]/text()

我如何使用 driver.find_element_by_xpath 来做到这一点?还是有更高效的功能?

【问题讨论】:

  • 您能否包括press_and_release 的定义以及所有必要的导入语句?

标签: python html selenium-webdriver web-scraping beautifulsoup


【解决方案1】:

假设您可以正确检索 url,那么您可以将其用作引用标头以及该 url 中的位置 ID,以调用实际返回预测的 API。我没有你对 press_and_release 的定义,所以在没有它的情况下测试代码。

import requests, re
from selenium import webdriver

# url = 'https://www.yr.no/nb/v%C3%A6rvarsel/daglig-tabell/2-6058560/Canada/Ontario/London'

def get_forecast(str:url)->object:
    
    location_id = re.search(r'daglig-tabell/(.*?)/', url).group(1)
    headers = {'user-agent': 'Mozilla/5.0', 'referer': url}
    forecasts = requests.get(f'https://www.yr.no/api/v0/locations/{location_id}/forecast', headers=headers).json()
    return forecasts 


def get_forecast_url():
    
    print('Hello, I can search up the weather for you.')

    driver = webdriver.Chrome()  # Open Microsoft Edge. (I changed to Chrome)

    while True:

        inp = input('Where shall I search? Enter a place :').capitalize()
        print('Alright, checking the weather in ' + inp + '...')

        URL = 'https://www.yr.no/nb'

        "Search for a place"

        driver.get(URL)  # Goes to the HTML-page of the given URL
        driver.find_element_by_id("page-header__search-button").click() #open search 
        # Find the search input box
        element = driver.find_element_by_id("page-header__search-input")
        element.send_keys(inp)  # Enter input
        press_and_release('enter')  # Click enter

        cURL = driver.current_url  # Current URL
        print(get_forecast(cURL))

    driver.quit()

【讨论】:

  • 我现在已经修改了我的代码,但是我需要安装ChromeDriverManager才能打开Chrome,所以我添加了:
  • 好的。您可以通过更改注释的那一行来切换回 Edge,
  • 对不起,我看到我的评论被打断了。我收到了 WebDriverException:消息:“MicrosoftWebDriver.exe”可执行文件需要在 PATH 中。请从go.microsoft.com/fwlink/?LinkId=619687 下载,所以我切换到 Chrome 看看它是否可以工作。我遇到了同样的异常,所以我安装了 ChromeDriverManager。但是现在,当我编写行 driver = webdriver.Chrome(ChromeDriverManager.install()) 来打开 Google Chrome 时,我得到了一个 TypeError: install() missing 1 required positional argument: 'self'
  • 我尝试查看这个问题:stackoverflow.com/questions/17534345/…,但我不知道如何解决我的问题
  • 您的 chromdriver 等需要在环境路径上,即包含文件夹需要在环境路径上或在您使用可执行路径参数指向的文件夹中
猜你喜欢
  • 2018-08-05
  • 2016-11-05
  • 2022-11-02
  • 1970-01-01
  • 1970-01-01
  • 2017-06-07
  • 2012-11-16
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多