【问题标题】:How to scrape content from a div class based on data-automation attribute in Python using BeautifulSoup?如何使用 BeautifulSoup 根据 Python 中的数据自动化属性从 div 类中抓取内容?
【发布时间】:2020-02-07 19:31:50
【问题描述】:

我正在尝试使用 BeautifulSoup 抓取动态页面。在 Selenium 的帮助下从https://www.nemlig.com/ 访问上述页面后(感谢@cruisepandey 的代码建议),如下所示:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from bs4 import BeautifulSoup


driver = webdriver.Chrome(executable_path = r'C:\Users\user\lib\chromedriver_77.0.3865.40.exe')
wait = WebDriverWait(driver,10)

driver.maximize_window()
driver.get("https://www.nemlig.com/")

wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".timeslot-prompt.initial-animation-done")))
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[type='tel'][class^='pro']"))).send_keys('2300')  
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".btn.prompt__button"))).click()

系统提示我要抓取此页面。

更准确地说,此时,我想从页面的右侧刮掉行。如果您查看这些背后的 HTML 代码,您会注意到 div 类 time-block__row 在一天中的主要 3 次具有 3 个不同的数据自动化属性。

<div class="time-block__row" data-automation="beforDinnerRowTmSlt">
                            <div class="time-block__row-header">Formiddag</div>

                            <div class="no-timeslots ng-hide" ng-show="$ctrl.timeslotDays[$ctrl.selectedDateIndex].morningHours == 0">
                                Ingen levering..
                            </div>

                            <!----><!----><div class="time-block__item duration-1 disabled" ng-repeat="item in $ctrl.selectedHours track by $index" ng-if="item.StartHour >= 0 &amp;&amp; item.StartHour < 12" ng-click="$ctrl.setActiveTimeslot(item, $index)" ng-class="['duration-1', {'cheapest': item.IsCheapHour, 'event': item.IsEventSlot, 'selected': $ctrl.selectedTimeId == item.Id || $ctrl.selectedTimeIndex == $index, 'disabled': item.isUnavailable()}]" data-automation="notActiveSltTmSlt">

                                <div class="time-block__inner-container">
                <div class="time-block__time">8-9</div>
                <div class="time-block__attributes">
                  <!----></div>
                                    <div class="time-block__cost">29&nbsp;kr.</div>

所以Formiddag(早上)有data-automation = "beforDinnerRowTmSlt"Eftermiddag(下午)有data-automation = "afternoonRowTmSlt"Aften(晚上)有@987654329 @。

page_source = wait.until(driver.page_source)
soup = BeautifulSoup(page_source)
   
time_of_the_day = soup.find('div', class_='time-block__row').text
  • 问题是

使用上面的代码,time_of_the_day 仅包含来自 Morning 行的信息。

如何使用data-automation 属性正确抓取这些行?我怎么可能访问其他 2 个 div 类及其子 div?我的计划是创建一个包含以下内容的数据框:

Time_of_the_day          Hours          Price        Day
Formiddag                8-9            29kr.        Tor. 10/10
....                     ....           ....         ....
Eftermiddag              12-13          29kr.        Tor. 10/10
....                     ....           ....         ....

day 列将包含此处的输出:day = soup.find('div', class_='content').text

我知道这是一篇很长的帖子,但希望我已经使任务变得容易理解,并且您将能够帮助我提供建议、提示或代码!

【问题讨论】:

    标签: python html web-scraping beautifulsoup


    【解决方案1】:

    这是获取所有这些值的代码。

    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from bs4 import BeautifulSoup
    import time
    import pandas as pd
    
    driver = webdriver.Chrome(executable_path = r'C:\Users\user\lib\chromedriver_77.0.3865.40.exe')
    wait = WebDriverWait(driver,10)
    driver.maximize_window()
    driver.get("https://www.nemlig.com/")
    
    wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".timeslot-prompt.initial-animation-done")))
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[type='tel'][class^='pro']"))).send_keys('2300')
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".btn.prompt__button"))).click()
    time.sleep(3)
    soup=BeautifulSoup(driver.page_source,'html.parser')
    time_of_day=[]
    price=[]
    Hours=[]
    day=[]
    for morn in soup.select_one('[data-automation="beforDinnerRowTmSlt"]').select('.time-block__time'):
        time_of_day.append(soup.select_one('[data-automation="beforDinnerRowTmSlt"] > .time-block__row-header').text)
        Hours.append(morn.text)
        price.append(morn.find_next(class_="time-block__cost").text)
        day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)
    
    df = pd.DataFrame({"time_of_day":time_of_day,"Hours":Hours,"price":price,"Day":day})
    print(df)
    
    time_of_day=[]
    price=[]
    Hours=[]
    day=[]
    
    for after in soup.select_one('[data-automation="afternoonRowTmSlt"]').select('.time-block__time'):
        time_of_day.append(soup.select_one('[data-automation="afternoonRowTmSlt"] > .time-block__row-header').text)
        Hours.append(after.text)
        price.append(after.find_next(class_="time-block__cost").text)
        day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)
    
    df = pd.DataFrame({"time_of_day":time_of_day,"Hours":Hours,"price":price,"Day":day})
    print(df)
    
    time_of_day=[]
    price=[]
    Hours=[]
    day=[]
    
    for evenin in soup.select_one('[data-automation="eveningRowTmSlt"]').select('.time-block__time'):
        time_of_day.append(soup.select_one('[data-automation="eveningRowTmSlt"] > .time-block__row-header').text)
        Hours.append(evenin.text)
        price.append(evenin.find_next(class_="time-block__cost").text)
        day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)
    
    df = pd.DataFrame({"time_of_day":time_of_day,"Hours":Hours,"price":price,"Day":day})
    print(df)
    

    输出:

             Day  Hours   price time_of_day
    0  fre. 11/10    8-9  29 kr.   Formiddag
    1  fre. 11/10   9-10  29 kr.   Formiddag
    2  fre. 11/10  10-11  39 kr.   Formiddag
    3  fre. 11/10  11-12  39 kr.   Formiddag
              Day  Hours   price  time_of_day
    0  fre. 11/10  12-13  29 kr.  Eftermiddag
    1  fre. 11/10  13-14  29 kr.  Eftermiddag
    2  fre. 11/10  14-15  29 kr.  Eftermiddag
    3  fre. 11/10  15-16  29 kr.  Eftermiddag
    4  fre. 11/10  16-17  29 kr.  Eftermiddag
    5  fre. 11/10  17-18  19 kr.  Eftermiddag
              Day  Hours   price time_of_day
    0  fre. 11/10  18-19  29 kr.       Aften
    1  fre. 11/10  19-20  19 kr.       Aften
    2  fre. 11/10  20-21  29 kr.       Aften
    3  fre. 11/10  21-22  19 kr.       Aften
    

    已编辑

    soup=BeautifulSoup(driver.page_source,'html.parser')
    time_of_day=[]
    price=[]
    Hours=[]
    day=[]
    disabled=[]
    
    for morn,d in zip(soup.select_one('[data-automation="beforDinnerRowTmSlt"]').select('.time-block__time'),soup.select_one('[data-automation="beforDinnerRowTmSlt"]').select('.time-block__item')):
    
        time_of_day.append(soup.select_one('[data-automation="beforDinnerRowTmSlt"] > .time-block__row-header').text)
        Hours.append(morn.text)
        price.append(morn.find_next(class_="time-block__cost").text)
        day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)
        if 'disabled' in d['class']:
            disabled.append('1')
        else:
            disabled.append('0')
    
    for after,d in zip(soup.select_one('[data-automation="afternoonRowTmSlt"]').select('.time-block__time'),soup.select_one('[data-automation="afternoonRowTmSlt"]').select('.time-block__item')):
        time_of_day.append(soup.select_one('[data-automation="afternoonRowTmSlt"] > .time-block__row-header').text)
        Hours.append(after.text)
        price.append(after.find_next(class_="time-block__cost").text)
        day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)
        if 'disabled' in d['class']:
            disabled.append('1')
        else:
            disabled.append('0')
    
    for evenin,d in zip(soup.select_one('[data-automation="eveningRowTmSlt"]').select('.time-block__time'),soup.select_one('[data-automation="eveningRowTmSlt"]').select('.time-block__item')):
        time_of_day.append(soup.select_one('[data-automation="eveningRowTmSlt"] > .time-block__row-header').text)
        Hours.append(evenin.text)
        price.append(evenin.find_next(class_="time-block__cost").text)
        day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)
        if 'disabled' in d['class']:
            disabled.append('1')
        else:
            disabled.append('0')
    
    df = pd.DataFrame({"time_of_day":time_of_day,"Hours":Hours,"price":price,"Day":day,"Disabled" : disabled})
    print(df)
    

    输出

               Day Disabled  Hours   price  time_of_day
    0   fre. 11/10        1    8-9  29 kr.    Formiddag
    1   fre. 11/10        1   9-10  29 kr.    Formiddag
    2   fre. 11/10        0  10-11  39 kr.    Formiddag
    3   fre. 11/10        0  11-12  39 kr.    Formiddag
    4   fre. 11/10        0  12-13  29 kr.  Eftermiddag
    5   fre. 11/10        0  13-14  29 kr.  Eftermiddag
    6   fre. 11/10        0  14-15  19 kr.  Eftermiddag
    7   fre. 11/10        0  15-16  29 kr.  Eftermiddag
    8   fre. 11/10        0  16-17  29 kr.  Eftermiddag
    9   fre. 11/10        0  17-18  29 kr.  Eftermiddag
    10  fre. 11/10        0  18-19  29 kr.        Aften
    11  fre. 11/10        0  19-20  19 kr.        Aften
    12  fre. 11/10        0  20-21  29 kr.        Aften
    13  fre. 11/10        0  21-22  19 kr.        Aften
    

    【讨论】:

    • 如果您要照顾的话,您也可以在单个数据帧上执行此操作,请告诉我
    • 我非常感谢您在这一切上的帮助!太糟糕了,我只能给你一次赞成票……至于数据框,我确实在寻找一个包含所有这些数据框的数据框。我会假设我必须使用pd.append 之类的东西才能实现这一目标?
    • 给我一个小时我现在在外面。不需要追加。如果你只保留第一个声明数组并删除所有数组和数据帧并只保留最后一个数据帧,你会得到那个。如果你不可能我会一个小时回来
    • 还有一件事——告诉我你是否可以在这里回答或者我应该发布另一个问题——我想为hours...price被禁用并且div类名称为@时添加一个新列987654327@。您对如何实现这一目标有任何提示吗? 0 或 1 的列取决于它是否被禁用。
    • @Questieme :更新了禁用选项的代码。
    【解决方案2】:

    你可以使用soup.find_all:

    from bs4 import BeautifulSoup as soup
    import re
    ... #rest of your current selenium code
    
    d = soup(driver.page_source, 'html.parser')
    r, _day = [[i.div.text, [['disabled' in k['class'], k.find_all('div', {'class':re.compile('time-block__time|ime-block__cost')})] for k in i.find_all('div', {'class':'time-block__item'})]] for i in d.find_all('div', {'class':'time-block__row'})], d.find('div', {'class':'content'}).get_text(strip=True)
    new_r = [[a, [[int(j), *[i.text for i in b]] for j, b in k]] for a, k in r]
    new_data = [[a, *i, _day] for a, b in new_r for i in b]
    

    要将结果转换为数据框:

    import pandas as pd
    df = pd.DataFrame([dict(zip(['Time_of_the_day', 'Disabled', 'Hours', 'Price', 'Day'], i)) for i in new_data])
    

    输出:

          Day  Disabled  Hours   Price Time_of_the_day
    0   fre.11/10         1    8-9  29 kr.       Formiddag
    1   fre.11/10         1   9-10  29 kr.       Formiddag
    2   fre.11/10         1  10-11  39 kr.       Formiddag
    3   fre.11/10         0  11-12  39 kr.       Formiddag
    4   fre.11/10         0  12-13  29 kr.     Eftermiddag
    ....
    

    【讨论】:

    • 非常感谢您的帮助!正如我在他的回答中问过@KunduK 一样,我可能还有另一个与此相关的问题。 Tell me if you can answer here or I should post another question - is that I thought of adding a new column for when the hours...price is disabled and the div class name is time-block__item duration-1 disabled. Do you have any tips on how to also achieve that? A column with 0 or 1 depending if it's disabled or not.
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-01-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多