【问题标题】:WebScrape A Field - Selenium/BeautifulSoupWebScrape A Field - Selenium/BeautifulSoup
【发布时间】:2021-09-29 09:20:21
【问题描述】:

作为问题重新发布似乎仍然很突出 -

一个网站有几行标题。其中一些标题(标题为蓝色)在单击时展开并显示更多标题。附上一个例子。

我的目标是执行一次抓取并提取所有标题、日期和时间。此外,如果可能,所有的标题(第 1 行的示例是它显示“按需”的位置)

当前代码- 存在一致性问题,无法收集所有下拉字段。

from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')

new_titles = set()

productlist=driver.find_elements_by_xpath("//div[@class='card item-container session']")
for property in productlist:
    sessiontitle=property.find_element_by_xpath(".//h4[@class='session-title card-title']").text
    print(sessiontitle)
    ifDropdown=driver.find_elements_by_xpath(".//*[@class='item-expand-action expand']")
    if(ifDropdown):
        ifDropdown[0].click()
        time.sleep(8)
        open_titles = driver.find_elements_by_class_name('card-title')
        for open_title in open_titles:
            title = open_title.text
            if(title not in new_titles):
                print(title)
                time.sleep(4)
                new_titles.add(title)

【问题讨论】:

  • 这能回答你的问题吗? Selenium/Webscrape this field
  • 它没有,它没有提取所有的日期和时间,如果可能的话,所有的标题(第 1 行的一个例子是它说“按需”)我不知道如何你做了你自己的部分代码来添加这个 - 在帖子中提到

标签: selenium web-scraping beautifulsoup request css-selectors


【解决方案1】:

我已尝试使用beautifulsoup 提取您需要的数据。

这会打印您需要的所有数据,包括下拉列表中的数据

import bs4 as bs
import requests


def scrape_sub_lists(s_url):
    resp = requests.get(s_url)
    soup = bs.BeautifulSoup(resp.text, 'html.parser')
    main_div = soup.find('div', class_='item-content')
    divs = main_div.findAll('div', class_='card presentation')
    print(f'\n***** Sublist Data *****\n')
    for i in divs:
        print(i.find('span', attrs = {'title': 'Session Name'}).text)
        print(i.find('h4', class_='card-title').text.strip())
        print(i.find('div', class_='details property-auto-width').find('div', class_='property').text)
        print(f'\n\n')
    print(f'\n***** End of Sublist Data *****\n')


url = 'https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list?p=1'
resp = requests.get(url)
soup = bs.BeautifulSoup(resp.text, 'html.parser')

divs = soup.findAll('div', class_='card item-container session')
print(len(divs))


for i in divs:
    head = i.find('span', attrs = {'title': 'Location'})
    if head is None:
        head = i.find('span', attrs = {'title': 'Session Type'})
    header = head.text.strip()
    title = i.find('h4', class_='session-title card-title')
    title_name = title.text.strip()
    date = i.find('div', class_='internal_date').find('div', class_='property').text
    time = i.find('div', class_='internal_time').find('div', class_='property').text

    print(f'{header}\n{title_name}\n{date}\n{time}\n\n')

    # Scraping the drop-down data
    a_exists = title.find('a', attrs = {'class': 'item-expand-action expand'})
    if a_exists:
        scrape_sub_lists(a_exists['href'].strip())

请参阅下面的示例输出。 ***** Sublist Data ********** End of Sublist Data ***** 之间的内容包含来自其上方项目的下拉列表中的数据。

Sample Output

On-demand
Educational sessions on-demand
Thu, 16.09.2021
08:30 - 09:40


On-demand
Special Symposia on-demand
Thu, 16.09.2021
12:30 - 13:40


On-demand
Multidisciplinary sessions on-demand
Thu, 16.09.2021
16:30 - 17:40


Channel 3
Illumina - Diagnosing Non-Small Cell Lung Cancer using Comprehensive Genomic Profiling
Fri, 17.09.2021
08:45 - 10:15

***** Sublist Data *****

Industry Satellite Symposium
Illumina gives an update on their IVD road map
08:45 - 08:50

Industry Satellite Symposium
The impact of Comprehensive Genomic Profiling
08:50 - 09:01

Industry Satellite Symposium
A day in the life of a pathologist using Comprehensive Genomic Profiling
09:01 - 09:29

Industry Satellite Symposium
Dealing with complexity through Comprehensive Genomic Profiling
09:29 - 09:57

Industry Satellite Symposium
Q & A (Live)
09:57 - 10:15

***** End of Sublist Data *****

【讨论】:

    猜你喜欢
    • 2021-09-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-02-09
    • 2022-01-05
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多