【问题标题】:How to extract name and links from a given website - python如何从给定网站中提取名称和链接 - python
【发布时间】:2021-03-19 19:13:49
【问题描述】:

对于下面提到的网站,我正在尝试从该网站查找名称及其对应的链接。但根本无法传递/获取数据。

使用 BeautifulSoup

from bs4 import BeautifulSoup
import requests

source = requests.get('https://mommypoppins.com/events/115/los-angeles/all/tag/all/age/all/all/deals/0/near/0/0')

soup = BeautifulSoup(source.text, 'html.parser')
mains = soup.find_all("div", {"class": "list-container-wrapper"})

name = []
lnks = []

for main in mains:
        name.append(main.find("a").text)
        lnks.append(main.find("a").get('href'))

使用 Selenium 网络驱动程序

from selenium import webdriver
driver = webdriver.Chrome(executable_path=r"chromedriver_win32\chromedriver.exe")
driver.get("https://mommypoppins.com/events/115/los-angeles/all/tag/all/age/all/all/deals/0/near/0/0")

lnks = []
name = []

for a in driver.find_elements_by_class_name('ng-star-inserted'):
    link = a.get_attribute('href')
    lnks.append(link)
    
    nm = driver.find_element_by_css_selector("#list-item-0 > div > h2 > a").text
    name.append(nm)

以上两种方法我都试过了。

示例:

name = ['Friday Night Flicks Drive-In at the Roadium', 'Open: Butterfly Pavilion and Nature Gardens']
lnks = ['https://mommypoppins.com/los-angeles-kids/event/in-person/friday-night-flicks-drive-in-at-the-roadium','https://mommypoppins.com/los-angeles-kids/event/in-person/open-butterfly-pavilion-and-nature-gardens']

【问题讨论】:

  • 你指的是什么名字?你能添加例子吗?
  • 谢谢@UtpalDutt。已添加。

标签: python-3.x selenium-webdriver beautifulsoup


【解决方案1】:

这里是 webdriver 的解决方案:

import time

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()

driver.get('https://mommypoppins.com/events/115/los-angeles/all/tag/all/age/all/all/deals/0/near/0/0')

time.sleep(3)

elements = driver.find_elements(By.XPATH, "//a[@angularticsaction='expanded-detail']")

attributes = [{el.text: el.get_attribute('href')} for el in elements]

print(attributes)
print(len(attributes))

driver.quit()

这是使用 webdriver 和 bs4 的解决方案:

import time

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get('https://mommypoppins.com/events/115/los-angeles/all/tag/all/age/all/all/deals/0/near/0/0')
time.sleep(3)

soup = BeautifulSoup(driver.page_source, 'html.parser')
mains = soup.find_all("a", {"angularticsaction": "expanded-detail"})

attributes = [{el.text: el.get('href')} for el in mains]

print(attributes)
print(len(attributes))

driver.quit()

这里是请求的解决方案:

import requests

url = "https://mommypoppins.com"
response = requests.get(f"{url}/contentasjson/custom_data/events_ng-block_1x/0/115/all/all/all/all/all").json()


attributes = [{r.get('node_title'): f"{url}{r['node'][r['nid']]['node_url']}"} for r in response['results']]

print(attributes)
print(len(attributes))

干杯!

【讨论】:

    【解决方案2】:

    该网站是动态加载的,因此requests 将不支持它。但是,可以通过向以下地址发送GET 请求以 JSON 格式获取数据:

    https://mommypoppins.com/contentasjson/custom_data/events_ng-block_1x/0/115/all/all/all/all/all.

    不需要BeautifulSoupSelenium,仅使用requests 即可,这将使您的代码更快。

    import requests
    
    URL = "https://mommypoppins.com/contentasjson/custom_data/events_ng-block_1x/0/115/all/all/all/all/all"
    BASE_URL = "https://mommypoppins.com"
    response = requests.get(URL).json()
    
    names = []
    links = []
    
    for json_data in response["results"]:
        data = json_data["node"][json_data["nid"]]
        names.append(data["title"])
        links.append(BASE_URL + data["node_url"])
    

    【讨论】:

      猜你喜欢
      • 2016-04-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-04-18
      • 2016-09-13
      • 1970-01-01
      • 2021-01-28
      相关资源
      最近更新 更多