【问题标题】：How to extract name and links from a given website - python如何从给定网站中提取名称和链接 - python
【发布时间】：2021-03-19 19:13:49
【问题描述】：

对于下面提到的网站，我正在尝试从该网站查找名称及其对应的链接。但根本无法传递/获取数据。

使用 BeautifulSoup

from bs4 import BeautifulSoup
import requests

source = requests.get('https://mommypoppins.com/events/115/los-angeles/all/tag/all/age/all/all/deals/0/near/0/0')

soup = BeautifulSoup(source.text, 'html.parser')
mains = soup.find_all("div", {"class": "list-container-wrapper"})

name = []
lnks = []

for main in mains:
        name.append(main.find("a").text)
        lnks.append(main.find("a").get('href'))

使用 Selenium 网络驱动程序

from selenium import webdriver
driver = webdriver.Chrome(executable_path=r"chromedriver_win32\chromedriver.exe")
driver.get("https://mommypoppins.com/events/115/los-angeles/all/tag/all/age/all/all/deals/0/near/0/0")

lnks = []
name = []

for a in driver.find_elements_by_class_name('ng-star-inserted'):
    link = a.get_attribute('href')
    lnks.append(link)
    
    nm = driver.find_element_by_css_selector("#list-item-0 > div > h2 > a").text
    name.append(nm)

以上两种方法我都试过了。

示例：

name = ['Friday Night Flicks Drive-In at the Roadium', 'Open: Butterfly Pavilion and Nature Gardens']
lnks = ['https://mommypoppins.com/los-angeles-kids/event/in-person/friday-night-flicks-drive-in-at-the-roadium','https://mommypoppins.com/los-angeles-kids/event/in-person/open-butterfly-pavilion-and-nature-gardens']

【问题讨论】：

你指的是什么名字？你能添加例子吗？
谢谢@UtpalDutt。已添加。

标签： python-3.x selenium-webdriver beautifulsoup

【解决方案1】：

这里是 webdriver 的解决方案：

import time

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()

driver.get('https://mommypoppins.com/events/115/los-angeles/all/tag/all/age/all/all/deals/0/near/0/0')

time.sleep(3)

elements = driver.find_elements(By.XPATH, "//a[@angularticsaction='expanded-detail']")

attributes = [{el.text: el.get_attribute('href')} for el in elements]

print(attributes)
print(len(attributes))

driver.quit()

这是使用 webdriver 和 bs4 的解决方案：

import time

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get('https://mommypoppins.com/events/115/los-angeles/all/tag/all/age/all/all/deals/0/near/0/0')
time.sleep(3)

soup = BeautifulSoup(driver.page_source, 'html.parser')
mains = soup.find_all("a", {"angularticsaction": "expanded-detail"})

attributes = [{el.text: el.get('href')} for el in mains]

print(attributes)
print(len(attributes))

driver.quit()

这里是请求的解决方案：

import requests

url = "https://mommypoppins.com"
response = requests.get(f"{url}/contentasjson/custom_data/events_ng-block_1x/0/115/all/all/all/all/all").json()


attributes = [{r.get('node_title'): f"{url}{r['node'][r['nid']]['node_url']}"} for r in response['results']]

print(attributes)
print(len(attributes))

干杯！

【讨论】：

【解决方案2】：

该网站是动态加载的，因此requests 将不支持它。但是，可以通过向以下地址发送GET 请求以 JSON 格式获取数据：

https://mommypoppins.com/contentasjson/custom_data/events_ng-block_1x/0/115/all/all/all/all/all.

不需要BeautifulSoup 或Selenium，仅使用requests 即可，这将使您的代码更快。

import requests

URL = "https://mommypoppins.com/contentasjson/custom_data/events_ng-block_1x/0/115/all/all/all/all/all"
BASE_URL = "https://mommypoppins.com"
response = requests.get(URL).json()

names = []
links = []

for json_data in response["results"]:
    data = json_data["node"][json_data["nid"]]
    names.append(data["title"])
    links.append(BASE_URL + data["node_url"])

【讨论】：