【发布时间】:2020-07-30 09:26:24
【问题描述】:
从网址中我要提取这家养老院的资料:信息在网站上以如下格式给出:https://www.carehome.co.uk/carehome.cfm/searchazref/10001005FITA
集团: Excelcare Holdings
负责人:Denise Marks(注册经理)
地方当局/社会服务:伦敦塔哈姆雷特自治市议会(点击查看联系方式)
等
我的 get_deets 函数只输出它们各自列表“标签”和“兄弟”中的第一个元素。我也想要完整的标签文本列表和相应的信息。
脚本
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup as soup
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\Users\Main\Documents\Work\Projects\chromedriver')
my_url = "https://www.carehome.co.uk/carehome.cfm/searchazref/10001005FITA"
def make_soup(url):
driver.get(url)
m_soup = soup(driver.page_source, features='html.parser')
return m_soup
main_page = make_soup(my_url)
strongs = main_page.select(".blue")
def get_deets(strongs):
tag = []
sibling = []
for strong_tag in strongs:
if strong_tag.next_sibling == '\n':
tag.append(strong_tag.text), sibling.append(strong_tag.next_sibling.next_sibling.text)
else:
tag.append(strong_tag.text), sibling.append(strong_tag.next_sibling.strip())
return tag, sibling
我当前的输出:
get_deets(strongs)
(['Group:'], ['Excelcare Holdings'])
期望的输出
标签
['Group:','Person in charge:', 'Local Authority / Social Services:']
兄弟姐妹
['Excelcare Holdings', 'Denise Marks (Registered Manager)','London Borough of Tower Hamlets Council (click for contact details)' ]
使用此 HTML:
<div class="profile-group-description col-xs-12 col-sm-8">
<p><strong class="blue">Group:</strong>
<a href="https://www.carehome.co.uk/care_search_results.cfm/searchgroup/36151505EXCA">Excelcare Holdings</a>
</p>
<p><strong class="blue">Person in charge:</strong>
Denise Marks (Registered Manager)</p>
<p><strong class="blue">Local Authority / Social Services:</strong>
London Borough of Tower Hamlets Council (<a href="https://www.carehome.co.uk/local-authorities/profile.cfm/id/Tower-Hamlets">click for contact details</a>)</p>
<p>
<strong class="blue">Type of Service:</strong>
Care Home only (Residential Care) – Privately Owned , Registered for a maximum of 44 Service Users
</p>
<p>
<strong class="blue">Registered Care Categories*:</strong>
Dementia • Learning Disability • Mental Health Condition • Old Age
</p>
【问题讨论】:
标签: list if-statement web-scraping beautifulsoup append