【问题标题】:Fetching lawyers details from different links in a website从网站中的不同链接获取律师详细信息
【发布时间】:2019-06-07 05:56:48
【问题描述】:

我是使用 Python 进行 Web Scraping 的绝对初学者,并且对 Python 编程知之甚少。我只是想提取田纳西州律师的信息。在网页中,有多个链接,其中还有更多的链接,里面是各个律师。

如果您能告诉我我应该遵循的步骤。

我已经完成了直到在第一页上提取他的链接,但我只需要城市的链接,而我已经获得了所有带有 href 标签的链接。现在我该如何迭代它们并进一步进行?

from bs4 import BeautifulSoup as bs
import pandas as pd

res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')

links = [item['href'] for item in soup.select('a')]
print(links)```

It is printing
````C:\Users\laptop\AppData\Local\Programs\Python\Python36-32\python.exe C:/Users/laptop/.PyCharmCE2017.1/config/scratches/scratch_1.py
['https://www.superlawyers.com', 'https://attorneys.superlawyers.com', 'https://ask.superlawyers.com', 'https://video.superlawyers.com',.... ````

All the links are extracted whereas I only need the links of the cities. Kindly help.

【问题讨论】:

  • 如果可能的话提出一些解决方案。

标签: python-3.x web-scraping beautifulsoup


【解决方案1】:

没有正则表达式:

cities = soup.find('div', class_="three_browse_columns" )
for city in cities.find_all('a'):
   print(city['href'])

【讨论】:

  • 现在我提取城市的每个链接中的链接。我已经完成了以下代码,但它显示错误。 ```` res = requests.get('attorneys.superlawyers.com/tennessee', headers = {'User-agent': 'Super Bot 9000'}) soup = bs(res.content, 'lxml') links = [item[' href'] for 汤中的项目.find_all('a',href=re.compile('attorneys.superlawyers.com/tennessee/'))] for l1 in links: links2 = l1.find('div', class_="three_browse_columns") for l2 in links2. find_all('a'): lk1=l2['href'] print(lk1)```` 错误:find() 没有关键字参数
  • @ag2019 - 但您似乎正在使用正则表达式。不知道为什么,但这是一种不同的方法。
【解决方案2】:

Faster 将用于使用父 ID,然后在其中选择 a 标记

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://attorneys.superlawyers.com/tennessee/')
soup = bs(r.content, 'lxml')
cities = [item['href'] for item in soup.select('#browse_view a')]

【讨论】:

  • 你能解释一下browse_view前'#'的意义吗?
  • 现在每个链接里面都有更多的链接,怎么访问呢?例如,在阿拉莫地区,有三类律师可用,我想要这些链接,在这三类中是我想要获取的律师详细信息。可以做什么?如果可能的话建议。
  • 我试过这个import requests from bs4 import BeautifulSoup as bs res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'}) soup = bs(res.content, 'lxml') cities = [item['href'] for item in soup.select('#browse_view a')] for c1 in cities: categories = [item['href'] for item in c1.select('three_browse_columns a')] print(categories) 但它给出了一个错误categories = [item['href'] for item in c1.select('three_browse_columns a')] AttributeError: 'str' object has no attribute 'select'。可以做什么?如果可能的话建议。
  • cities 是指向这些城市的超链接列表,即字符串。所以,c1 是一个字符串而不是一个标签。
【解决方案3】:

使用正则表达式re并搜索城市的href值。

from bs4 import BeautifulSoup as bs
import re

res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')

links = [item['href'] for item in soup.find_all('a',href=re.compile('https://attorneys.superlawyers.com/tennessee/'))]
print(links)

输出:

['https://attorneys.superlawyers.com/tennessee/alamo/', 'https://attorneys.superlawyers.com/tennessee/bartlett/', 'https://attorneys.superlawyers.com/tennessee/brentwood/', 'https://attorneys.superlawyers.com/tennessee/bristol/', 'https://attorneys.superlawyers.com/tennessee/chattanooga/', 'https://attorneys.superlawyers.com/tennessee/clarksville/', 'https://attorneys.superlawyers.com/tennessee/cleveland/', 'https://attorneys.superlawyers.com/tennessee/clinton/', 'https://attorneys.superlawyers.com/tennessee/columbia/', 'https://attorneys.superlawyers.com/tennessee/cookeville/', 'https://attorneys.superlawyers.com/tennessee/cordova/', 'https://attorneys.superlawyers.com/tennessee/covington/', 'https://attorneys.superlawyers.com/tennessee/dayton/', 'https://attorneys.superlawyers.com/tennessee/dickson/', 'https://attorneys.superlawyers.com/tennessee/dyersburg/', 'https://attorneys.superlawyers.com/tennessee/elizabethton/', 'https://attorneys.superlawyers.com/tennessee/franklin/', 'https://attorneys.superlawyers.com/tennessee/gallatin/', 'https://attorneys.superlawyers.com/tennessee/germantown/', 'https://attorneys.superlawyers.com/tennessee/goodlettsville/', 'https://attorneys.superlawyers.com/tennessee/greeneville/', 'https://attorneys.superlawyers.com/tennessee/henderson/', 'https://attorneys.superlawyers.com/tennessee/hendersonville/', 'https://attorneys.superlawyers.com/tennessee/hixson/', 'https://attorneys.superlawyers.com/tennessee/huntingdon/', 'https://attorneys.superlawyers.com/tennessee/huntsville/', 'https://attorneys.superlawyers.com/tennessee/jacksboro/', 'https://attorneys.superlawyers.com/tennessee/jackson/', 'https://attorneys.superlawyers.com/tennessee/jasper/', 'https://attorneys.superlawyers.com/tennessee/johnson-city/', 'https://attorneys.superlawyers.com/tennessee/kingsport/', 'https://attorneys.superlawyers.com/tennessee/knoxville/', 'https://attorneys.superlawyers.com/tennessee/la-follette/', 'https://attorneys.superlawyers.com/tennessee/lafayette/', 'https://attorneys.superlawyers.com/tennessee/lafollette/', 'https://attorneys.superlawyers.com/tennessee/lawrenceburg/', 'https://attorneys.superlawyers.com/tennessee/lebanon/', 'https://attorneys.superlawyers.com/tennessee/lenoir-city/', 'https://attorneys.superlawyers.com/tennessee/lewisburg/', 'https://attorneys.superlawyers.com/tennessee/lexington/', 'https://attorneys.superlawyers.com/tennessee/madisonville/', 'https://attorneys.superlawyers.com/tennessee/manchester/', 'https://attorneys.superlawyers.com/tennessee/maryville/', 'https://attorneys.superlawyers.com/tennessee/memphis/', 'https://attorneys.superlawyers.com/tennessee/millington/', 'https://attorneys.superlawyers.com/tennessee/morristown/', 'https://attorneys.superlawyers.com/tennessee/murfreesboro/', 'https://attorneys.superlawyers.com/tennessee/nashville/', 'https://attorneys.superlawyers.com/tennessee/paris/', 'https://attorneys.superlawyers.com/tennessee/pleasant-view/', 'https://attorneys.superlawyers.com/tennessee/pulaski/', 'https://attorneys.superlawyers.com/tennessee/rogersville/', 'https://attorneys.superlawyers.com/tennessee/sevierville/', 'https://attorneys.superlawyers.com/tennessee/sewanee/', 'https://attorneys.superlawyers.com/tennessee/shelbyville/', 'https://attorneys.superlawyers.com/tennessee/somerville/', 'https://attorneys.superlawyers.com/tennessee/spring-hill/', 'https://attorneys.superlawyers.com/tennessee/springfield/', 'https://attorneys.superlawyers.com/tennessee/tullahoma/', 'https://attorneys.superlawyers.com/tennessee/white-house/', 'https://attorneys.superlawyers.com/tennessee/winchester/', 'https://attorneys.superlawyers.com/tennessee/woodlawn/']

如果您想使用 css 选择器,请使用以下代码。

from bs4 import BeautifulSoup as bs
import requests
res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')
links = [item['href'] for item in soup.select('a[href^="https://attorneys.superlawyers.com/tennessee"]')]
print(links)

输出:

['https://attorneys.superlawyers.com/tennessee/alamo/', 'https://attorneys.superlawyers.com/tennessee/bartlett/', 'https://attorneys.superlawyers.com/tennessee/brentwood/', 'https://attorneys.superlawyers.com/tennessee/bristol/', 'https://attorneys.superlawyers.com/tennessee/chattanooga/', 'https://attorneys.superlawyers.com/tennessee/clarksville/', 'https://attorneys.superlawyers.com/tennessee/cleveland/', 'https://attorneys.superlawyers.com/tennessee/clinton/', 'https://attorneys.superlawyers.com/tennessee/columbia/', 'https://attorneys.superlawyers.com/tennessee/cookeville/', 'https://attorneys.superlawyers.com/tennessee/cordova/', 'https://attorneys.superlawyers.com/tennessee/covington/', 'https://attorneys.superlawyers.com/tennessee/dayton/', 'https://attorneys.superlawyers.com/tennessee/dickson/', 'https://attorneys.superlawyers.com/tennessee/dyersburg/', 'https://attorneys.superlawyers.com/tennessee/elizabethton/', 'https://attorneys.superlawyers.com/tennessee/franklin/', 'https://attorneys.superlawyers.com/tennessee/gallatin/', 'https://attorneys.superlawyers.com/tennessee/germantown/', 'https://attorneys.superlawyers.com/tennessee/goodlettsville/', 'https://attorneys.superlawyers.com/tennessee/greeneville/', 'https://attorneys.superlawyers.com/tennessee/henderson/', 'https://attorneys.superlawyers.com/tennessee/hendersonville/', 'https://attorneys.superlawyers.com/tennessee/hixson/', 'https://attorneys.superlawyers.com/tennessee/huntingdon/', 'https://attorneys.superlawyers.com/tennessee/huntsville/', 'https://attorneys.superlawyers.com/tennessee/jacksboro/', 'https://attorneys.superlawyers.com/tennessee/jackson/', 'https://attorneys.superlawyers.com/tennessee/jasper/', 'https://attorneys.superlawyers.com/tennessee/johnson-city/', 'https://attorneys.superlawyers.com/tennessee/kingsport/', 'https://attorneys.superlawyers.com/tennessee/knoxville/', 'https://attorneys.superlawyers.com/tennessee/la-follette/', 'https://attorneys.superlawyers.com/tennessee/lafayette/', 'https://attorneys.superlawyers.com/tennessee/lafollette/', 'https://attorneys.superlawyers.com/tennessee/lawrenceburg/', 'https://attorneys.superlawyers.com/tennessee/lebanon/', 'https://attorneys.superlawyers.com/tennessee/lenoir-city/', 'https://attorneys.superlawyers.com/tennessee/lewisburg/', 'https://attorneys.superlawyers.com/tennessee/lexington/', 'https://attorneys.superlawyers.com/tennessee/madisonville/', 'https://attorneys.superlawyers.com/tennessee/manchester/', 'https://attorneys.superlawyers.com/tennessee/maryville/', 'https://attorneys.superlawyers.com/tennessee/memphis/', 'https://attorneys.superlawyers.com/tennessee/millington/', 'https://attorneys.superlawyers.com/tennessee/morristown/', 'https://attorneys.superlawyers.com/tennessee/murfreesboro/', 'https://attorneys.superlawyers.com/tennessee/nashville/', 'https://attorneys.superlawyers.com/tennessee/paris/', 'https://attorneys.superlawyers.com/tennessee/pleasant-view/', 'https://attorneys.superlawyers.com/tennessee/pulaski/', 'https://attorneys.superlawyers.com/tennessee/rogersville/', 'https://attorneys.superlawyers.com/tennessee/sevierville/', 'https://attorneys.superlawyers.com/tennessee/sewanee/', 'https://attorneys.superlawyers.com/tennessee/shelbyville/', 'https://attorneys.superlawyers.com/tennessee/somerville/', 'https://attorneys.superlawyers.com/tennessee/spring-hill/', 'https://attorneys.superlawyers.com/tennessee/springfield/', 'https://attorneys.superlawyers.com/tennessee/tullahoma/', 'https://attorneys.superlawyers.com/tennessee/white-house/', 'https://attorneys.superlawyers.com/tennessee/winchester/', 'https://attorneys.superlawyers.com/tennessee/woodlawn/']

【讨论】:

  • 添加re后,输出保持不变,所有链接都像以前一样显示。
  • 我按照你说的做了,但是输出还是一样的。
  • 对不起,而不是选择使用 find_all
猜你喜欢
  • 1970-01-01
  • 2019-10-23
  • 1970-01-01
  • 2019-10-23
  • 1970-01-01
  • 1970-01-01
  • 2023-03-08
  • 2012-02-29
  • 2016-10-20
相关资源
最近更新 更多