如何获取特定类下的链接答案

【问题标题】：How can i get the links under a specific class如何获取特定类下的链接
【发布时间】：2018-05-05 19:09:29
【问题描述】：

所以 2 天前，我试图解析两个相同类之间的数据，然后 Keyur 帮了我很多，然后他把其他问题抛在脑后.. :D

现在我想获取特定类下的链接，这是我的代码，这里是错误。

from bs4 import BeautifulSoup
import urllib.request
import datetime

headers = {}  # Headers gives information about you like your operation system, your browser etc.
headers['User-Agent'] = 'Mozilla/5.0'  # I defined a user agent because HLTV perceive my connection as bot.
hltv = urllib.request.Request('https://www.hltv.org/matches', headers=headers)  # Basically connecting to website
session = urllib.request.urlopen(hltv)
sauce = session.read()  # Getting the source of website
soup = BeautifulSoup(sauce, 'lxml')

a = 0
b = 1
# Getting the match pages' links.
for x in soup.find('span', text=datetime.date.today()).parent:
    print(x.find('a'))

错误：

实际上没有任何错误，但输出如下：

None

None
None
-1
None
None
-1

然后我研究并发现如果没有任何数据可以提供，find 函数不会给你任何东西。然后我尝试使用 find_all

代码：

print(x.find_all('a'))

输出：

AttributeError: 'NavigableString' object has no attribute 'find_all'

这是类名：

<div class="standard-headline">2018-05-01</div>

我不想在这里发布所有代码，所以这里是链接 hltv.org/matches/ 以便您可以更轻松地检查课程。

【问题讨论】：

import bs4 as BeautifulSoup 不正确。你的代码到底是什么样的？
您是否无意或有意忘记提及您所说的班级名称？
你的意思是for a in soup.find('span', text=(datetime.date.today())).parent.find_all("a"): print(a)？
standard-headline 类名下没有链接。都是文字。至少我一个也找不到。请具体。

标签： python-3.x web-scraping beautifulsoup urllib

【解决方案1】：

我不太确定我能理解 OP 真正想要获取的链接。不过，我猜了一下。这些链接在复合类a-reset block upcoming-match standard-box 中，如果您能找到正确的类，那么一个单独的类就足以为您获取数据，如selectors 做的。试一试。

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib.parse import urljoin
import datetime

url = 'https://www.hltv.org/matches'

req = Request(url, headers={"User-Agent":"Mozilla/5.0"}) 
res = urlopen(req).read()
soup = BeautifulSoup(res, 'lxml')
for links in soup.find(class_="standard-headline",text=(datetime.date.today())).find_parent().find_all(class_="upcoming-match")[:-2]: 
    print(urljoin(url,links.get('href')))

输出：

https://www.hltv.org/matches/2322508/yeah-vs-sharks-ggbet-ascenso
https://www.hltv.org/matches/2322633/team-australia-vs-team-uk-showmatch-csgo
https://www.hltv.org/matches/2322638/sydney-saints-vs-control-fe-lil-suzi-winner-esl-womens-sydney-open-finals
https://www.hltv.org/matches/2322426/faze-vs-astralis-iem-sydney-2018
https://www.hltv.org/matches/2322601/max-vs-fierce-tiger-starseries-i-league-season-5-asian-qualifier

等等------

【讨论】：

但关键是只获得今天的比赛，我已经能够做到这一点。顺便说一句，我喜欢你缩短代码的方式。
如果您只想获取今天的链接，那么我最近所做的编辑应该可以让您到达那里。谢谢。
很高兴我们俩最终都能解决问题。谢谢。
你知道 SIM，一切都很好，但你为什么使用 [:-2]，我知道它的作用，但我的意思是，为什么？ :D
如果您在没有索引的情况下运行脚本，您可能会得到两个不需要的额外链接。这就是为什么我用它来满足您的需求。谢谢。