【发布时间】:2020-07-17 17:28:03
【问题描述】:
这是链接:https://www.mobihealthnews.com/news?page=0
对于新闻页面中的每篇文章,我都在尝试抓取文章的名称+其简短内容+链接+发布日期+作者姓名。
当网站有不同的类名时,我遇到了一些问题。例如:
<div class="views-row views-row-1 views-row-odd views-row-first">...</div>
<div class="views-row views-row-2 views-row-even">...</div>
<div class="views-row views-row-3 views-row-odd">...</div>
<div class="views-row views-row-4 views-row-even">...</div>
<div class="views-row views-row-5 views-row-odd">...</div>
<div class="views-row views-row-6 views-row-even">...</div>
<div class="views-row views-row-7 views-row-odd">...</div>
<div class="views-row views-row-8 views-row-even">...</div>
<div class="views-row views-row-9 views-row-odd">...</div>
<div class="views-row views-row-10 views-row-even views-row-last">...</div>
除了列出一长串if-else 声明之外,还有其他方法可以获取课程吗?
附加信息:我目前正在使用 BeautifulSoup4 和 requests 库。
提前感谢您的宝贵时间。
编辑:这是我的策略,但我很确定必须更改 links 变量中的某些内容。
soup=BeautifulSoup(page.text,'html.parser')
frame=[]
links=soup.find_all('div',attrs={'class':'group-left list-wrapper'})
print(len(links))
filename="mobi_health_news.csv"
f=open(filename,"w", encoding = 'utf-8')
headers="Title,Content,Date, Link, Author\n"
f.write(headers)
for j in links:
Title = j.find("div",attrs={'class':'views-field views-field-title'}).text.strip()
Link = "https://www.mobihealthnews.com"
Link += j.find("div",attrs={'class':'views-field views-field-title'}).find('a')['href'].strip()
Date = j.find('span',attrs={'class':'day_list'}).text.strip()
Content = j.find('div', attrs={'class':'views-field views-field-body'}).text.strip()
Author = j.find('span', attrs ={'class':'author_list'}).text.strip()
frame.append((Title,Content,Date,Link,Author)) f.write(Title.replace(",","^")+","+Link+","+Author.replace(",","^")+","+Content.replace(",","^")+","+Date.replace(",","^")+"\n")
upperframe.extend(frame)
f.close()
【问题讨论】:
-
你能分享你的代码吗?你尝试过什么
-
@Umair 刚刚编辑了我的问题
标签: html python-3.x beautifulsoup python-requests