【发布时间】:2020-03-04 08:41:27
【问题描述】:
我正在使用 Beautifulsoup4 解析新闻网站。但我无法处理摆脱 html 元素以获得纯文本。
还有一个问题是新闻的发布日期不是日期格式,我想把它改成日期格式,这样我就可以过滤掉不需要的新闻了。
我想知道哪种格式对我存储数据有用?我将在 ML 中使用它来训练模型。
import requests
from bs4 import BeautifulSoup as bs
URL = 'http://marja.az/search?q='
# if there is a prabel inside of keyword merge with + sign
KEYWORDS = ['Valizada',
]
for key in KEYWORDS:
search_url = URL + key
print(search_url)
r = requests.get(search_url)
soup = bs(r.content, "lxml")
for data in soup.find_all("div", {"class": "searchNews"}):
for a in data.find_all("a"):
href = a.get("href")
# print(href)
link = "http://marja.az/" + href
print(link)
r1 = requests.get(link)
soup1 = bs(r1.content, "lxml")
header = soup1.findAll("h1", attrs={"class": "title"})
print(header)
paragraph = soup1.findAll("div", attrs={"class": "text"})
for p in paragraph:
print(p.findAll('p', text=True, recursive=False))
date = soup1.findAll("div", attrs={"class": "left"})
for d in date:
print(soup1.find('div', {'style': 'color: #af0000; margin:10px 0px 10px 0px; font-size:12px; '
'font-weight:bold; text-align:left;'}))
期望的结果:
Date, Header, Content
【问题讨论】:
标签: python python-3.x beautifulsoup python-requests