【发布时间】:2020-11-07 19:47:17
【问题描述】:
我正在尝试按如下方式抓取数据:
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
daterange = pd.date_range('02-25-2015', '09-16-2020', freq='D')
def main(req, date):
r = req.get(f"https://it.sputniknews.com/politica/{date.strftime('%Y%m%d')}")
print(r, r.content)
soup = BeautifulSoup(r.content, 'html.parser')
tag=None
print (soup.select("b-plainlist"))
#for tag in soup.select(".b-plainlist "):
#print(tag.select_one(".b-plainlist__date").text)
#print(tag.select_one(".b-plainlist__title").text)
#print(tag.find_next(class_="b-plainlist__announce").text.strip())
return tag.select_one(".b-plainlist__date").text, tag.select_one(".b-plainlist__title").text, tag.find_next(class_="b-plainlist__announce").text.strip()
with ThreadPoolExecutor(max_workers=30) as executor:
with requests.Session() as req:
fs = [executor.submit(main, req, date) for date in daterange]
allin = []
for f in fs:
allin.append(f.result())
df = pd.DataFrame.from_records(
allin, columns=["Date", "Title", "Content"])
print(df)
尝试创建一个包含日期、标题和内容的数据框。
这段代码应该没问题,但我无法创建一个“干净”的数据框,所以我认为标签有问题。 你能看看吗?谢谢
【问题讨论】:
标签: python pandas web-scraping beautifulsoup