【发布时间】:2019-11-28 10:03:16
【问题描述】:
这是我只抓取一页的代码,但我有 11000 个。区别在于他们的身份。
https://www.rlsnet.ru/mkb_index_id_1.htm
https://www.rlsnet.ru/mkb_index_id_2.htm
https://www.rlsnet.ru/mkb_index_id_3.htm
....
https://www.rlsnet.ru/mkb_index_id_11000.htm
如何循环我的代码来抓取所有 11000 页?如此大量的页面甚至可能吗?可以将它们放入一个列表中,然后进行抓取,但如果有 11000 个,那将是很长的路要走。
import requests
from pandas import DataFrame
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
page_sc = requests.get('https://www.rlsnet.ru/mkb_index_id_1.htm')
soup_sc = BeautifulSoup(page_sc.content, 'html.parser')
items_sc = soup_sc.find_all(class_='subcatlist__item')
mkb_names_sc = [item_sc.find(class_='subcatlist__link').get_text() for item_sc in items_sc]
mkb_stuff_sce = pd.DataFrame(
{
'first': mkb_names_sc,
})
mkb_stuff_sce.to_csv('/Users/gfidarov/Desktop/Python/MKB/mkb.csv')
【问题讨论】:
-
简单,将循环计数器传递给 url:
'https://www.rlsnet.ru/mkb_index_id_{}.htm'.format(i)
标签: python pandas web web-scraping beautifulsoup