网页抓取大量链接？答案

【问题标题】：Web scraping large amount of links?网页抓取大量链接？
【发布时间】：2020-11-17 09:22:53
【问题描述】：

我对网络抓取非常陌生。我已经开始在 Python 中使用 BeautifulSoup。我编写了一个代码，它将遍历一个 url 列表并获取我需要的数据。该代码适用于 10-12 个链接，但我不确定如果列表有超过 100 个链接，相同的代码是否有效。是否有任何替代方法或任何其他库可以通过输入大量 url 列表来获取数据，而不会以任何方式损害网站。到目前为止，这是我的代码。

url_list = [url1, url2,url3, url4,url5]
mylist = []
for l in url_list:
    url = l 
    res = get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    data = soup.find('pre').text
    mylist.append(data)

【问题讨论】：

您提出 100 个请求并没有“损害”网站。更大的数字可能会开始出现问题。您使用的库没有区别。该网站必须处理与您发送给它的请求一样多的请求。如果您想对服务器更加温和，可以在请求之间添加time.sleep(number_of_seconds)。

标签： python web-scraping beautifulsoup

【解决方案1】：

这是一个例子，也许适合你。

from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain, utils

class MySpider(Spider):
    name = 'my_spider'
    start_urls = ['url1']
    # refresh_urls = True # If you want to download the downloaded link again, please remove the "#" in the front
    def __init__(self):
        # If your link is stored elsewhere, read it out here.
        self.start_urls = utils.getFileLines('you url file name.txt')
        Spider.__init__(self,self.name) # Necessary

    def extract(self, url, html, models, modelNames):
        doc = SimplifiedDoc(html)
        data = doc.select('pre>text()') # Extract the data you want.
        return {'Urls': None, 'Data':{'data':data} } # Return the data to the framework, which will save it for you.

SimplifiedMain.startThread(MySpider())  # Start download

你可以在这里看到更多的例子，以及Librarysized_scrapy的源代码：https://github.com/yiyedata/simplified-scrapy-demo

【讨论】：