【发布时间】:2020-12-02 18:27:45
【问题描述】:
scrapy 可以实时打印结果吗?我打算爬取大型网站,担心如果我的 vpn 连接中断,爬取的努力就会白费,因为它不会打印任何结果。
我目前正在使用带有轮换用户代理的 VPN,我知道使用轮换代理而不是 VPN 是理想的,但这将用于未来的脚本升级。
import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
results = open('results.csv','w')
class TestSpider(CrawlSpider):
name = "test"
with open("domains.txt", "r") as d:
allowed_domains = [url.strip() for url in d.readlines()]
with open("urls.txt", "r") as f:
start_urls = [url.strip() for url in f.readlines()]
f.close()
rules = (Rule(LinkExtractor(allow=('/'), deny=('9','10')), follow=True, callback='parse_item'),)
def parse_item(self, response):
for pattern in ['Albert Einstein', 'Bob Marley']:
result = re.findall(pattern, response.text)
print(response.url,">",pattern,'>',len(result), file = results)
非常感谢。
更新
harada 的脚本可以完美运行,除了保存文件之外没有任何更改。我需要做的就是对当前文件进行一些修改,如下所示,以便一切正常。
蜘蛛定义的项目
import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import TestItem
class TestSpider(CrawlSpider):
name = "test"
with open("domains.txt", "r") as d:
allowed_domains = [url.strip() for url in d.readlines()]
with open("urls.txt", "r") as f:
start_urls = [url.strip() for url in f.readlines()]
f.close()
rules = (Rule(LinkExtractor(allow=('/'), deny=('9','10')), follow=True, callback='parse_item'),)
def parse_item(self, response):
items = TestItem()
for pattern in ['Albert Einstein', 'Bob Marley']:
result = re.findall(pattern, response.text)
url = response.url
count = len(result)
items['url'] = url
items['pattern'] = pattern
items['count'] = count
yield(items)
items.py - 将项目添加为字段
import scrapy
class TestItem(scrapy.Item):
url = scrapy.Field()
pattern = scrapy.Field()
count = scrapy.Field()
settings.py - 未注释的 ITEM_PIPELINES
ITEM_PIPELINES = {
'test.pipelines.TestPipeline': 300,
}
【问题讨论】: