scrapy爬虫框架(一)
创建项目
scrapy startproject 项目名
创建爬虫文件
此前要进入爬虫文件夹,使用cd命令
scrapy genspider 爬虫名 网站域名
修改配置文件Settings.py
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
\'Accept\': \'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\',
\'Accept-Language\': \'en\',
\'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36\'
}
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
第一个实例
爬取糗事百科
# -*- coding: utf-8 -*-
import scrapy
class QsbkSpider(scrapy.Spider):
name = \'qsbk\'
allowed_domains = [\'www.yicommunity.com\']
start_urls = [\'http://www.yicommunity.com/\']
def parse(self, response):
print("="*80)
contents = response.xpath(\'//div[@class="col1"]/div\')
print(contents)
print("="*80)
for content in contents:
author = content.xpath("./div[@class=\'author\']/text()").get()
word = content.xpath("./div[@class=\'content\']/text()").get()
print(author,word)
运行cmd命令
scrapy crawl qsbk
pycharm中运行
在pyvenv.cfg同目录下创建start.py文件
from scrapy import cmdline
cmdline.execute("scrapy crawl qsbk".split())